# **Data Labeling**
Datasets will be labeled using the Majority Voting or Averaging based Ensemble method depending on the following 4 pre-trained models and libraries.

1. Twitter-roBERTa-base (https://arxiv.org/abs/2010.12421)
2. XLM-T (https://arxiv.org/abs/2104.12250)
3. VADER (https://ojs.aaai.org/index.php/ICWSM/article/view/14550)
4. TextBlob

Primarily Majority Voting will be used. Is there is a tie in the top voting count, Averaging will be use to make the decision.

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install vaderSentiment
!pip install textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple

In [None]:

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

from transformers import pipeline

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

from textblob import TextBlob

In [None]:
def preprocess(text):
    text = str(text)
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# **Twitter-roBERTa-base(2022)**

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Downloading:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
def sentiment_trb(text):
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    return config.id2label[ranking[0]]

# **XLM-T**

In [None]:
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)

Downloading:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
def sentiment_xlmt(text):
  return sentiment_task(text)[0]["label"]

# **VADER**

In [None]:
def sentiment_vader(text):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(text)
    compound = sentiment_dict['compound']
    if sentiment_dict['compound'] >= 0.05:
        overall_sentiment = "positive"
    elif sentiment_dict['compound'] <= -0.05:
        overall_sentiment = "negative"
    else :
        overall_sentiment = "neutral"
    return overall_sentiment

# **TextBlob**

In [None]:
def sentiment_texblob(text):
    classifier = TextBlob(text)
    polarity = classifier.sentiment.polarity  
    if polarity >= 0.05:
        overall_sentiment = "positive"
    elif polarity <= -0.05:
        overall_sentiment = "negative"
    else :
        overall_sentiment = "neutral"
    return overall_sentiment

# **Ensemble Method**

In [None]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import datetime
import os
from tqdm import tqdm
pd.options.mode.chained_assignment = None  # default='warn'

x = datetime.datetime(2022, 2, 24) #24th Feb 2022
nowx = datetime.datetime(2022,3,20) #31th Oct 2022

while x != nowx:
  month = x.strftime("%m")
  day = x.strftime("%d")
  filename = '/content/drive/MyDrive/FSTweets/fs-tweets-'+day+'-'+month+'-2022.csv'
  isFile = os.path.isfile(filename)
  if isFile:
        desfilename = '/content/drive/MyDrive/LabeledTweets/labeled-tweets-'+day+'-'+month+'-2022.csv'
        desFile = os.path.isfile(desfilename)
        if desFile:
            print(desfilename+" already exists...")
            x=x+datetime.timedelta(days=1)
            continue
        df = pd.read_csv(filename)

        labels = []
        limit = df.shape[0]
        pbar = tqdm(total=limit, position=0, leave=True)
        for idx in df.index:
            text = df['rawContent'][idx]
            text = preprocess(text)

            voting_cnt = {
                "positive": 0,
                "negative": 0,
                "neutral": 0,
            }
            
            vote = sentiment_xlmt(text)
            voting_cnt[vote] = voting_cnt[vote] + 1
            vote = sentiment_vader(text)
            voting_cnt[vote] = voting_cnt[vote] + 1
            vote = sentiment_texblob(text)
            voting_cnt[vote] = voting_cnt[vote] + 1

            mx_vote = 0

            for ct in voting_cnt.values():
                mx_vote = max(mx_vote,ct)

            mx_vote_ct=0

            for label in voting_cnt:
                if voting_cnt[label] == mx_vote:
                    mx_vote_ct = mx_vote_ct + 1

            if mx_vote_ct == 1:
                for label in voting_cnt:
                    if voting_cnt[label] == mx_vote:
                        final_label = label
                        break
            else:
                vote = sentiment_trb(text)
                voting_cnt[vote] = voting_cnt[vote] + 1

                mx_vote = 0

                for ct in voting_cnt.values():
                    mx_vote = max(mx_vote,ct)

                mx_vote_ct=0

                for label in voting_cnt:
                    if voting_cnt[label] == mx_vote:
                        mx_vote_ct = mx_vote_ct + 1

                if mx_vote_ct == 1:
                    for label in voting_cnt:
                        if voting_cnt[label] == mx_vote:
                            final_label = label
                            break
                else:
                    avg_prediction = 0
                    for label in voting_cnt:
                        if label == 'positive':
                            weight = 1
                        elif label == 'neutral':
                            weight = 0
                        else:
                            weight = -1
                        avg_prediction = avg_prediction + (weight * voting_cnt[label])

                    avg_prediction = avg_prediction / 4

                    if(avg_prediction > 0):
                        final_label = 'positive'
                    elif(avg_prediction < 0):
                        final_label = 'negative'
                    else:
                        final_label = 'neutral'

            labels.append(final_label)
            pbar.update(1)
            
        pbar.close()
        df['label'] = labels
        display(df)
        df.to_csv('/content/drive/MyDrive/LabeledTweets/labeled-tweets-'+day+'-'+month+'-2022.csv',index=False)
        print(f"Done: "+day+"/"+month)
        x=x+datetime.timedelta(days=1)

/content/drive/MyDrive/LabeledTweets/labeled-tweets-24-02-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-25-02-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-26-02-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-27-02-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-28-02-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-01-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-02-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-03-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-04-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-05-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-06-03-2022.csv already exists...
/content/drive/MyDrive/LabeledTweets/labeled-tweets-07-03-2022.cs

  6%|▌         | 554/10000 [03:12<54:43,  2.88it/s]
100%|██████████| 10000/10000 [45:13<00:00,  3.69it/s]


Unnamed: 0,id,rawContent,replyCount,retweetCount,likeCount,quoteCount,hashtags,label
0,1505333150641233920,@WarOnTheRocks Thank you! Subscribed. You had ...,0,0,0,0,,negative
1,1505333150150447105,@BillAckman @Ukraine You continue to post gros...,0,0,0,0,,positive
2,1505333109423702016,"Russia: literally bombs a children’s hospital,...",0,0,1,0,,negative
3,1505333048237346822,"@RajaChemayel 2014 - sigh, from all your tweet...",0,0,0,0,,negative
4,1505333027370639360,"“It’s time to meet, it’s time to talk,” said P...",0,1,1,0,"['Zelenskiy', 'Russia', 'thefallofputin', 'get...",neutral
...,...,...,...,...,...,...,...,...
9995,1505113456797798402,@BovayNicolas @GeorgeSzamuely @TuckerCarlson w...,0,0,2,0,"['Russia', 'Ukraine']",negative
9996,1505113452985163776,Ukraine War update: Turkey &amp; Israel offers...,0,2,0,0,,neutral
9997,1505113444764340226,"“We are engaged in a conflict here, it’s a pro...",1,4,8,0,"['Russia', 'Ukraine']",negative
9998,1505113436757311490,Why has Russia invaded Ukraine? The conflict e...,8,1,7,2,,negative


Done: 19/03
