<a href="https://colab.research.google.com/github/ceydab/NLP_Projects/blob/main/SentimentIntensityAnalysiswithTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Natural Language Processing: Sentiment Intensity with Transformers
This notebook shows an example of how to use pretrained models to perform sentiment affection analysis on the sentiment anger on an English dataset.

The notebook follows the structure below:

dataset import,

dataset prepration,

model setup,

evaluation using pearson r-coefficient and p-value.

In [None]:
!pip install simpletransformers
!pip install emoji



In [None]:
import pandas as pd
import emoji
import re
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging
import torch
from scipy.stats import pearsonr


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
path_train = "/content/EI-reg-En-anger-train.txt"
path_test = "/content/2018-EI-reg-En-anger-test-gold.txt"


train_df = pd.read_csv(path_train, sep="\t")
test_df = pd.read_csv(path_test, sep="\t")
print(train_df.head())
print(train_df['Affect Dimension'].unique())
print(test_df['Affect Dimension'].unique())

              ID                                              Tweet  \
0  2017-En-10264  @xandraaa5 @amayaallyn6 shut up hashtags are c...   
1  2017-En-10072  it makes me so fucking irate jesus. nobody is ...   
2  2017-En-11383         Lol Adam the Bull with his fake outrage...   
3  2017-En-11102  @THATSSHAWTYLO passed away early this morning ...   
4  2017-En-11506  @Kristiann1125 lol wow i was gonna say really?...   

  Affect Dimension  Intensity Score  
0            anger            0.562  
1            anger            0.750  
2            anger            0.417  
3            anger            0.354  
4            anger            0.438  
['anger']
['anger']


Since, we already know there is only anger as a sentiment we can drop that column, and ID column as well, since it will not be helpful.

In [None]:
train_df = train_df.drop(['ID', 'Affect Dimension'], axis=1)
test_df = test_df.drop(['ID', 'Affect Dimension'], axis=1)
print(train_df.head())

                                               Tweet  Intensity Score
0  @xandraaa5 @amayaallyn6 shut up hashtags are c...            0.562
1  it makes me so fucking irate jesus. nobody is ...            0.750
2         Lol Adam the Bull with his fake outrage...            0.417
3  @THATSSHAWTYLO passed away early this morning ...            0.354
4  @Kristiann1125 lol wow i was gonna say really?...            0.438


Now, we will prepare preprocess function with the following:
- lowercase
- number removal
- url removal
- mention and hashtag removal
- non-alphabetic character removal
- user id removal
- emoji removal

In [None]:
def preprocess_text(text):
    text = text.lower() #lowercase
    text = re.sub(r'\d+', '', text) #remove numbers
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'\@\w+|\#', '', text) # Remove mentions and hashtags
    text = re.sub(r'[^A-Za-z\s]', '', text) # Remove non-alphabetic characters
    text = re.sub(r"@\S+", '', text) #remove user ids
    text = emoji.demojize(text)
    return text

In [None]:
text = "`:.j c)=xZO?F-`l;@Jk]8?xDm)b0vCQg6{6-g~qH9KX'3dA~Y.o%W~?gFfJ{)W~bJ;9^&:>M$^8U1A7r~)"
preprocess_text(text)

'j cxzoflxdmbvcqggqhkxdayowgffjwbjmuar'

In [None]:
train_df['Tweet'] = train_df['Tweet'].apply(preprocess_text)
test_df['Tweet'] = test_df['Tweet'].apply(preprocess_text)
print(train_df[:5])

Now we successfully preprocessed the data, we will create the training module.

 We will utilize information given on https://simpletransformers.ai/docs/regression/

Yet we will use a different model cardiffnlp/twitter-roberta-base-sentiment

In [None]:
# Define Pearson correlation coefficient function
def pearson_corr_coef(labels, preds):
    return pearsonr(labels, preds)[0]

In [None]:
# Set up logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Create a ClassificationModel
model_args = ClassificationArgs(num_train_epochs=5, manual_seed=42, regression=True, overwrite_output_dir=True)
model = ClassificationModel(
    "roberta",
    "cardiffnlp/twitter-roberta-base-sentiment",
    num_labels=1,
    args=model_args,
    ignore_mismatched_sizes=True
)
# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df, pearson_corr_coef=pearson_corr_coef)
print(result)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([1, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([1]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

  self.pid = os.fork()


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/213 [00:00<?, ?it/s]



Running Epoch 2 of 5:   0%|          | 0/213 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/213 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/213 [00:00<?, ?it/s]

Running Epoch 5 of 5:   0%|          | 0/213 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/2 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/11 [00:00<?, ?it/s]

{'pearson_corr_coef': 0.8064287743001394, 'eval_loss': 0.012588972056453878}


In [None]:
predictions, raw_outputs = model.predict(test_df['Tweet'].tolist())
pearson_r, p_value = pearsonr(test_df['Intensity Score'], predictions)

print(f"Pearson-R: {pearson_r}", f"P-value: {p_value}")


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

Pearson-R: 0.8064287743197975 P-value: 2.1176488809515034e-230


We have obtained the evaluation metrics as follows:

Trainset pearson_corr_coef: 0.8064287743001394, eval_loss: 0.012588972056453878

Testset Pearson-R: 0.8064287743197975 P-value: 2.1176488809515034e-230



We have pearson R at about 80% which shows strong correlation between the predicted and actual numbers. P-value shows a low value suggestion statistical significance in the correlation.

Note that the model can be improved by the use of different models such as roberta-case, xml-roberta etc. but as we find 80% to be satisfying, we will not further intervene.