<a href="https://colab.research.google.com/github/celelunar/Sentiment-Analysis-RoBERTa/blob/main/Sentiment%20Analysis%20Classification%20RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nama: Diva Nabila Henryka
<br>
NIM: 2501975620
<br>


---



You are a data scientist tasked with developing a sentiment
analysis system for a hospital in Indonesia. This system aims to discern emotions from
questionnaire responses. You have access to various datasets containing information such as
Text and sentiment labels.

### Preparation

#### Import libraries needed

In [None]:
!pip install transformers



In [None]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
from tabulate import tabulate
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from tensorflow.keras.metrics import CategoricalAccuracy, Precision, Recall, Precision, F1Score

from transformers import RobertaModel, TFAutoModel, AdamWeightDecay, RobertaTokenizer

import tensorflow as tf

#### Accessing the dataset through Google Drive

I choose to store the dataset in my Google Drive to avoid the hassle of reuploading the dataset everytime the runtime gets disconnected.
<br>
In order to access the dataset, my Google Drive has to be mounted first, then read the dataset using:

```
pd.read_csv("/path to file/file.csv)
```



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data  =  pd.read_csv('/content/drive/MyDrive/Final/Emotion.csv')
data.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


### A. Data Exploration, Cleaning, and Preprocessing

#### Familiarize with the dataset
The first step of data exploration is familiarizing ourselves with the dataset that we'll be dealing with to know the attributes, columns in this case, data shape, and type of each attributes using:
```
.info()
```


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    20000 non-null  object
 1   label   20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


From the result above it can be seen that our data has 2 columns, text and sentiment label of the text, and it has 20000 entries. Thus the data shape is (20000, 2).

#### Checking for missing values
Checking for missing value is a crucial step in data exploration, since incomplete data can bias the result of the model and/or reduce the accuracy of the model. To check the missing values we can use:
```
.isna()
.isnull()
```

In [None]:
print("Missing Values: ")
Na = data.isna().sum().sort_values(ascending = False)
Null = data.isnull().sum().sort_values(ascending = False)
missingData = pd.concat([Na, Null], axis = 1, keys = ['Total Na', 'Total Null'])
missingData.head()

Missing Values: 


Unnamed: 0,Total Na,Total Null
text,0,0
label,0,0


From the result above, it can be concluded that the data has no missing values.

#### Check for duplicates
Data duplication can add the weights of sample and cause bias to the result of the model, thus it is important to check and handle it. To check duplicates, we can use:
```
.duplicated()
```

In [None]:
print("Before removing duplicated data:")
data.duplicated().sum()

Before removing duplicated data:


1

In [None]:
df_duplicate = data[data.duplicated(keep = False)]
df_duplicate

Unnamed: 0,text,label
4975,i feel more adventurous willing to take risks ...,joy
13846,i feel more adventurous willing to take risks ...,joy


Since the data turned out to have duplicated rows, we can resolve it by dropping one of the duplicated rows using:
```
.drop_duplicates()
```

In [None]:
data = data.drop_duplicates(keep = 'first')

print("After removing duplicated data:")
data.duplicated().sum()

After removing duplicated data:


0

#### Sentiment label modification
To make it easier for the model to conduct sentiment analysis of a text, we have to change the categorical text values to a numerical values. To do so we can use:
```
.apply(lambda x: [numerical value] if x == '[categorical text value]' else x)
```

In [None]:
data['label'].unique()

array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
      dtype=object)

In [None]:
def modify_label(data):
  df = data.copy()

  df['label'] = df['label'].apply(lambda x: 0 if x == 'sadness' else x)
  df['label'] = df['label'].apply(lambda x: 1 if x == 'anger' else x)
  df['label'] = df['label'].apply(lambda x: 2 if x == 'love' else x)
  df['label'] = df['label'].apply(lambda x: 3 if x == 'surprise' else x)
  df['label'] = df['label'].apply(lambda x: 4 if x == 'fear' else x)
  df['label'] = df['label'].apply(lambda x: 5 if x == 'joy' else x)

  return df

In [None]:
def info_data(df, name):
  sadness = df[df.label == 0].shape[0]
  anger = df[df.label == 1].shape[0]
  love = df[df.label == 2].shape[0]
  surprise = df[df.label == 3].shape[0]
  fear = df[df.label == 4].shape[0]
  joy = df[df.label == 5].shape[0]

  amount = [
      ["sadness", sadness],
      ["anger", anger],
      ["love", love],
      ["surprise", surprise],
      ["fear", fear],
      ["joy", joy]
  ]

  print(name, "dataset info:")
  print("Shape: ", df.shape)
  print("Amount per category:")
  print(tabulate(amount, headers = ["Category", "Amount"], tablefmt = "psql"))

In [None]:
modified_data = modify_label(data)
info_data(modified_data, "Original")

Original dataset info:
Shape:  (19999, 2)
Amount per category:
+------------+----------+
| Category   |   Amount |
|------------+----------|
| sadness    |     5797 |
| anger      |     2709 |
| love       |     1641 |
| surprise   |      719 |
| fear       |     2373 |
| joy        |     6760 |
+------------+----------+


#### Text cleaning
Not only we have to modify the categorial text values to numerical values, we also have to standardized the format of all text inputs. Thus we have to remove all special characters, numbers, and change it into either lowercase or uppercase.

In [None]:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…',
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─',
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞',
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

def text_cleaning(text):
  def clean_text(x):
    x = str(x)
    for punct in puncts:
      if punct in x:
        x = x.replace(punct, f' {punct} ')
    return x

  def clean_num(x):
    if bool(re.search(r'\d', x)):
      x = re.sub('[0-9]{5,}', '#####', x)
      x = re.sub('[0-9]{4}', '####', x)
      x = re.sub('[0-9]{3}', '###', x)
      x = re.sub('[0-9]{2}', '##', x)
    return x

  text = text.lower()
  text = clean_text(text)
  text = clean_num(text)

  return text

In [None]:
def preprocess(df, text_col_name):
  df[text_col_name] = df[text_col_name].apply(lambda x: text_cleaning(x))
  df[text_col_name] = df[text_col_name].fillna("_##_")

  return df

#### Tokenization with RoBERTa Tokenizer
Tokenization is an essential preprocessing step in NLP pipelines, as it enables the transformation of unstructured text data into a format that can be easily fed into the neural network model using RoBERTa that will be built later on.

In [None]:
SEQ_LEN = 80

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [None]:
def tokenize(df):
  input_id = []
  attention_mask = []

  for i, text in enumerate(df["text"]):
    token = tokenizer.encode_plus(text, max_length = SEQ_LEN, truncation = True, padding = 'max_length',  add_special_tokens = True, return_attention_mask = True,
                                  return_token_type_ids = False, return_tensors = 'tf')
    input_id.append(np.asarray(token["input_ids"]).reshape(SEQ_LEN,))
    attention_mask.append(np.asarray(token["attention_mask"]).reshape(SEQ_LEN,))

  return (np.asarray(input_id), np.asarray(attention_mask))

#### One hot encoding for the sentiment labels
To ensure the suitability of the response variable, we have to perform one hot encoding on the labels. One hot encoding is basically a technique used to represent categorical variable as a binary vector. Though, if the dataset is bigger I'd suggest to use other techniques such as embeddings or other encoding schemes, since if one were to use one hot encoding the resulting binary vector will have a very high dimension.

In [None]:
def one_hot_encoding(df):
  emotion_values = set(df["label"].values)
  labels = []

  for index, row in df.iterrows():
    label = np.zeros((len(emotion_values)))
    label[row["label"]] = 1
    labels.append(label)

  return (np.asarray(labels))

### B. Data Splitting (Training, Validation, Testing)

#### Split the data
The next part is splitting the data into 3 parts, 70% training set, 15% validation set, and 15% testing set. To do so we can first calculate the size of training and validation set, then use the `train_test_split` function from `sklearn` library.

In [None]:
train, temp = train_test_split(modified_data, test_size = 0.3, shuffle = True)
val, test = train_test_split(temp, test_size = 0.5, shuffle = True)

In [None]:
info_data(train, "Training")
print("")
info_data(val, "Validation")
print("")
info_data(test, "Testing")

Training dataset info:
Shape:  (13999, 2)
Amount per category:
+------------+----------+
| Category   |   Amount |
|------------+----------|
| sadness    |     4144 |
| anger      |     1871 |
| love       |     1116 |
| surprise   |      511 |
| fear       |     1623 |
| joy        |     4734 |
+------------+----------+

Validation dataset info:
Shape:  (3000, 2)
Amount per category:
+------------+----------+
| Category   |   Amount |
|------------+----------|
| sadness    |      828 |
| anger      |      448 |
| love       |      259 |
| surprise   |      109 |
| fear       |      356 |
| joy        |     1000 |
+------------+----------+

Testing dataset info:
Shape:  (3000, 2)
Amount per category:
+------------+----------+
| Category   |   Amount |
|------------+----------|
| sadness    |      825 |
| anger      |      390 |
| love       |      266 |
| surprise   |       99 |
| fear       |      394 |
| joy        |     1026 |
+------------+----------+


#### Divide the data into X and Y
The next step is to actually divide the data into independent and response by choosing a column and storing it in another array.

In [None]:
trainX = train[["text"]]
trainY = train[["label"]]

valX = val[["text"]]
valY = val[["label"]]

testX = test[["text"]]
testY = test[["label"]]

#### Preprocess the data
The last step of data preparation is to actually preprocess the data by calling the functions that we've made above.

In [None]:
trainX = preprocess(trainX.copy(), "text")
valX = preprocess(valX.copy(), "text")

In [None]:
train_input_id, train_attention_mask = tokenize(trainX)
val_input_id, val_attention_mask = tokenize(valX)

In [None]:
trainY = one_hot_encoding(trainY)
valY = one_hot_encoding(valY)

### C. RoBERTa Model
Since the last 2 numbers of my student ID is 20 ⟶ 2 + 0 ⟶ 2, thus I'll be creating the RoBERTa model.

In [None]:
roberta = TFAutoModel.from_pretrained("roberta-base")

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'roberta.embeddings.position_ids', 'lm_head.dense.bias']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaModel were not initialized from the PyTorch model and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

#### Set callbacks for early stopping, saving, and reducing learning rate
Since the question asks to train the model until achieving satisfactory accuracy, then we'll need callbacks which will immediately stop the model training if the validation accuracy has not improved in 10 (can be changed as you please) epochs. Also, because we want to maximize the validation accuracy, then we can use `monitor = 'val_accuracy` and `mode = max` in all three callbacks.

In [None]:
earlystopping = EarlyStopping(monitor = 'val_accuracy', mode = 'max', verbose = 1, patience = 10)
checkpointer = ModelCheckpoint(monitor = 'val_accuracy', mode = 'max', filepath = "model.h5", verbose = 1, save_best_only = True)
reduce_lr = ReduceLROnPlateau(monitor = 'val_accuracy', mode = 'max', verbose = 1, patience = 5, min_lr = 0.00001, factor = 0.2)
callbacks = [checkpointer, earlystopping, reduce_lr]

In [None]:
input_id = tf.keras.layers.Input(shape = (SEQ_LEN,), name = 'input_ids', dtype = 'int32')
mask = tf.keras.layers.Input(shape = (SEQ_LEN,), name = 'attention_mask', dtype = 'int32')

#### Build the model
To build the model, we can call the `roberta` function that we've set ahead. To enhance the quality of the model I also added other layers such as `BatchNormalization()`, `Dense()`, `Activation()`, and `Dropout()`.

In [None]:
embeddings = roberta(input_id, attention_mask = mask)[0]
roberta_output = embeddings[:, 0, :]

X = tf.keras.layers.BatchNormalization()(roberta_output)
X = tf.keras.layers.Dense(768)(X)
X = tf.keras.layers.Activation("relu")(X)
X = tf.keras.layers.Dense(768)(X)
X = tf.keras.layers.Dropout(0.1)(X)
y = tf.keras.layers.Dense(6, activation='softmax', name='outputs')(X)

model = tf.keras.Model(inputs = [input_id, mask], outputs = y)
model.layers[2].trainable = False

optimizer = AdamWeightDecay(2e-03, beta_1 = 0.8, beta = 0.9, weight_decay_rate = 0.0001)
loss = tf.keras.losses.CategoricalCrossentropy()

metrics = [
    CategoricalAccuracy(name = 'accuracy'),
    Precision(name = 'precision'),
    Recall(name = 'recall'),
    F1Score(name = 'f1_score')
]

model.compile(optimizer = optimizer, loss = loss, metrics = metrics)

model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 80)]                 0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 80)]                 0         []                            
 )                                                                                                
                                                                                                  
 tf_roberta_model (TFRobert  TFBaseModelOutputWithPooli   1246456   ['input_ids[0][0]',           
 aModel)                     ngAndCrossAttentions(last_   32         'attention_mask[0][0]']      
                             hidden_state=(None, 80, 76                                     

#### Train the model
After building the model, we can train the model iteratively by using the training set and validation set to validate the model's performance. I'll be using epochs of 100, batch size of 32, and of course the callbacks that have been made to stop the epoch if the validation accurracy has not improved in 10 epochs.

In [None]:
history = model.fit((train_input_id, train_attention_mask), trainY, validation_data = ((val_input_id,val_attention_mask), valY), epochs = 100, batch_size = 32, callbacks = callbacks)

Epoch 1/100
Epoch 1: val_accuracy improved from -inf to 0.50667, saving model to model.h5
Epoch 2/100
Epoch 2: val_accuracy improved from 0.50667 to 0.56100, saving model to model.h5
Epoch 3/100
Epoch 3: val_accuracy improved from 0.56100 to 0.56800, saving model to model.h5
Epoch 4/100
Epoch 4: val_accuracy improved from 0.56800 to 0.57400, saving model to model.h5
Epoch 5/100
Epoch 5: val_accuracy improved from 0.57400 to 0.58667, saving model to model.h5
Epoch 6/100
Epoch 6: val_accuracy did not improve from 0.58667
Epoch 7/100
Epoch 7: val_accuracy did not improve from 0.58667
Epoch 8/100
Epoch 8: val_accuracy did not improve from 0.58667
Epoch 9/100
Epoch 9: val_accuracy improved from 0.58667 to 0.60067, saving model to model.h5
Epoch 10/100
Epoch 10: val_accuracy did not improve from 0.60067
Epoch 11/100
Epoch 11: val_accuracy did not improve from 0.60067
Epoch 12/100
Epoch 12: val_accuracy did not improve from 0.60067
Epoch 13/100
Epoch 13: val_accuracy did not improve from 0.60

The model stops at epoch 32 with accuracy of 63.28% and validation accuracy of 60.33%. These percentage are actually "good enough", but if we want to improve it we can modify the architecture that we've made above by adding or removing a layer or changing the hyperparameters.

### D. Performance Analysis

#### Prepare the testing data
To do a prediction using the model that has been made, we have to prepare the testing data first using the same method as we use for the training and validation set.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

In [None]:
def testing_preparation(text):
  text = text_cleaning(text)

  token = tokenizer.encode_plus(text, max_length = 80,
                                   truncation = True, padding = 'max_length',
                                   add_special_tokens = True, return_token_type_ids = False,
                                   return_tensors = 'tf')

  return {'input_ids': tf.cast(token['input_ids'], tf.float64),
            'attention_mask': tf.cast(token['attention_mask'], tf.float64)}

#### Predict the testing data
To start the prediction we can use:
```
[model name].predict([testing set])
```

In [None]:
def prediction(text):
    testing_data = testing_preparation(text)
    predictions = model.predict(testing_data)[0]
    return (np.argmax(predictions))

In [None]:
predict = np.asarray(testX["text"].apply(lambda x: prediction(x)))
actual = np.asarray(testY["label"])



#### Analyze the performance
I'll be hiring `classification_report` function from the `sklearn` library to help print out the model performance analysis using the testing data.

In [None]:
report = classification_report(actual, predict, digits = 4, output_dict = False, target_names = ["sadness", "anger", "love", "surprise", "fear", "joy"],)
print(report)

              precision    recall  f1-score   support

     sadness     0.5355    0.8412    0.6544       825
       anger     0.6605    0.3641    0.4694       390
        love     0.6000    0.2030    0.3034       266
    surprise     0.6154    0.1616    0.2560        99
        fear     0.6801    0.4695    0.5556       394
         joy     0.7084    0.7602    0.7334      1026

    accuracy                         0.6237      3000
   macro avg     0.6333    0.4666    0.4954      3000
weighted avg     0.6382    0.6237    0.6001      3000



Accuracy:
- The overall accuracy of the model is 62.37% which means that the model prediction is correct around 62% of all testing data.

<br>

Precision:
- Out of all text that the model predicted to have 'sadness' as the sentiment, 53.55% of them actually categorized as 'sadness'.
- Out of all text that the model predicted to have 'anger' as the sentiment, 66.05% of them actually categorized as 'anger'.
- Out of all text that the model predicted to have 'love' as the sentiment, 60% of them actually categorized as 'love'.
- Out of all text that the model predicted to have 'surprise' as the sentiment, 61.54% of them actually categorized as 'surprise'.
- Out of all text that the model predicted to have 'fear' as the sentiment, 68.01% of them actually categorized as 'fear'.
- Out of all text that the model predicted to have 'sadness' as the sentiment, 70.84% of them actually categorized as 'joy'.

<br>

Recall:
- Out of all text that actually categorized as 'sadness', the model predicted it correctly for 84.12% of them.
- Out of all text that actually categorized as 'anger', the model predicted it correctly for 36.41% of them.
- Out of all text that actually categorized as 'love', the model predicted it correctly for 20.30% of them.
- Out of all text that actually categorized as 'surprise', the model predicted it correctly for 16.16% of them.
- Out of all text that actually categorized as 'fear', the model predicted it correctly for 46.95% of them.
- Out of all text that actually categorized as 'joy', the model predicted it correctly for 76.02% of them.

<br>

F1 Score:
- Out of all categories, the model is better at predicting 'sadness', 'fear', and 'joy' as they have F1 Score greater than 0.5.
- The model do the best at predicting 'joy' as it has the largest F1 score at 0.73 out of all categories.