# Main Notebook: NLP Series Workshop 2: Diving Deeper into Sentiment Analysis Techniques

TODO:
- include graphic for pipeline
- visuals for everything
- finish the entire noteboook
- remove dropout, embedding (all the complicated stuff)
- better explanations
  - Vincent: I explained everything up to the modeling part.
- need an evaluation section

Credit to this wonderful notebook: https://www.kaggle.com/code/dorgavra/emotion-classification-nlp/notebook

<span style="color:red">__DISCLAIMER__</span> : This dataset contains hateful speech and explicit content. 

Conventions used:

❗ - Required <br>
❓ - Question <br>
🛑 - Stop and Think

# 1. Setup

The dataset we'll use can be found here: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

In [None]:
import gdown
!mkdir emotion-sentiment
%cd emotion-sentiment
gdown.download('https://drive.google.com/uc?export=download&id=1qkkiX1X5udWEUxbBBLKPH-xMP_upG0QX')
!unzip -q archive.zip
!rm archive.zip

/content/emotion-sentiment


Downloading...
From: https://drive.google.com/uc?export=download&id=1qkkiX1X5udWEUxbBBLKPH-xMP_upG0QX
To: /content/emotion-sentiment/archive.zip
100%|██████████| 738k/738k [00:00<00:00, 78.2MB/s]


🛑: Stop and take a look at the data (txt files)! 

Don't know how to check the data?

Click on that small folder icon on the left.
Then click the `emotion-sentiment` folder and just double-click any of the text files!

🛑: What do you notice?

It's just a bunch of sentences! Whoever published this Kaggle dataset split the data into 3 parts: train (for training our model), validation (for checking how well our model performs), and test (for seeing how our model performs on wild data). Each `.txt` file has a bunch of sentences. Specifically, each row in the `.txt` file is a sentence. 

In [None]:
import re
import nltk
import numpy as np
import pandas as pd

from nltk.stem import PorterStemmer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.backend as K
from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import one_hot
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, Flatten, Dropout

In [None]:
test_data = pd.read_csv("/content/emotion-sentiment/test.txt", header=None, sep=";", names=["Comment","Emotion"], encoding="utf-8")
train_data = pd.read_csv("/content/emotion-sentiment/train.txt", header=None, sep=";", names=["Comment","Emotion"], encoding="utf-8")
validation_data = pd.read_csv("/content/emotion-sentiment/val.txt", header=None, sep=";", names=["Comment","Emotion"], encoding="utf-8")

print("Train : ", train_data.shape)
print("Test : ", test_data.shape)
print("Validation : ", validation_data.shape)

Train :  (16000, 2)
Test :  (2000, 2)
Validation :  (2000, 2)


Here we download our data, import the relevant libraries, and load in the `.csv` files again.

Let's take a quick look at the train data again just for a refresher!

In [None]:
train_data

Unnamed: 0,Comment,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
...,...,...
15995,i just had a very brief time in the beanbag an...,sadness
15996,i am now turning and i feel pathetic that i am...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such a rude comment and i...,anger


# Preprocessing

Let's get the length of each tweet and put it in the `train_data` table.

In [None]:
train_data['length'] = [len(x) for x in train_data['Comment']]

🛑: Stop and take a look at the `train_data` table now.

In [None]:
lb = LabelEncoder()
train_data['Emotion'] = lb.fit_transform(train_data['Emotion'])
test_data['Emotion'] = lb.fit_transform(test_data['Emotion'])
validation_data['Emotion'] = lb.fit_transform(validation_data['Emotion'])

Let's take a look at the data!

What we basically did was map all the emotions into numbers.

In [None]:
train_data

Unnamed: 0,Comment,Emotion,length
0,i didnt feel humiliated,4,23
1,i can go from feeling so hopeless to so damned...,4,108
2,im grabbing a minute to post i feel greedy wrong,0,48
3,i am ever feeling nostalgic about the fireplac...,3,92
4,i am feeling grouchy,0,20
...,...,...,...
15995,i just had a very brief time in the beanbag an...,4,101
15996,i am now turning and i feel pathetic that i am...,4,102
15997,i feel strong and good overall,2,30
15998,i feel like this was such a rude comment and i...,0,59


In English, there are lots of words that we say that don't really mean much (ouch). For training an AI like an RNN, it won't care much about these __stopwords__. So we will use the `nltk` package to get a list of stopwords for later use.

In [None]:
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


🛑: Take a look at `stopwords`. What is it? What do these words mean?

In [None]:
max_len=train_data['length'].max()
print("Longest tweet length: ", max_len)

Longest tweet length:  300


In [None]:
vocab_size = 11000  # We decided on 11000 for how many words we will have in our vocabulary.

This function is really complex. We don't cover every single little detail. But basically, this dataset will go through every single tweet and "clean" the tweet. 

🛑: What do you think "cleaning" the tweet mean? 

It means removing special characters and converting everything to lowercase. In the end, we are just trying to convert the data (both the tweets and the labels) into a format acceptable for training the model. This is the essence of __preprocessing__.

In [None]:
def text_cleaning(df, column):
    """Removing unrelevent chars, Stemming and padding"""
    stemmer = PorterStemmer()
    corpus = []
    
    for text in df[column]:
        text = re.sub("[^a-zA-Z]", " ", text)
        text = text.lower()
        text = text.split()
        text = [stemmer.stem(word) for word in text if word not in stopwords]
        text = " ".join(text)
        corpus.append(text)
    one_hot_word = [one_hot(input_text=word, n=vocab_size) for word in corpus]
    pad = pad_sequences(sequences=one_hot_word,maxlen=max_len,padding='pre')
    print(pad.shape)
    return pad

Here we run the above function on all the tweets in train, test, and val.

In [None]:
x_train = text_cleaning(train_data, "Comment")
x_test = text_cleaning(test_data, "Comment")
x_val = text_cleaning(validation_data, "Comment")

(16000, 300)
(2000, 300)
(2000, 300)


Notice how our x_train is just a matrix now. Each row is a tweet. Each row (a vector) has 300 numbers in it: one number for each word in the tweet. Our `x_train` has 16000 tweets, `x_test` and `x_val` with 2000 tweets each.

🛑: How is it that all tweets are the same length? 

Our above function basically chose a length (300 in this case), and padded all tweets shorter than 300 words (and cut off all tweets greater than 300 words). The same goes for the `x_val` and `x_test`.

In [None]:
x_train

array([[    0,     0,     0, ...,  8862, 10669,  9104],
       [    0,     0,     0, ...,  4452,  7059,  1579],
       [    0,     0,     0, ..., 10669,   681,  3611],
       ...,
       [    0,     0,     0, ...,  9396,  9049,  5322],
       [    0,     0,     0, ...,  1398,  1738,  9796],
       [    0,     0,     0, ..., 10669,  5424,  6161]], dtype=int32)

In [None]:
y_train = train_data["Emotion"]
y_test = test_data["Emotion"]
y_val = validation_data["Emotion"]

In [None]:
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
y_val = to_categorical(y_val)

Our labels for train, test, val are in another format. Notice again our train has 16000 labels (one label for each tweet), test and val have 2000 labels each. 

What's special is that each label corresponds to 6 values. So the label for the first train tweet is a vector of 6 numbers. 

In [None]:
print(y_train.shape)
print(y_test.shape)
print(y_val.shape)

(16000, 6)
(2000, 6)
(2000, 6)


Below you can take a look at what `y_train` may look like (`y_test` and `y_val` are the same with just different number of rows).

We did something called __one-hot encoding__. Simply put, we have 6 different emotions that all the tweets in the dataset fall into. For the first row in `y_train`, we have a 1 in the 4th index (counting from 0) and zeros everywhere else. So that means the first tweet in the train dataset corresponds to whatever emotion is in the 4th index. 

In [None]:
y_train

array([[0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.]], dtype=float32)

### Why are we using RNNs?
RNNs are advantageous in NLP because they are able to recognize a sequence of words rather than just the individual words in the tweet.
 
i.e. We can use an RNN model to understand sentences in the tweets!

RNNs deliver a better accuracy than MLP because they consider the current input and also what they have learned from the inputs they received previously.

# Modeling

## ❓What is an RNN?

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

The following code creates the model and adds layers to it.

In [None]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=150, input_length=300))
model.add(Dropout(0.2))
model.add(SimpleRNN(128))
model.add(Dropout(0.2))
model.add(Dense(64,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(6,activation='softmax'))

❓ What does the dropout layer do?

# Training

Now we will train the model.

In [None]:
hist = model.fit(x_train,y_train,epochs=10,batch_size=64,
                 validation_data=(x_val,y_val), verbose=1)

In [None]:
model.evaluate(x_val,y_val,verbose=1)



[0.550441324710846, 0.843999981880188]

In [None]:
model.evaluate(x_test,y_test,verbose=1)



[0.5513563752174377, 0.8330000042915344]

Looks like our model is 84% accurate on the validation data and 83% accurate on the test data.

# Evaluation

How did the RNN model perform compared to how a MLP Classifier would perform?