# Main Notebook: NLP Series Workshop 2: Diving Deeper into Sentiment Analysis Techniques

TODO:
- include graphic for pipeline
- visuals for everything
- finish the entire noteboook
- remove dropout, embedding (all the complicated stuff)
- better explanations
  - Vincent: I explained everything up to the modeling part.
- need an evaluation section

Credit to this wonderful notebook: https://www.kaggle.com/code/isidronavarrooporto/hate-speech-tweet-classification

<span style="color:red">__DISCLAIMER__</span> : This dataset contains hateful speech and explicit content. 

Conventions used:

❗ - Required <br>
❓ - Question

# 1. Setup

The dataset we'll use can be found here: https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech

In [1]:
import gdown
!mkdir twitter-sentiment
%cd twitter-sentiment
gdown.download('https://drive.google.com/uc?export=download&id=1tMrkYFAuzjCWjhDCJRGqVNLd4j0XrlVK')
!unzip -q twitter-sentiment-analysis-hatred-speech.zip
!rm twitter-sentiment-analysis-hatred-speech.zip

/content/twitter-sentiment


Downloading...
From: https://drive.google.com/uc?export=download&id=1tMrkYFAuzjCWjhDCJRGqVNLd4j0XrlVK
To: /content/twitter-sentiment/twitter-sentiment-analysis-hatred-speech.zip
100%|██████████| 1.98M/1.98M [00:00<00:00, 162MB/s]


In [2]:
import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.backend as K
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, Flatten, Dropout

In [11]:
train_csv = pd.read_csv("/content/twitter-sentiment/train.csv")
test_csv = pd.read_csv("/content/twitter-sentiment/test.csv")

Here we download our data, import the relevant libraries, and load in the `.csv` files again.

Let's take a quick look at the train data again just for a refresher!

In [12]:
train_csv

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


# Preprocessing

Computers don't understand English! It's as simple as that. It understands numbers. So how do we turn our table of tweets into sequences of numbers?

This process isn't that easy. But no worries! We will thoroughly walk you through the steps of text preprocessing with code for you toy with. 

Here are the steps we will take to turn our string tweets into number sequences:
1. Clean the text by:
  - lowercasing all text
  - stripping the end of contractions (e.g. `what's` to `what`)
  - breaking contractions into its components "can't" to "can not"
  - formalizing slang (e.g. `'scuse` to `excuse`)
  - removing special characters (that aren't an alphabetical character or number)
    - this includes punctuation!
  - stripping excessive white space

In [13]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub('[^A-Za-z0-9]+', ' ', text)
    text = text.strip(' ')
    return text

Now we can remove the irrelevant `id` column and run that function for cleaning tweets we just defined through all the tweets in the dataset. We run it twice so that the data is extra tidy!

__Note__: Even though we have a function that cleans up all the textual mess, it is not comprehensive nor perfect. There will always be some small textual problems (e.g. maybe you couldn't break all the contractions up). But, for the purposes of our simple model, this is perfectly fine!

In [14]:
# Drop irrelevant column.
train_csv.drop("id", axis=1, inplace=True)

# Run through the text cleaning pipeline twice.
train_csv['tweet'] = train_csv['tweet'].map(lambda t: clean_text(t))
train_csv['tweet'] = train_csv['tweet'].map(lambda t: clean_text(t))

In [15]:
train_csv

Unnamed: 0,label,tweet
0,0,user when a father is dysfunctional and is so ...
1,0,user user thanks for lyft credit i can not use...
2,0,bihday your majesty
3,0,model i love u take with u all the time in ur
4,0,factsguide society now motivation
...,...,...
31957,0,ate user isz that youuu
31958,0,to see nina turner on the airwaves trying to w...
31959,0,listening to sad songs on a monday morning otw...
31960,1,user sikh temple vandalised in in calgary wso ...


Notice we dropped the `id` column and also the text is a lot more readable. 

Basically, we now have a train dataset of tweets that solely consist of lowercase words, no punctuations, no special characters, no weird spacing, etc!

Next, we define a tokenizer.

❓: What is a __tokenizer__?

> __tokenizer__ : an NLP technique that converts sentences (text, more generally) to a sequence of tokens; in this case, we want the model to train on this data so our tokens take the form of numbers!

Essentially, a tokenizer will build a __vocabulary__ which is a dictionary like below.

```py
vocabulary = {
  0: <e> (end token),
  1: <s> (start token),
  2: <UNK> (unknown token),
  3: the,
  4: a,
  5: how,
  ...
}
```

We see the tokenizer's vocabulary usually reserves the first 3 spots for special tokens that denote when a sentence ends, starts, and unknown characters.

Right below, we have a tokenizer defined with a max vocabulary size of 2000. The `split` parameter simply says that each of our clean and tidy tweets are sentences that are separated by spaces!

In [16]:
# Define a tokenizer.
vocabulary_size = 2000
tokenizer = Tokenizer(num_words=vocabulary_size, split=' ')

In [18]:
train_csv['tweet'].values

array(['user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run',
       'user user thanks for lyft credit i can not use cause they do not offer wheelchair vans in pdx disapointed getthanked',
       'bihday your majesty', ...,
       'listening to sad songs on a monday morning otw to work is sad',
       'user sikh temple vandalised in in calgary wso condemns act',
       'thank you user for you follow'], dtype=object)

Next, we fit this tokenizer to the text with the convenient `fit_on_texts` method. 

In case you're curious what `train_csv['tweet'].values` is:

```py
[
  'user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run',
  'user user thanks for lyft credit i can not use cause they do not offer wheelchair vans in pdx disapointed getthanked',
  'bihday your majesty',
  ...
]
```

Essentially the `fit_on_texts` method takes in a list of clean, tidy strings. It does not output anything, instead the tokenizer will build the vocabulary we specified above with this method.

In [19]:
# Fit the tokenizer on all the train tweets to establish a vocabulary. 
tokenizer.fit_on_texts(train_csv['tweet'].values)

Afterwards, our tokenizer will have a vocabulary. With this vocabulary, it will convert the list of clean, tidy tweets into numbers.

If you're curious what `X` looks like:

```py
[[1, 37, 5, 71, 10, 7, 10, 26, 73, 95, 250, 255, 95, 456],
 [1, 1, 169, 9, 4, 35, 14, 439, 649, 62, 27, 14, 1522, 8],
 [57, 31],
 [141, 4, 15, 38, 75, 19, 38, 24, 2, 41, 8, 111],
 [1480, 47, 293],
 [74, 74, 1034, 705, 7, 260, 706, 243, 62, 366, 7, 189, 37, 62, 54, 83],
 [1, 112, 1, 1, 1, 1, 1, 1, 1],
 ...
]
```

In short, it took our list of clean, tidy tweets into numbers where each number corresponds to a word in the vocabulary.

In [20]:
# Using this vocabulary, the tokenizer converts each tweet into a sequence of numbers
# where each number corresponds to a word in the vocabulary. 
X = tokenizer.texts_to_sequences(train_csv['tweet'].values)

Notice that these tweets are mostly of different lengths! Models don't like that. So we basically find a point to cut off long tweets, and any tweet shorter than that cut off point will be padded with 0s. That's what `pad_sequences` does!

If you're curious what `X` looks like after running the below cell:

```py
[[   0,    0,    0, ...,  255,   95,  456],
 [   0,    0,    0, ...,   14, 1522,    8],
 [   0,    0,    0, ...,    0,   57,   31],
 ...,
 [   0,    0,    0, ...,   76,   10,  120],
 [   0,    0,    0, ..., 1608, 1609,  672],
 [   0,    0,    0, ...,    9,    6,  152]
]
```

Notice how it is a rectangular matrix now.

In [22]:
# Pad the sequences.
X = pad_sequences(X)

Lastly, we will also get labels for all the tweets.

If you're curious what `y` looks like:

```py
[
  0,
  1,
  0,
  1,
  1,
  ...
]
```

In [24]:
# Get all the labels.
y = train_csv["label"]

So we now have a list of labels and a rectangular matrix where each row is a tweet and all tweets are not only clean and tidy but in their numeric form (tokenized form). We will apply the train and test split from before to partition the data.

__Note__: remember we have been working with the train data all this time, so we are actually splitting the train dataset.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=66)

In [28]:
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (27167, 39)
y_train shape:  (27167,)
X_test shape:  (4795, 39)
y_test shape:  (4795,)


# Building a Model

We will be building our model using tf.keras.Sequential

The first layer is the encoder, which converts the text to a sequence of token indices.

After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors.

These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

### ❓ What is an RNN?
A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.


We will be using tf.keras.layers.Bidirectional wrapper with our RNN layer.

This propagates the input forward and backwards through the RNN layer and then concatenates the final output.

In [None]:
embed_size = 128
vocab_size = 3000
simplernn_out = 64

def f1_metric(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

def build_model():
  model = Sequential()
  model.add(Embedding(vocab_size, embed_size, input_shape=(X_train.shape[1],)))
  model.add(SimpleRNN(simplernn_out, activation="relu", return_sequences=True))
  model.add(SimpleRNN(simplernn_out, activation="relu", return_sequences=False))
  model.add(Flatten())
  model.add(Dense(1, activation='sigmoid'))
  print(model.summary())

  model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy', 
                                                                       tf.keras.metrics.Precision(), 
                                                                       tf.keras.metrics.Recall(),
                                                                       f1_metric])

  return model

In [None]:
model = build_model()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 39, 128)           384000    
                                                                 
 simple_rnn_14 (SimpleRNN)   (None, 39, 64)            12352     
                                                                 
 simple_rnn_15 (SimpleRNN)   (None, 64)                8256      
                                                                 
 flatten_6 (Flatten)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 404,673
Trainable params: 404,673
Non-trainable params: 0
_________________________________________________________________
None


# Training the Model

In [None]:
batch_size = 32
history = model.fit(X_train, y_train, epochs = 7, batch_size=batch_size, validation_split=0.2)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
