# Main Notebook: NLP Series Workshop 2: Diving Deeper into Sentiment Analysis Techniques

TODO:
- include graphic for pipeline
- visuals for everything
- finish the entire noteboook
- remove dropout, embedding (all the complicated stuff)
- better explanations
- need an evaluation section

Credit to this wonderful notebook: https://www.kaggle.com/code/isidronavarrooporto/hate-speech-tweet-classification

<span style="color:red">__DISCLAIMER__</span> : This dataset contains hateful speech and explicit content. 

Conventions used:

❗ - Required <br>
❓ - Question

# 1. Setup

The dataset we'll use can be found here: https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech

In [None]:
import gdown
!mkdir twitter-sentiment
%cd twitter-sentiment
gdown.download('https://drive.google.com/uc?export=download&id=1tMrkYFAuzjCWjhDCJRGqVNLd4j0XrlVK')
!unzip -q twitter-sentiment-analysis-hatred-speech.zip
!rm twitter-sentiment-analysis-hatred-speech.zip

/content/twitter-sentiment


Downloading...
From: https://drive.google.com/uc?export=download&id=1tMrkYFAuzjCWjhDCJRGqVNLd4j0XrlVK
To: /content/twitter-sentiment/twitter-sentiment-analysis-hatred-speech.zip
100%|██████████| 1.98M/1.98M [00:00<00:00, 115MB/s]


In [None]:
import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.backend as K
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, Flatten, Dropout

In [None]:
train_csv = pd.read_csv("/content/twitter-sentiment/train.csv")
test_csv = pd.read_csv("/content/twitter-sentiment/test.csv")

Here we download our data, import the relevant libraries, and load in the `.csv` files again.

# Preprocessing

Computers don't understand English! It's as simple as that. It understands numbers. So how do we turn our table of tweets into sequences of numbers?

This process isn't that easy. But no worries! We will thoroughly walk you through the steps of text preprocessing with code for you toy with. 

Here are the steps we will take to turn our string tweets into number sequences:
1. Clean the text by:
  - lowercasing all text
  - stripping the end of contractions (e.g. `what's` to `what`)
  - breaking contractions into its components "can't" to "can not"
  - formalizing slang (e.g. `'scuse` to `excuse`)
  - removing special characters (that aren't an alphabetical character or number)
  - stripping excessive white space

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub('[^A-Za-z0-9]+', ' ', text)
    text = text.strip(' ')
    return text

In [None]:
# Read CSV and drop irrelevant column.
data = pd.read_csv("../twitter-sentiment/train.csv")
data.drop("id", axis=1, inplace=True)

# Run through the text cleaning pipeline twice.
data['tweet'] = data['tweet'].map(lambda t: clean_text(t))
data['tweet'] = data['tweet'].map(lambda t: clean_text(t))

In [None]:
# Tokenize and convert tweets to sequence of numbers.
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['tweet'].values)
X = tokenizer.texts_to_sequences(data['tweet'].values)
X = pad_sequences(X)

# Get all the labels.
y = data["label"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=66)

In [None]:
print("X_train shape: ",X_train.shape)
print("y_train shape: ",y_train.shape)
print("X_test shape: ",X_test.shape)
print("y_test shape: ",y_test.shape)

X_train shape:  (27167, 39)
y_train shape:  (27167,)
X_test shape:  (4795, 39)
y_test shape:  (4795,)


# Building a Model

We will be building our model using tf.keras.Sequential

The first layer is the encoder, which converts the text to a sequence of token indices.

After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors.

These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

### ❓ What is an RNN?
A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.


We will be using tf.keras.layers.Bidirectional wrapper with our RNN layer.

This propagates the input forward and backwards through the RNN layer and then concatenates the final output.

In [None]:
embed_size = 128
vocab_size = 3000
simplernn_out = 64

def f1_metric(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

def build_model():
  model = Sequential()
  model.add(Embedding(vocab_size, embed_size, input_shape=(X_train.shape[1],)))
  model.add(SimpleRNN(simplernn_out, activation="relu", return_sequences=True))
  model.add(SimpleRNN(simplernn_out, activation="relu", return_sequences=False))
  model.add(Flatten())
  model.add(Dense(1, activation='sigmoid'))
  print(model.summary())

  model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy', 
                                                                       tf.keras.metrics.Precision(), 
                                                                       tf.keras.metrics.Recall(),
                                                                       f1_metric])

  return model

In [None]:
model = build_model()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 39, 128)           384000    
                                                                 
 simple_rnn_14 (SimpleRNN)   (None, 39, 64)            12352     
                                                                 
 simple_rnn_15 (SimpleRNN)   (None, 64)                8256      
                                                                 
 flatten_6 (Flatten)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 404,673
Trainable params: 404,673
Non-trainable params: 0
_________________________________________________________________
None


# Training the Model

In [None]:
batch_size = 32
history = model.fit(X_train, y_train, epochs = 7, batch_size=batch_size, validation_split=0.2)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
