# Recurring Neural Nets for NLP


##1. Setup & Exploration





###**a. Setup** 

In [None]:
import gdown
!mkdir -p /content/emotion-sentiment
%cd /content/emotion-sentiment
gdown.download('https://drive.google.com/uc?export=download&id=1EFpJf3GblKvBzutrykHZvoBVPdqFrTh_')
!unzip -q archive.zip
!rm -q archive.zip

/content/emotion-sentiment


Downloading...
From: https://drive.google.com/uc?export=download&id=1EFpJf3GblKvBzutrykHZvoBVPdqFrTh_
To: /content/emotion-sentiment/archive.zip
100%|██████████| 738k/738k [00:00<00:00, 44.8MB/s]


rm: invalid option -- 'q'
Try 'rm --help' for more information.


Import libraries

In [None]:

import re
import nltk
import numpy as np
import pandas as pd

from nltk.stem import PorterStemmer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import tensorflow as tf
import keras.backend as K
from tensorflow import keras
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, Flatten, Dropout

Load data with pandas

In [None]:
test_data = pd.read_csv("/content/emotion-sentiment/test.txt", header=None, sep=";", names=["Comment","Emotion"], encoding ="utf-8")
train_data = pd.read_csv("/content/emotion-sentiment/train.txt", header=None, sep=";", names=["Comment","Emotion"], encoding ="utf-8")
validation_data = pd.read_csv("/content/emotion-sentiment/val.txt", header=None, sep=";", names=["Comment","Emotion"], encoding ="utf-8")

### b. Exploration

There are 3 portions to the data: the train set, to train our model with, the validation data, to check how well our model performs, and the test data, to test our model's performance on random, wild data.

Examining the size of the data we are given:

In [None]:
print("Train size:\t", train_data.shape)
print("Test size:\t", test_data.shape)
print("Validation size:\t", validation_data.shape)

Train size:	 (16000, 2)
Test size:	 (2000, 2)
Validation size:	 (2000, 2)


contents of data

In [None]:
train_data

Unnamed: 0,Comment,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
...,...,...
15995,i just had a very brief time in the beanbag an...,sadness
15996,i am now turning and i feel pathetic that i am...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such a rude comment and i...,anger


What are the emotions labeling the comments?

In [None]:
print(train_data["Emotion"].unique())

['sadness' 'anger' 'love' 'surprise' 'fear' 'joy']


-> there are six classes of emotions, so we must train an multi-class classification model
  
  ❓What is a **multi-class classification**?

 **multi-class classification:**
  classifying instances into more than 2 instances

#2. Preprocessing

##a. Dataset Modifications

adding a new column of data containing the length of each comment to training data

In [None]:
train_data["Length"] = [len(x) for x in train_data["Comment"]]

replacing emotions with integer representations

In [None]:
lb = LabelEncoder()
train_data['Emotion'] = lb.fit_transform(train_data['Emotion'])
test_data['Emotion'] = lb.fit_transform(test_data['Emotion'])
validation_data['Emotion'] = lb.fit_transform(validation_data['Emotion'])

modified train_data:

In [None]:
train_data

Unnamed: 0,Comment,Emotion,Length
0,i didnt feel humiliated,4,23
1,i can go from feeling so hopeless to so damned...,4,108
2,im grabbing a minute to post i feel greedy wrong,0,48
3,i am ever feeling nostalgic about the fireplac...,3,92
4,i am feeling grouchy,0,20
...,...,...,...
15995,i just had a very brief time in the beanbag an...,4,101
15996,i am now turning and i feel pathetic that i am...,4,102
15997,i feel strong and good overall,2,30
15998,i feel like this was such a rude comment and i...,0,59


##b. Data Cleaning

A few techniques for cleaning text data

###i. Stop Words

ntlk stop word list

In [None]:
sentences = ['Caleb brought some donuts and is driving', 'He is very kind for doing all that']

nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))
print()

for sentence in sentences:
  stopped = " ".join([word for word in sentence.split() if word not in stopwords])
  print(f"{sentence}\n{stopped}\n")



Caleb brought some donuts and is driving
Caleb brought donuts driving

He is very kind for doing all that
He kind



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


###ii.Stemming

taking suffixes and prefixes out of words, chnaging singular to plural and vice versa, etc

In [None]:
sentences = ['Caleb brought some donuts and is driving', 'He is running and jumping']
stemmer = PorterStemmer()
for sentence in sentences:
  stemmed = " ".join([stemmer.stem(word) for word in sentence.split()])
  print(f"{sentence}\n{stemmed}\n")

Caleb brought some donuts and is driving
caleb brought some donut and is drive

He is running and jumping
he is run and jump



###iii. one-hot encoding

A tensorflow function assigning words to randomly generated numbers from 1-10. Padding will be added to shorter sentences to match the longest one in the array

In [None]:
sentences = ["Caleb ate some donuts", "Caleb ate some","Some ate"]
max_len = max([len(sentence.split()) for sentence in sentences])
encoded = []
for sentence in sentences:
  encoded.append(one_hot(input_text=sentence, n=20))

padded = pad_sequences(sequences=encoded, maxlen=max_len,padding="pre")

#converts text into number vectors
for i in range(len(padded)): print(f"{sentences[i]}\n{padded[i]}\n")


Caleb ate some donuts
[17  4 15  8]

Caleb ate some
[ 0 17  4 15]

Some ate
[ 0  0 15  4]



Next step: incorporating all the above techniques into a function called
    **text_cleaning**  

This function will:
1. remove special characters
2. convert everything to lowercase
3. remove stop words
4. stem all the text
5. one-hot encode all the sentences

This will ultimately convert the text into a format usable by the model -> the aim of **preprocessing**

In [None]:
vocab_size = 1100
max_len = train_data['Length'].max()

#downloading stopwords from nltk and saving them for use
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))

#the text_cleaning function
def text_cleaning(df,column):

  """Removing irrelevant characters, stemming, and padding"""
  stemmer = PorterStemmer()
  corpus = []

  for text in df[column]:
    #converts to lowercase & removes special chars
    text = text_to_word_sequence(text)

    #apply stemming while removing stop words
    text = [stemmer.stem(word) for word in text if word not in stopwords]
    text = " ".join(text)

    corpus.append(text)

  #one-hot encode each sentence (convert it into a vector)
  one_hot_word = [one_hot(input_text=sentence,n=vocab_size) for sentence in corpus]
  #apply padding to make all vector representations of sentences of equal length
  pad = pad_sequences(sequences=one_hot_word,maxlen=max_len,padding='pre')

  return pad
    

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Next, it's time to apply this function to clean up the data in our dataset

In [None]:
x_train = text_cleaning(train_data, "Comment")
x_test = text_cleaning(test_data,"Comment")
x_val = text_cleaning(validation_data,"Comment")

In [None]:
print(x_train)
print(x_train.shape)

[[  0   0   0 ... 917 137  76]
 [  0   0   0 ... 172 759 869]
 [  0   0   0 ... 137 725 974]
 ...
 [  0   0   0 ... 816 719 301]
 [  0   0   0 ... 284 437 772]
 [  0   0   0 ... 137 183 554]]
(16000, 300)


1600 sentences are in the training dataset. Next step: create 
    **y_train**
  , 
    **y_val**
  , and 
    **y_test** 
  out of the emotion labels (we need these to see how well the model predicts the emotion behind each sentence).

In [None]:
# saving the emotion columns into arrays
y_train_nums = train_data["Emotion"]
y_val_nums = validation_data["Emotion"]
y_test_nums = test_data["Emotion"]

#creating training values as vectors (to categorical employs traditional one-hot encoding to convert the emotion values into vectors)
y_train = to_categorical(y_train_nums)
y_val = to_categorical(y_val_nums)
y_test = to_categorical(y_test_nums)

print("As numbers:\n" + str(y_train_nums.head(5)))
print()
print("As vectors:\n"+str(y_train[0:5]))

print()
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)


As numbers:
0    4
1    4
2    0
3    3
4    0
Name: Emotion, dtype: int64

As vectors:
[[0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0.]]

(16000, 6)
(2000, 6)
(2000, 6)


#3. Modeling Recurring Neural Networks

Standard neural nets are "forward-pass", meaning they look at data in an isolated environment. They do not remember prior pieces of data.

**Sequential data:** Where data points rely on other pieces of data

i.e. a sentence: the meaning of the words depends on the context of the sentence, or what other words are in the sentence

**Recurrent Neural Network:** A type of neural network that is built to process sequential data (it comes with internal memory!)

RNNs process sequence input by iterating through the elements. They pass output from one timestep to the input of the next timestep.



Layers of our neural net:
1. Embedding: maps output into a latent space smaller than vocab
2. Dropout: zeros out some neurons to avoid overfitting
3. SimpleRNN: a simple RNN with a few hidden layers interacting
4. Dense: layer of neurons outputing a tensor of specific size

In [None]:
def build_model():
  model = Sequential()
  #putting words in a smaller latent space to make the sentences easier to process
  model.add(Embedding(input_dim=vocab_size,input_length=max_len,output_dim=150))
  #adding Dropout layer to prevent overfitting
  model.add(Dropout(0.2))
  model.add(SimpleRNN(128))
  model.add(Dropout(0.2))
  model.add(Dense(64,activation='sigmoid'))
  model.add(Dropout(0.2))
      #use Dense(6, activation='softmax') because we have 6 classes (6 emotions) so a tensor of dimension 6 is generated
  model.add(Dense(6, activation = 'softmax'))
      #softmax takes in classes and returns a probability decimal for each of them, totaling to 1
  
  model.compile(optimizer='Adam',loss=tf.keras.losses.CategoricalCrossentropy(), metrics=['accuracy',
                                                                                   tf.keras.metrics.Precision(),
                                                                                   tf.keras.metrics.Recall()])
  
  return model

Next: training the model using .fit()

In [None]:
model = build_model()


In [None]:
hist = model.fit(x_train,y_train,epochs=10,batch_size=64,
                 validation_data=(x_val,y_val), verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
