# Recurrent Neural Networks
Recurrent Neural Networks are used when data is sequential. For example, a video is a sequence of frames, an audio wave is sequence of different amplitutes, or language is a sequence of letters and words. In all those scenarios, every step of data is related with other part of it. Data points are not independent of each other. <br>
In order to analyze a series data as a whole, we keep an additional hidden layer, which tracks the "state" of the flow. Every datapoint results in two different outputs. While one output is the result output, the other one is the state output which is appended to the next datapoint. <br>
For a detailed explanation you can check: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

## Movie Sentiment Analysis Using LSTM

We are going to make the sentiment analysis of IMDB Movie reviews. We will use a specialized RNN architecture called LSTM (Long Short Term Memory). With this lecture, we will inspect the basic methods that are used in NLP. **Note that some steps are exceptional for this basic task and should not be generalized.**
You can find the dataset here: <br>
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

### Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


### Data Preprocessing

In [2]:
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In order to make a clean classification of words we need to convert all to words to same structure, which is lower or upper case. For a clear output, we modify sentiment to be 0-1.

In [3]:
## make lower case the reviews
df['review'] = df['review'].apply(lambda x: x.lower())
## convert sentiments to integer value positive = 1, negative = 0
df['sentiment'] = df['sentiment'].apply(lambda x: int(x=='positive'))
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production. <br /><br />the...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,"petter mattei's ""love in the time of money"" is...",1


As you can inspect, there are additional information such as HTML tags and punctuations. We also get rid of them to achieve a clear input state.

In [4]:
## clean review text from html tags like <br/> 
df['review'] = df['review'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())
## remove non letter and space characters from reviews (like punctuations, special characters)
df['review'] = df['review'].apply(lambda x: re.sub(r'[^a-zA-Z]',' ',x))
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1


There are many stopwords in English, which do not add additional meaning to a sentence, but connect the words. We also get rid of them to reduce the workload of our model.

In [5]:
## remove stop words 
## nltk english stopwords for preprocessing step
STOPWORDS = stopwords.words('english')

def remove_stopwords(sentence, stopwords):
    word_tokens = word_tokenize(sentence)
    
    filtered_sentence = [w for w in word_tokens if not w in stopwords]
    
    return ' '.join(filtered_sentence)

df['review'] = df['review'].apply(lambda x: remove_stopwords(x, STOPWORDS))

In [6]:
## example of a review after preprocessing
df['review'][0]

'one reviewers mentioned watching oz episode hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skill

### Train-Test Split

Our data contains movie reviews. Thus, there are many character names, place names, etc. We take the most common 1000 words in order to reduce the workload of our model, so that our model does not waste time with less common words.

In [7]:
## take 80% of data as train data, the rest is for test
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

from collections import Counter
corpus = Counter()
X_train.str.lower().str.split().apply(corpus.update)

corpus_trashed = sorted(corpus, key=corpus.get, reverse=True)[0:1000]

words=corpus_trashed

## find the number of words in the dictionary of this data
#words = pd.Series(' '.join(X_train).split()).unique()
words_dict = {}
for index, word in enumerate(words):
    words_dict[word] = index+1
    
df['word_count'] = df['review'].apply(lambda x: len(x.split()))
print(f'Length of the longest review: {df["word_count"].max()}')

Length of the longest review: 1416


### Tokenization and Padding
Neural networks can not process text data. If we want to make computer understand words, we need to convert it to numbers. This is where tokenization takes place. Tokenization is the process of converting words (or letters) to numbers. It does so by indexing every word in a vocabulary to a corresponding number and uses it to transform that word to a vector afterwards.
<br>For example the sequence of ['I', 'love', 'data', 'science'] transforms into [5, 64, 22, 103]. So that, model can understand which node a word represents.<br>
In more in-depth, it uses indexes to create a one-hot vector. However, pytorch enables us to use index numbers to train model in more memory efficient way.
<br>
In the other hand, padding is not as crucial as tokenization. It solves the batch multiplication problem. When we are working with batches, we want every row to be the same length. However, different paragraphs have different lengths, so we compansate the difference of size with zeros.
<br>
For example if we have a maximum lenght of 10, we change the series [5, 64, 22, 103] into [0, 0, 0, 0, 0, 0, 5, 64, 22, 103].

In [8]:
MAX_SEQ_LENGTH = df["word_count"].max() ## this can be changed according to the longest review

In [9]:
## assign an index to words
def tokenize(word):
    global words_dict
    try:
        return words_dict[word]
    except KeyError:
        return 0
    
def preprocess(seq):
    return [tokenize(word) for word in seq.split()]

## padding the reviews according to the given MAX_SEQ_LENGTH in order to create review matrix by making lengths of the reviews equal
def pad_seq(seq, max_length):
    #max_length = len(max(seq,key=len))
    
    tensor = np.zeros((len(seq), max_length))
    for i, sentence in enumerate(seq):
        review_len = len(sentence)
        if review_len <= max_length:
            zeroes = list(np.zeros(max_length-review_len))
            sentence = zeroes+sentence
        elif review_len > max_length:
            sentence = sentence[0:max_length]
        
        tensor[i,:] = np.array(sentence)
        
    return torch.tensor(tensor, dtype=torch.long).to(device)

## tokenization
X_train = [preprocess(x) for x in X_train]
X_test = [preprocess(x) for x in X_test]

## padding
X_train = pad_seq(X_train, MAX_SEQ_LENGTH)
X_test = pad_seq(X_test, MAX_SEQ_LENGTH)

## convert labels from dataframe to tensor
y_train = torch.tensor(y_train.to_numpy().reshape((-1,1)), dtype=torch.float).to(device)
y_test = torch.tensor(y_test.to_numpy().reshape((-1,1)), dtype=torch.float).to(device)

## 64 batch data loader
train_set = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
test_set = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_set, batch_size=64, shuffle=True)

### Model Building

In [10]:
## Hyper parameters
EMB_DIM = 64 ## dimension of 1 embedding vector
HIDDEN_DIM = 256 ## dimension of dense layer
EPOCHS = 5

In [11]:
class lstm_model(nn.Module):
    def __init__(self):
        super().__init__()
        global words
        input_dim = len(words)+1
        emb_dim = EMB_DIM
        hidden_dim = HIDDEN_DIM
  
        self.embedding = nn.Embedding(input_dim, emb_dim) ## embedding layer [numberofwords, embeddingdimension]
        self.lstm = nn.LSTM(emb_dim,hidden_dim,batch_first=True) ## lstm layer
        self.dropout = nn.Dropout(0.3) ## drop 30% of neural units
        self.fc = nn.Linear(hidden_dim, 1) ## fully connected layer
    
    def forward(self, batch):
        embeds = self.embedding(batch) 
        lstm_out, _ = self.lstm(embeds)
        out = self.dropout(lstm_out)
        out = F.sigmoid(self.fc(out))[:,-1,:] ## output layer is sigmoid because it is a binary classification problem
        return out

### Training

In [12]:
model = lstm_model().to(device) ## load model to device (if gpu exists, it loads to gpu, otherwise, to cpu)
loss_function = nn.BCELoss() ## loss function is binary cross entropy
optimizer = optim.Adam(model.parameters()) ## optimizer is adam

epochs = EPOCHS

for epoch in range(epochs):
    loss_sum = 0
    model.train()
    for i, (sentence, sentiment) in enumerate(train_loader):
        model.zero_grad()
        output = model(sentence)
        loss = loss_function(output, sentiment)
        loss.backward()
        loss_sum += loss.item()
        optimizer.step()
        print(f"Epoch {epoch+1}/{epochs}\tLoss:{loss_sum/(i+1):.4f}\t({100*(i+1)/len(train_loader):.0f}%)", end='\r')
    model.eval()
    valid_loss = 0
    with torch.no_grad():
        for i, (sentence, sentiment) in enumerate(test_loader):
            output = model(sentence)
            loss = loss_function(output, sentiment)
            valid_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs}\tLoss:{loss_sum/len(train_loader):.4f}\tValidation Loss:{valid_loss/len(test_loader):.4f}")

Epoch 1/5	Loss:0.6035	Validation Loss:0.5514
Epoch 2/5	Loss:0.4735	Validation Loss:0.3941
Epoch 3/5	Loss:0.3648	Validation Loss:0.3386
Epoch 4/5	Loss:0.3285	Validation Loss:0.3400
Epoch 5/5	Loss:0.3045	Validation Loss:0.3296


### Test

In [13]:
y_true = [] ## ground truth labels
y_pred = [] ## predictions

In [14]:
## in this loop, prediction is performed batch by batch
for sentence, sentiment in test_loader:
    
    sentence = sentence.to(device)
    sentiment = sentiment.to(device)
    
    batch_size = sentiment.shape[0]
    
    ## make prediction for 1 batch
    pred = model(sentence)
    
    sentiment = sentiment.detach().cpu().numpy()
    pred = pred.detach().cpu().numpy()
    
    for i in range(batch_size):
        y_true.append(sentiment[i])
        y_pred.append(pred[i,:])

y_pred = np.asarray(y_pred)
y_pred = np.where(y_pred>0.5, 1, 0) ## since prediction results are originally a probability value,
                                  ## we need to convert it into integer

### Evaluation

In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [16]:
print("Accuracy: " + str(accuracy_score(y_true, y_pred)))
print("Conf Matrix: " + str(confusion_matrix(y_true, y_pred)))

Accuracy: 0.8583
Conf Matrix: [[4004  957]
 [ 460 4579]]


### Save Model

Lastly, do not forget to save you model if you are happy with your results :)

In [17]:
torch.save(model, "sentiment_imdb.pt")

# Exercise
Using the methods above, create a model which classifies messages as spam or not. Data set is given below.
<br>
https://www.kaggle.com/ozlerhakan/spam-or-not-spam-dataset