# Sentiment Analysis with an RNN

#### Contents
    1. Introduction
    2. Import Data and Pre-processing
    3. Make DataLoaders
    4. Implementation of an RNN Model
    5. Train and Validation
    6. Test and Prediction
    7. Reference

## 1. Introduction
With the advent of deep learning, problems that could not be solved with previous machine learning techniques have been resolved. **Sentiment Analysis** is one representative of those problems. While traditional techniques exploits hand-crafted programming, deep learning involves a huge amount of data with pre-processing done to do prediction and analysis. This task involves pre-processing of raw IMDB dataset and build a simple deep neural network model to analyze sentiment of a given data. 

## 2. Import Data and Pre-processing
### 2.1 Import Data
Download the from [here](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/sentiment-rnn/data). 
<br> Original Dataset can be downloaded at [here](https://ai.stanford.edu/~amaas/data/sentiment/).

In [1]:
import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [2]:
print(reviews[:1000])
print()
print(labels[:99])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

### 2.2 Encoding

Before feeding the data into a deep learning model, they should be converted to numerical values because deep learning models cannot process language as human being. Conversion is called **encoding** and this involves each word converting to an integer. Before doing encoding, it is required to clean the data, as in deep learning tasks there is a famous maxim, **"Garbage In, Garbage Out."**

There are a few pre-processing steps, following:
1. Remove punctuation.
2. Split the text using \n as the delimiter.
3. Combine all the reviews back together into one big string.

In [3]:
from string import punctuation

# remove punctuation
reviews = reviews.lower()
text = ''.join([c for c in reviews if c not in punctuation])
print(punctuation)

# split by new lines and spaces
reviews_split = text.split('\n')
text = ' '.join(reviews_split)

# create a list of words
words = text.split()

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [4]:
words[:30]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

### 2.3 Build a Dictionary and Encode the Reviews

Embedding lookup requires integer, thus make a **dictionary** that maps the words in the vocabulary to integers. Then through this dictionary, reviews can be converted into integers before feeding to the network. 

In [5]:
from collections import Counter

# make a dictionary that maps vocabs to integers
word_counts = Counter(words)
vocab = sorted(word_counts, key = word_counts.get, reverse = True)

vocab2idx = {vocab:idx for idx, vocab in enumerate(vocab, 1)}

In [6]:
print("Size of Vocabulary: ", len(vocab))

Size of Vocabulary:  74072


In [7]:
encoded_reviews = []
for review in reviews_split:
    encoded_reviews.append([vocab2idx[vocab] for vocab in review.split()])

In [8]:
print("The number of reviews: ", len(encoded_reviews))

The number of reviews:  25001


### 2.4 Encode the labels
Negative and Positive should be labelled to 0 and 1 (integers), respectively in order for them to be fed into the deep neural network.

In [9]:
splitted_labels = labels.split("\n")
encoded_labels = np.array([
    1 if label == "positive" else 0 for label in splitted_labels
])

In [10]:
encoded_labels

array([1, 0, 1, ..., 1, 0, 0])

In [11]:
print("The number of  labels: ", len(encoded_labels))

The number of  labels:  25001


### 2.5 Remove Outliers
Reviews with length of 0 should be removed for processing. Then, padding will be applied to the remaining data so that all data have the same length. 

In [12]:
length_reviews = Counter([len(x) for x in encoded_reviews])
print("Zero-length reviews: ", length_reviews[0])
print("Maximum review length: ", max(length_reviews))

Zero-length reviews:  1
Maximum review length:  2514


In [13]:
# get indices of any reviews with length 0
non_zero_idx = [i for i, review in enumerate(encoded_reviews) if len(review) != 0]

# Remove 0-length reviews and thier labels
encoded_reviews = [encoded_reviews[i] for i in non_zero_idx]
encoded_labels = np.array([encoded_labels[i] for i in non_zero_idx])

print("The number of reviews: ", len(encoded_reviews))
print("The number of  labels: ", len(encoded_labels))

The number of reviews:  25000
The number of  labels:  25000


### 2.6 Padding Sequences

To process both very long and very short reviews, it is necessary to pad short reviews with 0s to fit them to the specific length and truncate (or cut) long reviews to the `seq_length` words. `seq_length` is set to 200 in this notebook. In other words, all reviews should have the same length so padding and truncating are implemented.
<br>For example, if there is review with length of 4 and `seq_length` is 10, it is converted to, 
- [64, 128, 256, 512]
- [0, 0, 0, 0, 0, 0, 64, 128, 256, 512]

In [14]:
def text_padding(encoded_reviews, seq_length):
    
    reviews = []
    
    for review in encoded_reviews:
        if len(review) >= seq_length:
            reviews.append(review[:seq_length])
        else:
            reviews.append([0]*(seq_length-len(review)) + review)
        
    return np.array(reviews)

In [15]:
seq_length = 200
padded_reviews = text_padding(encoded_reviews, seq_length)

In [16]:
print(padded_reviews[:12, :12])

[[    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35  1819    16]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819    56    17]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5     1   730]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]]


## 3. Make DataLoaders

Split the data into train, validation and test sets with a ratio of 8:1:1. Then, `TensorDataset` and `DataLoader` functions will be used for processing the review and labels data.

In [17]:
ratio = 0.8
train_length = int(len(padded_reviews) * ratio)

X_train = padded_reviews[:train_length]
y_train = encoded_labels[:train_length]

remaining_x = padded_reviews[train_length:]
remaining_y = encoded_labels[train_length:]

test_length = int(len(remaining_x)*0.5)

X_val = remaining_x[: test_length]
y_val = remaining_y[: test_length]

X_test = remaining_x[test_length :]
y_test = remaining_y[test_length :]

In [18]:
print("Feature shape of train review set: ", X_train.shape)
print("Feature shape of   val review set: ", X_val.shape)
print("Feature shape of  test review set: ", X_test.shape)

Feature shape of train review set:  (20000, 200)
Feature shape of   val review set:  (2500, 200)
Feature shape of  test review set:  (2500, 200)


In [19]:
import torch
from torch.utils.data import TensorDataset, DataLoader

batch_size = 50
device = "cuda" if torch.cuda.is_available() else "cpu"

In [20]:
train_dataset = TensorDataset(torch.from_numpy(X_train).to(device), torch.from_numpy(y_train).to(device))
valid_dataset = TensorDataset(torch.from_numpy(X_val).to(device), torch.from_numpy(y_val).to(device))
test_dataset = TensorDataset(torch.from_numpy(X_test).to(device), torch.from_numpy(y_test).to(device))

train_loader = DataLoader(train_dataset, batch_size = batch_size, shuffle = True)
valid_loader = DataLoader(valid_dataset, batch_size = batch_size, shuffle = True)
test_loader = DataLoader(test_dataset, batch_size = batch_size, shuffle = True)

In [21]:
data_iter = iter(train_loader)
X_sample, y_sample = data_iter.next()

In [22]:
print('Sample input size: ', X_sample.size())
print('Sample input: \n', X_sample)
print()
print('Sample label size: ', y_sample.size())
print('Sample label: \n', y_sample)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   10,   206,   133,  ...,  1335,     8,     1],
        [   59,   150,    69,  ...,    35,  1816, 44390],
        [   10,   216,     3,  ...,     3, 16548,  1557],
        ...,
        [  136,     1,   846,  ...,     1,  1393,   281],
        [    1,   421,     4,  ...,    29,     1,  8363],
        [    0,     0,     0,  ...,   213,   280,     3]], dtype=torch.int32)

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1], dtype=torch.int32)


## 4. Implementation of an RNN Model

![SentimentAnalysis](./images/SentimentAnalysis.png)

Pre-processing including tokenizing is done so far. Now build a neural network model that predicts sentiment of reviews. 

First, an **embedding layer** converts word tokens into embeddings of a specific size. It performs as a lookup table instead of one-hot encoding, which often results in curse of dimensionality.
<br> Second, an let be **LSTM** layer defined by `hidden_size` and `num_layers`.
<br> Third, a desired output size is mapped from the output of LSTM layer via **a fully connected layer**.
<br> Lastly, **a sigmoid activation layer** returns the outputs in the form of probability, 0 to 1.

The above image is from [here](https://towardsdatascience.com/reading-between-the-layers-lstm-network-7956ad192e58).


In [23]:
import torch.nn as nn
from torch.autograd import Variable

In [24]:
class Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers):
        super(Model, self).__init__()
        
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        # embedding and LSTM
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        self.lstm = nn.LSTM(input_size = embedding_dim, 
                            hidden_size = hidden_dim, 
                            num_layers = num_layers, 
                            batch_first = True, 
                            dropout = 0.5, 
                            bidirectional = False)
        
        # fully connected layers
        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(hidden_dim, output_dim),
            nn.Sigmoid()
        )
        
    def forward(self, token, hidden):
        
        batch_size = token.size(0)
        
        # embedding and lstm output
        out = self.embedding(token.long())
        out, hidden = self.lstm(out, hidden)
        
        # stack up lstm outputs
        out = out.contiguous().view(-1, self.hidden_dim)
        
        # fully connected layer
        out = self.fc(out)
        
        # reshape to be batch_size first
        out = out.view(batch_size, -1)
        
        # get the last batch of labels
        out = out[:, -1]
    
        return out
    
    def init_hidden(self, batch_size):
        return (Variable(torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)), 
                 Variable(torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)))

### Constants

- `vocab_size` : size of vocabulary
- `embedding_dim` : number of columns in the embedding lookup table
- `hidden_dim` : number of units in the hidden layers of LSTM cells
- `output_dim` : size of desired output, in this case 1 (positive/negative)

In [25]:
vocab_size = len(vocab)+1 # +1 for the 0 padding + our word tokens
embedding_dim = 400
hidden_dim = 256
output_dim = 1
num_layers = 2

In [26]:
model = Model(vocab_size, embedding_dim, hidden_dim, output_dim, num_layers).to(device)

In [27]:
model

Model(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=256, out_features=1, bias=True)
    (2): Sigmoid()
  )
)

## 5. Train and Validation

For loss function, `BCELoss` is used as **Binary Cross Entropy Loss** classifies by giving probability between 0 and 1.<br/>
Adam optimizer is used with learning rate of 0.001. <br/>
Besides, `torch.nn.utils.clip_grad_norm_(model.parameters(), clip = 5)` prevents the exploding and vanishing problem of gradient in RNN. `clip` is the maximum value of gradient to clip at.

In [28]:
# Loss function and Optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)

In [29]:
# Constants
num_epochs = 3

train_losses = []
val_losses = []

In [30]:
for epoch in range(num_epochs):
    
    model.train()
    hidden = model.init_hidden(batch_size)
    
    for i, (review, label) in enumerate(train_loader):
        review, label = review.to(device), label.to(device)
        
        # Initialize Optimizer #
        optimizer.zero_grad()
        
        # Make requires_grad = False of the Last Set of Hidden #
        hidden = tuple([h.data for h in hidden])
        
        # Feed Forward #
        output = model(review, hidden)
        
        # Calculate the Loss
        loss = criterion(output.squeeze(), label.float())
        
        # Back Propagation #
        loss.backward()
        
        # Prevent Exploding Gradient Problem #
        nn.utils.clip_grad_norm_(model.parameters(), 5)
        
        # Update #
        optimizer.step()
        
        train_losses.append(loss.item())
        
        # Print Statistics #
        if (i+1) % 100 == 0:
            
            ##################
            ### Evaluation ###
            ##################
            
            # initialize hidden state
            val_h = model.init_hidden(batch_size)
            val_losses = []

            model.eval()
            
            for review, label in valid_loader:
                review, label = review.to(device), label.to(device)
                val_h = tuple([h.data for h in val_h])
                output = model(review, val_h)
                val_loss = criterion(output.squeeze(), label.float())
                
                val_losses.append(val_loss.item())
                
            print("Epoch: {}/{} | Step {}, Train Loss {:.4f}, Val Loss {:.4f}".
                  format(epoch+1, num_epochs, i+1, np.mean(train_losses), np.mean(val_losses)))

Epoch: 1/3 | Step 100, Train Loss 0.6720, Val Loss 0.6383
Epoch: 1/3 | Step 200, Train Loss 0.6623, Val Loss 0.6645
Epoch: 1/3 | Step 300, Train Loss 0.6594, Val Loss 0.6032
Epoch: 1/3 | Step 400, Train Loss 0.6522, Val Loss 0.6602
Epoch: 2/3 | Step 100, Train Loss 0.6360, Val Loss 0.5446
Epoch: 2/3 | Step 200, Train Loss 0.6201, Val Loss 0.6153
Epoch: 2/3 | Step 300, Train Loss 0.6072, Val Loss 0.5183
Epoch: 2/3 | Step 400, Train Loss 0.5962, Val Loss 0.5881
Epoch: 3/3 | Step 100, Train Loss 0.5802, Val Loss 0.5452
Epoch: 3/3 | Step 200, Train Loss 0.5650, Val Loss 0.5022
Epoch: 3/3 | Step 300, Train Loss 0.5458, Val Loss 0.4594
Epoch: 3/3 | Step 400, Train Loss 0.5299, Val Loss 0.4412


## 6. Test

It should be test on the test dataset for calculation of average loss and accuracy.

In [31]:
def test(net, loader):
    
    # to calculate average of loss at the end
    losses = []
    corrects = 0
    
    # initialize hidden state
    hidden = net.init_hidden(batch_size)
    
    # declare evaluation mode
    net.eval()
    
    for review, label in loader:
        
        review, label = review.to(device), label.to(device)
        hidden = tuple([h.data for h in hidden])
        output = net(review, hidden)
        
        # calculate losses
        loss = criterion(output.squeeze(), label.float())
        losses.append(loss.item())
        
        # convert output probabilities to classes, 0 or 1
        pred = torch.round(output.squeeze())
        
        # compare with true labels
        correct_tensor = pred.eq(label.float().view_as(pred))
        correct = np.squeeze(correct_tensor.cpu().numpy())
        corrects += np.sum(correct)
        
    accuracy = corrects / len(loader.dataset)
        
    print("Test Loss: {:.6f}".format(np.mean(losses)))
    print("Test Accuracy: {:.3f}".format(accuracy))

In [32]:
test(model, test_loader)

Test Loss: 0.449302
Test Accuracy: 0.804


In [33]:
def inference(net, sentence):
    
    ## First, given sentence should be tokenized at the first
    sentence = sentence.lower()
    sentence = ''.join([c for c in sentence if c not in punctuation])
    
    words = sentence.split()
    
    tokens = []
    tokens.append([vocab2idx[vocab] for vocab in words])
    
    # padding for constant length
    sequence_length = 200
    padded_sentence = text_padding(tokens, sequence_length)
    
    # convert to tensor in order to feed into the neural network
    sentence_tensor = torch.from_numpy(padded_sentence).to(device)

    # calculate the batch size
    batch_size = sentence_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    # feed forward
    output = net(sentence_tensor, h)
    
    # calculate rounded prediction
    pred = torch.round(output.squeeze())
    print("Predicted value is {:.4f}".format(output.item()))
    
    if pred.item() == 0:
        print("It is a negative review.")
    else:
        print("It is a positive review.")    

In [34]:
sentence = "This movie was horrible, don't wanna waste my time anymore so I went out of a theater!"

In [35]:
inference(model, sentence)

Predicted value is 0.0573
It is a negative review.


## 8. Reference
- [Udacity Deep Learning V2 PyTorch Repository](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/sentiment-rnn)