# Exercise 5 (NLP): Very Deep Learning

**Natural language processing (NLP)** is the ability of a computer program to understand human language as it is spoken. It involves a pipeline of steps and by the end of the exercise, we would be able to classify the sentiment of a given review as POSITIVE or NEGATIVE.


Before starting, it is important to understand the need for RNNs and the lecture from Stanford is a must to see before starting the exercise:

https://www.youtube.com/watch?v=iX5V1WpxxkY

When done, let's begin. 

In [1]:
# In this exercise, we will import libraries when needed so that we understand the need for it. 
# However, this is a bad practice and don't get used to it.
import numpy as np

# read data from reviews and labels file.
with open('data/reviews.txt', 'r') as f:
    reviews_ = f.readlines()
with open('data/labels.txt', 'r') as f:
    
    labels = f.readlines()

In [2]:
# One of the most important task is to visualize data before starting with any ML task. 
for i in range(5):
    print(labels[i] + "\t: " + reviews_[i][:100] + "...")

positive
	: bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life...
negative
	: story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terr...
positive
	: homelessness  or houselessness as george carlin stated  has been an issue for years but never a plan...
negative
	: airport    starts as a brand new luxury    plane is loaded up with valuable paintings  such belongin...
positive
	: brilliant over  acting by lesley ann warren . best dramatic hobo lady i have ever seen  and love sce...




We can see there are a lot of punctuation marks like fullstop(.), comma(,), new line (\n) and so on and we need to remove it. 

Here is a list of all the punctuation marks that needs to be removed 
```
(!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)
```


## Task 1: Remove all the punctuation marks from the reviews.
Many ways of doing it: Regex, Spacy, import punctuation from string.

In [3]:
# Make everything lower case to make the whole dataset even. 
reviews = ''.join(reviews_).lower()


In [4]:
# complete the function below to remove punctuations and save it in no_punct_text
import re

def text_without_punct(reviews):
    spl_char = '[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
    #return re.sub('[^A-Za-z0-9]+','',reviews)
    return re.sub(spl_char,'',reviews).strip()


no_punct_text = text_without_punct(reviews)
reviews_split = no_punct_text.split('\n')
print('labels-',len(labels),'reviews ',len(reviews_split))
reviews_split[0]

('labels-', 25000, 'reviews ', 25000)


'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   '

In [5]:
# split the formatted no_punct_text into words
def split_in_words(no_punct_text):
    return no_punct_text.split()

words = split_in_words(no_punct_text)

In [6]:
# once you are done print the ten words that should yield the following output
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

In [7]:
# print the total length of the words
len(words)

6020196

In [8]:
# Total number of unique words
len(set(words))

74072


Next step is to create a vocabulary. This way every word is mapped to an integer number.
```
Example: 1: hello, 2: I, 3: am, 4: Robo and so on...
```


In [9]:
# Lets create a vocab out of it

# feel free to use this import 
from collections import Counter

## Let's keep a count of all the words and let's see how many words are there. 
def word_count(words):
    return Counter(words)

counts=word_count(words)

In [10]:
# If you did everything correct, this is what you should get as output. 
print (counts['wonderful'])

print (counts['bad'])


1658
9308


## Task 2: Word to Integer and Integer to word
The task is to map every word to an integer value and then vice-versa. 


In [11]:
# define a vocabulary for the words
def vocabulary(counts):
    vocab = []
    for c in counts:
        vocab.append(c)
    return vocab

vocab = vocabulary(counts)
print(len(vocab))
vocab[1]

74072


'tsukino'

In [12]:
# map each vocab word to an integer. Also, start the indexing with 1 as we will use 
# '0' for padding and we dont want to mix the two.
def vocabulary_to_integer(vocab):
    vocab_to_int = {}
    for i in range(len(vocab)):
        vocab_to_int[vocab[i]] = i
    return vocab_to_int

vocab_to_int = vocabulary_to_integer(vocab)

In [13]:
# verify if the length is same and if 'and' is mapped to the correct integer value.
print(len(vocab_to_int))
vocab_to_int['tsukino']

74072


1

Let's see what positve words in positive reviews we have and what we have in negative reviews. 

In [14]:
positive_counts = Counter()
negative_counts = Counter()

In [15]:
# loop over each sentence
for i in range(len(reviews_)):
    # if the sentence has positive review, all the words contained in the sentence contribute +1 to positive_counts
    if(labels[i] == 'positive\n'):
        for word in reviews_[i].split(" "):
            positive_counts[word] += 1
    # if the sentence has negative review, all the words contained in the sentence contribute +1 to negative_counts
    else:
        for word in reviews_[i].split(" "):
            negative_counts[word] += 1

In [16]:
labels

['positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive\n',
 'negative\n',
 'positive

In [17]:
positive_counts.most_common()

[('', 537968),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('\n', 12500),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('

In [18]:
negative_counts.most_common()

[('', 548962),
 ('.', 167538),
 ('the', 163389),
 ('a', 79321),
 ('and', 74385),
 ('of', 69009),
 ('to', 68974),
 ('br', 52637),
 ('is', 50083),
 ('it', 48327),
 ('i', 46880),
 ('in', 43753),
 ('this', 40920),
 ('that', 37615),
 ('s', 31546),
 ('was', 26291),
 ('movie', 24965),
 ('for', 21927),
 ('but', 21781),
 ('with', 20878),
 ('as', 20625),
 ('t', 20361),
 ('film', 19218),
 ('you', 17549),
 ('on', 17192),
 ('not', 16354),
 ('have', 15144),
 ('are', 14623),
 ('be', 14541),
 ('he', 13856),
 ('one', 13134),
 ('they', 13011),
 ('\n', 12500),
 ('at', 12279),
 ('his', 12147),
 ('all', 12036),
 ('so', 11463),
 ('like', 11238),
 ('there', 10775),
 ('just', 10619),
 ('by', 10549),
 ('or', 10272),
 ('an', 10266),
 ('who', 9969),
 ('from', 9731),
 ('if', 9518),
 ('about', 9061),
 ('out', 8979),
 ('what', 8422),
 ('some', 8306),
 ('no', 8143),
 ('her', 7947),
 ('even', 7687),
 ('can', 7653),
 ('has', 7604),
 ('good', 7423),
 ('bad', 7401),
 ('would', 7036),
 ('up', 6970),
 ('only', 6781),
 ('m

The above is just to show the most common words in the positive and negative sentences. However, there are a lot of unnecessary words like `the`, `a`, `was`, and so on. Can you find a way to show the relevant words and not these words? 

```
Hint: Stop Words removal or normalizing each term.
```

In [19]:
words[:30]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

In [20]:
[vocab_to_int[word] for word in words[:30]]

[43732,
 59198,
 28537,
 62650,
 52828,
 3699,
 28540,
 69932,
 20610,
 23859,
 29728,
 56652,
 20607,
 63343,
 36918,
 9940,
 50681,
 9415,
 19130,
 56915,
 20607,
 33161,
 32588,
 1442,
 62078,
 23859,
 64734,
 5599,
 73870,
 32565]

In [21]:
vocab_to_int['bromwell']

43732

## One hot encoding

We need one hot encoding for the labels. Think of a reason why we need one hot encoded labels for classes?

## Task 3: Create one hot encoding for the labels. 

* Write the one hot encoding logic in the `one_hot` function.
* Use 1 for positive label and 0 for negative label.
* Save all the values in the `encoded_labels` function.

In [22]:
# 1 for positive label and 0 for negative label
#from sklearn.preprocessing import LabelEncoder
def one_hot(labels):
    encoded_labels = np.asarray(labels)    
    condlist = [encoded_labels=='positive\n', encoded_labels=='negative\n']
    choicelist = [1, 0]
    one_hot = np.select(condlist, choicelist)
    return one_hot
    
 
encoded_labels = one_hot(labels)
encoded_labels


array([1, 0, 1, ..., 0, 1, 0])

In [23]:
#print the length of your label and uncomment next line only if the encoded_labels size is 25001.
# If you dont get the intuition behind this step, print encoded_labels to see it.
#encoded_labels = encoded_labels[:25000]

In [24]:
len(encoded_labels)

25000

In [25]:
# reviews_ints: list like reviews_split but containing corresponding integer instead of word. contains 25000 reviews
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])


In [26]:
# This step is to see if any review is empty and we remove it. Otherwise the input will be all zeroes.
# review_lens: how many similar length reviews occur and length of reviews
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
review_lens[2514]

Zero-length reviews: 0
Maximum review length: 2514


1

In [27]:
print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

('Number of reviews before removing outliers: ', 25000)
('Number of reviews after removing outliers: ', 25000)


In [28]:
len(encoded_labels)


25000

## Task 4: Padding the data

> Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network. 
* The data should come from `review_ints`, since we want to feed integers to the network. 
* Each row should be `seq_length` elements long. 
* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. 
* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is: 
```
[117, 18, 128]
```
The resultant, padded sequence should be: 

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```

**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**

In [29]:
# Write the logic for padding the data

def pad_features(reviews_ints, seq_length):
    padded = []
    for review in reviews_ints:
        
        if (len(review) >= seq_length):
            review = review[:seq_length]
        else:
            review = [0 for _ in range(seq_length-len(review))]+review[:]
        padded.append(review)
    
    return np.asarray(padded)

In [30]:
# Verify if everything till now is correct. 

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [54757 34566 45716 20607 71033 43935 20874   886 33120 20604]
 [30208 65137 20607 62650 69307 27812 42136 58102 28537 46955]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [14822 27396 42637 47823 32588  3213 32848 32565 56777 39300]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [23859 55573 46491 21663 62650 27812 73293 28753 62078 59825]
 [62078 59205 21153 51903 55651 37505 45643 34541 42661 34367]
 [    0     0     0     0     0     0     0     0     0

In [31]:
import random

a = [[1, 2, 3, 4], [5, 6], [7, 8, 9]]
# random.seed(101)
random.shuffle(a)
print(a)

[[7, 8, 9], [1, 2, 3, 4], [5, 6]]


Now we have everything ready. It's time to split our dataset into `Train`, `Test` and `Validate`. 

Read more about the train-test-split here : https://cs230-stanford.github.io/train-dev-test-split.html

## Task 5: Lets create train, test and val split in the ratio of 8:1:1.  

Hint: Either use shuffle and slicing in Python or use train-test-val split in Sklearn. 

In [32]:


import random

train_frac = 0.8
val_frac = 0.1
test_frac = 0.1


def train_test_val_split(features):
    random.seed(101)
    random.shuffle(features)
    split_1 = int(0.8 * len(features))
    split_2 = int(0.9 * len(features))
    train_x = features[:split_1]
    val_x = features[split_1:split_2]
    test_x = features[split_2:]
    return train_x, val_x, test_x

def train_test_val_labels(encoded_labels):
    random.seed(102)
    random.shuffle(features)
    split_1 = int(0.8 * len(encoded_labels))
    split_2 = int(0.9 * len(encoded_labels))
    train_y = encoded_labels[:split_1]
    val_y = encoded_labels[split_1:split_2]
    test_y = encoded_labels[split_2:]
    return train_y, val_y, test_y

train_x, val_x, test_x = train_test_val_split(features)
train_y, val_y, test_y = train_test_val_labels(encoded_labels)

In [33]:
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
('Train set: \t\t(20000, 200)', '\nValidation set: \t(2500, 200)', '\nTest set: \t\t(2500, 200)')


## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.

### Task 6: Create a generator function for the dataset. 
See the above link for more info.

In [34]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets for train, test and val
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50 

# make sure to SHUFFLE your training data. Keep Shuffle=True.
train_loader = DataLoader(train_data, batch_size=batch_size,shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

In [35]:
# obtain one batch of training data and label. 
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

('Sample input size: ', torch.Size([50, 200]))
('Sample input: \n', tensor([[    0,     0,     0,  ..., 34027, 34400, 42395],
        [27396, 57185, 59205,  ..., 61788,  7060, 53166],
        [65992,  6935, 71669,  ..., 21657, 29500, 40500],
        ...,
        [34296, 23859,  4088,  ...,   886, 68128,  2210],
        [    0,     0,     0,  ..., 59205, 58499, 22288],
        [    0,     0,     0,  ..., 66700, 19331, 28540]]))
()
('Sample label size: ', torch.Size([50]))
('Sample label: \n', tensor([1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 0]))


In [36]:
# Check if GPU is available.
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


## Creating the Model 

Here we are creating a simple RNN in PyTorch and pass the output to the a Linear layer and Sigmoid at the end to get the probability score and prediction as POSITIVE or NEGATIVE. 

The network is very similar to the CNN network created in Exercise 2. 

More info available at: https://pytorch.org/docs/0.3.1/nn.html?highlight=rnn#torch.nn.RNN

Read about the parameters that the RNN takes and see what will happen when `batch_first` is set as `True`.

In [37]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # RNN layer
        self.rnn = nn.RNN(vocab_size, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # RNN out layer
        print ('X-',x.shape, 'hidden',hidden.shape)
        rnn_out, hidden = self.rnn(x, hidden)
        print ('out-',rnn_out, 'hidden',hidden)
    
        # stack up lstm outputs
        rnn_out = rnn_out.view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(rnn_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

    


## Task 7 : Know the shape

Given a batch of 64 and input size as 1 and a sequence length of 200 to a RNN with 2 stacked layers and 512 hidden layers, find the shape of input data (x) and the hidden dimension (hidden) specified in the forward pass of the network. Note, the batch_first is kept to be True. 



In [38]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
hidden_dim = 256
n_layers = 2

#input shape = (64,200,vocab_size)
#hidden shape = (2*1,64,256)

net = SentimentRNN(vocab_size, output_size, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (rnn): RNN(74073, 256, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)



## Task 8: LSTM 

Before we start creating the LSTM, it is important to understand LSTM and to know why we prefer LSTM over a Vanilla RNN for this task. 
> Here are some good links to know about LSTM:
* [Colah Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [Understanding LSTM](http://blog.echen.me/2017/05/30/exploring-lstms/)
* [RNN effectiveness](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)


Now create a class named SentimentLSTM with `n_layers=2`, and rest all hyperparameters same as before. Also, create an embedding layer and feed the output of the embedding layer as input to the LSTM model. Dont forget to add a regularizer (dropout) layer after the LSTM layer with p=0.4 to prevent overfitting. 

In [103]:
import torch.nn as nn

class SentimentLSTM(nn.Module):
    """
    The LSTM model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentLSTM, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define embedding, LSTM, dropout and Linear layers here
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
       
        self.lstm= nn.LSTM(embedding_dim,hidden_dim,num_layers=self.n_layers)
       
        self.fc = nn.Linear(in_features=hidden_dim, out_features=output_size)
        
        self.dropout = nn.Dropout(p=drop_prob)
        # define embedding, LSTM, dropout and Linear layers here
        self.sig = nn.Sigmoid()

    def forward(self, input, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        # input = B x S . size(0) = B
        batch_size = input.size(0)
        seq_len = input.size(1)

        # input:  B x S  -- (transpose) --> S x B
        input = input.t()
        
        # Embedding Seq X Batch (50 x 200) ---->  Seq X Batch x E (50 x 200 x 300) (embedding size)
        #print("  input", input.size())
        embedded = self.embedding(input)
        #print("  embedding", embedded.size())
        
       
        lstm_out, hidden = self.lstm(embedded, hidden)
        #print('lstm output',lstm_out.size()) #('lstm hidden output', torch.Size([200, 50, 256]))
    
        # stack up lstm outputs
        out = lstm_out.view(-1, self.hidden_dim)
        #print('shaped output',out.size()) 
        
        # dropout and fully-connected layer
        out = self.dropout(out)
        fc_output = self.fc(out)
        #print('fc_output',fc_output.size()) 

        # sigmoid function
        sig_out = self.sig(fc_output)
        #print('sig_out',sig_out.size(),sig_out) 
       
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        #print('reshaped sig_out',sig_out.size(),sig_out)
        # return last sigmoid output and hidden state
        return sig_out, hidden
      
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [104]:
# Instantiate the model with these hyperparameters
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 300
hidden_dim = 256
n_layers = 2

net = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentLSTM(
  (embedding): Embedding(74073, 300)
  (lstm): LSTM(300, 256, num_layers=2)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (dropout): Dropout(p=0.5)
  (sig): Sigmoid()
)


In [105]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


### Task 9: Loss Functions
We are using `BCELoss (Binary Cross Entropy Loss)` since we have two output classes. 

Can Cross Entropy Loss be used instead of BCELoss? 

If no, why not? If yes, how?

Is `NLLLoss()` and last layer as `LogSoftmax()` is same as using `CrossEntropyLoss()` with a Softmax final layer? Can you get the mathematical intuition behind it?

In [107]:
#Training and Validation

epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()
        #print (inputs.shape)
        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        #break
        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
    #break

('Epoch: 1/4...', 'Step: 100...', 'Loss: 0.691256...', 'Val Loss: 0.692942')
('Epoch: 1/4...', 'Step: 200...', 'Loss: 0.684352...', 'Val Loss: 0.694444')
('Epoch: 1/4...', 'Step: 300...', 'Loss: 0.691322...', 'Val Loss: 0.693709')
('Epoch: 1/4...', 'Step: 400...', 'Loss: 0.690494...', 'Val Loss: 0.693814')
('Epoch: 2/4...', 'Step: 500...', 'Loss: 0.692708...', 'Val Loss: 0.694503')
('Epoch: 2/4...', 'Step: 600...', 'Loss: 0.694232...', 'Val Loss: 0.694022')
('Epoch: 2/4...', 'Step: 700...', 'Loss: 0.695717...', 'Val Loss: 0.693767')
('Epoch: 2/4...', 'Step: 800...', 'Loss: 0.691160...', 'Val Loss: 0.693639')
('Epoch: 3/4...', 'Step: 900...', 'Loss: 0.698660...', 'Val Loss: 0.693980')
('Epoch: 3/4...', 'Step: 1000...', 'Loss: 0.699789...', 'Val Loss: 0.693398')
('Epoch: 3/4...', 'Step: 1100...', 'Loss: 0.699838...', 'Val Loss: 0.694568')
('Epoch: 3/4...', 'Step: 1200...', 'Loss: 0.694633...', 'Val Loss: 0.692897')
('Epoch: 4/4...', 'Step: 1300...', 'Loss: 0.692366...', 'Val Loss: 0.6931

## Inference
Once we are done with training and validating, we can improve training loss and validation loss by playing around with the hyperparameters. Can you find a better set of hyperparams? Play around with it. 

### Task 10: Prediction Function
Now write a prediction function to predict the output for the test set created. Save the results in a CSV file with one column as the reviews and the prediction in the next column. Calculate the accuracy of the test set.

In [115]:
def predict():
    net.eval()
    label_dict = {}
    label_dict[0] = 'positive'
    label_dict[1] = 'negative'
    test_h = net.init_hidden(batch_size)
    prediction = []
    for inputs,labels in test_loader:
        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        test_h = tuple([each.data for each in val_h])
        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()
        output, test_h = net(inputs, val_h)
        print(output)
        #prediction.append(label_dict[])
        

                

predict()                

       

tensor([0.5031, 0.5016, 0.5016, 0.5016, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5028, 0.5044, 0.5190, 0.5139, 0.5145, 0.5106, 0.5304, 0.5155, 0.5016,
        0.5067, 0.5245, 0.5079, 0.5089, 0.5094, 0.5155, 0.5145, 0.5014, 0.5047,
        0.5152, 0.4993, 0.5212, 0.4933, 0.5165, 0.5407, 0.5122, 0.5141, 0.5274,
        0.5021, 0.5181, 0.5097, 0.5109, 0.5120, 0.5068, 0.5136, 0.5147, 0.5160,
        0.5149, 0.5259, 0.5042, 0.4945, 0.5350],
       device='cuda:0', grad_fn=<SelectBackward>)
tensor([0.5151, 0.5007, 0.5117, 0.5189, 0.5186, 0.5030, 0.5257, 0.5162, 0.5203,
        0.5163, 0.5156, 0.5143, 0.4883, 0.5007, 0.5112, 0.5063, 0.5249, 0.5198,
        0.5442, 0.5204, 0.5130, 0.5217, 0.5190, 0.5169, 0.5140, 0.5118, 0.4942,
        0.5242, 0.5247, 0.5287, 0.5073, 0.5222, 0.5046, 0.5275, 0.5114, 0.5287,
        0.5200, 0.5207, 0.5146, 0.5128, 0.5052, 0.5100, 0.4934, 0.5064, 0.5078,
        0.4988, 0.4946, 0.5110, 0.5186, 0.5173],
       device='cuda:0', grad_fn=<SelectBackward>)
te

tensor([0.5031, 0.5016, 0.5016, 0.5016, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5083, 0.5195, 0.5113,
        0.5355, 0.5137, 0.5239, 0.5018, 0.5189, 0.5142, 0.5263, 0.5151, 0.5234,
        0.5083, 0.5144, 0.5075, 0.5092, 0.5111, 0.5081, 0.5293, 0.5179, 0.5122,
        0.5008, 0.5124, 0.5076, 0.5149, 0.5004, 0.5268, 0.5169, 0.5164, 0.5025,
        0.5169, 0.5183, 0.5153, 0.5047, 0.5177],
       device='cuda:0', grad_fn=<SelectBackward>)
tensor([0.5031, 0.5016, 0.5016, 0.5016, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015, 0.5015,
        0.5015, 0.5135, 0.5103, 0.5190, 0.5028, 0.4988, 0.5012, 0.5140, 0.5293,
        0.5133, 0.5055, 0.5183, 0.5203, 0.5216],
       device='cuda:0', grad_fn=<SelectBackward>)
te

## Bonus Question: Create an app using Flask

> Extra bonus points if someone attempts this question:
* Save the trained model checkpoints.
* Create a Flask app and load the model. A similar work in the field of CNN has been done here : https://github.com/kumar-shridhar/Business-Card-Detector (Check `app.py`)
* You can use hosting services like Heroku and/or with Docker to host your app and show it to everyone. 
Example here: https://github.com/selimrbd/sentiment_analysis/blob/master/Dockerfile
