# Artificial Neural Networks project: Sentiment Analysis
This a NLP (**Natural Language Processing**) learning project, of which task is to perform sentiment analysis on movie reviews. In this project we will be using an IMDB dataset containing 50,000 reviews classified as ***positive*** or ***negative***.

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k ***positive*** and 25k ***negative***). Also included are an additional 50,000 unlabeled documents for unsupervised learning.

**Our intent in that project is to use different methods, i.e. different ANN models and different text encodings/embeddings, in the search of the best accuracy.**

With regards to the ANN framework, we'll be using **PyTorch 1.5**.

This dataset has been made publicly available by the authors of the following paper:

***maas-EtAl:2011:ACL-HLT2011***
* **author**    = Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher
* **title**     = Learning Word Vectors for Sentiment Analysis
* **booktitle** = Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
* **date**      = June 2011
* **publisher** = Association for Computational Linguistics
* **pages**     = 142--150
* **url**       = http://www.aclweb.org/anthology/P11-1015

Also, the **best performances** of ANNs on this dataset are listed [here](https://paperswithcode.com/sota/sentiment-analysis-on-imdb). The different methods are often listed with a companion paper giving great source of inspiration.

# References
A list of reference of the technologies used in this notebook.
* (py)**Torch and Torchtext**: https://pytorch.org/
* **spaCy**: https://spacy.io/
* **GloVe**: https://nlp.stanford.edu/projects/glove/

In [1]:
import torch
import torchtext
from torchtext import data, datasets
import torch.nn as nn
import torch.optim as optim

In [2]:
import spacy

In [3]:
import random as rd
from datetime import datetime, timedelta
from itertools import product

## 1) Model: RNN, Encoding: one hot vectors
For this first method, we are using a "standard" **Recurrent Neural Network** (RNN) and One Hot Encoding.

**RNN** is a family of sequence based NNs, making them good for processing/understanding languages. We will use here the basic RNN.

In **One Hot Encoding**, each word is encoded into a vector of 0 & 1. The dimension of each vector is the size of the vocabulary (i.e. if our corpus is made of 30,000 words, a word will be represented with a a vector of dimension 30,000).

The drawbacks of such a representation is well known:
* Sparse vector of high dimension. The bigger the vocabulary, the bigger the dimension of each vector, which makes the deep learning training intensively costly (in terms of time and resources), even unmanageable.
* The encoding is totally uninformed, that is, similar words are not placed closer to each other in the encoding space.

It is however a simple encoding method which is worth a try.

### 1.1) Download and encode IMDB data
We use TorchText as it provide easy access (download) to some popular dataset, including the IMDB dataset.

First, we need to define our data Field and Label, we're using default PyTorch tokenisation for sequence (sentences), i.e. string.split (spliting a sequence simply on space).

We will later improve the text processing/cleaning to verify the impact on the quality of the model.

In [4]:
reviews = data.Field()
labels = data.LabelField(dtype = torch.float)

**datasets.IMDB** will download (if not already present in ***.data*** directory) the IMDB Dataset and splits it into a train and test set. The data is processed using the **data.Field** settings defined above.

In [5]:
train_data, test_data = datasets.IMDB.splits(reviews, labels)

In [6]:
print("Dataset has {0} training samples and {1} test samples".format(len(train_data), len(test_data)))

Dataset has 25000 training samples and 25000 test samples


Some examples of tokenised reviews:

In [7]:
print(train_data.examples[rd.randint(1,100)].__dict__)

{'text': ['If', 'there', 'is', 'one', 'thing', 'to', 'recommend', 'about', 'this', 'film', 'is', 'that', 'it', 'is', 'intriguing.', 'The', 'premise', 'certainly', 'draws', 'the', 'audience', 'in', 'because', 'it', 'is', 'a', 'mystery,', 'and', 'throughout', 'the', 'film', 'there', 'are', 'hints', 'that', 'there', 'is', 'something', 'dark', 'lurking', 'about.', 'However,', 'there', 'is', 'not', 'much', 'tension,', 'and', "Williams'", 'mild', 'mannered', 'portrayal', "doesn't", 'do', 'much', 'to', 'makes', 'us', 'relate', 'to', 'his', 'obsession', 'with', 'the', 'boy.<br', '/><br', '/>Collete', 'fares', 'much', 'better', 'as', 'the', 'woman', 'whose', 'true', 'nature', 'and', 'intentions', 'are', 'not', 'very', 'clear.', 'The', 'production', 'felt', 'rushed', 'and', 'holes', 'are', 'apparent.', 'It', 'certainly', 'feels', 'like', 'a', 'preview', 'for', 'a', 'much', 'more', 'complete', 'and', 'better', 'effort.', 'The', 'book', 'is', 'probably', 'better.<br', '/><br', '/>One', 'thing', 'i

Let's also define a **validation set** (as 80% of the training data):

In [8]:
train_data, validation_data = train_data.split(split_ratio = 0.8)

#### Building the vocabulary of our corpus
We set a limit, maximum size, of our vocabulary to keep the dimension of the one hot vectors manageable. Let's define a limit at **30,000 words** (tokens!).

We only use the training set for building vocabulary. Indeed, we want to avoid introducing bias from validation/test data. That is, our model should be able to give a *sentiment* eventhough the review does **not** contain a word that our model have already learnt. Words not included into the vocabulary will be encoded as **<UNK>**, i.e. **UNKNOWN**.

In [9]:
maximum_vocab_size = 30_000
reviews.build_vocab(train_data, max_size=maximum_vocab_size)
labels.build_vocab(train_data)

The reviews **Field** can then show diverse information on the vocabulary, e.g.:

* most frequent words

In [10]:
print("The five most frequent words: {0}".format(reviews.vocab.freqs.most_common(5)))

The five most frequent words: [('the', 228701), ('a', 123678), ('and', 121873), ('of', 114229), ('to', 105933)]


***Something to note here, is that the most frequent words are not necessarily those which could be of best used to identify a sentiment. There will be definitively something to improve here.***

* the vocabulary (here, first 10 elements)

In [11]:
print(reviews.vocab.itos[:10])

['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 'I']


the ```<unk>``` and ```<pad>``` are not part of the initial vocabulary. They are tokens added for UNKNOWN vocabulary (*i.e. when limiting the size of the vocabulary, there are words that will not be included into the vocabulary, hence they will be tagged as UNKNOWN*) and for PADDING (*i.e. adding one or several PADDING token so that all sequences in a batch has the same length).

As a result here, our vocabulary's size is **not 30,000 but 30,002**.

We then need to define an **iterator** that will be used for batch training. I.e. during training, our model will be fed with batch_size sequences. batch_size is one of the (many) deep learning hyper-parameters, and must be chosen wisely. Small batch size has been known to generalise better, although a batch size of 1 is also known to be a poor choice.

Let's use 64.

We also define the device (CPU or GPU) where to do all computations.

In [12]:
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [13]:
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, validation_data, test_data),
                                                               batch_size=batch_size,
                                                               device=device)

### 1.2) Build our simple RNN model
Our RNN model will be comprised of:
#### An embedding layer
It transforms the sparse one hot vector into a denser vector. It is basically a linear layer to which one can feed indexes instead of vectors, i.e. a word/token is fed by the index of its one hot vector instead of by the one hot vector directly. Using embedding not only reduce the size of the dimension, but during training "similar words" (similarity being here depending on the task the embedding layer will be trained for) will be closer to each other.

The layer takes as input dimension the size of the vocabulary. The output dimension is the desired dense vector dimension. The output dimension could be seen as an hyper-parameter and be tuned using cross-validation.

***Note***: embedding layer can also be pre-trained, we'll use that later.
#### An RNN layer
The RNN layer takes as input dimension the embedding dimension (output of the embedding layer). The output dimension is the dimension of the hidden state. This dimension could also be seen as an hyperparameter of the model, i.e. could be tuned using cross-validation.
#### A linear layer
Our output layer is a classic linear (fully connected) layer. It takes into input the output of the RNN layer, and therefore the input dimension is the dimension of the RNN hidden state. The output dimension is one. Indeed, the problem here is a binary classification we can use a scalar within [0,1] bounds.

#### Defining our model

In [14]:
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, review):

        embeddings = self.embedding(review)
        output, hidden = self.rnn(embeddings)
        output = self.fc(hidden.squeeze(0))
        
        return(output)

#### Setting our model parameters:

In [15]:
input_dim = len(reviews.vocab)
embedding_dim = 128
hidden_dim = 128
output_dim = 1

Initialising the model:

In [16]:
model_rnn = RNN(input_dim, embedding_dim, hidden_dim, output_dim)

We can count the number of parameters of our model, which give insights of its complexity/simplicity.

In [17]:
def model_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [18]:
print("RNN model has {0:,} parameters".format(model_parameters(model_rnn)))

RNN model has 3,873,409 parameters


Defining the optimiser and the loss function:

In [19]:
optimiser = optim.Adam(model_rnn.parameters())

In [20]:
loss_fn = nn.BCEWithLogitsLoss()

Moving computation to the appropriate device (GPU or CPU)

In [21]:
model_rnn = model_rnn.to(device)
loss_fn = loss_fn.to(device)

### 1.3) Running the model
We define here some helper functions to run and evaluate the model.

**A function to compute accuracy**: the output of our last layer being an unbounded real, we first transform it to a real bounded to [0,1]. Values greater than 0.5 are then considered to be positive sentiment, whereas those less than 0.5 are considered to be negative sentiment.

In [22]:
def get_accuracy(predictions, labels):
    '''
    accuracy of a batch of predictions.
    Use sigmoid for transforming unbounded
    predictions to [0,1]
    Parameters
    ----------
    - predictions: a tensor of predictions
    - labels: the correct labels for these predictions
    Return
    ------
    - Accuracy = correct predictions / number of predictions
    '''
    predictions = torch.round(torch.sigmoid(predictions))
    
    # Correct prediction will be set to 1 from boolean True
    # Allowing to compute accuracy
    accuracy = (predictions == labels).float()
    accuracy = accuracy.sum()/len(accuracy)
    
    return(accuracy)

**A function to train the model**: This function train the model for an epoch.

In [23]:
def train_model(model, data, optimiser, loss_fn):
    '''
    Train the model on a set of data (training set as batch)
    That is, it performs the forward and backward propagation
    steps as well as gradient descent for an epoch.
    Parameters
    ----------
    - the model to train
    - the data: in iterator yielding batching of size
        batch_size
    - the chosen optimiser
    - the chosen loss function
    '''
    epoch_loss = 0
    epoch_accuracy = 0
    
    model.train()
    
    for data_batch in data:
        optimiser.zero_grad()
        predictions = model(data_batch.text).squeeze(1)
        loss = loss_fn(predictions, data_batch.label)
        accuracy = get_accuracy(predictions, data_batch.label)
        loss.backward()
        optimiser.step()
        
        # accumulate loss and accuracy of each batch
        epoch_loss = epoch_loss + loss.item()
        epoch_accuracy = epoch_accuracy + accuracy.item()
        
    epoch_loss = epoch_loss / len(data)
    epoch_accuracy = epoch_accuracy / len(data)
    
    return(epoch_loss, epoch_accuracy)

**A function to evaluate the model**: Evaluation -> we're using the validation set. Also, and **very important**, this is **NOT** training, so we do not compute the gradients. Apart from this, the function is quite similar to ***train_model()***.

In [24]:
def evaluate_model(model, data, loss_fn):
    '''
    Evaluate the model (-> using the validation set)
    This is not training, i.e. we do not compute gradient
    Parameters
    ----------
    - the model to train
    - the data: in iterator yielding batching of size
        batch_size
    - the chosen loss function
    '''
    epoch_loss = 0
    epoch_accuracy = 0
    
    model.eval()
    
    with torch.no_grad():
        for data_batch in data:
            predictions = model(data_batch.text).squeeze(1)
            loss = loss_fn(predictions, data_batch.label)
            accuracy = get_accuracy(predictions, data_batch.label)
        
            # accumulate loss and accuracy of each batch
            epoch_loss = epoch_loss + loss.item()
            epoch_accuracy = epoch_accuracy + accuracy.item()
        
    epoch_loss = epoch_loss / len(data)
    epoch_accuracy = epoch_accuracy / len(data)
    
    return(epoch_loss, epoch_accuracy)

**A function to "run" the model**: loop through the number of epochs and run training and evaluation on the model.

In [25]:
def run_model(nb_epochs, model, train_iter, valid_iter, optimiser, loss_fn):
    '''
    Run training, validation for all epochs
    Output the loss and accuracy values for both
    training and validation steps
    Parameters
    ----------
    - the number of epochs
    - the model to train
    - the train and validation data: (iterators)
    - the chosen optimiser
    - the chosen loss function
    '''

    best_loss = float("inf")

    for an_epoch in range(nb_epochs):
    
        startt = datetime.now()
    
        train_loss, train_accuracy = train_model(model_rnn, train_iter, optimiser, loss_fn)
        validation_loss, validation_accuracy = evaluate_model(model_rnn, valid_iter, loss_fn)
    
        duration = (datetime.now() - startt)
        duration_str = "{0}mn:{1}s".format(duration.seconds//60, duration.seconds%60)
    
        print("epoch {0}: {1}".format(an_epoch+1, duration_str))
        print("\t  Training loss: {0:.4f} |   Training accuracy: {1:.2f} %".
              format(train_loss, train_accuracy*100))
        print("\tValidation loss: {0:.4f} | Validation accuracy: {1:.2f} %".
              format(validation_loss, validation_accuracy*100))

In [26]:
nb_epochs = 5
run_model(nb_epochs, model_rnn, train_iter, valid_iter, optimiser, loss_fn)

epoch 1: 0mn:14s
	  Training loss: 0.6965 |   Training accuracy: 50.20 %
	Validation loss: 0.6959 | Validation accuracy: 49.94 %
epoch 2: 0mn:14s
	  Training loss: 0.6939 |   Training accuracy: 50.79 %
	Validation loss: 0.6973 | Validation accuracy: 51.11 %
epoch 3: 0mn:13s
	  Training loss: 0.6925 |   Training accuracy: 50.61 %
	Validation loss: 0.6991 | Validation accuracy: 51.46 %
epoch 4: 0mn:13s
	  Training loss: 0.6921 |   Training accuracy: 50.22 %
	Validation loss: 0.7026 | Validation accuracy: 49.86 %
epoch 5: 0mn:13s
	  Training loss: 0.6915 |   Training accuracy: 50.06 %
	Validation loss: 0.7052 | Validation accuracy: 52.22 %


### 1.4) Predicting sentiment on a review
Let's pick up a review from test_data, i.e. reviews never seen byt our model and see how the model classify it. We're picking up a review at random, within the 25,000 reviews of the test set.

But first let's define the function to predict:

In [27]:
def predict_review_sentiment_with_RNN(model, review_as_tokens):
    '''
    Predict the sentiment of a movie review
    Parameters
    ----------
    - the ANN model to use to predict
    - A review as a list of tokens
    Return
    ------
    A sentiment prediction, as a [0,1] real
    A value < 0.5 meaning negative review
    '''
    model.eval()
    review_indexed = [reviews.vocab.stoi[token] for token in review_as_tokens]
    tensor = torch.LongTensor(review_indexed).to(device)
    tensor = tensor.unsqueeze(1)
    sentiment_prediction = torch.sigmoid(model(tensor))
    return (sentiment_prediction.item())

Let's now pick up at random a review amongst the 25000 reviews from the **test_data**:

In [28]:
a_review = test_data.examples[rd.randint(1,25000)].__dict__
a_review_label = a_review["label"]
a_review_text = a_review["text"]
print("{0}\n".format(a_review_text))
if a_review_label == "neg":
    print("This is a NEGATIVE review !")
else:
    print("This is a POSITIVE review !")

['I', 'bought', 'this', 'film', 'from', 'e-bay', 'as', 'part', 'of', 'a', 'lot', 'of', 'about', 'twenty', 'horror', 'flicks,', 'all', 'about', 'a', 'dollar', 'a', 'piece.', 'When', 'watching', 'this,', 'my', 'first', 'impression', 'was', 'that', 'it', 'probably', 'was', 'from', 'the', 'late', '80s.', 'Later', 'on', 'I', 'began', 'thinking', '-', 'the', 'Linkin', 'Park', 'posters', 'on', 'the', 'wall', 'and', 'everything', 'else', 'seemed', 'to', 'hint', 'that', 'I', 'was', 'dealing', 'with', 'a', 'more', 'recent', 'film.', 'Realizing', 'that,', 'the', 'flick', 'became', 'an', 'unbearable', 'torment.', 'The', 'last', '3', 'minutes', 'were', 'the', 'longest', 'in', 'the', 'movie', 'history', '-', 'the', 'film', 'just', 'refused', 'to', 'end.', 'Is', 'there', 'a', 'genre', 'such', 'as', '"horror', 'for', 'children"?', 'In', 'that', 'case', 'this', 'film', 'is', 'definitely', 'it.', 'If', 'there', 'are', 'parents,', 'perverse', 'enough', 'to', 'want', 'to', 'introduce', 'their', 'offspring

And predict ...

In [29]:
sentiment = predict_review_sentiment_with_RNN(model_rnn, a_review_text)
if sentiment < 0.5:
    print("This is a NEGATIVE review ! {0:.4f}".format(sentiment))
else:
    print("This is a POSITIVE review ! {0:.4f}".format(sentiment))

This is a POSITIVE review ! 0.5929


We can also make up our own review ... (*we're using spaCy here only for tokenisation of our review. We will discuss spaCy in the next section.*)

In [30]:
nlp = spacy.load("en_core_web_sm")

In [31]:
#my_review = "The best movie of the year"
#my_review = "The best movie of the year. Very good, I like it, I enjoy it a lot. It's fun."
my_review = "This is such a bad movie!"

my_review_tokens = [token.text for token in nlp.tokenizer(my_review)]
print(my_review_tokens)

['This', 'is', 'such', 'a', 'bad', 'movie', '!']


And the prediction of this review is ...

In [32]:
sentiment = predict_review_sentiment_with_RNN(model_rnn, my_review_tokens)
if sentiment < 0.5:
    print("This is a NEGATIVE review ! {0:.4f}".format(sentiment))
else:
    print("This is a POSITIVE review ! {0:.4f}".format(sentiment))

This is a NEGATIVE review ! 0.2163


### 1.5) Conclusion on this RNN / One Hot Encoding model
The accuracy of this model is pretty poor, basically equivallent to flipping a coin !

Let's see if we can improve the accuracy with a better pre-processing of the text (reviews). In the next section, we will use **spaCy**, a powerful open source library for NLP, which includes many features (e.g. POS tagging, NER, etc.) and support many languages.

___

## 2) Model: RNN, Text processing using spaCy
We will be using the same ANN architecture ("standard" RNN), but we will use for tokenisation and removing ***stopwords***. Stopwords are usually the most common words in a language, e.g. "**a**", "**the**", "**and**", "**of**", etc., which may not bring a lot of benefit to a NLP task.

**However**, negation words could be very important, typically in a sentiment analysis task.

Let's see how it goes.

In [33]:
nlp = spacy.load("en_core_web_sm")

Some examples of stop words:

In [34]:
list(nlp.Defaults.stop_words)[:5]

['two', 'almost', 'twelve', 'doing', 'here']

### 2.1) Building and encoding the IMDB data
We re-create our train, validation and test data using **spaCy**. This is done through parameters of the **Field** object. In the configuration below, we're using spaCy's tokeniser and spaCy's defined stop words.

We're also adding some punctuation and specific stopwords in the spaCy default's stopwords.

In [35]:
nlp.Defaults.stop_words |= {",", ".", "-", ";", ":", "_", "/><br", "(", ")", "!", "?", "...", "br", "/>", "'", '"'}

In [36]:
#reviews = data.Field(tokenize = "spacy")
reviews = data.Field(tokenize = "spacy", stop_words = nlp.Defaults.stop_words)
labels = data.LabelField(dtype = torch.float)

In [37]:
%%time
train_data, test_data = datasets.IMDB.splits(reviews, labels)

Wall time: 1min 28s


In [38]:
train_data, validation_data = train_data.split(split_ratio = 0.8)

In [39]:
maximum_vocab_size = 30_000
reviews.build_vocab(train_data, max_size=maximum_vocab_size)
labels.build_vocab(train_data)

In [40]:
print("The five most frequent words: {0}".format(reviews.vocab.freqs.most_common(5)))

The five most frequent words: [('I', 62192), ('movie', 34141), ('film', 30974), ('The', 30223), ('like', 15535)]


In [41]:
print(reviews.vocab.itos[:10])

['<unk>', '<pad>', 'I', 'movie', 'film', 'The', 'like', 'It', 'good', 'This']


***An improvement here compare to the previous encoding, as the most frequent words are not more punctuation***. Let see if it has an impact on the accuracy of our model.

In [42]:
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [43]:
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, validation_data, test_data),
                                                               batch_size=batch_size,
                                                               device=device)

### 2.2) Re-initalising the model

In [44]:
input_dim = len(reviews.vocab)
embedding_dim = 128
hidden_dim = 128
output_dim = 1

In [45]:
model_rnn = RNN(input_dim, embedding_dim, hidden_dim, output_dim)

Optimiser, loss and moving computation to the appropriate device (GPU or CPU)

In [46]:
optimiser = optim.Adam(model_rnn.parameters())
loss_fn = nn.BCEWithLogitsLoss()

model_rnn = model_rnn.to(device)
loss_fn = loss_fn.to(device)

### 2.3) Running the model

In [47]:
nb_epochs = 5
run_model(nb_epochs, model_rnn, train_iter, valid_iter, optimiser, loss_fn)

epoch 1: 0mn:9s
	  Training loss: 0.6971 |   Training accuracy: 49.34 %
	Validation loss: 0.6990 | Validation accuracy: 49.33 %
epoch 2: 0mn:9s
	  Training loss: 0.6947 |   Training accuracy: 50.43 %
	Validation loss: 0.7058 | Validation accuracy: 48.77 %
epoch 3: 0mn:9s
	  Training loss: 0.6916 |   Training accuracy: 50.67 %
	Validation loss: 0.7193 | Validation accuracy: 50.79 %
epoch 4: 0mn:9s
	  Training loss: 0.6914 |   Training accuracy: 50.65 %
	Validation loss: 0.7166 | Validation accuracy: 48.32 %
epoch 5: 0mn:9s
	  Training loss: 0.6910 |   Training accuracy: 50.46 %
	Validation loss: 0.7241 | Validation accuracy: 51.34 %


### 2.4) Conclusion on this RNN / spaCy encoding model
This model using **spaCy** tokeniser and **stopwords** shows no improvement compare to the previous one using "standard" pytorch tokeniser. If anything, the accuracy seems even a bit lower.

We've also check there's no significant impact of the **stopwords** configuration, i.e. encoding our corpus with or without the stopwords option does not change significantly the accuracy of the model.

___

## 3) Model: RNN, Pre-trained embeddings
For this new model, we will again using the same ANN architecture ("standard" RNN). The change here is that we will be using pre-trained embeddings. Instead of building our embeddings **during** the training step, we will be using word embeddings that have been previously trained.

**GloVe**, for Global Vectors for word representation, is an unsupervised learning algorithm for obtaining vector representations for words. In addition to the algorithm, some pre-trained words vectors (embeddings) are already available for download. The idea here is that, by having embeddings that have been extensively trained on many corpus, our model will be able to better identify sentiments from our IMDB review corpus, more partucularly on words that have not been seen during training.

### 3.1) Building and encoding the IMDB data
We re-create our train, validation and test data. We keep using **spaCy** for tokenisation, but do not use the **stopwords** feature. When building the vocabulary, we then specify to use the pre-trained embeddings from **GloVe**. That is, instead of having embedding layer being initialised randomly, we initialise it with the GloVe embeddings.

There are different GloVe embeddings available, we will be using the **glove.6B.100d**. This particular embeddings is made up of **6 billions tokens**, a **vocabulary of 400,000** words, with each word being represented as a vector of **dimension 100**.

In [48]:
reviews = data.Field(tokenize = "spacy")
labels = data.LabelField(dtype = torch.float)

In [49]:
%%time
train_data, test_data = datasets.IMDB.splits(reviews, labels)

Wall time: 1min 26s


In [50]:
train_data, validation_data = train_data.split(split_ratio = 0.8)

In [51]:
maximum_vocab_size = 30_000
reviews.build_vocab(train_data, max_size=maximum_vocab_size,
                    vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_)
labels.build_vocab(train_data)

In [52]:
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [53]:
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, validation_data, test_data),
                                                               batch_size=batch_size,
                                                               device=device)

### 3.2) Re-initalising the model
We change the embedding dimension to match the dimension of the GloVe embedding we're using (**dim=100**)

In [54]:
input_dim = len(reviews.vocab)
embedding_dim = 100
hidden_dim = 128
output_dim = 1

In [55]:
model_rnn = RNN(input_dim, embedding_dim, hidden_dim, output_dim)

#### Replacing the embedding layer with GloVe embeddings
We replace the initial weights of the embedding layer (initialised during the instantiation of the RNN model in the above cell), with the GloVe embeddings weights

In [56]:
GloVe_embeddings = reviews.vocab.vectors
model_rnn.embedding.weight.data.copy_(GloVe_embeddings)

tensor([[-0.6585, -0.8961,  2.6492,  ..., -0.4100,  0.5340,  0.6607],
        [ 2.2799,  0.1549, -1.5530,  ..., -1.0601,  0.0717,  0.0137],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]])

We also re-initialise the `<PAD>` and `<UNK>` weights to zero

In [57]:
pad_idx = reviews.vocab.stoi[reviews.pad_token]
unk_idx = reviews.vocab.stoi[reviews.unk_token]

model_rnn.embedding.weight.data[pad_idx] = torch.zeros(embedding_dim)
model_rnn.embedding.weight.data[unk_idx] = torch.zeros(embedding_dim)
model_rnn.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]])

Optimiser, loss and moving computation to the appropriate device (GPU or CPU)

In [58]:
optimiser = optim.Adam(model_rnn.parameters())
loss_fn = nn.BCEWithLogitsLoss()

model_rnn = model_rnn.to(device)
loss_fn = loss_fn.to(device)

### 3.3) Running the model
We need to change our embedding dimension here, as the dimension of the GloVe embeddings we're using is 100.

In [59]:
nb_epochs = 5
run_model(nb_epochs, model_rnn, train_iter, valid_iter, optimiser, loss_fn)

epoch 1: 0mn:16s
	  Training loss: 0.6940 |   Training accuracy: 50.11 %
	Validation loss: 0.6946 | Validation accuracy: 50.10 %
epoch 2: 0mn:16s
	  Training loss: 0.6931 |   Training accuracy: 50.08 %
	Validation loss: 0.7110 | Validation accuracy: 50.06 %
epoch 3: 0mn:16s
	  Training loss: 0.6963 |   Training accuracy: 50.25 %
	Validation loss: 0.6961 | Validation accuracy: 49.76 %
epoch 4: 0mn:15s
	  Training loss: 0.6955 |   Training accuracy: 49.82 %
	Validation loss: 0.6973 | Validation accuracy: 49.05 %
epoch 5: 0mn:16s
	  Training loss: 0.6954 |   Training accuracy: 49.57 %
	Validation loss: 0.6937 | Validation accuracy: 49.58 %


### 3.4) Conclusion on this RNN / GloVe emmbedings model
Perhaps surprisingly, using GloVe trained embeddings does not improve the accuracy of our model on the training and validation set. Possibly, the parameter that is likely limiting our accuracy is the model by itself. Keeping the vanilla RNN, we can still try different options, such as:
* size of the hidden dimension
* learning rate of the optimiser

Before moving to a completely different ANN architecture, let's try doing a grid search over some hyperparameters of the mode, i.e. a combination of:
* hidden_dim = [128, 256]
* learning_rate = [0.01, 0.001, 00001]

In [60]:
hidden_dims = [128, 256]
learning_rates = [0.01, 0.001, 0.0001]

In [61]:
nb_epochs = 3
for hidden_dim, lr in product(hidden_dims, learning_rates):
    
    # change model settings
    model_rnn = RNN(input_dim, embedding_dim, hidden_dim, output_dim)
    model_rnn = model_rnn.to(device)
    optimiser = optim.Adam(model_rnn.parameters(), lr=lr)
    
    print("Running model with hidden_dim={0} and learning_rate={1:.6f}".
          format(hidden_dim, lr))
    
    run_model(nb_epochs, model_rnn, train_iter, valid_iter, optimiser, loss_fn)
    print("---------")

Running model with hidden_dim=128 and learning_rate=0.010000
epoch 1: 0mn:16s
	  Training loss: 0.7115 |   Training accuracy: 49.55 %
	Validation loss: 0.7041 | Validation accuracy: 51.21 %
epoch 2: 0mn:15s
	  Training loss: 0.7086 |   Training accuracy: 50.32 %
	Validation loss: 0.7424 | Validation accuracy: 50.36 %
epoch 3: 0mn:17s
	  Training loss: 0.7135 |   Training accuracy: 50.37 %
	Validation loss: 0.7146 | Validation accuracy: 51.34 %
---------
Running model with hidden_dim=128 and learning_rate=0.001000
epoch 1: 0mn:16s
	  Training loss: 0.6971 |   Training accuracy: 49.43 %
	Validation loss: 0.7104 | Validation accuracy: 48.38 %
epoch 2: 0mn:16s
	  Training loss: 0.6948 |   Training accuracy: 50.45 %
	Validation loss: 0.7032 | Validation accuracy: 50.36 %
epoch 3: 0mn:15s
	  Training loss: 0.6943 |   Training accuracy: 50.93 %
	Validation loss: 0.6991 | Validation accuracy: 49.86 %
---------
Running model with hidden_dim=128 and learning_rate=0.000100
epoch 1: 0mn:15s
	  Tra

### 3.5) Conclusion on grid search and general conclusion on this RNN model
The grid search on the hyperparameters **(hidden dimension, learning rate)**, doesn't show any improvements in term of accuracy neither.
It looks clear that, at least for this specific task of sentiment analysis, this particular, simple, RNN architecture is the limiting factor. Indeed, despite the different tunings and improvements we've tried on this model, we keep getting a poor accuracy, and loss not decreasing during training.

In the next section we will build on what we've done so far (Spacy and GloVe embeddings) and try another architecture (**LSTM**).

___

## 4) Model: LSTM, Pre-trained embeddings
Long Short-Term Memory (LSTM) networks are RNNs that  have been designed to overcome the main drawbacks of the standard RNNs:
* standard RNNs have short-term memory. When a sequence is too long, RNNs struggle to propagate information from earlier steps to the later ones. That is definitively an issue with IMDB reviews, where reviews are multi-sentences.
* standards RNNs suffer from vanishing (and exploding) gradient due to how backpropagation is performed (backpropagation through time).

LSTMs have additional components' architecture that prevent these two drawbacks, in particular the memory cell C and the gates (update and forget gates) controlling the update of the memory cell.

Another change we're introducing is ***packed padded sequences***. Instead of passing padded sequences (sequences with `<PAD>` tokens) to the LSTM, we'll pass the sequences without padding. We therefore need to tell the LSTM how long the sequences are. This can be returned by the torch.data.FIELD class when setting the parameters ***include_lengths***. This is what we will do.

### 4.1) Building and encoding the IMDB data
As previously, using **GloVe**.

In [62]:
reviews = data.Field(tokenize = "spacy", include_lengths = True)
labels = data.LabelField(dtype = torch.float)

In [63]:
%%time
train_data, test_data = datasets.IMDB.splits(reviews, labels)

Wall time: 1min 27s


In [64]:
train_data, validation_data = train_data.split(split_ratio = 0.8)

In [65]:
maximum_vocab_size = 30_000
reviews.build_vocab(train_data, max_size=maximum_vocab_size,
                    vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_)
labels.build_vocab(train_data)

In [66]:
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [67]:
train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data, validation_data, test_data),
                                                               batch_size=batch_size,
                                                               sort_within_batch=True,
                                                               device=device)

### 4.1) Build our LSTM model

Our LSTM model will be comprised of:
#### An embedding layer
We'll use **GloVe** as previously as pre-trained embeddings. The layer takes as input dimension the size of the vocabulary. The output dimension must comply with the pre-trained embeddings used, i.e. here **100**.
#### An LSTM layer
The LSTM layer takes as input dimension the embedding dimension (output of the embedding layer). The output dimension is the dimension of the hidden state.
#### A linear layer
Our output layer is a classic linear (fully connected) layer. It takes into input the output of the LSTM layer, and therefore the input dimension is the dimension of the LSTM hidden state. The output dimension is one. Indeed, the problem here is a binary classification we can use a scalar within [0,1] bounds.

#### Defining our model

In [68]:
class LSTM(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim,
                 output_dim, padding_idx):
        super(LSTM, self).__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, review, review_lengths):

        embeddings = self.embedding(review)
        
        p_embeddings = nn.utils.rnn.pack_padded_sequence(embeddings, review_lengths)
        p_output, (hidden, cell) = self.lstm(p_embeddings)
        
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(p_output)
        hidden = self.fc(hidden[-1])
        
        return(hidden)

#### Setting our model parameters:

In [69]:
input_dim = len(reviews.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
padding_idx = reviews.vocab.stoi[reviews.pad_token]

In [70]:
model_lstm = LSTM(input_dim, embedding_dim, hidden_dim,
                  output_dim, padding_idx)
model_lstm = model_lstm.to(device)

In [71]:
def model_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [72]:
print("RNN model has {0:,} parameters".format(model_parameters(model_lstm)))

RNN model has 3,367,049 parameters


#### Replacing the embedding layer with GloVe embeddings.

In [73]:
GloVe_embeddings = reviews.vocab.vectors
print(GloVe_embeddings.shape)

torch.Size([30002, 100])


In [74]:
model_lstm.embedding.weight.data.copy_(GloVe_embeddings)

tensor([[ 0.6106, -2.3774,  0.2805,  ..., -0.4891, -0.1276, -0.2461],
        [ 1.0234, -0.3995, -0.7861,  ..., -0.1443, -0.6640, -0.7930],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]],
       device='cuda:0')

In [75]:
unknown_idx = reviews.vocab.stoi[reviews.unk_token]

model_lstm.embedding.weight.data[padding_idx] = torch.zeros(embedding_dim)
model_lstm.embedding.weight.data[unknown_idx] = torch.zeros(embedding_dim)
model_lstm.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]],
       device='cuda:0')

In [76]:
optimiser = optim.Adam(model_lstm.parameters())

In [77]:
loss_fn = nn.BCEWithLogitsLoss()
loss_fn = loss_fn.to(device)

### 4.2) Running the model
We need to redefine the train and evaluate function, as there's an additional parameter to take into account during training/validation (review lengths), due to our change to ***packed padded sequences***.

In [78]:
def train_lstm_model(model, data, optimiser, loss_fn):
    '''
    Train the model on a set of data (training set as batch)
    That is, it performs the forward and backward propagation
    steps as well as gradient descent for an epoch.
    Parameters
    ----------
    - the model to train
    - the data: in iterator yielding batching of size
        batch_size
    - the chosen optimiser
    - the chosen loss function
    '''
    epoch_loss = 0
    epoch_accuracy = 0
    
    model.train()
    
    for data_batch in data:
        optimiser.zero_grad()
        review, review_lengths = data_batch.text
        predictions = model(review, review_lengths).squeeze(1)
        loss = loss_fn(predictions, data_batch.label)
        accuracy = get_accuracy(predictions, data_batch.label)
        loss.backward()
        optimiser.step()
        epoch_loss = epoch_loss + loss.item()
        epoch_accuracy = epoch_accuracy + accuracy.item()
    
    epoch_loss = epoch_loss / len(data)
    epoch_accuracy = epoch_accuracy / len(data)
    
    return(epoch_loss, epoch_accuracy)

In [79]:
def evaluate_lstm_model(model, data, loss_fn):
    '''
    Evaluate the model (-> using the validation set)
    This is not training, i.e. we do not compute gradient
    Parameters
    ----------
    - the model to train
    - the data: in iterator yielding batching of size
        batch_size
    - the chosen loss function
    '''
    epoch_loss = 0
    epoch_accuracy = 0
    
    model.eval()
    
    with torch.no_grad():
        for data_batch in data:
            review, review_lengths = data_batch.text
            predictions = model(review, review_lengths).squeeze(1)
            loss = loss_fn(predictions, data_batch.label)
            accuracy = get_accuracy(predictions, data_batch.label)
        
            # accumulate loss and accuracy of each batch
            epoch_loss = epoch_loss + loss.item()
            epoch_accuracy = epoch_accuracy + accuracy.item()
        
    epoch_loss = epoch_loss / len(data)
    epoch_accuracy = epoch_accuracy / len(data)
    
    return(epoch_loss, epoch_accuracy)

In [80]:
def run_lstm_model(nb_epochs, model, train_iter, valid_iter, optimiser, loss_fn):
    '''
    Run training, validation for all epochs
    Output the loss and accuracy values for both
    training and validation steps
    Parameters
    ----------
    - the number of epochs
    - the model to train
    - the train and validation data: (iterators)
    - the chosen optimiser
    - the chosen loss function
    '''

    best_loss = float("inf")

    for an_epoch in range(nb_epochs):
    
        startt = datetime.now()
    
        train_loss, train_accuracy = train_lstm_model(model_lstm, train_iter, optimiser, loss_fn)
        validation_loss, validation_accuracy = evaluate_lstm_model(model_lstm, valid_iter, loss_fn)
    
        duration = (datetime.now() - startt)
        duration_str = "{0}mn:{1}s".format(duration.seconds//60, duration.seconds%60)
    
        print("epoch {0}: {1}".format(an_epoch+1, duration_str))
        print("\t  Training loss: {0:.4f} |   Training accuracy: {1:.2f} %".
              format(train_loss, train_accuracy*100))
        print("\tValidation loss: {0:.4f} | Validation accuracy: {1:.2f} %".
              format(validation_loss, validation_accuracy*100))

In [81]:
nb_epochs = 5
run_lstm_model(nb_epochs, model_lstm, train_iter, valid_iter, optimiser, loss_fn)

epoch 1: 0mn:16s
	  Training loss: 0.6380 |   Training accuracy: 63.04 %
	Validation loss: 0.5946 | Validation accuracy: 68.55 %
epoch 2: 0mn:16s
	  Training loss: 0.4870 |   Training accuracy: 77.68 %
	Validation loss: 0.5528 | Validation accuracy: 72.05 %
epoch 3: 0mn:16s
	  Training loss: 0.3658 |   Training accuracy: 84.43 %
	Validation loss: 0.3614 | Validation accuracy: 84.89 %
epoch 4: 0mn:16s
	  Training loss: 0.2346 |   Training accuracy: 90.88 %
	Validation loss: 0.3268 | Validation accuracy: 87.38 %
epoch 5: 0mn:16s
	  Training loss: 0.1689 |   Training accuracy: 93.80 %
	Validation loss: 0.4158 | Validation accuracy: 81.45 %


### 4.3) Predicting sentiment on a review
Let's again use some test_data reviews and a made-up review as previously and see if this LSTM model is predicting better reviews that have never been seen.

But first we need to change the predict function to include the length of our review:

In [82]:
def predict_review_sentiment_with_LSTM(model, review_as_tokens):
    '''
    Predict the sentiment of a movie review
    Parameters
    ----------
    - the ANN model to use to predict
    - A review as a list of tokens
    Return
    ------
    A sentiment prediction, as a [0,1] real
    A value < 0.5 meaning negative review
    '''
    model.eval()
    review_indexed = [reviews.vocab.stoi[token] for token in review_as_tokens]
    review_length = [len(review_indexed)]
    review_tensor = torch.LongTensor(review_indexed).to(device)
    review_tensor = review_tensor.unsqueeze(1)
    review_length_as_tensor = torch.LongTensor(review_length)
    sentiment_prediction = torch.sigmoid(model(review_tensor, review_length_as_tensor))
    return (sentiment_prediction.item())

Let's now pick up at random a review amongst the 25000 reviews from the **test_data**:

In [83]:
a_review = test_data.examples[rd.randint(1,25000)].__dict__
a_review_label = a_review["label"]
a_review_text = a_review["text"]
print("{0}\n".format(a_review_text))
if a_review_label == "neg":
    print("This is a NEGATIVE review !")
else:
    print("This is a POSITIVE review !")

['Rainy', 'day', 'with', 'not', 'much', 'to', 'do', '.', 'We', 'were', 'surfing', 'the', 'movie', 'network', 'channels', 'and', 'found', 'this', 'one', 'just', 'starting', ',', 'so', 'we', 'gave', 'it', 'a', 'chance.<br', '/><br', '/>The', 'more', 'we', 'watched', ',', 'the', 'more', 'we', 'became', 'engrossed', 'in', 'the', 'story', '.', 'Its', 'the', 'old', 'story', 'of', 'working', 'class', 'underdog', 'trying', 'to', 'make', 'it', 'in', 'a', 'sport', 'which', 'at', 'the', 'time', '(', '1913', 'I', 'think', ')', 'was', 'usually', 'played', 'by', 'the', 'wealthy', 'upper', 'class', 'but', 'this', 'movie', 'was', 'every', 'bit', 'as', 'interesting', 'as', 'Seabiscuit', 'and', 'this', 'is', 'also', 'based', 'on', 'a', 'true', 'story.<br', '/><br', '/>The', 'acting', 'is', 'believable', 'and', 'the', 'casting', 'is', 'brilliant', '.', 'AND', '.', '.', '.', '.', 'we', 'are', 'NOT', 'golfers', ',', 'so', 'please', 'do', "n't", 'miss', 'this', 'one', 'just', 'because', 'its', 'about', 'gol

And predict ...

In [84]:
sentiment = predict_review_sentiment_with_LSTM(model_lstm, a_review_text)
if sentiment < 0.5:
    print("This is a NEGATIVE review ! {0:.4f}".format(sentiment))
else:
    print("This is a POSITIVE review ! {0:.4f}".format(sentiment))

This is a POSITIVE review ! 0.5116


We can also make up our own review ... (*we're using spaCy here only for tokenisation of our review. We will discuss spaCy in the next section.*)

In [85]:
#my_review = "The best movie of the year"
my_review = "The best movie of the year. Very good, I like it, I enjoy it a lot. It's fun."
#my_review = "This is such a bad movie!"

my_review_tokens = [token.text for token in nlp.tokenizer(my_review)]
print(my_review_tokens)

['The', 'best', 'movie', 'of', 'the', 'year', '.', 'Very', 'good', ',', 'I', 'like', 'it', ',', 'I', 'enjoy', 'it', 'a', 'lot', '.', 'It', "'s", 'fun', '.']


And the prediction of this review is ...

In [86]:
sentiment = predict_review_sentiment_with_LSTM(model_lstm, my_review_tokens)
if sentiment < 0.5:
    print("This is a NEGATIVE review ! {0:.4f}".format(sentiment))
else:
    print("This is a POSITIVE review ! {0:.4f}".format(sentiment))

This is a POSITIVE review ! 0.9791


### 4.4) Conclusion on this LSTM / GloVe emmbedings model
The results of this model are significantly better than all previous models! First of all we can see the loss getting smaller and smaller at each Epoch. And then, we managed to have an accuracy greater than **90%** on the training set!! That's a huge jump from the accuracy around 50% from all the other models.

The LSTM is clearly a better model for this task of sentiment analysis. Given the structure of the data for our IMDB review sentiment analysis, it's very likely that the jump of performance is due to LSTM abilities to handle long sequences.

However not everything is perfect here. We can observe quite a gap between the accuracy on the training set and on the validation set (for this particular run, more than 10 points on the last epoch, i.e. 93% of accuracy on the training dataset versus 81% on the validation dataset). This is a typical sign of our model **overfitting**. We can try to correct overfitting by using **regularisation**. One computationally cheap and effective way of doing regularisation for ANN is **dropout**. We'll add dropout to our LSTM model in the next section.

## 5) Model: Adding Dropout to LSTM & Pre-trained embeddings
We're here simply adding **dropout** to our previous **LSTM/GloVe** model. Dropout is simply the action of **randomly** dropping neurons during the training step. This forces the network to be more robust and prevent the network to build complex co-adaptations on training data.

We will add dropout on the embedding layer and the LSTM hidden layer. We'll give a value of **0.5** to our dropout, which is supposed to yield the maximum regularization for big networks. However, it would be worth trying a different value for the "input" layer (embeddings) and the LSTM hidden layer.

### 5.1) Adding dropout to our LSTM model

In [87]:
class LSTM(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim,
                 output_dim, dropout, padding_idx):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, review, review_lengths):

        embeddings = self.dropout(self.embedding(review))
        
        p_embeddings = nn.utils.rnn.pack_padded_sequence(embeddings, review_lengths)
        p_output, (hidden, cell) = self.lstm(p_embeddings)
        hidden = self.fc(self.dropout(hidden[-1]))
        
        return(hidden)

#### Setting our model parameters:

In [88]:
input_dim = len(reviews.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim = 1
dropout = 0.5
padding_idx = reviews.vocab.stoi[reviews.pad_token]

In [89]:
model_lstm = LSTM(input_dim, embedding_dim, hidden_dim,
                  output_dim, dropout, padding_idx)
model_lstm = model_lstm.to(device)

#### Replacing the embedding layer with GloVe embeddings.

In [90]:
GloVe_embeddings = reviews.vocab.vectors
print(GloVe_embeddings.shape)

torch.Size([30002, 100])


In [91]:
model_lstm.embedding.weight.data.copy_(GloVe_embeddings)

tensor([[ 0.6106, -2.3774,  0.2805,  ..., -0.4891, -0.1276, -0.2461],
        [ 1.0234, -0.3995, -0.7861,  ..., -0.1443, -0.6640, -0.7930],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]],
       device='cuda:0')

In [92]:
unknown_idx = reviews.vocab.stoi[reviews.unk_token]

model_lstm.embedding.weight.data[padding_idx] = torch.zeros(embedding_dim)
model_lstm.embedding.weight.data[unknown_idx] = torch.zeros(embedding_dim)
model_lstm.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7447, -0.2493, -0.0636,  ...,  0.1555,  1.0456,  0.1147],
        [-0.5160,  0.0312,  0.0855,  ..., -0.8593,  0.0180,  0.7858],
        [-0.2045,  0.0965, -0.4298,  ..., -0.4046,  0.4052,  0.3617]],
       device='cuda:0')

In [93]:
optimiser = optim.Adam(model_lstm.parameters())

In [94]:
loss_fn = nn.BCEWithLogitsLoss()
loss_fn = loss_fn.to(device)

### 5.2) Running the model

In [95]:
nb_epochs = 5
run_lstm_model(nb_epochs, model_lstm, train_iter, valid_iter, optimiser, loss_fn)

epoch 1: 0mn:16s
	  Training loss: 0.6625 |   Training accuracy: 60.62 %
	Validation loss: 0.5660 | Validation accuracy: 73.14 %
epoch 2: 0mn:15s
	  Training loss: 0.6040 |   Training accuracy: 68.31 %
	Validation loss: 0.5217 | Validation accuracy: 75.14 %
epoch 3: 0mn:16s
	  Training loss: 0.4768 |   Training accuracy: 78.14 %
	Validation loss: 0.3679 | Validation accuracy: 84.41 %
epoch 4: 0mn:16s
	  Training loss: 0.4832 |   Training accuracy: 77.18 %
	Validation loss: 0.4357 | Validation accuracy: 80.10 %
epoch 5: 0mn:16s
	  Training loss: 0.3338 |   Training accuracy: 86.24 %
	Validation loss: 0.3146 | Validation accuracy: 87.58 %


### 5.3) Conclusion on this LSTM with dropout / GloVe emmbedings model


The regularistion using dropout works pretty well, certainly decreasing the accuracy on the training set, **but** definitively **increasing** the accuracy on the validation set, which is what really matters. Our model is now better on never seen reviews.

## Next steps ...

We achieved good progress during this notebook, but there are still some limitations, and therefore rooms from improvements. For instance, our model is not very good at giving a proper prediction on very small sentences/reviews (like, "*This movie is not good*"), which in some of our tries have been classified as a positive review!
Therefore, there are still a lot that can be done. Starting with trying over (more complex and more recent) network architecture, such as:
* Bidirectional RNN, allowing the network to build predictions from both the ***past*** (start of the sequence) and ***futur*** (end of the sequence).
* More complex RNN architecture, i.e. LSTM (Long Short Term Memory), known to be able to better capture/learn from long sequences (which is typically the case in this dataset for movie reviews).
* Transformers (because it is a trendy architecture ! Not sure it will work well on this problem, need to try).