In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../css/custom.css">

# Recurrent Neural Networks for Sentiment Classification


![footer_logo](../images/logo.png)

History matters for a lot of problems.
If you're looking at a video with a tiny dog house, you're probably more likely to think that the weird object in the next frame is a chihuahua and not a muffin.

<p style="text-align:center;">
    <img src="../images/chihuahua-muffin.png" alt="Drawing" style="width: 40%;"/>
</p>

> Source: https://twitter.com/teenybiscuit/status/707727863571582978

Feedforward networks learn their parameters once and have a fixed state, so they cannot take context in the input data into account.
Recurrent neural networks (RNNs) also learn their parameters once, but keep a state depending on the sequence they have seemed so far.
This makes RNNs well suited for problems with sequences, like converting speech to text: translation of a word can be helped by knowing the words that came before.

In this exercise we'll apply a RNN for sentiment classification.
We'll get some reviews from movies and try to classify if they have a positive or negative sentiment.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os
import pandas as pd
import seaborn as sns

%matplotlib inline

## 1 Data

Like many other libraries, `keras` includes some standard datasets to play around with.
We'll use the IMDB dataset.
This section shows what this dataset contains.

From the [website](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) (emphasis ours): 

> "Dataset of __25,000 movies reviews from IMDB, labeled by sentiment (positive/negative)__. Reviews have been preprocessed, and __each review is encoded as a sequence of word indexes__ (integers). For convenience, __words are indexed by overall frequency__ in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
>
> As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


We'll load reviews with only the 20,000 most frequent words:

In [None]:
from tensorflow.keras.datasets import imdb

NUM_WORDS = 20000

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=NUM_WORDS)

print("Loading data...")
print(len(x_train), "train sequences")
print(len(x_test), "test sequences")

`x_train` and `x_test` are `numpy.ndarray`'s containing list of sequences.

A few examples are show below: the samples don't have the same length and are encoded by integers.

In [None]:
x_train[0]

This may be a valid way to represent text for machines, but it's not really usefull for humans.
Let's try to get the original text back.

The `imdb` module ships with a function `get_word_index()` to decode the integers to words, but we'll have to do some extra work: there are some special words for that are not the word index.
See the arguments `start_char`, `oov_char` and `index_from` of the function `imdb.load_data()` for more details.

In [None]:
INDEX_FROM = 3  # First actual word.

word_index = imdb.get_word_index()
word_to_index = {k: (v + INDEX_FROM) for k, v in word_index.items()}
# Add special words.
word_to_index["<PAD>"] = 0  # Padding
word_to_index["<START>"] = 1  # Starting of sequence
word_to_index["<UNK>"] = 2  # Unknown word

index_to_word = {v: k for k, v in word_to_index.items()}

With our dictionary `index_to_word` we can display the original reviews:

In [None]:
I_SHOW = 4

" ".join([index_to_word[w] for w in x_train[I_SHOW]])

In [None]:
x_train[4]

> #### Exercise 
> 
> Play around with `I_SHOW` and read some other reviews!

## 2 Preprocessing

The previous section showed that the text was encoded by integers, but we need to do some more processing: `keras` needs all sequences (/reviews) to be of equal length.

We can choose to pad all sequences to the longest length, or we can choose a maximum review length and cut longer reviews.
We'll cut reviews after `MAXLEN=80` words and pad them if needed: 

In [None]:
from tensorflow.keras.preprocessing import sequence

MAXLEN = 80

X_train = sequence.pad_sequences(x_train, maxlen=MAXLEN)
X_test = sequence.pad_sequences(x_test, maxlen=MAXLEN)

print("Size of X_train", X_train.shape)
print("Size of X_test", X_test.shape)

Is this a valid threshold?

The figure below shows that we'll be cutting most texts and padding only some.
However, this can be fine: most of the sentiment could be in the first 80 words.
If we find out that it's not enough, we'll come back and increase the text length!

In [None]:
lengths = [len(s) for s in x_train]

fig, ax = plt.subplots(figsize=(10, 6))
sns.kdeplot(lengths, cumulative=True, label="x_train", ax=ax)
ax.plot((80, 80), ax.get_ylim(), "--k")
ax.set_xlabel("Sequence lengths [# words]")
ax.set_ylabel("Cumulative fraction")
ax.set_title("Occurence of sequence lengths in reviews")
ax.legend(["x_train", "Max length"])

We're now done with the preprocessing!
In our training set we have 25000 reviews of 80 words (some of the padded).
All words are encoded by integers:

In [None]:
print("Size of X_train:", X_train.shape)

X_train[3, :]

## 3 Theory

We're done with all the data manipulation, how is our neural network going to look like?
Our initial model will consist of three layers: an embedding layer, a recurrent layer and a dense layer.
The embedding layer learns the relations between words, the recurrent layer learns what the document is about and the dense layer translates that to sentiment.

### 3.1 Embedding layer

The embedding layer will embed our original word vectors in a dense, lower-dimensional space.
This embedding can capture complicated relationships between words and make it easier to learn.

We'll see in a minute what we mean with that, let's first start with the traditional approach of __one-hot encoding__.
One-hot encoding words indexes words and represents them as a big vectors with zeros and ones.


With one-hot encoding, the vocabulary "$\textsf{code - console - cry - cat - dog}$" would be represented like this:

![](../images/one_hot_encoding.png)


The three text snippets "$\textsf{code console}$", "$\textsf{cry cat}$" and "$\textsf{dog}$" are represented by combining these word vectors:

![](../images/one_hot_document.png)


This representation has some problems.

This matrix will be very large for large vocabulary and also very empty.
Many statistical models have problems learning from such big and sparse data. 
There are __too many features__ to learn from and __not enough samples__ to understand every feature.
Combining words in an intelligent way could solve this.

Treating words as __atomic units__ throws away a lot of information.
"$\textsf{cat}$" is more similar to "$\textsf{dog}$" than to "$\textsf{code}$", and "$\textsf{console}$" has a different meaning when occuring next to "$\textsf{code}$" than when it's next to "$\textsf{cry}$".
These complex relationships cannot be represented by our simple one-hot encoding.

Instead of learning from one-hot encoding, we first let the neural network __embed__ words in a smaller, continuous vector space where similar words are close to each other.
The smaller space makes it easier to learn from and a continuous representation allows to learn complex relationships.

Such an embedding for our vocabulary could look like this: 

![](../images/embedding_encoding.png)


We only need two dimensions for our words instead of five, "$\mathsf{cat}$" is close to "$\mathsf{dog}$", and "$\mathsf{console}$" is somewhere between "$\mathsf{code}$" and "$\mathsf{cry}$".
Closeness in this space indicates similarity.
Encoding our documents with the average of their word vectors also makes a lot of sense:
![](../images/embedding_document.png)


The snippet "$\textsf{dog}$" is now closer to "$\textsf{cry cat}$" than to "$\textsf{code console}$".
How is this different than the one-hot encoding?

These vectors are a thus __lower dimensional__, __denser representation__ of our words and they also capture __semantic information__ about words and their relationships to another. 
Certain directions in the vector space embed certain semantic relationships such as male-femal, verb-tense and country-capital relationships between words.

<img src="../images/linear-relationships.png" alt="Drawing" style="width: 90%;"/>

> Source: https://www.tensorflow.org/tutorials/word2vec

Build an embedding layer in `keras` using `keras.layers.Embedding`.
`keras` can learn this layer for you, but you can also pretrained embeddings generated by others.
More on this later.

### 3.2 Recurrent layer

__Recurrent Neural Nets__ naturally deal with word order because they can go over a __sequence__ of words and keep a __memory__ of the information that has been calculated so far.

This could help when trying to assign sentiment to sentences, as shown in the figure below.
A word can trigger a sentiment that carries on for one or multiple sentences.

<img src="../images/sentiment-neuron.gif" style="width: 75%;"/>


> Source: [Unsupervised Sentiment Neuron](https://blog.openai.com/unsupervised-sentiment-neuron/)

If we'd be interested in __understanding a document__ like in the previous example, we could use the following architecture:

<img src="../images/rnn-architecture.png" style="width: 75%;"/>

> Source: [Goodfellow, 2016]

The left side of the figure shows a short-hand of the neural network, the right side shows the unrolled version.
In the figure we have:

* $\mathbf{x}^{(t-1)}$, $\mathbf{x}^{(t)}$, $\mathbf{x}^{(t+1)}$: input word vector at time $t$.
* $\mathbf{h}^{(t-1)}$, $\mathbf{h}^{(t)}$, $\mathbf{h}^{(t+1)}$: output of the previous time-step $t-1$.

At each time-step, the input is the output of the previous time-step $\mathbf{h}^{(t-1)}$ and a new input word vector $\mathbf{x}^{(t)}$.
Over time we adjust our idea of the document $\mathbf{h}^{(t)}$ until we've seen all words in the document.
This is illustrated in the figure below: we get a new word vector at each time-step and __carry over a score__.

The final score $\mathbf{h}^{(T)}$ represents what the neural network has learned about the document after having seen every word.
We could, for instance, use the final scores to detect sentiments - and that's exactly what we'll be doing!

We'll use a specific kind of recurrent layer: a LSTM.
The Long Short Term Memory neuron are able to learn long-term dependencies and often perform better than standard RNNs.
Read [this blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) if you'd like more info.

The LSTM layer can be found in `keras.layers.LSTM`.

### 3.3 Dense layer

The first layer learns a good representation of words, the second learns to combine words in a single idea, and the final layer turns this idea into a classification.
We will use a simple dense layer from `keras.layers.Dense` that transforms the idea vectors into a 0 or 1.
The layer will consist of a single neuron that takes all connections and outputs 0 or 1.

## 4 Model

Now that we now how the architecture looks like and we have our data in `X_train` and `X_test`, it's time to build a model.

> #### Exercise: LSTM for sentiment classification
>
> * Build a sequential model with three layers: embedding, LSTM and dense layers. 
>     * Don't make the embedding layer *larger than* 200 units. Use `NUM_WORDS` as the vocabulary size.
>     * Use *at most* 256 LSTM units. Play around with parameters `dropout` and `recurrent_dropout`.
> * Compile the model with `'binary_crossentropy'` as the loss and use `'accuracy'` as validation metric.
> * Add callbacks to the fitting: use `keras.callbacks.ModelCheckpoint()` and `keras.callbacks.EarlyStopping()`.
> * Reasonable test scores are 0.42 for the binary cross-entropy and and 0.81 for the accuracy.

In [None]:
# %load ../answers/sentiment_lstm.py


If you reached the benchmarks, you've succesfully trained a recurrent neural network for text classification!

The next section gives some pointers what to do next.

# 5 Going deeper

We've only touched the surface on applying RNNs to text.
This section contains some more exercises to deepen your understanding.

> #### Exercise: Baseline
>
> This dataset is small, so might not really benefit from the complexity from deep learning.
> Use [`sklearn`](http://scikit-learn.org/stable/) to create a baseline: create a `Pipeline` using the `TfidfVectorizer()` and `BernoulliNB()` 
> What's your best score?
>
> Hint:
> - Use `X_train_translated` and `X_test_translated` from below; `TfidfVectorizer()` works better with real strings than integers.
> You can still use `y_train` and `y_test`.

In [None]:
# Convert indices to text.
X_train_translated = [" ".join(index_to_word[w] for w in s) for s in x_train]
X_test_translated = [" ".join(index_to_word[w] for w in s) for s in x_test]

In [None]:
# %load ../answers/sentiment_sklearn.py



> #### Exercise: Minimum Viable Network
> 
> You probably don't need a big network for this small dataset: there's not enough data to learn really complex relations.
What's the smallest network you still get good results with?

In [None]:
# %load ../answers/sentiment_mvn.py



> #### Exercise: Visualizing Embeddings
>
> Section 3 argued that the embeddings would learn relations between words.
However, we didn't prove this for this solution: we just provided an architecture for the network and told the network to learn sentiment.
This means that it didn't necessarily use the embedding to learn a representation that makes sense to us: everything was conditioned on sentiment classfication.
>
> Visualize the embeddings: are they any good?
> * Get the weights from the right layer:
    * See the attribute `.layers` of the network.
    * Use the method `.get_weights()` of the layer.
> * Use [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to map the weights into a 2D representation.

In [None]:
# %load ../answers/sentiment_tsne.py


> #### Exercise: Transfer Learning
>
> Instead of training the embeddings, you can use word embeddings pretrained on a large corpus.
This allows you to leverage complex relations learned from large corpora on your smaller datasets.
>
> Read [this Keras blog](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) and use word embeddings from GloVe.
Note that not all words may be present, so you'll have to do some preprocessing.
Does this improve your model?