# Lab 3: Sentiment Analysis with LSTMs using Keras

In this lab session, you'll implement an **RNN-based sentence classifier**. Plain old RNNs aren't very good at sentiment classification, so we will use Keras to implement a more advanced __Long Short Term Memory__ (LSTM) based sentiment classifier. 


__Objectives__: In this lab session will learn the following:

- Word embedding representation
- Preprocessing data for recurrent archictures (sequence padding)
- Implementation of LSTM-based classifier



----

Remember from the theory that Recurrent Neural Networks apply over and over the same function (the recursive cell) to every token in the sequence. In a simplified version the next token is combined with the output of the previous _state_ (contains the information of what has been seen so far) into the recursive function, so that the whole sequence is represented in a single vector. 

Figure below shows an unrolled LSTM archicture, in which input text sequence, once tokenized and obtained the word index, is represented by word embeddings. Emebdding lookup layer takes a list of word indexes and returns a list word embeddings (low-dimensional dense vectors that represent words). These word embeddings are what actually fed to sentence encoder. Finally, the last output of the LSTM is fed to a fully connected Dense layer. As we'll learn, this is fairly easy to code with Keras.


![](http://ixa2.si.ehu.es/~jibloleo/uc3m_dl4nlp/img/LSTM_sentiment.png)


Advantages of these types of architectures:

- We do not need to show the whole sequence to the model. Actually, each input token is processed independently and current state is kept in memory for the next step.
- We save memory as we share the weights for each time-step.

## 1. Loading the data
We'll use the same data used in previous session.

In [0]:
# Mount Drive files
from google.colab import drive
drive.mount('/content/drive')

In [0]:
sst_home = 'drive/My Drive/kschool-nlp/data/trees/'

In [0]:
import numpy as np
import pandas as pd
import re

import tensorflow as tf
from sklearn.utils import shuffle

## for replicability of results
np.random.seed(1)
tf.set_random_seed(2)

# Let's do 2-way positive/negative classification instead of 5-way    
def load_sst_data(path,
                  easy_label_map={0:0, 1:0, 2:None, 3:1, 4:1}):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)
    data = pd.DataFrame(data)
    return data

def pretty_print(example):
    print('Label: {}\nText: {}'.format(example['label'], example['text']))

training_set = load_sst_data(sst_home+'/train.txt')
dev_set = load_sst_data(sst_home+'/dev.txt')
test_set = load_sst_data(sst_home+'/test.txt')

# Shuffle dataset
training_set = shuffle(training_set)
dev_set = shuffle(dev_set)
test_set = shuffle(test_set)

# Obtain text and label vectors
train_texts = training_set.text
train_labels = training_set.label

dev_texts = dev_set.text
dev_labels = dev_set.label

test_texts = test_set.text
test_labels = test_set.label


print('Training size: {}'.format(len(training_set)))
print('Dev size: {}'.format(len(dev_set)))
print('Test size: {}'.format(len(test_set)))

## 2. Preprocessing: Tokenization, Sequence Padding


**Word representation**

Once data is loaded the next step is to preprocess it to obtain the vectorized form (i.e. the process of transforming text into numeric tensors), which basically consist of:

- Tokenization, tipically segment the text into words. (Alternatively, we could segment text into characters, or extract n-grams of words or characters.)
- Definition of the dictionary index and vocabulary size (in this case we set to 1000 most frequent words)
- Transform each **word** into a vector. 


There are multiple ways to vectorize tokens. The main two are the following: ___One-hot encoding___ and ___word embedding___. In this lab, we'll use first use Keras basic tools to obtain the one-hot encoding, and we'll leave word embeddings for the next section.

### 2.1. One-hot encoding of the data

One-hot encoding is the most basic way to convert a token into a vectort. Here, we'll turn the input vectors into (0,1)-vectors. The process consist of associating a unique integer-index with every word in the vocabulary.

>>>>>![](http://ixa2.si.ehu.es/~jibloleo/uc3m_dl4nlp/img/vectorize_small.png)


For example, if the tokenized vector contains a word that its dictionary index is 14, then in the processed vector, the 14th entry of the vector will be 1 and the rest will set to 0.

Note that when using keras built-in tools for indexing, ```0``` is a reserved index that won't be assigned to any word.


**Sentence representation**

When process data to feed a recurrent archicture, we need to do it differently compared to what we have seen so far. Unrolling each sequence one by one would take for ever, as we would lost the hability for parallelization. In deep learning framework learning is done by mini-batching the training data, which requires having same sequence length for all the input examples in the mini-batch. 

In order to do mini-batch (there are more sophisticated alternatives):
- Choose a single unrolling constant N (e.g. max sequence length)
- Pad first words with zeros (shifting right)


There are more sophisticated alternivative like shuffling examples by sentence length (set N to max. length in mini-batch). 

In the following chunk of code, we will use Keras built-in functions for tokenization and padding sequence.


In [0]:
from keras import preprocessing

max_words = 10000
max_seq = 40

# Create a tokenize that takes the 10000 most common words
tokenizer = preprocessing.text.Tokenizer(num_words=max_words)

# Build the word index (dictionary)
tokenizer.fit_on_texts(train_texts) # Create word index using only training part
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Get data as a lists of integers
train_sequences = tokenizer.texts_to_sequences(train_texts)
dev_sequences = tokenizer.texts_to_sequences(dev_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Padding data: Turn the lists of integers into a 2D integer tensor of shape `(samples, max_seq)`
x_train = preprocessing.sequence.pad_sequences(train_sequences, maxlen=max_seq)
x_dev = preprocessing.sequence.pad_sequences(dev_sequences, maxlen=max_seq)
x_test = preprocessing.sequence.pad_sequences(test_sequences, maxlen=max_seq)

y_train = train_labels
y_dev = dev_labels
y_test = test_labels

print('Shape of the training set (nb_examples, vector_size): {}'.format(x_train.shape))
print('Shape of the validation set (nb_examples, vector_size): {}'.format(x_dev.shape))
print('Shape of the test set (nb_examples, vector_size): {}'.format(x_test.shape))

# Print some examples
print()
print('TEXT: {}\nPADDED: {}'.format(train_texts.iloc[0], x_train[0]))
print()
print('TEXT: {}\nPADDED: {}'.format(train_texts.iloc[1], x_train[1]))

## 3. First LSTM-based classifier

In this section we will build our first LSTM-based classifier in Keras. Note that the LSTM layer, like the rest of layers in Keras, takes inputs of shape ```[batch_size, sequence_length, input_features]```, and it is the reason of why we perform padding when preprocessed the data in the previous section.

Note that after the padding our data still is 2D tenfor of shape ```[batch_size, sequence_length]```. Transformation from 2D to 3D is done with```Embedding``` layer in Keras, which takes the 2D tensor (```[batch_size, sequence_length]```). ```sequence_length``` is an entry of a list of word indexes, where all the entries in the bacth have the same length. (That's why we padded with zeros the shorter sequences).

```Embedding``` layer return a tensor of shape ```[batch_size, sequence_length, embedding_size]```. This can be understood like adding the corresponding embedded vector to each word in the sequence. The layer can be initialized at radom and learn with backpropagation, or use precomputed embedding vector like _Word2vec_ or _Glove_.

Once our data is represented with 3D tensors we can use directly the ```LSTM``` layer. For this task we will combine three Keras layers in this specific order: 

- ```Embedding``` layer: it will transfor the data from 2D to 3D by addig associated embeddings to words in the sequences. In the constructor we need to specify two arguments: 
   - input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
   - output_dim: int >= 0. Dimension of the dense embedding.
   
- ```LSTM``` layer: It will encode the input sequnces and return output tensor. For the LSTM we need to specify the number of units of the LSTM:
   - units: Positive integer, dimensionality of the output space.
   
- ```Dense```: It will take the output of the LSTM as input and perform the classification. 

In [0]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

max_words = 10000
embedding_size = 128
lstm_hidden_size = 128 #128

model = Sequential()

# 1. Define and add Embedding layer to the model. 
#    Note we are using mask_zero=True as we want to ignore the '0' words in the padding
model.add(Embedding(max_words, embedding_size, mask_zero=True))
# After the Embedding layer, 
# our activations have shape `(batch_size, max_seq, embedding_size)`.

# 2. Define and add LSTM layer to the model.
model.add(LSTM(lstm_hidden_size))

# 3. Define and add Dense layer to the model
model.add(Dense(1, activation='sigmoid'))

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.summary()

In [0]:
history = model.fit(x_train, y_train, epochs=20, batch_size=128, validation_data=(x_dev, y_dev), verbose=1)

In [0]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

score = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: ", score[1])

### Exercise 1

- Try different embedding sizes and number of LSTM units. Do you see any differences in loss and accuracy curves? What happens very small embedding size (e.g ```embedding_size = 8```) or LSTM units (e.g ```units = 16```)? And what happens when we do the opposite thing?

- __Hint:__ Plotting original model's loss curve and your choice's curve will help you analysing the differences.

- __Hint__: Experiments take longer than in the previous labs (model complexity has increase, as well as the vocabulary size), you can increase ```batch_size``` in order to speed up the experiments.

### Exercise 2

- It is sometimes useful to stack several recurrent layers one after the other in order to increase the representational power of a network. How would you do it? 

- __Hint:__ You have to get all intermediate layers to return full sequences. Recurrent layers in Keras run in two modes that return different type of tensors:
  - The first one returns the last output of each input sequence $\rightarrow$ ```[batch_size, output_features]```
  - The second one returns the full sequence of successive output for each time-step $\rightarrow$ ```[batch_size, sequence_length, output_features]```
  
- This two modes are controlled by ```return_sequences``` argument in the layer constructor.

## 4. Second LSTM: Initialized with Glove

When the training set is not large enough usually it is a good idea to use precomputed word embeddings as we can add some external knowledge that it is very difficult to acquire from the training set.

### 4.1 Reading precomputed embeddings





In [0]:
def read(file, threshold=0, dim=50, word_index=None):
    max_words = 400000 if threshold <= 0 else min(threshold, 400000)
    
    embeddings = {}
    lines = file.read()
    file.close()
    lines = lines.decode('utf8')
    for line in lines.split('\n'):
        vec = line.split(' ')
        word = vec[0]
        coefs = np.asarray(vec[1:], dtype='float32')
        embeddings[word] = coefs
    
    matrix = np.zeros((max_words, dim))
    for word, i in word_index.items():
      embedding_vector = embeddings.get(word)
      if i < max_words:
        if embedding_vector is not None:
          # Words not found in embedding index will be all-zeros.
          matrix[i] = embedding_vector
    return matrix

In [0]:
import bz2

# Read input embeddings
glove_home = 'drive/My Drive/kschool-nlp/data/word-embeddings/'
embsfile = bz2.open(glove_home + 'glove.6B.50d.txt.bz2')
embedding_matrix = read(embsfile, threshold=max_words, word_index=word_index)

print(embedding_matrix.shape)

In [0]:
embedding_matrix[1]

In [0]:
word_index

### 4.2 Build LSTM based model

We will be using the same architecture as before. Only difference is that now we are going to use an ```embedding_size``` of 50, and the model will be compiled later (after we load the Glove embeddings).

In [0]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

max_words = 10000
embedding_size = 50
lstm_hidden_size = 128

model = Sequential()

# 1. Define and add Embedding layer to the model
model.add(Embedding(max_words, embedding_size, mask_zero=True))
# After the Embedding layer, 
# our activations have shape `(batch_size, max_seq, embedding_size)`.

# 2. Define and add LSTM layer to the model.
model.add(LSTM(lstm_hidden_size))

# 3. Define and add Dense layer to the model
model.add(Dense(1, activation='sigmoid'))

model.summary()

### 4.3 Load precomputed weights

Once we created the embedding matrix (in Section 4.1) in the correct format, we can easily load it into the Embedding layer. Remember that the matrix is of shape ```[max_words, embedding_size]```, where each entry ```i``` contains the embedding vector of the word of index ```i```, created with the tokenizer. Note that the index 0 is not supposed to stand for any word or token.

Embedding layer is the first layer in our model and we can access it with ```model.layers``` list. Once we get it we can initialize the weight as in the code below. 

In addition we can freeze the weights so that we avoid updating during training and avoid forgetting what they already know. This is done by setting the attribute ```trainable``` of the layer to ```False```.

In [0]:
model.layers[0].set_weights([embedding_matrix]) # These are the key step!!!
model.layers[0].trainable = False

### 4.4. Train and evaluate

Training and evaluation is done as in previous models. First we need to compile it (loss, optimizer and evaluation metric are indicated), then fit the model with the training data.

In [0]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=20, batch_size=128, validation_data=(x_dev, y_dev), verbose=2)

In [0]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

score = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: ", score[1])

In [0]:
history10 = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_data=(x_dev, y_dev), verbose=2)

In [0]:

# summarize history for accuracy
plt.plot(history.history['loss'] + history10.history['loss'])
plt.plot(history.history['val_loss'] + history10.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.plot(history.history['acc'] + history10.history['acc'])
plt.plot(history.history['val_acc'] + history10.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

#score = model.evaluate(x_test, y_test, verbose=1)
#print("Accuracy: ", score[1])

The model does not show any improvements compared to models of the previous section. The reason for that can be  many, for example:
  - Vocabulary size is not big enough so we can expect many unknown word when encoding the sentence (not checked).
  - Maximum sequence length might be too small so we are leaving out some important information (not checked).
  - Training size is not too large, which in those cases simpler model perform usually perform better (due to the overfitting).
 
 
The loss plot show huge overfitting of the model: Training loss decreases very fast while development loss increases over time. 

### Exercise 3

- In previous lab session we learn different techniques to avoid overfitting the model. In this case ```Dropout``` seems promising. Try adding a drop-out layer in the LSTM-based model. 


### Exercise 4
- Another technique to fight overfitting is to reduce model size. Gated Recurrent Units (```GRU```) are a simpler version of the LSTMs. They have a smaller number of units. Try learning a new model based on GRU layers. 

- __Hint__: Check the API for the GRU layer $\rightarrow$ https://keras.io/layers/recurrent/#gru
    

## 5. Bidirectional LSTMs (Exercise to run)

Another well-known archicture is the bidirectional RNN. It is an extension of the regular RNN and usually offers a really good performance. Nowadays it is a standard archictecture in NLP.

Note that RNNs (LSTMs and GRUs in a lesser degree) are dependent of the order and tent o remember the last words they have seen. A solution to this is to read the sentence from left-to-right and right-to-left, which can be implemented with bidirectional LSTMs.


In [0]:
from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM, Dense

max_words = 10000
embedding_size = 50

model = Sequential()
model.add(Embedding(max_words, embedding_size, mask_zero=True))
model.add(Bidirectional(LSTM(lstm_hidden_size)))
model.add(Dense(1, activation='sigmoid'))

model.summary()

In [0]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=20, batch_size=128, validation_data=(x_dev, y_dev), verbose=2)

In [0]:
import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

score = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: ", score[1])

### Exercise 6
You can try going further with, for example:
- Try adjusting the number of units in the recurrent layer. 
- Try ```GRU```instead of ```LSTM```
- Try to adjust the learning rate used in ```RMSprop```
- Try some regularization techniques.