# Lab 2: Sentiment Analysis with a Multi Layer Perceptron using Keras


### Introduction

In this lab session we will implement a classification model for __sentence classification__ using Keras. Given a sentence our model will predict if it is a positive or negative piece of texts. The dataset we are going to use ranges the polarity annotation from 0 to 5, where 0 denotes extremely negative sentiment,  and 5  is the most  positive. 

Nevertheless, for this lab we'll  simplify the task, and we will translate the 5-way classification task into 2-way classification task (0 $\rightarrow$ _negative,_ ;1 $\rightarrow$ positive),

In addition, we will review some of the regularization techniques seen in class. 

All in all, the main __objectives__ of this first laboratory are the following: 
- Learn how to build, train and evaluate a Model in Keras
- Explore hyperparameters like:
  - Optimizers: SGD, ADAGRAD, etc.
  - Learning Rates
  - Regularization
- Plot learning curves for model selection

## 1. Loading the data
We'll use the same data used in previous session. You need to follow the same steps specified in lab1.

In [0]:
# Mount Drive files
from google.colab import drive
drive.mount('/content/drive')

In [0]:
## for replicability of results
import numpy as np
import tensorflow as tf

np.random.seed(1)
tf.set_random_seed(2)

In [0]:
sst_home = 'drive/My Drive/kschool-nlp/data/trees/'

In [0]:
import re
import pandas as pd

# Let's do 2-way positive/negative classification instead of 5-way    
def load_sst_data(path,
                  easy_label_map={0:0, 1:0, 2:None, 3:1, 4:1}):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)
    data = pd.DataFrame(data)
    return data

training_set = load_sst_data(sst_home + 'train.txt')
dev_set = load_sst_data(sst_home + 'dev.txt')
test_set = load_sst_data(sst_home + 'test.txt')

print('Training size: {}'.format(len(training_set)))
print('Dev size: {}'.format(len(dev_set)))
print('Test size: {}'.format(len(test_set)))

## 2. Preprocessing and vectorization

Once data is loaded the next step is to preprocess it to obtain the vectorized form (i.e. the process of transforming text into numeric tensors), which basically consist of:

- Tokenization, tipically segment the text into words. (Alternatively, we could segment text into characters, or extract n-grams of words or characters.)
- Definition of the dictionary index and vocabulary size (in this case we set to 1000 most frequent words)
- Transform each sentence into a vector. 

In this lab, we will follow the Bag of Words approach. 

In [0]:
from sklearn.utils import shuffle

# Shuffle dataset
training_set = shuffle(training_set)
dev_set = shuffle(dev_set)
test_set = shuffle(test_set)

# Obtain text and label vectors, and tokenize the text
train_texts = training_set.text
train_labels = training_set.label

dev_texts = dev_set.text
dev_labels = dev_set.label

test_texts = test_set.text
test_labels = test_set.label

In [0]:
from keras.preprocessing import text

# Create a tokenize that takes the 1000 most common words
tokenizer = text.Tokenizer(num_words=1000)

# Build the word index (dictionary)
tokenizer.fit_on_texts(train_texts) # Create word index using only training part

# Vectorize texts into one-hot encoding representations
x_train = tokenizer.texts_to_matrix(train_texts, mode='binary')
x_dev = tokenizer.texts_to_matrix(dev_texts, mode='binary')
x_test = tokenizer.texts_to_matrix(test_texts, mode='binary')
          
y_train = train_labels
y_dev = dev_labels
y_test = test_labels

print('Text of the first examples: \n{}\n'.format(train_texts[0]))
print('Vector of the first example:\n{}\n'.format(x_train[0]))
print('Binary representation of the output:\n{}\n'.format(y_train[0]))


print('Shape of the training set (nb_examples, vector_size): {}'.format(x_train.shape))
print('Shape of the validation set (nb_examples, vector_size): {}'.format(x_dev.shape))
print('Shape of the test set (nb_examples, vector_size): {}'.format(x_test.shape))

In [0]:
# Recorver the word index that was created with the tokenizer
word_index = tokenizer.word_index
print('Found {} unique tokens.\n'.format(len(word_index)))

word_count = tokenizer.word_counts
print("Show the most frequent word index:")

for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=True)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break

for i, word in enumerate(sorted(word_count, key=word_count.get, reverse=False)):
    print('   {} ({}) --> {}'.format(word, word_count[word], word_index[word]))
    if i == 9: 
        print('')
        break

Check what we obtain when we vectorize words that are out of the index (out of vocabulary words).

In [0]:
oov_sample = ['saddam', 'plausible'] 
sequences = tokenizer.texts_to_matrix(oov_sample)
print(sum(sequences[0]))

It is possible to obtain the lists of integers indices instead of the one-hot binary representation.

In [0]:
word_index_inverse = {word_index[k]:k for k in word_index}

In [0]:
# Turns strings into list of integer indices
one_hot_results = tokenizer.texts_to_sequences(train_texts)
print(one_hot_results[0])
print(train_texts.iloc[0])
print([word_index_inverse[x] for x in one_hot_results[0]])

## 3. Building the model

When we build a neural network we usually take into account the following points:
- The __layers__, and how they are combined (that is, the structure and parameters of the model)
- The __input__ and the __labeled output__ data that the model needs to map.
- __Loss function__ that signals how well the model is doing.
- The __optimizier__ which defines the learning procedure.

In this very first session we'll keep all this very simple. Keras provide a simple framework for combining layers. There are available two types of classes for building the model: The _Sequential_ Class and the _functional_ API. The later is dedicated to DAGs structures, which let you to build arbitrary models. The former is for linear stacks of layers, which is the most common and simplest archicture. 

In this session, we will build a __Multi Layer Perceptron__ with one hidden units, which is one of the most simple neural network model. More complicated models, like LSTM and CNN, will be learnt in the next lab sessions.

For that we'll make use of a fully connected (```Dense```) layers with a ```relu``` activation. In this case the hidden layer will have 16 hidden units (feel free to explore different dimensionality of hidden units).

Remenber that applying ```Dense``` layer with ```relu``` activation we are implementing the following tensor operation:

```
output = relu(dot(W, input) + b
```

where `relu` is the element-wise activation function, ```W``` is a weights matrix created by the layer, and ```b``` is a bias vector created by the layer.

Remenber from the slides that mathematically can written as follows:
> $sigmoid(W^{T}X + b)$


Regarding input data, we will use the __one-hot encoding__. We'll set ```(binary) cross-entropy``` as a __loss function__ and ```rmsprop```, a variant of the _Stochastic Gradient Descent_, as the __optimizer__.

Feel free to explore different loss-functions (e.g. MSE) and optimizers (e.g. ADAM) you can improve the model (see Exercise 2, below).

### Exercise 1
Answer the following questions:
- What does having 16 hidden units mean? What the size of matrix ```W```?

-----

Increasing the number of hidden units we are allowing the network to learn more complex representations, but at the same time we are making the network more computationally expensive and may lead to overfit the training data.

Regarding the architecture, there are three main decisions that we need to take:
- The number of layers
- The number of hidden units for each layers
- Activation function of the layers

The code below implements a fully connected archicteture with only one intermediate layer and an output layer that predicts the sentiment of the input review. 

>>>>>>>![](http://ixa2.si.ehu.es/~jibloleo/uc3m_dl4nlp/img/Two_layers_NN.png)

In [0]:
from keras.models import Sequential
from keras.layers import Dense

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(input_size,)))
model.add(Dense(units=1, activation='sigmoid'))  
# Note that we do not need to indicate the input shape for the sucessives layes

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

In [0]:
history = model.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=1)

In [0]:

import matplotlib.pyplot as plt

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model accuracy')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [0]:
def plot_curve(history_, metric_, title_, legend_):
  import matplotlib.pyplot as plt

  plt.plot(history_.history[metric_])
  plt.plot(history_.history['val_'+metric_])
  plt.title(title_)
  plt.ylabel(metric_)
  plt.xlabel('epoch')
  plt.legend(legend_, loc='upper left')
  plt.show()


plot_curve(history, 'acc', 'Accuracy', ['train', 'test'])
plot_curve(history, 'loss', 'Loss', ['train', 'test'])

(mini-) **Exercise**: Train the model with different optimizers: SGD, ADA, ADAGRAD, etc

In [0]:
#to-do

## 4. Evaluating the model

Once we fit the model we can use the method ```model.evaluate()``` to obtain the accuracy on test set.

In [0]:
score = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: ", score[1])
print(score)

### Exercise 2 (home)

Plots show that model ends up overfitting the training data. One way to prevent overfitting is to stop training once accuracy in the validation set starts decreasing. 
- Could you retrain the model from the scratch for only four epochs? 

- Optionally, Keras provides early stopping mechanism as callback object(https://keras.io/callbacks/#earlystopping) that could be used when fiting the model:

```
from keras.callbacks import EarlyStopping
...
early_stop = EarlyStopping(monitor='acc', patience=1)
...
history = model.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=1, callbacks=[early_stop])

```

---
- Optionally you can try different activations (e.g ```tahn```, ```sigmoid```) instead of ```relu```.
- Or try different loss function like mean_squared_error (```mse```).

---

## 5. Model Tuning

### 5.1. Effect of Learning Rate 

The model in Section 3 uses default values of learning rate, and does not use any type of regularization.

You can check Keras API to learn how to use and set up different optimizers: 
- https://keras.io/optimizers/
- https://keras.io/regularizers/



### Exercise 3

In this exercise we'll focus in the importance of the learning rate. We'll compare a large and a small learning rate with the default one. 


__Run the following cells and answers to the next question:__

- Why we obtain such a different plots with each learning rate?

- What is the difference when comparing the following curves:
   - ```train large```_vs_ ```train orig```
   - ```dev large```_vs_ ```dev orig```

- And the following ones:
   - ```train small```_vs_ ```train orig```
   - ```dev small```_vs_ ```dev orig```



In [0]:
# Example of using optimizer object

from keras import optimizers, regularizers

model2 = Sequential()

# add L2 weight regularization to logistic regression
regularizer = regularizers.l2(0.)
model2.add(Dense(units=1, activation='sigmoid', input_shape=(input_size,), kernel_regularizer=regularizer))

# Init rmsprop
rmsprop_small = optimizers.RMSprop(lr=0.000001)
rmsprop_large = optimizers.RMSprop(lr=0.5)

model2.compile(loss='binary_crossentropy', optimizer=rmsprop_small, metrics=['accuracy'])
history_small_lr= model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

model2.compile(loss='binary_crossentropy', optimizer=rmsprop_large, metrics=['accuracy'])
history_large_lr= model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

In [0]:
# summarize history for accuracy
plt.plot(history_large_lr.history['acc'])
plt.plot(history_large_lr.history['val_acc'], linestyle='--')

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'], linestyle='--')

plt.plot(history_small_lr.history['acc'])
plt.plot(history_small_lr.history['val_acc'], linestyle='--')

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train large', 'dev large', 'train orig', 'dev orig', 'train small', 'dev small'], loc='center right')
plt.show()

### 5.2. Effect of regularization (more on next section)

Regularization is a method to avoid overfitting. It add a penalization term on the weights, so they are kept small.



#### Exercise 4

In this session we'll focus in the effect of regularization. We'll compare regularized model agains non-regularized one.

__Please run the following cell and answers to the next question__:

What is the effect of including a regularization term? Is it always a good thing to be included?


------

The plots might not be the expected, but you should note that we reduced the vocabulary to only 1000 most frequent words in training. Anyway, you should see the differences of learning curves when training with and without regularization.

----

In [0]:
import matplotlib.pyplot as plt
from keras import optimizers, regularizers

model2 = Sequential()

# add L2 weight regularization to logistic regression
regularizer = regularizers.l2(0.0001)
model2.add(Dense(units=1, activation='sigmoid', input_shape=(input_size,), kernel_regularizer=regularizer))

# Init rmsprop
rmsprop = optimizers.RMSprop() 
model2.compile(loss='binary_crossentropy', optimizer=rmsprop, metrics=['accuracy'])
history_reg = model2.fit(x_train, y_train, epochs=50, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

In [0]:
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'], linestyle='--')

plt.plot(history_reg.history['acc'])
plt.plot(history_reg.history['val_acc'], linestyle='--')

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train orig', 'dev orig', 'train reg', 'dev reg'], loc='lower right')
plt.show()

## 6. Adding new layers
In this section we will extend the model by adding new fully connected layer. By adding new layers we are increasing the capacity of the model we are not reducing the overfitting, but for the moment  we do not care about this. 

### Exercise 5
- The code below defines a model with two hidden layers,  add a third intermediate layer with 16 hidden units and ```relu``` activation.

In [0]:
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(input_size,)))
model.add(Dense(units=16, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
# Note that we do not need to indicate the input shape for the sucessives layers

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

# Train the model
history = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=1)

# summarize history for accuracy
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'dev'], loc='upper left')
plt.show()

## 7. Regularization techniques

We will apply the following ones:
- Reducing the network size
- Adding weight regularization
- Adding dropout

### 7.1 Reducing the network size
One way to prevent overfitting is to reduce the size of the model. As we know the size of the model is measured with the number of parameters that we need to learn. Remenber that the number of parameters are determined by the number of layers and the number of units per layer.

### Exercise 6
Run the following cell of code and try to answer the following question:
- Can you describe the relation of training and validation loss curves when traning with less parameters?

In [0]:
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(units=4, activation='relu', input_shape=(input_size,)))
model.add(Dense(units=4, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
# Note that we do not need to indicate the input shape for the sucessives layes

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

# Train the model
history = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)


# Define and train the "original" model of two hidden layers of 16 output units each
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(input_size,)))
model.add(Dense(units=16, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history_orig = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=1)

In [0]:


# summarize history for accuracy
plt.plot(history_orig.history['loss'])
plt.plot(history.history['loss'], linestyle='--')

plt.plot(history_orig.history['val_loss'])
plt.plot(history.history['val_loss'], linestyle='--')

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train: orig model', 'train: small model', 'dev: orig model', 'dev: small model'], loc='upper left')
plt.show()

We can compare the loss of bigger model with a higher _capacity_. Note that a model with a higher number of parameters has more memorization capacity and consequently can show a poorer generalization with higher risk of overfitting to training data.

### Exercise 7
Run the following cell of code and try to answer the following question:
- Can you describe the relation of training and validation loss curves when traning with more parameters?

In [0]:
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(units=512, activation='relu', input_shape=(input_size,)))
model.add(Dense(units=512, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
# Note that we do not need to indicate the input shape for the sucessives layes

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

# Train the model
history = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

In [0]:
# summarize history for accuracy
plt.plot(history_orig.history['loss'])
plt.plot(history.history['loss'], linestyle='--')

plt.plot(history_orig.history['val_loss'])
plt.plot(history.history['val_loss'], linestyle='--')

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train: orig model', 'train: bigger model', 'dev: orig model', 'dev: bigger model'], loc='upper left')
plt.show()

In [0]:
# summarize history for accuracy
plt.plot(history_orig.history['loss'])
plt.plot(history_orig.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train: orig model', 'dev: orig model'], loc='upper left')
plt.show()

-----

### 7.2 Adding weight regularization
 
Another common way to try avoiding overfitting is to put constraints on the complexity of a network by forcing its weights to only take small values, which makes the distribution of weight values more "regular". This is called _weight regularization_, and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

- __L1 regularization__, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).

- __L2 regularization__, where the cost added is proportional to the square of the value of the weights coefficients.

In Keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. In this case, we'll use the L2 norm to regularize the weights of the model.

In [0]:
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras import regularizers

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu', input_shape=(input_size,)))
model.add(Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

# Train the model
history_l2 = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

In [0]:
# summarize history for accuracy
plt.plot(history_orig.history['loss'])
plt.plot(history_l2.history['loss'], linestyle='--')

plt.plot(history_orig.history['val_loss'])
plt.plot(history_l2.history['val_loss'], linestyle='--')

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train: orig', 'train: l2', 'dev: orig', 'dev: l2'], loc='upper left')
plt.show()

### Exercise 8

- Explore different regularization weigths (e.g. 0.001, 0.01, 0.1). Do you see any difference in the learning curves?
- You can try __L1 regularization__, or both together. 

```
# L1 regularization
regularizers.l1(0.001)

# L1 and L2 regularization at the same time
regularizers.l1_l2(l1=0.001, l2=0.001)
```

-----

### 7.3 Adding dropout
Another popular regularization technique for deep learning is _dropout_. It has be proven to very successful in many cases, which state-of-the-art of the architecture can be improved around 1-2% of accuracy. 

The algorithm is simple: At every training step every unit has a probability $p$ of being dropped out (it will not take into account during the training step, setting it to zero).  In Keras user needs to set a _dropout rate_, which is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In [0]:
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout

input_size = x_train[0].shape[0] ## vector length equals to vocabulary size.

# Define the model
model = Sequential()
model.add(Dense(units=16, activation='relu', input_shape=(input_size,)))
# we add a drop-out layer after the first fully connected layer
model.add(Dropout(0.5))

model.add(Dense(units=16, activation='relu'))
# we add a drop-out layer after the second fully connected layer
model.add(Dropout(0.5))
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model using a loss function and an optimizer.
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

# Train the model
history = model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_dev, y_dev), verbose=0)

# summarize history for accuracy
plt.plot(history_orig.history['loss'])
plt.plot(history.history['loss'])

plt.plot(history_orig.history['val_loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train: orig model', 'train: dropout', 'dev: orig model', 'dev: dropout'], loc='upper left')
plt.show()

### Exercise 8
- Try different dropout rates and decide which one is the best.