# Homework 7: Text classification with Keras

The homework consists of only coding part (25 pts).
 - All theoretical questions must be answered in your own words, do not copy-paste text from the internet. Points can be deducted for terrible formatting or incomprehensible English.
 - Code must be commented. If you use code you found online, you have to add the link to the source you used. There is no penalty for using outside sources as long as you convince us you understand the code.

**Note that coding part consists of two different notebooks.**

*Once completed zip the entire directory containing this exercise and upload it to https://courses.cs.ut.ee/2020/nn/spring/Main/Practices.*

**If you did this homework together with course mates, please write here their names (answers still have to be your own!).**

**Name(s):** fill this in if applicable

## Introduction

In this practice session we are looking into text classification. This means we are going to touch topics like word embeddings and recurrent neural networks.

In [None]:
%matplotlib inline

from keras.models import Model
from keras.layers import Input, Dense, Embedding, SimpleRNN, LSTM, GRU
from keras.preprocessing.text import text_to_word_sequence
from keras.datasets import imdb
import numpy as np

## Loading data

We are going to use IMDB moview review dataset for sentiment classification. Luckily, this is included with Keras again. 

In [None]:
(X_train_seq, y_train), (X_test_seq, y_test) = imdb.load_data()

print('Train data shape: ', X_train_seq.shape)
print('Train labels shape: ', y_train.shape)
print('Test data shape: ', X_test_seq.shape)
print('Test labels shape: ', y_test.shape)

Inputs are the movie reviews encoded as sequences of word indexes. There are couple of "word indexes" with special meaning: 1 means start of sentence, 2 means unknown word (word not in vocabulary) and 0 means padding (empty) and should be ignored. Index 3 just does not exist (imdb.load_data() weird speciality).

All other numbers signify actualy words, with the smaller numbers coding for more frequently used words. Therefore 4 corresponds to the most used word in our database, 5 the second most used etc.

You can see below one encoded review. As expected it starts with 1, also expectedly 4 and 5 look quite frequent words indeed.

In [None]:
print(X_train_seq[0])

Our model should learn to produce the sentiment of the review: 1 movie is good, 0 movie is bad.

In [None]:
y_train[0]

As a data scientist you should always be a little bit more curious what the data is really about. In this case we might want to know what the text actually looked like. To decode the sequences of numbers into words, we need the number-to-word correspondences. Luckly Keras IMDB dataset has function for that.

In [None]:
word2num = imdb.get_word_index()
word2num

But this gives us only word-to-number encoding. We need to produce the inverse mapping (number-to-word) by ourselves.

**Task 1 (2pts):**

In [None]:
################################################################################
# TODO: Construct a dictionary for the opposite mapping, from numbers to words.#
################################################################################
num2word = {} # fill this dict with the inverse mapping
################################################################################
#                             END OF YOUR CODE                                 #
################################################################################
num2word

Print out the most frequent words. Notice that this vocabulary indexing still starts from 1 (and not 4). Later when translating text into numbers or the other way round we will have to shift it by 3.

In [None]:
for i in range(1,11):
    print(str(i) + ':', num2word[i])

Now we can finally see what those movie reviews were about.

In [None]:
# function to decode reviews
def nums2sentence(nums):
    # the encoding has 0,1 and 2 as special numbers, 3 is omitted, therefore words start from nr 4
    # word indexes start from 1, so to match the two sysems we have to shift by 3 
    return " ".join(["UNK" if i == 2 else num2word[i-3] for i in nums[1:]]) 

# print the first 3 reviews
for i in range(3):
    print(nums2sentence(X_train_seq[i]), "- GOOD" if y_train[i] == 1 else "- BAD")
    print()

## Training model

Our goal in this exercise is to train a model to predict sentiment from the movie review text. Our neural networks expect fixed size inputs though, therefore we need to do some preprocessing first. In particular:
 - we need to make all sentences of the same length, so that the result can be represented as a matrix (tensor),
 - we might want to limit the vocabulary size to only certain number of more frequent words to save resources

In the following you will write a function to do those preprocessing steps.

**Task 2 (5pts):**

In [None]:
def preprocess_sequences(seq, vocab_size, maxlen):
    out = []
    ##############################################################################
    # TODO: Write a function to preprocess sequences:                            #
    #       - all sequences should have the same length - pad them with 0s at    #
    #         the beginning (the extra zeros go in the beginning of the list)    #
    #       - if sequence is too long (above maxlen), then keep only the maxlen  #
    #         words in end of the review(works better than keeping the beginning)#
    #       - remove all words with index >= vocab_size, replace them with 2s.   #
    #         (Words are ordered by frequency, so you are in fact removing less  #
    #         frequent words.)                                                   #
    ##############################################################################
    pass
    ##############################################################################
    #                             END OF YOUR CODE                               #
    ##############################################################################
    return np.array(out)

vocab_size = 20000
maxlen = 80

X_train = preprocess_sequences(X_train_seq, vocab_size, maxlen)
assert X_train.shape[1] == maxlen, "Second dimension of training set must be equal to maximum sentence length."
assert np.max(X_train) < vocab_size, "The training set should not contain words with index >= vocab_size"
X_test = preprocess_sequences(X_test_seq, vocab_size, maxlen)

Now we are ready to create a network. Unconventionally our input layer processes only integers this time, corresponding to the words. These are fed to Embedding layer where the are transformed into the word vectors (embeddings) that are fed to subsequent layers. Using `mask_zero=True` tells the network that 0 is a special value that denotes padding. Timesteps where input is 0 will not be considered when backpropagating the error signal (no learning happens at those timesteps). 

Output of Embedding layer goes to RNN layer that walks over its inputs while keeping the hidden state as memory. Normally RNN layers produce output at each timestep, but here we have directed it with `return_sequences=False` to produce the output only at the last timestep.

As we have a binary classification task, we are using sigmoid as final activation and binary crossentropy as a loss function this time. If $y$ is the target value (1 or 0) and $p$ is the predicted probability (0..1) then the loss function looks like this:

$$
L = y \log p + (1 - y) \log (1 - p)
$$

This is basically equivalent to usual softmax categorical loss, where $y$ is the one-hot vector of target values and $p$ is the vector of probabilities that sums up to 1.

$$
L = \sum_i y_i \log p_i
$$


For more information see Keras documentation:

* [Embedding layers](https://keras.io/layers/embeddings/)
* [Recurrent layers](https://keras.io/layers/recurrent/)
* [Loss functions](https://keras.io/losses/)

In [None]:
#Definign a simple RNN 
embed_size = 128
rnn_size = 128

x = Input(shape=(None,), dtype='int32')
e = Embedding(vocab_size, embed_size, mask_zero=True)(x)
r = SimpleRNN(rnn_size, return_sequences=False)(e)
p = Dense(1, activation='sigmoid')(r)

rnn_model = Model(x, p)
rnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
rnn_model.summary()

**Task 3 (2pts):** What are the three dimensions of embedding layer output? Why LSTM layer is missing one of them?

**Your Answer:** *fill this in*

Let's train the model for 1 epoch.

In [None]:
# Training the simple RNN, should get accuracy above 50% (maybe even 60%)
# you can change nr of epochs to train longer, if you can afford spending more time
history = rnn_model.fit(X_train, y_train, batch_size=32, epochs=1, validation_data=(X_test, y_test))

Simple RNN layer has limitations on how long it can keep its memory. The ability to remember can be imporved with gated recurrent layers like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). Let's try to compare them! 

But for the comparison to be fair, the networks need to have approximately the same number of parameters. LSTM and GRU use additional gates that have additional weights. Your task is to create LSTM and GRU networks that have approximately the same number of parameters as the SimpleRNN network above and train them.

**Task 4 (1.5pts):**

In [None]:
##############################################################################
# TODO: 1) Construct a network with LSTM layer that has approximately the    #
#       same number of parameters as previous SimpleRNN network.             #                  
#       2) compile the model and print out summary                           #
#       3) train the model for one epoch (should get accuracy >80%)          #
##############################################################################
pass
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

**Task 5 (1.5pts):**

In [None]:
##############################################################################
# TODO: 1) Construct a network with GRU layer that has approximately the     #
#       same number of parameters as previous SimpleRNN network.             #                  
#       2) compile the model and print out summary                           #
#       3) train the model for one epoch (should get accuracy >80%)          #
##############################################################################
pass
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

**Task 6 (1.5pt):** Which recurrent unit produced the best results? How much smaller the LSTM or GRU layer had to be to be comparable to SimpleRNN in number of parameters?

**Your Answer:** *fill this in*

In [None]:
##############################################################################
# TODO: Assign the best model to model variable.                             #
##############################################################################
model = None
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

## Try it out

Let's try out the best model you have! Your job is to create your own movie reviews and make the model classify them as either good or bad.

**Task 7 (3pts):**

In [None]:
def words2sequences(words):
    return [word2num[w]+3 if w in word2num else 2 for w in text_to_word_sequence(words)]
def sentences2sequences(sentences):
    return [[1] + words2sequences(s) for s in sentences]

##############################################################################
# TODO: Write one positive (> 0.9) and one negative (< 0.1) movie review.    #
#       Try to write it yourself, do not just copy paste reviews from        #
#       somewhere until you find one that works.                             #
##############################################################################
myreviews = [
    "", # fill this with good review
    "" # fill this with bad review
]
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

myreviews_seq = sentences2sequences(myreviews)
X_myreviews = preprocess_sequences(myreviews_seq, vocab_size, maxlen)
model.predict(X_myreviews)


**Task 8 (1pt):** How would you grade your handmade artificial intelligence?

**Your Answer:** *fill this in*

## Word embeddings

As mentioned above the network receives inputs as one-hot vectors. Multiplying this one-hot vector with the first weight layer transforms the words into some kind of real-valued vectors, we say it preforms embedding. Notice that as there is only one 1 in the one-hot input vector, effectlively each word corresponds to one row in the weight matrix.

We can think of the weights that correspond to each word as the representation of that word was given by the network. This representation depends on the task the network is performing, the dataset we use etc. The netowrk uses the representation that is most useful for it.

In this section we want to see if there is some consistency or logic in the representations that the network gives to the words. Are some word more similarly represented than others? To do so we first need to compress the 128-dimensional representation (the nr of nodes in Embedding layer) into 2D, in odrer to plot it. We can achieve this by using Principal Component Analysis (read more about it if you don't know what it is).

In [None]:
# let's extract the weight layer and make sure it's size makes sense
embed = model.layers[1].get_weights()[0]
embed.shape

Let's perform Principal Component Analysis on the rows of this matrix. We want to extract only the two most important axis of variance, because we want to plot the points in that space.


In [None]:
from sklearn.decomposition import PCA
embed_2d = PCA(n_components = 2).fit_transform(embed)
embed_2d.shape

Now we can look at where different words are positioned in the space defined by the first two components of PCA.

We define three sets of words that we color differently.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
word_idx=[]
for word in ["great","awesome","beautiful","magnificent","masterpiece"]:
    idx = words2sequences(word)[0]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="red",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], num2word[idx- 3],fontsize=20)
for word in ["bad", "terrible", "boring", "lame"]:
    idx = words2sequences(word)[0]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="blue",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], num2word[idx- 3],fontsize=20)

for word in ["actor", "producer", "director", "dog","and","the"]:
    idx = words2sequences(word)[0]
    word_idx.append(idx)
    plt.scatter(embed_2d[idx,0],embed_2d[idx,1],color="gray",s=100)
    plt.text(embed_2d[idx, 0], embed_2d[idx, 1], num2word[idx- 3],fontsize=20)

plt.show()


**Task 9 (1pt):** What meaning does the fist dimension of the PCA (the x-axis) seem to represent?

**Your Answer:** *fill this in*