# Deep Learning - Day 5 - Sentiment analysis with Word2Vec

### Exercise objectives:
- Convert words to vectors with Word2Vec
- Discover Sentiment analysis

<hr>
<hr>

In the previous exercise, you have learnt how to transform sentences into vector representations that can be fed to a neural network. Let's use it to do some Sentiment Analysis.


# The data

Let's first load the ÌMDB dataset: it corresponds to sentences that are movie reviews, each being positive (label 1) or negative (label 0).

❓ **Question** ❓ Just load the data. 

⚠️ **Warning - Reminder** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [None]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    X_test = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_test]
    
    return X_train, y_train, X_test, y_test


### Just run this cell to load the data
X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)

❓ **Question** ❓ Here, let's re-use exactly what you have done in the previous exercise. Reuse the previous functions to get data that you can fed to a neural network. To do that, you have to :

- **Step #1**: convert `X_train` and `X_test` from list of strings (sentences) to list of list of strings (words)
- **Step #2**: import gensim and train a word2vec algorithm on the training sentences. You can definitely choose your hyperparameters. But do not load a pretrained model here.
- **Step #3**: convert your list of list of strings to list of list of vectors thanks the trained word2vec embedding.
- **Step #4**: pad your input sequences and store the results in `X_train_pad` and `X_test_pad`

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############
from gensim.models import Word2Vec
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences


# –– Step #1
def convert_sentences(X):
    return [sentence.split(' ') for sentence in X]

X_train_words = convert_sentences(X_train)
X_test_words = convert_sentences(X_test)


# –– Step #2
word2vec = Word2Vec(sentences=X_train, size=60, min_count=10, window=10)


# –– Step #3
def embed_sentence(word2vec, sentence):
    embedded_sentence = []
    for word in sentence:
        if word in word2vec.wv:
            embedded_sentence.append(word2vec.wv[word])
        
    return np.array(embedded_sentence)

def embedding(word2vec, sentences):
    embed = []
    
    for sentence in sentences:
        embedded_sentence = embed_sentence(word2vec, sentence)
        embed.append(embedded_sentence)
        
    return embed

X_train_embed = embedding(word2vec, X_train_words)
X_test_embed = embedding(word2vec, X_test_words)


# –– Step #4
X_train_pad = pad_sequences(X_train_embed, dtype='float32', padding='post')
X_test_pad = pad_sequences(X_test_embed, dtype='float32', padding='post')

❓ **Question** ❓ To be sure that it worked, please check the following for `X_train_pad` and `X_test_pad` :
- they are numpy arrays
- they are 3-dimensional
- the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`
- the first dimension is of the size of your `X_train` and `X_test`

✅ **Good Practice** ✅ Such tests are quite important! Not only in this exercise, but in real-life applications. It prevents from searching at errors too late and from letting them propagate through the entire notebook.

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

for X in [X_train_pad, X_test_pad]:
    assert type(X) == np.ndarray
    assert X.shape[-1] == word2vec.wv.vector_size


assert X_train_pad.shape[0] == len(X_train)
assert X_test_pad.shape[0] == len(X_test)

# Baseline model

❓ **Question** ❓ What is your baseline accuracy? In this case, your baseline can be to predict the label that is the most present in `y_train` (of course, if the dataset is balanced, the baseline accuracy is 1/n where n is the number of classes - 2 here).

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

from sklearn.metrics import accuracy_score

unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))
print('Number of labels in train set', counts)

y_pred = 0 if counts[0] > counts[1] else 1

print('Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))

# The model

❓ **Question** ❓ Write a RNN with the following layers:
- a masking layer
- a LSTM with 20 units and tanh activation function
- a Dense with 10 units
- a output layer that depends on your task

Then, compile your model (we advise you to use the rmsprop as the optimizer - at least to begin with)

In [None]:
def init_model():
    ### YOUR CODE HERE
    return model

model = init_model()

In [None]:
##############
### Answer ###
##############

from tensorflow.keras import Sequential
from tensorflow.keras import layers

def init_model():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(20, activation='tanh'))
    model.add(layers.Dense(15, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model()

❓ **Question** ❓ Fit the model on your embedded and padded data - do not forget the early stopping criterion.

❗ **Remark** ❗ Your accuracy with greatly depend on your training test corpus. Here just make sure that your performance is above the baseline model (which should be the case even if you loaded only 20% of the initial IMDB data).

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

#X_train_pad_short = X_train_pad[:500] # These two lines are just to accelerate the cell run
#y_train_short = y_train[:500]

from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, restore_best_weights=True)

model.fit(X_train_pad, y_train, 
          batch_size = 32,
          epochs=100,
          validation_split=0.3,
          callbacks=[es]
         )

❓ **Question** ❓ Evaluate your model on the test set

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

res = model.evaluate(X_test_pad, y_test, verbose=0)

print(f'The accuracy evaluated on the test set is of {res[1]*100:.3f}%')

# Trained Word2Vec - Transfer Learning

Your accuracy, while above the baseline model, might be quite low. There are multiple options to improve it, as data cleaning and improving the quality of the embedding.

We won't dig into data cleaning strategies here. On the other hand, let's try to improve the quality of our embedding. But instead of just loading a larger corpus, why not benefiting from the embedding that other have learnt? Because, the quality of an embedding, i.e. the proximity of the words, can be derived from different tasks. This is exactly what transfer learning is.

❓ **Question** ❓ As shown on the previous exercise, load a pretrained word2vec embedding spave.

The list of the different models is available with : 

```
import gensim.downloader as api
print(list(api.info()['models'].keys()))
```

than you can `api.load(the-model-of-your-choice)`.

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

import gensim.downloader as api
print(list(api.info()['models'].keys()))


word2vec_wiki = api.load("glove-wiki-gigaword-50")

❓ **Question** ❓ Use your new embedding that you just loaded to embed `X_train` and `X_test` ! Do not forget to pad your results and store it in `X_train_pad_2` and `X_test_pad_2`.

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

# –– Convert list of sentences to list of list of words
X_train_words_2 = convert_sentences(X_train)
X_test_words_2 = convert_sentences(X_test)


# –– Embed the sentences thanks to the new embedding
X_train_embed_2 = embedding(word2vec_wiki, X_train_words_2)
X_test_embed_2 = embedding(word2vec_wiki, X_test_words_2)


# –– Pad the sentences
X_train_pad_2 = pad_sequences(X_train_embed_2, dtype='float32', padding='post')
X_test_pad_2 = pad_sequences(X_test_embed_2, dtype='float32', padding='post')

❓ **Question** ❓ Reinitialize a model and fit it on your new embedded (and padded) data!  Evaluate it on your test set and compare it to your previous accuracy.

❗ **Remark** ❗ The training could take some time here. You can just compute 10 epochs (this is **not** a good practice, it is just not to wait too long) and go to the next exercise while it trains - or take a break, you probably deserve it ;)

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, restore_best_weights=True)

model = init_model()

model.fit(X_train_pad_2, y_train, 
          batch_size = 32,
          epochs=10,
          validation_split=0.3,
          callbacks=[es]
         )

In [None]:
##############
### Answer ###
##############

res = model.evaluate(X_test_pad_2, y_test, verbose=0)

print(f'The accuracy evaluated on the test set is of {res[1]*100:.3f}%')

❓ **Question** ❓ According to you, what causes the model to take so much time to train, especially compared to the first training? To understand it, you can check the size of `X_train_pad` compared to `X_train_pad_2`.

In [None]:
### YOUR CODE HERE

In [None]:
##############
### Answer ###
##############

print(np.shape(X_train_pad))
print(np.shape(X_train_pad_2))

Because your new word2vec has been trained on a large corpus, it has a representation for many many words! Way more than with your small dataset, especially as you discarder words that were not present more than a given number of time in the train set. For that reason, you have way more embedded words in your train and test set, which makes each iteration longer than previously