# Deep Learning - Day 5 - Sentiment analysis with Word2Vec

### Exercise objectives:
- Convert words to vectors with Word2Vec
- Discover Sentiment analysis

<hr>
<hr>

In the previous exercise, you have learnt how to transform sentences into vector representations that can be fed to a neural network. Let's use it to do some Sentiment Analysis.


# The data

Let's first load the ÌMDB dataset: it corresponds to sentences that are movie reviews, each being positive (label 1) or negative (label 0).

❓ **Question** ❓ Just load the data. 

⚠️ **Warning - Reminder** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [1]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    X_test = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_test]
    
    return X_train, y_train, X_test, y_test


### Just run this cell to load the data
X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=10)

In [3]:
len(X_train)

2500

❓ **Question** ❓ Here, let's re-use what you have done in the previous exercise. Reuse the previous function to get data that you can fed to a neural network. To do that, you have to :

- Step #1: import gensim and train a word2vec algorithm on the training sentences. You can definitely choose your hyperparameters. But do not load a pretrained model here.
- Step #2: convert `X_train` and `X_test` from list of strings (sentences) to list of list of strings (words)
- Step #3: convert your list of list of strings to list of list of vectors thanks the trained word2vec embedding.
- Step #4: pad your input sequences and store the results in `X_train_pad` and `X_test_pad`

In [4]:
from nltk.tokenize import word_tokenize
def convert_string(x):
    word_tokens = word_tokenize(x) 
    text = [w for w in word_tokens ]
    return text

def convert(X):
    output = []
    for x in X:
        output.append(convert_string(x))
    return output

sentences_train = convert(X_train)
sentences_test = convert(X_test)

In [7]:
from gensim.models import Word2Vec
word2vec = Word2Vec(sentences=sentences_train, size=50, min_count = 8)

In [10]:
def embed_sentence(word2vec, sentence):
    output = []
    for word in sentence:
        if word in word2vec.wv.vocab:
            output.append(word2vec.wv[word])
        else:
            continue
    return np.array(output) 

def embedding(word2vec, sentences):
    output = []
    for sentence in sentences:
        output.append(embed_sentence(word2vec, sentence))
    return np.array(output)  
    

In [13]:
X_trainf = embedding(word2vec, sentences_train)
X_testf = embedding(word2vec, sentences_test)

from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train_pad = pad_sequences(X_trainf, dtype='float32', padding='post')
X_test_pad = pad_sequences(X_testf, dtype='float32', padding='post')

❓ **Question** ❓ To be sure that it worked, please check the following for `X_train_pad` and `X_test_pad` :
- they are numpy arrays
- they are 3-dimensional
- the last dimension is of the size of your word2vec embedding space (you can get it with `word2vec.wv.vector_size`
- the first dimension is of the size of your `X_train` and `X_test`

✅ **Good Practice** ✅ Such tests are quite important! Not only in this exercise, but in real-life applications. It prevents from searching at errors too late and from letting them propagate through the entire notebook.

In [18]:
assert(len(X_train_pad.shape) == 3)
assert(len(X_test_pad.shape) == 3)
assert(X_train_pad.shape[2] == 50)
assert(X_test_pad.shape[2] == 50)
type(X_train_pad)
assert(X_test_pad.shape[0] == 2500)

# Baseline model

❓ **Question** ❓ What is your baseline accuracy? In this case, your baseline can be to predict the label that is the most present in `y_train` (of course, if the dataset is balanced, the baseline accuracy is 1/n where n is the number of classes - 2 here).

In [19]:
import pandas as pd
df = pd.DataFrame(y_train)

In [23]:
df[0].value_counts()

1    1282
0    1218
Name: 0, dtype: int64

In [24]:
y_pred = np.ones(2500)

In [26]:
from sklearn.metrics import accuracy_score
baseline_score = accuracy_score(y_test, y_pred)
print('baseline accuracy is:', baseline_score)

baseline accuracy is: 0.4768


# The model

❓ **Question** ❓ Write a RNN with the following layers:
- a masking layer
- a LSTM with 20 units and tanh activation function
- a Dense with 10 units
- a output layer that depends on your task

Then, compile your model (we advise you to use the rmsprop as the optimizer - at least to begin with)

In [49]:
from tensorflow.keras.utils import to_categorical
test = to_categorical(y_train)

In [61]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

def init_model():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(20, activation='tanh'))
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    return model

model = init_model()
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop', 
              metrics=['accuracy'])

❓ **Question** ❓ Fit the model on your embedded and padded data - do not forget the early stopping criterion.

❗ **Remark** ❗ Your accuracy with greatly depend on your training test corpus. Here just make sure that your performance is above the baseline model (which should be the case even if you loaded only 20% of the initial IMDB data).

In [35]:
type(y_train[0])

numpy.int64

In [44]:
test = y_train.astype('int32')
type(test[0])

numpy.int32

In [62]:
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(patience = 5)

model.fit(X_train_pad, y_train,
          validation_split=0.3,
          batch_size=30,
          epochs=5,
          callbacks=[es],
          verbose=1,
         )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f32d845bd00>

❓ **Question** ❓ Evaluate your model on the test set

In [64]:
model.evaluate(X_test_pad, y_test)



[0.6678482890129089, 0.5983999967575073]

# Trained Word2Vec - Transfer Learning

Your accuracy, while above the baseline model, might be quite low. There are multiple options to improve it, as data cleaning and improving the quality of the embedding.

We won't dig into data cleaning strategies here. On the other hand, let's try to improve the quality of our embedding. But instead of just loading a larger corpus, why not benefiting from the embedding that other have learnt? Because, the quality of an embedding, i.e. the proximity of the words, can be derived from different tasks. This is exactly what transfer learning is.

❓ **Question** ❓ As shown on the previous exercise, load a pretrained word2vec embedding spave.

The list of the different models is available with : 

```
import gensim.downloader as api
print(list(api.info()['models'].keys()))
```

than you can `api.load(the-model-of-your-choice)`.

In [65]:
import gensim.downloader as api
print(list(api.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [66]:
word2vec_2 = api.load('glove-twitter-25')

❓ **Question** ❓ Embed `X_train` and `X_test` in your new embedding space (with the new word2vec that you loaded)!
Store the results in `X_train_pad_2` and `X_test_pad_2`.

In [67]:
X_trainf = embedding(word2vec_2, sentences_train)
X_testf = embedding(word2vec_2, sentences_test)


X_train_pad_2 = pad_sequences(X_trainf, dtype='float32', padding='post')
X_test_pad_2 = pad_sequences(X_testf, dtype='float32', padding='post')

  if word in word2vec.wv.vocab:
  output.append(word2vec.wv[word])


In [68]:
del X_trainf
del X_testf

In [69]:
del X_test_pad
del X_train_pad

❓ **Question** ❓ Reinitialize a model and fit it on your new embedded (and padded) data!  Evaluate it on your test set and compare it to your previous accuracy.

❗ **Remark** ❗ The training could take some time here. You can just compute 10 epochs (this is **not** a good practice, it is just not to wait too long) and go to the next exercise while it trains - or take a break, you probably deserve it ;)

In [71]:
def init_model():
    model = Sequential()
    model.add(layers.Masking())
    model.add(layers.LSTM(20, activation='tanh'))
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    return model

model = init_model()
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop', 
              metrics=['accuracy'])

In [72]:
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(patience = 5)

model.fit(X_train_pad_2, y_train,
          validation_split=0.3,
          batch_size=30,
          epochs=5,
          callbacks=[es],
          verbose=1,
         )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f33839fe940>

❓ **Question** ❓ According to you, what causes the model to take so much time to train, especially compared to the first training? To understand it, you can check the size of `X_train_pad` compared to `X_train_pad_2`.

In [75]:
model.evaluate(X_test_pad_2, y_test)



[0.6110473871231079, 0.696399986743927]

In [77]:
X_train_pad_2.shape

(2500, 1635, 25)

Because your new word2vec has been trained on a large corpus, it has a representation for many many words! Way more than with your small dataset, especially as you discarder words that were not present more than a given number of time in the train set. For that reason, you have way more embedded words in your train and test set, which makes each iteration longer than previously