# <!-- TITLE --> Sentiment analysis
<!-- DESC --> This notebook is an example of sentiment analysis using a dataset from Internet Movie Database (IMDB). 
<!-- AUTHOR : Jean-Luc Parouty (CNRS/SIMaP) -->

## Objectives :
 - The objective is to guess whether film reviews are **positive or negative** based on the analysis of the text. 
 - Understand the management of **textual data** and **sentiment analysis**

It is decomposed in 3 parts:

- **Part 1**: Build a classifier with one-hot encoding
- **Part 2**: Replace one-hot endodings by word embeddings
- **Part 3**: Combine word embedding and a recurrent architecture

Original dataset can be find **[there](http://ai.stanford.edu/~amaas/data/sentiment/)**  
Note that [IMDb.com](https://imdb.com) offers several easy-to-use [datasets](https://www.imdb.com/interfaces/)  
For simplicity's sake, we'll use the dataset directly [embedded in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)

## What we're going to do :

 - Retrieve data
 - Preparing the data
 - Build a model
 - Train the model
 - Evaluate the result

Disclaimer: This notebook is based on [fidle-cnrs](https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/tree/master)

# Preliminaries: import and init

In [None]:
import numpy as np

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.datasets.imdb as imdb

import matplotlib.pyplot as plt
import matplotlib

import pandas as pd

from sklearn.metrics import confusion_matrix

import os,sys,h5py,json
from importlib import reload

In [None]:
run_dir = os.getcwd()
output_dir='data'

# Part 1: Model based on one-hot encoding

## Step 1 - Parameters

The words in the vocabulary are classified from the most frequent to the rarest.\
`vocab_size` is the number of words we will remember in our vocabulary (the other words will be considered as unknown).\
`hide_most_frequently` is the number of ignored words, among the most common ones\
`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch

In [None]:
vocab_size           = 5000
hide_most_frequently = 0

epochs               = 10
batch_size           = 512
fit_verbosity        = 1

## Step 2 - Understanding hot-one encoding
#### We have a **sentence** and a **dictionary** :

In [None]:
sentence = "I've never seen a movie like this before"

dictionary  = {"a":0, "before":1, "fantastic":2, "i've":3, "is":4, "like":5, "movie":6, "never":7, "seen":8, "this":9}

#### We encode our sentence as a **numerical vector** :

In [None]:
sentence_words = sentence.lower().split()

sentence_vect  = [ dictionary[w] for w in sentence_words ]

print('Words sentence are         : ', sentence_words)
print('Our vectorized sentence is : ', sentence_vect)

#### Next, we **one-hot** encode our vectorized sentence as a tensor :

In [None]:
# ---- We get a (sentence length x vector size) matrix of zeros
#
onehot = np.zeros( (10,8) )

# ---- We set some 1 for each word
#
for i,w in enumerate(sentence_vect):
    onehot[w,i]=1

# --- Show it
#
print('In a basic way :\n\n', onehot, '\n\nWith a pandas wiew :\n')
data={ f'{sentence_words[i]:.^10}':onehot[:,i] for i,w in enumerate(sentence_vect) }
df=pd.DataFrame(data)
df.index=dictionary.keys()
# --- Pandas Warning 
# 
df.style.format('{:1.0f}').highlight_max(axis=0).set_properties(**{'text-align': 'center'})

### 3.2 - Load dataset
For simplicity, we will use a pre-formatted dataset - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data)  
However, Keras offers some usefull tools for formatting textual data - See [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text)  

By default : 
 - Start of a sequence will be marked with : 1
 - Out of vocabulary word will be : 2
 - First index will be : 3

In [None]:
# ----- Retrieve x,y
#
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=vocab_size, skip_top=hide_most_frequently)

y_train = np.asarray(y_train).astype('float32')
y_test  = np.asarray(y_test ).astype('float32')

# ---- About
#
print("x_train : {}  y_train : {}".format(x_train.shape, y_train.shape))
print("x_test  : {}  y_test  : {}".format(x_test.shape,  y_test.shape))

## Step 4 - About our dataset
When we loaded the dataset, we asked for using \<start\> as 1, \<unknown word\> as 2  
So, we shifted the dataset by 3 with the parameter index_from=3

### 4.1 - Sentences encoding

In [None]:
print('\nReview example (x_train[12]) :\n\n',x_train[12])
print('\nOpinions (y_train) :\n\n',y_train)

### 4.2 - Load dictionary

In [None]:
# ---- Retrieve dictionary {word:index}, and encode it in ascii
#
word_index = imdb.get_word_index()

# ---- Shift the dictionary from +3
#
word_index = {w:(i+3) for w,i in word_index.items()}

# ---- Add <pad>, <start> and <unknown> tags
#
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )

# ---- Create a reverse dictionary : {index:word}
#
index_word = {index:word for word,index in word_index.items()} 

# ---- About dictionary
#
print('\nDictionary size     : ', len(word_index))
print('\nSmall extract :\n')
for k in range(440,455):print(f'    {k:2d} : {index_word[k]}' )

# ---- Add a nice function to transpose :
#
def dataset2text(review):
    return ' '.join([index_word.get(i, '?') for i in review])

### 4.3 - Have a look, for human

In [None]:
print(f'Review example : \n{x_train[12]}')

print(f'After translation : \n{dataset2text(x_train[12])}')

### 4.4 - Few statistics

In [None]:
sizes=[len(i) for i in x_train]
plt.figure(figsize=(16,6))
plt.hist(sizes, bins=400)
plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)), 
              xlabel='Size', ylabel='Density', xlim=[0,1500])
plt.show()

In [None]:
unk=[ 100*(s.count(2)/len(s)) for s in x_train]
plt.figure(figsize=(16,6))
plt.hist(unk, bins=100)
plt.gca().set(title='Percent of unknown words - [{:5.2f}, {:5.2f}]'.format(min(unk),max(unk)), 
              xlabel='# unknown', ylabel='Density', xlim=[0,30])
plt.show()

## Step 5 - Basic approach with "one-hot" vector encoding

Each sentence is encoded with a **vector** of length equal to the **size of the dictionary**.   

Each sentence will therefore be encoded with a simple vector.  
The value of each component is 0 if the word is not present in the sentence or 1 if the word is present.

For a sentence s=[3,4,7] and a dictionary of 10 words...    
We wil have a vector v=[0,0,0,1,1,0,0,1,0,0,0]

### 5.1 - Our one-hot encoder

In [None]:
def one_hot_encoder(x, vector_size=10000):
    
    # ---- Set all to 0
    #
    x_encoded = np.zeros((len(x), vector_size))
    
    # ---- For each sentence
    #
    for i,sentence in enumerate(x):
        for word in sentence:
            x_encoded[i, word] = 1.

    return x_encoded

### 5.2 - Encoding

In [None]:
x_train_one_hot = one_hot_encoder(x_train, vector_size=vocab_size)
x_test_one_hot  = one_hot_encoder(x_test,  vector_size=vocab_size)

print("To have a look, x_train[12] became :", x_train_one_hot[12] )

## Step 6 - Build the model

In [None]:
def get_model(vector_size=10000):
    
    model = keras.Sequential()
    model.add(keras.layers.Input( shape=(vector_size,) ))
    model.add(keras.layers.Dense( 32, activation='relu'))
    model.add(keras.layers.Dense( 32, activation='relu'))
    model.add(keras.layers.Dense( 1, activation='sigmoid'))
    
    model.compile(optimizer = 'rmsprop',
                  loss      = 'binary_crossentropy',
                  metrics   = ['accuracy'])
    return model

## Step 7 - Train the model
### 7.1 - Get it

In [None]:
model = get_model(vector_size=vocab_size)

model.summary()

### 7.2 - Add callback

In [None]:
os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model_one_hot.h5'
savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)

### 7.3 - Train it

In [None]:
%%time

history = model.fit(x_train_one_hot,
                    y_train,
                    epochs          = epochs,
                    batch_size      = batch_size,
                    validation_data = (x_test_one_hot, y_test),
                    verbose         = fit_verbosity,
                    callbacks       = [savemodel_callback])


## Step 8 - Evaluate
### 8.1 - Training history

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### 8.2 - Reload and evaluate best model

In [None]:
model = keras.models.load_model(f'{run_dir}/models/best_model_one_hot.h5')

# ---- Evaluate
score  = model.evaluate(x_test_one_hot, y_test, verbose=0)

print('x_test / loss      : {:5.4f}'.format(score[0]))
print('x_test / accuracy  : {:5.4f}'.format(score[1]))

values=[score[1], 1-score[1]]

# ---- Confusion matrix

y_sigmoid = model.predict(x_test_one_hot)

y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1    

confusion_matrix(y_test, y_pred, labels=range(2))

## Questions

1. What are the drawbacks of this encoding approach?
2. Does the model seem to overfit/underfit the data? On which hyperparameter(s) can you act to resolve this?

# **Part 2: Word embeddings**

## Step 1 - Preprocess the data

### 1.1 - Parameters
The words in the vocabulary are classified from the most frequent to the rarest.  
`review_len` is the review length  
`dense_vector_size` is the size of the generated dense vectors  
`output_dir` is where we will go to save our dictionaries. (./data is a good choice)\
`fit_verbosity` is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch

In [None]:
review_len           = 256
dense_vector_size    = 32

epochs               = 30
batch_size           = 512

output_dir           = './data'
fit_verbosity        = 1

### 1.2 - Padding

In order to be processed by an embedding neural network, all entries must have the **same length.**  
We chose a review length of **review_len**  
We will therefore complete them with a padding (of \<pad\>\)  

In [None]:
x_train_pad = keras.preprocessing.sequence.pad_sequences(x_train,
                                                     value   = 0,
                                                     padding = 'post',
                                                     maxlen  = review_len)

x_test_pad  = keras.preprocessing.sequence.pad_sequences(x_test,
                                                     value   = 0 ,
                                                     padding = 'post',
                                                     maxlen  = review_len)

# fidle.utils.subtitle('After padding :')
print(f'After padding: {x_train_pad[12]}')

## Step 2 - Build the model

More documentation about this model functions :
 - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)
 - [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D)

In [None]:
def get_model(vocab_size=10000, dense_vector_size=32, review_len=256):
    
    model = keras.Sequential()
    model.add(keras.layers.Input( shape=(review_len,) ))
    model.add(keras.layers.Embedding(input_dim    = vocab_size, 
                                     output_dim   = dense_vector_size, 
                                     input_length = review_len))
    model.add(keras.layers.GlobalAveragePooling1D())
    model.add(keras.layers.Dense(dense_vector_size, activation='relu'))
    model.add(keras.layers.Dense(1,                 activation='sigmoid'))

    model.compile(optimizer = 'adam',
                  loss      = 'binary_crossentropy',
                  metrics   = ['accuracy'])
    return model

## Step 3 - Train the model
### 3.1 - Get it

In [None]:
model = get_model(vocab_size, dense_vector_size, review_len)

model.summary()

### 3.2 - Add callback

In [None]:
os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model_embeddings.h5'
savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)

### 3.3 - Train it

In [None]:
%%time

history = model.fit(x_train_pad,
                    y_train,
                    epochs          = epochs,
                    batch_size      = batch_size,
                    validation_data = (x_test_pad, y_test),
                    verbose         = fit_verbosity,
                    callbacks       = [savemodel_callback])


## Step 4 - Evaluate
### 4.1 - Training history

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### 4.2 - Reload and evaluate best model

In [None]:
model = keras.models.load_model(f'{run_dir}/models/best_model_embeddings.h5')

# ---- Evaluate
score  = model.evaluate(x_test_pad, y_test, verbose=0)

print('x_test / loss      : {:5.4f}'.format(score[0]))
print('x_test / accuracy  : {:5.4f}'.format(score[1]))

values=[score[1], 1-score[1]]

# ---- Confusion matrix

y_sigmoid = model.predict(x_test_pad)

y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1    

confusion_matrix(y_test, y_pred, labels=range(2))

## Questions

1. Compare empirically this model to the one of the previous part (with one-hot encoded features): which one is best in terms of performance, memory and runtime? Which one would you choose and why?
2. What are the theoretical advantages and disadvantages of each approach (in terms of performance, computation, interpretability...)?
3. Change the dimension of the embedding space. How does it influence performance? How would you choose this dimension?

## Step 5 - Have a look at the embeddings
### 5.1 Retrieve embeddings

In [None]:
embeddings = model.layers[0].get_weights()[0]
print('Shape of embeddings : ',embeddings.shape)

### 5.2 - Build a nice dictionary

In [None]:
word_embedding = { index_word[i]:embeddings[i] for i in range(vocab_size) }

### 5.3 Show embedding of a word :

In [None]:
word_embedding['nice']

#### Few usefull functions to play with

In [None]:
# Return a l2 distance between 2 words
#
def l2w(w1,w2):
    v1=word_embedding[w1]
    v2=word_embedding[w2]
    return np.linalg.norm(v2-v1)

# Show distance between 2 words 
#
def show_l2(w1,w2):
    print(f'\nL2 between [{w1}] and [{w2}] : ',l2w(w1,w2))

# Displays the 15 closest words to a given word
#
def neighbors(w1):
    v1=word_embedding[w1]
    dd={}
    for i in range(4, 1000):
        w2=index_word[i]
        dd[w2]=l2w(w1,w2)
    dd= {k: v for k, v in sorted(dd.items(), key=lambda item: item[1])}
    print(f'\nNeighbors of [{w1}] : ', list(dd.keys())[1:15])
    

### 5.4 Examples

In [None]:
show_l2('nice', 'pleasant')
show_l2('nice', 'horrible')

neighbors('horrible')
neighbors('great')


## Question

1. Is this method designed to force the embeddings to have semantic meaning?
2. Give a visualization of the word embeddings by doing a PCA on the embedding matrix. Plot some of the words the space of the first two Principal Components. 
3. Give some interpretation on the PCs of these embeddings.

# Part 3: Train a GRU

## Step 1 : Parameters

In [None]:
epochs               = 10
batch_size           = 128

## Step 2 : Build the model

In [None]:
def get_model(dense_vector_size=128):
    
    model = keras.Sequential()
    model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim = dense_vector_size))
    model.add(keras.layers.GRU(50))
    model.add(keras.layers.Dense(1, activation='sigmoid'))

    model.compile(optimizer = 'rmsprop',
                  loss      = 'binary_crossentropy',
                  metrics   = ['accuracy'])
    return model

## Step 3 - Train the model
### 3.1 - Get it

In [None]:
model = get_model(32)

model.summary()

### 3.2 - Add callback

In [None]:
os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model_gru.h5'
savemodel_callback = tf.keras.callbacks.ModelCheckpoint(filepath=save_dir, verbose=0, save_best_only=True)

### 3.3 - Train it

In [None]:
%%time

history = model.fit(x_train_pad,
                    y_train,
                    epochs          = epochs,
                    batch_size      = batch_size,
                    validation_data = (x_test_pad, y_test),
                    verbose         = fit_verbosity,
                    callbacks       = [savemodel_callback])


## Step 4 - Evaluate
### 4.1 - Training history

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

### 4.2 - Reload and evaluate best model

In [None]:
model = keras.models.load_model(f'{run_dir}/models/best_model_gru.h5')

# ---- Evaluate
score  = model.evaluate(x_test_pad, y_test, verbose=0)

print('x_test / loss      : {:5.4f}'.format(score[0]))
print('x_test / accuracy  : {:5.4f}'.format(score[1]))

values=[score[1], 1-score[1]]

# ---- Confusion matrix

y_sigmoid = model.predict(x_test_pad)

y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1    

confusion_matrix(y_test, y_pred, labels=range(2))

## Questions

1. What is the main advantage of this model compared to the previous ones?
2. Compare empirically the performance of the 3 models. Which one would you choose and why?
3. Change the model to use one-hot encoding instead of word embeddings, but keeping a GRU layer. Train it and compare the performance.
4. Play with other recurrent architectures, as described in the [Documentation](https://keras.io/api/layers/recurrent_layers/) and compare their performance. In particular you can compare a bi-directional GRU to a simple GRU, and also use a Long-Short Term Memory Network (LSTM).