## General instructions : 
- Please answer the questions by inserting your text or your codes into new cells of this notebook.
- The test aims at evaluating differents aspects: 
  - your knowledge in data science
  - your ability to provide clean code and visualizations that respects standards
  - your ability to search and implement functions on tier libraries
  - your ability to find answers in all available resources at your disposal.
  
This test is adapted from this tutorial : https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
  

## Movie review classification

In this notebook, you will learn to classify movie reviews as positive or negative, based on the text content of the reviews.

### About the dataset

The IMDB dataset is a set of 50 000 highly polarized reviews from the Internet Movie Database. They’re split into 25 000 reviews for training and 25 000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.

The notebook is organized as follows :
1. **Loading the IMDB dataset**
<br>

2. **Preparing the data**
<br>

3. **Building the network**
<br>

4. **Evaluating the network**
<br>

5. **Go further**

<br>

We load in a single place all the packages and then load the dataset.

### 1. Loading the IMDB dataset

In [1]:
import numpy as np # linear algebra

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam, RMSprop

import matplotlib.pyplot as plt # data vizualisation
plt.style.use('ggplot')
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
np.random.seed(7)

Using TensorFlow backend.


In [2]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

Sine reviews are not the same size, passing the argument **num_words=10000** means that we will only keep the top 10 000 most frequently occurring words in the training data.

Note that **X_train** and **X_test** are lists of reviews. Each review is a list of word indices (encoding a sequence of words).

In [None]:
# display the first list of reviews and the corresponding label.

In [None]:
# Check if no word index exceed 10 000

#### Quick look at the first review

In [3]:
word_index = imdb.get_word_index()  #  dictionary mapping words to an integer index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # Reverse, mapping integer indices to words
decoded_review = ' '.join([reverse_word_index.get(i - 3, '') for i in X_train[0]]) # Decodes the review
decoded_review

" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little boy's that played the  of norman and paul they were just brilliant children are often left out of the  list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the

### 2. Preparing the data

Remember that we have a list of integers and we can't feed that into a neural network. We have to turn our lists into
tensors.

We can use an **Embedding** layer to handle that or doing **One-hot encoding**.
This would mean, for instance, turning the sequence [3, 5] into a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s.

The function below encode the integer sequences into a binary matrix.

In [4]:
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

X_train = vectorize_sequences(X_train)
X_test = vectorize_sequences(X_test)

In [5]:
# We should also vectorize our labels
y_train = np.asarray(y_train).astype('float32')
y_test = np.asarray(y_test).astype('float32')

### 3. Building the network

In [6]:
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=(10000,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

#### Questions : 

1. How many layers does this network have ? 
<br>

2. Explain the argument passed to each Dense layer
<br>

3. What is a sigmoid? Explain why it is well suited for the last layer.
<br>

4. What is the role of each layer ? 
<br>

5. Write the cross-entropy equation and explain in few sentences why this is indeed a loss (i.e. it decreases when the batch sample predictions are close to the right labels).
<br>

6. Adam optimizer is a variant of the batch stochastic gradient descent where the learning rate is adjusted according to the dynamic of the learning process. Assuming we would use batch stochastic gradient descent (BSGD) algorithm to optimize the loss, please explain in few sentences the underlying principle of BSGD.

In [7]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [8]:
x_val = X_train[:10000]
partial_x_train = X_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

The following cell will run the training (please be patient)

While waiting for training to finish, 
1. please explain the two parameters (epochs, batch_size) meaning and their potential impacts on the model training. 
2. explain what is the purpose of the validation_data argument
3. is it relevant to use the test set (X_test, y_test) as values for the argument validation_data ? 

In [None]:
history = model.fit(partial_x_train, partial_y_train, epochs=10, batch_size=128, validation_data=(x_val, y_val))

### 4. Evaluating the network 

model.fit() returns a __history__ object which is a dictionary containing data about everything that happened during training. 
<br> Let’s use Matplotlib to plot the training and validation loss side by side.

In [32]:
score = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (score[1]*100))

Accuracy: 84.93%


### 5. Go further

#### a. What is the importance of having sufficiently large intermediate layers ?
Let’s see what happens when you introduce an information bottleneck by having intermediate layers that are significantly less than 16-dimensional: for example, 4-dimensional.

#### b. Propose tracks to improve these performances.

#### c. Regularization techniques

The goal of this section is to evaluate knowledge on regularization

1. Explain what is underfitting and overfitting in few sentences
2. Explain the following regularization techniques in few sentences:
 - L1-regularization
 - L2-regularization (what is the impact difference vs L1 regularization)
 - dropout (principle of dropout and impact on parameters)
3. Are you aware of other ways used to regularize deep neural networks.