# 01-Movie-Review

![](https://images.unsplash.com/photo-1524985069026-dd778a71c7b4?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1051&q=80)

Photo by [Erik Witsoe](https://unsplash.com/photos/GF8VvBgcJ4o)

In this exercise, you will compare the classical NLP approach to the sequential approach on the movie dataset.

First download the dataset, located in `tensorflow.keras.datasets.imdb` with 10000 words (if you are experiencing memory issue, you can go down to 5000 words).

In [1]:
# TODO: Load the dataset
### STRIP_START ###
from tensorflow.keras import datasets
imdb = datasets.imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

X_train.shape, y_train.shape
### STRIP_END ###

((25000,), (25000,))

Explore the dataset: you can make use of the function `imdb.get_word_index()` to get back to words and display some reviews. Be careful, the word indices `0`, `1`, `2` and `3` are reserved and mean no word. 

In [2]:
# TODO: Explore the data, display some sentences
### STRIP_START ###
imdb.get_word_index()

index_to_word = dict([(value, key) for (key, value) in imdb.get_word_index().items()])

output = [index_to_word[w-3] for w in X_train[0] if w>2]

print(' '.join(output))
### STRIP_END ###

this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big for the whole film but these children are amazing and should be for what they have done don't you think the whole story was so lovely because it 

## Classical NLP

Make a prediction using classical NLP tools: BOW and TF-IDF. Followed by a classification model. Choose a random forest or gradient boosting, and perform a grid search for hyperparameter optimization.

*Warning, you are used to manipulate words, here they are already encoded into integers.*

In [3]:
### TODO: Perform classification using NLP tools
### STRIP_START ###
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer=(lambda x:x))

tfidf.fit(X_train)

X_train = tfidf.transform(X_train).toarray()
### STRIP_END ###

In [4]:
### TODO: Perform classification using NLP tools
### STRIP_START ###
# Here using only the train dataset for memory reasons, non mandatory step
X_test = X_train[20000:]
y_test = y_train[20000:]
X_train = X_train[:20000]
y_train = y_train[:20000]
### STRIP_END ###

In [5]:
### TODO: Perform classification using NLP tools
### STRIP_START ###
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)
### STRIP_END ###



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [6]:
### TODO: Perform classification using NLP tools
### STRIP_START ###
print("accuracy on train:", rf.score(X_train, y_train))
print("accuracy on test:", rf.score(X_test, y_test))
### STRIP_END ###

accuracy on train: 0.99275
accuracy on test: 0.7706


In [7]:
### TODO: Perform classification using NLP tools
### STRIP_START ###
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [10, 30],
              'max_depth': [None, 5, 10]}

grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)

grid.fit(X_train, y_train)

print("best params:", grid.best_params_)
print("accuracy on train:", grid.score(X_train, y_train))
print("accuracy on test:", grid.score(X_test, y_test))
### STRIP_END ###

best params: {'max_depth': None, 'n_estimators': 30}
accuracy on train: 1.0
accuracy on test: 0.8142


What accuracy did you reach? Let's see if we can do better with RNN.

## RNN

Since you will use sequences, you will have to choose a sequence length.

First, you can check the min, max and average length of the sequences.

In [50]:
# TODO: compute basic descriptive statistics of the length of sequences
### STRIP_START ###
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

lengths = [len(seq) for seq in X_train]
print('min length:', np.min(lengths))
print('max length:', np.max(lengths))
print('mean length:', np.mean(lengths))
print('median length:', np.median(lengths))

### STRIP_END ###

min length: 11
max length: 2494
mean length: 238.71364
median length: 178.0


Make now the padding of sequences: you choose a value related to the mean or median length.

In [51]:
# TODO: Make the padding
### STRIP_START ###
from tensorflow.keras.preprocessing import sequence

X_train = sequence.pad_sequences(X_train,
                                 value=0,
                                 padding='post', # to add zeros at the end
                                 maxlen=128) # the length we want

# Step done here to compare performances on the same data, not necessary if enough memory...
X_test = X_train[20000:]
y_test = y_train[20000:]
X_train = X_train[:20000]
y_train = y_train[:20000]

### STRIP_END ###

Now build a RNN, with for example two layers of 32 units. Do not forget the first layer of embedding, and the last layer of sigmoid for binary classification. Warning, the training might take several minutes! You can choose to have less layers and/or units!

In [54]:
# TODO: Build your model
### STRIP_START ###
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding, Dropout


def my_RNN():

    model = Sequential()
    model.add(Embedding(input_dim=5000, output_dim=32, input_length=128))
    model.add(SimpleRNN(units=16, return_sequences=False))
    model.add(Dense(units=1, activation='sigmoid'))

    return model
### STRIP_END ###

Finally compile and train the model on the training data.

In [56]:
# TODO: Compile and fit your model
### STRIP_START ###
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras import optimizers

optimizer = optimizers.Adam(lr=0.005)

model = my_RNN()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Define now our callbacks
callbacks = [EarlyStopping(monitor='val_loss', patience=3),
             TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)]

model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=30, batch_size=64, callbacks=callbacks)

### STRIP_END ###

Train on 20000 samples, validate on 5000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30


<tensorflow.python.keras.callbacks.History at 0x7f7a975b0e48>

You can have a look at the tensorboard as usual.

As usual, compute the accuracy.

In [57]:
# TODO: Compute the accuracy of your model
### STRIP_START ###
from sklearn.metrics import accuracy_score

print('accuracy on train with NN:', model.evaluate(X_train, y_train)[1])
print('accuracy on test with NN:', model.evaluate(X_test, y_test)[1])
### STRIP_END ###

accuracy on train with NN: 0.9556
accuracy on test with NN: 0.8522


You might want to improve your results by playing with the hyperparameters: play with the layers and number of units, you can add dropout, play with the optimizer, mini-batch size, data preprocessing...