# IMDB movie critics

We'll see how to train RNNs in order to do sentiment analysis on movie critics. We'll see how to train embeddings, LSTM, GRU models.

In [1]:
%matplotlib inline
from utils import *
from __future__ import division, print_function

Using TensorFlow backend.


In [2]:
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
set_session(tf.Session(config=config))

# NLP: Sentiment Analysis: Is the critic positive or negative ?!
We are doing sentiment analysis on the imdb critics labelled data set. The data is labelled; either positive of negative.The data set is composed of words ids. 

In [3]:
from keras.datasets import imdb

In [4]:
# dowload data
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)
print('Shape of X _train: ', np.shape(x_train))
print('Shape of X _test: ', np.shape(x_test))

Shape of X _train:  (25000,)
Shape of X _test:  (25000,)


#### Explore data

In [5]:
def get_text(data_ids, len_data):
    word_to_id = keras.datasets.imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    id_to_word = {value:key for key,value in word_to_id.items()}


    text = []
    for i in range(len_data):
        text.append(' '.join(id_to_word[id] for id in data_ids[i]))
    return text    
        
def get_label_txt(data):
    values = []
    for idx in range(len(data)):
        if data[idx] == [1]:
            values.append('Positive')
        else:
            values.append('Negative')
    return values

In [6]:
text = get_text(x_train, 10)
label = get_label_txt(y_train[:10])

for i in range(3):
    print('*******************************************************************************')
    print('TEXT n°', i + 1, ' -- LABEL:',label[i])
    print('-------------------------------------------------------------------------------')
    print(text[i])
    print('*******************************************************************************')

*******************************************************************************
TEXT n° 1  -- LABEL: Positive
-------------------------------------------------------------------------------
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just bri

# Preparing Data
We'll consider only the 5000 most frequent words in the text. We'll then try to generate embeddings for these text

In [7]:
vocab_size = 5000
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

In [8]:
lens = np.array(list(map(len, trn)))
print('Maximum text length:', lens.max(),' -- Minimum length:', lens.min(), '-- Mean length of text:',lens.mean())

Maximum text length: 2494  -- Minimum length: 11 -- Mean length of text: 238.71364


In [9]:
# we'll pad all inputs to obtain homogeneous inputs of dim 500
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen = seq_len,value=0)
test = sequence.pad_sequences(test, maxlen = seq_len,value=0)

In [10]:
trn.shape

(25000, 500)

## Models - Training own Embeddings
### It's possible, nothing unfeasible, just a lot of patience !

# A. Using MLP to classify

In [11]:
def MLP():
    model = Sequential()
    model.add(Embedding(vocab_size,32,input_length=seq_len))
    
    model.add(Flatten())
    model.add(Dense(100,activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    
    model.add(Dense(100,activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    
    model.add(Dense(1,activation='sigmoid'))

    model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics=['accuracy'])
    print(model.summary())
    
    return model

model = MLP()  

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               1600100   
_________________________________________________________________
batch_normalization_1 (Batch (None, 100)               400       
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
__________

#### Fitting

In [12]:
model.fit(trn, y_train, validation_data=(test, y_test), epochs=5, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7e10ce25f8>

#### Evaluating

In [13]:
scores = model.evaluate(test,y_test,verbose=0)
print('loss: ', scores[0],'- accuracy: ', scores[1])

loss:  0.570253434144 - accuracy:  0.84924


# B - Using CNN
A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [14]:
def CNN():
    model = Sequential()
    model.add(Embedding(vocab_size, 32, input_length=seq_len))
    
    model.add(Dropout(0.2))
    model.add(Conv1D(64, 5, padding='same', activation='relu'))
    model.add(Dropout(0.2))
    model.add(MaxPooling1D())
    
    model.add(Flatten())
    
    model.add(Dense(100, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics=['accuracy'])
    print(model.summary())
    
    return model

model = CNN()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
dropout_3 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 64)           10304     
_________________________________________________________________
dropout_4 (Dropout)          (None, 500, 64)           0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 64)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 100)               1600100   
__________

#### Fitting

In [15]:
model.fit(trn, y_train, validation_data=(test, y_test), epochs=6, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f7de974ff28>

#### Evaluating

In [16]:
scores = model.evaluate(test,y_test,verbose=0)
print('loss: ', scores[0],'- accuracy: ', scores[1])

loss:  0.351344470656 - accuracy:  0.87908


# C - Using multi-size CNN

In [17]:
graph_in = Input ((vocab_size, 32))
convs = [ ] 
for fsz in range (3, 6): 
    x = Conv1D(64, fsz, padding='same', activation="relu")(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = merge(inputs=convs,mode="concat") 
graph = Model(graph_in, out)

  name=name)


In [18]:
def multisize_CNN():
    model = Sequential()
    model.add(Embedding(vocab_size, 32,input_length=seq_len))
    model.add(Dropout (0.2))
    model.add(graph)
    model.add(Dropout(0.5))
    model.add(Dense(100, activation="relu"))
    model.add(Dropout(0.7))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics=['accuracy'])
    print(model.summary())
    
    return(model)

model = multisize_CNN()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
dropout_6 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
model_1 (Model)              multiple                  24768     
_________________________________________________________________
dropout_7 (Dropout)          (None, 48000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 100)               4800100   
_________________________________________________________________
dropout_8 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 101       
Total para

#### Fitting

In [19]:
model.fit(trn, y_train, validation_data=(test, y_test), epochs=5, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7de8672fd0>

#### Evaluating

In [20]:
scores = model.evaluate(test,y_test,verbose=0)
print('loss: ', scores[0],'- accuracy: ', scores[1])

loss:  0.311121658216 - accuracy:  0.88412


## Visual Checking

In [33]:
predictions = model.predict(test[:10],1)
predictions = np.round(predictions).astype('int')

In [34]:
text = get_text(x_test, 10)
preds = get_label_txt(predictions[:10])
true = get_label_txt(y_test[0:10])

In [40]:
for i in range(10):
    print('*******************************************************************************')
    print('TEXT n°', i + 1, ' -- TRUE label:', true[i], ' -- PREDICTED label:', preds[i])
    print('-------------------------------------------------------------------------------')
    print(text[i])
    print('*******************************************************************************')

*******************************************************************************
TEXT n° 1  -- TRUE label: Positive  -- PREDICTED label: Positive
-------------------------------------------------------------------------------
<START> how his charter evolved as both man and ape was outstanding not to mention the scenery of the film christopher lambert was astonishing as lord of greystoke christopher is the soul to this masterpiece i became so with his performance i could feel my heart pounding the of the movie still moves me to this day his portrayal of john was oscar worthy as he should have been nominated for it
*******************************************************************************
*******************************************************************************
TEXT n° 2  -- TRUE label: Positive  -- PREDICTED label: Positive
-------------------------------------------------------------------------------
<START> bride of chucky starts late one night as officer bob bailey vince s

# D. Using LSTM

In [19]:
def RNN_LSTM():
    model = Sequential()
    model.add(Embedding(vocab_size,5,input_length=seq_len))
    model.add(LSTM(50))
    model.add(Dropout(0.25))
    model.add(Dense(1,activation='sigmoid'))

    model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics=['accuracy'])
    print(model.summary())
    
    return model

model = RNN_LSTM()       

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 5)            25000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                11200     
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 51        
Total params: 36,251.0
Trainable params: 36,251.0
Non-trainable params: 0.0
_________________________________________________________________
None


#### Fitting

In [20]:
model.fit(trn, y_train, validation_data=(test, y_test), epochs=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f7f22b9fef0>

#### Evaluating

In [21]:
scores = model.evaluate(test,y_test,verbose=0)
print('loss: ', scores[0],'- accuracy: ', scores[1])

loss:  0.310951983457 - accuracy:  0.8774


## Visual Checking

In [22]:
predictions = model.predict(test[:10],1)
predictions = np.round(predictions).astype('int')

In [23]:
text = get_text(x_test, 10)
preds = get_label_txt(predictions[:10])
true = get_label_txt(y_test[:10])

In [24]:
for i in range(10):
    print('*******************************************************************************')
    print('TEXT n°', i + 1, ' -- TRUE label:',  true[i], ' -- PREDICTED label:', preds[i])
    print('-------------------------------------------------------------------------------')
    print(text[i])
    print('*******************************************************************************')

*******************************************************************************
TEXT n° 1  -- TRUE label: Positive  -- PREDICTED label: Positive
-------------------------------------------------------------------------------
<START> how his charter evolved as both man and ape was outstanding not to mention the scenery of the film christopher lambert was astonishing as lord of greystoke christopher is the soul to this masterpiece i became so with his performance i could feel my heart pounding the of the movie still moves me to this day his portrayal of john was oscar worthy as he should have been nominated for it
*******************************************************************************
*******************************************************************************
TEXT n° 2  -- TRUE label: Positive  -- PREDICTED label: Negative
-------------------------------------------------------------------------------
<START> bride of chucky starts late one night as officer bob bailey vince s

# E. Using GRU

In [12]:
def RNN_GRU():
    model = Sequential()
    model.add(Embedding(vocab_size,5,input_length=seq_len))
    model.add(GRU(50))
    model.add(Dropout(0.25))
    model.add(Dense(1,activation='sigmoid'))

    model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics=['accuracy'])
    print(model.summary())
    
    return model

model = RNN_GRU()  

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 5)            25000     
_________________________________________________________________
gru_1 (GRU)                  (None, 50)                8400      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
Total params: 33,451.0
Trainable params: 33,451.0
Non-trainable params: 0.0
_________________________________________________________________
None


#### Fitting

In [13]:
model.fit(trn, y_train, validation_data=(test, y_test), epochs=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f7f404c1eb8>

In [14]:
scores = model.evaluate(test,y_test,verbose=0)
print('loss: ', scores[0],'- accuracy: ', scores[1])

loss:  0.351130843825 - accuracy:  0.86724


# Visual Checking

In [15]:
predictions = model.predict(test[:10],1)
predictions = np.round(predictions).astype('int')

In [16]:
text = get_text(x_test, 10)
preds = get_label_txt(predictions[:10])
true = get_label_txt(y_test[:10])

In [18]:
for i in range(10):
    print('*******************************************************************************')
    print('TEXT n°', i + 1, ' -- TRUE label:', true[i], ' -- PREDICTED label:', preds[i])
    print('-------------------------------------------------------------------------------')
    print(text[i])
    print('*******************************************************************************')

*******************************************************************************
TEXT n° 1  -- TRUE label: Positive  -- PREDICTED label: Positive
-------------------------------------------------------------------------------
<START> how his charter evolved as both man and ape was outstanding not to mention the scenery of the film christopher lambert was astonishing as lord of greystoke christopher is the soul to this masterpiece i became so with his performance i could feel my heart pounding the of the movie still moves me to this day his portrayal of john was oscar worthy as he should have been nominated for it
*******************************************************************************
*******************************************************************************
TEXT n° 2  -- TRUE label: Positive  -- PREDICTED label: Negative
-------------------------------------------------------------------------------
<START> bride of chucky starts late one night as officer bob bailey vince s