# **Assignment 5**
Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset). You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.
Your quality metric is accuracy score on test dataset. Look at "type" column for train/test split.
You can use pretrained embeddings from external sources.
You have to provide data for trials with different hyperparameter values.

You have to beat following baselines:

[3 points] acc = 0.75

[5 points] acc = 0.8

[8 points] acc = 0.9

[2 points] for using unsupervised data

In [1]:
import os
import numpy as np
import tensorflow as tf
import random as rn
import keras.backend as K
import pandas as pd
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.callbacks import EarlyStopping
from keras.datasets import imdb
from sklearn.model_selection import StratifiedKFold
from gensim.models import Word2Vec
# to remove warnings about deprecated functions on Colab
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

Using TensorFlow backend.


# Parameters & Data

In [2]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-02-02 18:25:02--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-02-02 18:25:02--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-02-02 18:25:03--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [3]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [0]:
max_features = 5000
embedding_dims = 50
maxlen = 400

batch_size = 32
epochs = 2

num_filters = 128
kernel_size = 3
hidden_dims = 250

In [5]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
25000 train sequences
25000 test sequences


In [6]:
x_train_seq = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test_seq = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000,)
x_test shape: (25000,)


# Modelling

## Basic CNN model

In [7]:
model = Sequential() # an empty container




In [8]:
# start with an embedding layer to map vocab indices into vectors
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
# FEATURE EXTRACTION
# we add a Convolution1D, which will learn filters
# word group filters of size kernel_size:
model.add(Conv1D(num_filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

In [0]:
# CLASSIFICATION
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

In [11]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 128)          19328     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               32250     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 250)              

In [12]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_seq, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test_seq, y_test))



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 25000 samples, validate on 25000 samples
Epoch 1/2





Epoch 2/2


<keras.callbacks.History at 0x7efd3d29df60>

# Pretrained embeddings from external sources

https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset

In [0]:
INDEX_FROM = 3
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features, index_from=INDEX_FROM)

In [14]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> f

In [0]:
train_texts = [[id_to_word[i] for i in t] for t in x_train]

In [16]:
print(train_texts[0])

['<START>', 'this', 'film', 'was', 'just', 'brilliant', 'casting', 'location', 'scenery', 'story', 'direction', "everyone's", 'really', 'suited', 'the', 'part', 'they', 'played', 'and', 'you', 'could', 'just', 'imagine', 'being', 'there', 'robert', '<UNK>', 'is', 'an', 'amazing', 'actor', 'and', 'now', 'the', 'same', 'being', 'director', '<UNK>', 'father', 'came', 'from', 'the', 'same', 'scottish', 'island', 'as', 'myself', 'so', 'i', 'loved', 'the', 'fact', 'there', 'was', 'a', 'real', 'connection', 'with', 'this', 'film', 'the', 'witty', 'remarks', 'throughout', 'the', 'film', 'were', 'great', 'it', 'was', 'just', 'brilliant', 'so', 'much', 'that', 'i', 'bought', 'the', 'film', 'as', 'soon', 'as', 'it', 'was', 'released', 'for', '<UNK>', 'and', 'would', 'recommend', 'it', 'to', 'everyone', 'to', 'watch', 'and', 'the', 'fly', '<UNK>', 'was', 'amazing', 'really', 'cried', 'at', 'the', 'end', 'it', 'was', 'so', 'sad', 'and', 'you', 'know', 'what', 'they', 'say', 'if', 'you', 'cry', 'at'

In [17]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(train_texts)
sequences = tokenizer.texts_to_sequences(train_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 4998 unique tokens.


In [18]:
embeddings_index = {}
with open('glove.6B.50d.txt') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [19]:
embeddings_index['am'].shape

(50,)

In [0]:
# prepare embedding matrix
num_words = min(max_features, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dims))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [21]:
embedding_matrix.shape

(4999, 50)

In [22]:
print('Build model...')
model = Sequential()
model.add(Embedding(num_words,
                            embedding_dims,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False))
# this layer is frozen and contains Glove embeddings
model.add(Dropout(0.2))
model.add(Conv1D(num_filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', patience = 2)
model.fit(x_train_seq, y_train,
          batch_size=32,
          epochs=100, callbacks = [es],
          validation_data=(x_test_seq, y_test))

Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100


<keras.callbacks.History at 0x7efd39c4dbe0>

In [23]:
model.evaluate(x_test_seq, y_test)



[0.4065890553855896, 0.8138]

# Hyperparameters tuning

Grid-search + cross-validation in Keras

In [0]:
def get_model(embedding_matrix=None, strides = 1, hidden_dims = 64,
              dropout_rate=0.2, embedding_dims=50, num_words=5000, maxlen=400,
              num_filters = 64, kernel_size = 2, verbose = 0):
    K.clear_session()
    model = Sequential()
    if embedding_matrix is None:
        model.add(Embedding(max_features,
                            embedding_dims,
                            input_length=maxlen))
    else:
        model.add(Embedding(num_words,
                        embedding_dims,
                        weights=[embedding_matrix],
                        input_length=maxlen,
                        trainable=False)) 
    model.add(Dropout(dropout_rate))
    model.add(Conv1D(num_filters,
                    kernel_size,
                    padding='valid',
                    activation='relu',
                    strides=strides))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(hidden_dims))
    model.add(Dropout(dropout_rate))
    model.add(Activation('relu'))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    os.environ['PYTHONHASHSEED'] = '0'
    np.random.seed(42)
    rn.seed(12345)
    tf.set_random_seed(1234)

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    if verbose > 0:
        model.summary()
    return model

In [0]:
# cross-validation
def cross_validate_model(model, x_train_seq, y_train, seed = 42):
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    cvscores = []
    for i, (train, test) in enumerate(kfold.split(x_train_seq, y_train)):
        print('Fold {}'.format(i))
        es = EarlyStopping(monitor='val_loss', patience = 2)
        model.fit(x_train_seq[train], y_train[train], epochs=100, callbacks = [es], validation_split=0.2, batch_size=32, verbose = 0)
        scores = model.evaluate(x_train_seq[test], y_train[test], verbose = 0)
        print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
    return cvscores

In [26]:
model = get_model()
cvscores = cross_validate_model(model, x_train_seq, y_train)


Fold 0
acc: 87.08%
Fold 1
acc: 94.96%
Fold 2
acc: 96.64%
Fold 3
acc: 96.64%
Fold 4
acc: 96.96%
Fold 5
acc: 97.28%
Fold 6
acc: 96.76%
Fold 7
acc: 96.72%
Fold 8
acc: 96.96%
Fold 9
acc: 97.44%
97.44% (+/- 0.00%)


In [30]:
cv_scores = {}
for num_filters in [32, 64, 128, 256]:
    for strides in [1, 2, 3, 4]:
        for kernel_size in [2, 3, 4]:
            print('num_filters{}_strides{}_ks'.format(num_filters, strides, kernel_size))
            model = get_model(num_filters = num_filters, strides=strides)
            score = cross_validate_model(model, x_train_seq, y_train)
            cv_scores[(num_filters, strides, kernel_size)] = score

num_filters32_strides1_ks
Fold 0
acc: 87.44%
Fold 1
acc: 93.64%
Fold 2
acc: 95.76%
Fold 3
acc: 96.08%
Fold 4
acc: 96.40%
Fold 5
acc: 96.84%
Fold 6
acc: 97.04%
Fold 7
acc: 97.32%
Fold 8
acc: 96.84%
Fold 9
acc: 97.40%
97.40% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.12%
Fold 1
acc: 92.92%
Fold 2
acc: 95.76%
Fold 3
acc: 96.12%
Fold 4
acc: 96.80%
Fold 5
acc: 96.92%
Fold 6
acc: 97.04%
Fold 7
acc: 97.32%
Fold 8
acc: 96.84%
Fold 9
acc: 97.44%
97.44% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.00%
Fold 1
acc: 93.88%
Fold 2
acc: 96.16%
Fold 3
acc: 96.32%
Fold 4
acc: 97.00%
Fold 5
acc: 97.36%
Fold 6
acc: 96.76%
Fold 7
acc: 97.12%
Fold 8
acc: 96.96%
Fold 9
acc: 97.00%
97.00% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 85.04%
Fold 1
acc: 93.32%
Fold 2
acc: 95.68%
Fold 3
acc: 96.60%
Fold 4
acc: 96.52%
Fold 5
acc: 97.04%
Fold 6
acc: 96.88%
Fold 7
acc: 97.00%
Fold 8
acc: 97.04%
Fold 9
acc: 96.76%
96.76% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 86.00%
Fold 1
acc: 93.

In [0]:
import operator
best_param = max(cv_scores.items(), key=operator.itemgetter(1))[0]

In [32]:
best_model = get_model(num_filters = best_param[0], strides = best_param[1], kernel_size = best_param[2])
es = EarlyStopping(monitor='val_loss', patience = 2)
best_model.fit(x_train_seq, y_train, validation_split=0.2, epochs = 10, callbacks=[es])
best_model.evaluate(x_test_seq, y_test)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


[0.2978860420632362, 0.88296]

In [0]:
#glove.6B.100d

In [33]:
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [34]:
embedding_matrix.shape

(4999, 50)

In [0]:
# prepare embedding matrix
embedding_dims = 100
num_words = min(max_features, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dims))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [36]:
print('Build model...')
model = Sequential()
model.add(Embedding(num_words,
                            embedding_dims,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False))
# this layer is frozen and contains Glove embeddings
model.add(Dropout(0.2))
model.add(Conv1D(num_filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', patience = 2)
model.fit(x_train_seq, y_train,
          batch_size=32,
          epochs=100, callbacks = [es],
          validation_data=(x_test_seq, y_test))

Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


<keras.callbacks.History at 0x7f629c8d4710>

In [37]:
model.evaluate(x_test_seq, y_test)



[0.4222362138080597, 0.8258]

In [38]:
model = get_model()
cvscores = cross_validate_model(model, x_train_seq, y_train)

Fold 0
acc: 88.56%
Fold 1
acc: 94.08%
Fold 2
acc: 96.36%
Fold 3
acc: 96.64%
Fold 4
acc: 97.60%
Fold 5
acc: 97.08%
Fold 6
acc: 97.04%
Fold 7
acc: 96.20%
Fold 8
acc: 97.12%
Fold 9
acc: 97.24%
97.24% (+/- 0.00%)


In [40]:
cv_scores = {}
for num_filters in [32, 64, 128, 256]:
    for strides in [1, 2, 3, 4]:
        for kernel_size in [2, 3, 4]:
            print('num_filters{}_strides{}_ks'.format(num_filters, strides, kernel_size))
            model = get_model(num_filters = num_filters, strides=strides)
            score = cross_validate_model(model, x_train_seq, y_train)
            cv_scores[(num_filters, strides, kernel_size)] = score

num_filters32_strides1_ks
Fold 0
acc: 87.72%
Fold 1
acc: 93.04%
Fold 2
acc: 94.96%
Fold 3
acc: 96.64%
Fold 4
acc: 95.48%
Fold 5
acc: 97.28%
Fold 6
acc: 96.52%
Fold 7
acc: 97.32%
Fold 8
acc: 97.16%
Fold 9
acc: 97.04%
97.04% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.88%
Fold 1
acc: 93.68%
Fold 2
acc: 96.36%
Fold 3
acc: 96.64%
Fold 4
acc: 96.84%
Fold 5
acc: 97.12%
Fold 6
acc: 96.84%
Fold 7
acc: 97.12%
Fold 8
acc: 97.40%
Fold 9
acc: 97.32%
97.32% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.28%
Fold 1
acc: 93.64%
Fold 2
acc: 95.40%
Fold 3
acc: 96.56%
Fold 4
acc: 96.56%
Fold 5
acc: 97.40%
Fold 6
acc: 97.04%
Fold 7
acc: 97.60%
Fold 8
acc: 97.08%
Fold 9
acc: 96.84%
96.84% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 85.48%
Fold 1
acc: 92.76%
Fold 2
acc: 96.04%
Fold 3
acc: 96.16%
Fold 4
acc: 96.24%
Fold 5
acc: 96.88%
Fold 6
acc: 96.64%
Fold 7
acc: 97.28%
Fold 8
acc: 96.68%
Fold 9
acc: 96.56%
96.56% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 86.32%
Fold 1
acc: 93.

In [0]:
best_param = max(cv_scores.items(), key=operator.itemgetter(1))[0]

In [42]:
best_model = get_model(num_filters = best_param[0], strides = best_param[1], kernel_size = best_param[2])
es = EarlyStopping(monitor='val_loss', patience = 2)
best_model.fit(x_train_seq, y_train, validation_split=0.2, epochs = 10, callbacks=[es])
best_model.evaluate(x_test_seq, y_test)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


[0.344524891974926, 0.86756]

In [0]:
#glove.6B.200d

In [27]:
embeddings_index = {}
with open('glove.6B.200d.txt') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [28]:
embedding_matrix.shape

(4999, 50)

In [0]:
# prepare embedding matrix
embedding_dims = 200
num_words = min(max_features, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dims))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [30]:
print('Build model...')
model = Sequential()
model.add(Embedding(num_words,
                            embedding_dims,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False))
# this layer is frozen and contains Glove embeddings
model.add(Dropout(0.2))
model.add(Conv1D(num_filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', patience = 2)
model.fit(x_train_seq, y_train,
          batch_size=32,
          epochs=100, callbacks = [es],
          validation_data=(x_test_seq, y_test))

Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<keras.callbacks.History at 0x7efd3aa460b8>

In [31]:
model.evaluate(x_test_seq, y_test)



[0.45022787885665894, 0.80696]

In [32]:
model = get_model()
cvscores = cross_validate_model(model, x_train_seq, y_train)

Fold 0
acc: 87.92%
Fold 1
acc: 94.60%
Fold 2
acc: 96.20%
Fold 3
acc: 97.52%
Fold 4
acc: 97.16%
Fold 5
acc: 96.80%
Fold 6
acc: 96.72%
Fold 7
acc: 97.60%
Fold 8
acc: 97.12%
Fold 9
acc: 97.24%
97.24% (+/- 0.00%)


In [33]:
cv_scores = {}
for num_filters in [32, 64, 128, 256]:
    for strides in [1, 2, 3, 4]:
        for kernel_size in [2, 3, 4]:
            print('num_filters{}_strides{}_ks'.format(num_filters, strides, kernel_size))
            model = get_model(num_filters = num_filters, strides=strides)
            score = cross_validate_model(model, x_train_seq, y_train)
            cv_scores[(num_filters, strides, kernel_size)] = score

num_filters32_strides1_ks
Fold 0
acc: 87.52%
Fold 1
acc: 93.28%
Fold 2
acc: 95.96%
Fold 3
acc: 96.00%
Fold 4
acc: 97.36%
Fold 5
acc: 96.84%
Fold 6
acc: 96.88%
Fold 7
acc: 97.16%
Fold 8
acc: 96.60%
Fold 9
acc: 97.24%
97.24% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 86.88%
Fold 1
acc: 92.52%
Fold 2
acc: 95.56%
Fold 3
acc: 96.84%
Fold 4
acc: 96.92%
Fold 5
acc: 97.36%
Fold 6
acc: 96.60%
Fold 7
acc: 97.08%
Fold 8
acc: 97.08%
Fold 9
acc: 97.20%
97.20% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 88.24%
Fold 1
acc: 93.12%
Fold 2
acc: 95.80%
Fold 3
acc: 96.44%
Fold 4
acc: 96.84%
Fold 5
acc: 96.68%
Fold 6
acc: 96.36%
Fold 7
acc: 97.24%
Fold 8
acc: 97.56%
Fold 9
acc: 97.32%
97.32% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 85.12%
Fold 1
acc: 93.04%
Fold 2
acc: 95.56%
Fold 3
acc: 96.84%
Fold 4
acc: 96.80%
Fold 5
acc: 96.88%
Fold 6
acc: 96.68%
Fold 7
acc: 96.64%
Fold 8
acc: 96.96%
Fold 9
acc: 96.60%
96.60% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 85.60%
Fold 1
acc: 92.

In [0]:
best_param = max(cv_scores.items(), key=operator.itemgetter(1))[0]

In [38]:
best_model = get_model(num_filters = best_param[0], strides = best_param[1], kernel_size = best_param[2])
es = EarlyStopping(monitor='val_loss', patience = 2)
best_model.fit(x_train_seq, y_train, validation_split=0.2, epochs = 10, callbacks=[es])
best_model.evaluate(x_test_seq, y_test)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


[0.3033345353007317, 0.88788]

In [0]:
#glove.6B.300d

In [39]:
embeddings_index = {}
with open('glove.6B.300d.txt') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [40]:
embedding_matrix.shape

(4999, 200)

In [0]:
# prepare embedding matrix
embedding_dims = 300
num_words = min(max_features, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dims))
for word, i in word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [42]:
print('Build model...')
model = Sequential()
model.add(Embedding(num_words,
                            embedding_dims,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False))
# this layer is frozen and contains Glove embeddings
model.add(Dropout(0.2))
model.add(Conv1D(num_filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', patience = 2)
model.fit(x_train_seq, y_train,
          batch_size=32,
          epochs=100, callbacks = [es],
          validation_data=(x_test_seq, y_test))

Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<keras.callbacks.History at 0x7efd3e447780>

In [43]:
model.evaluate(x_test_seq, y_test)



[0.3971713258457184, 0.84848]

In [44]:
model = get_model()
cvscores = cross_validate_model(model, x_train_seq, y_train)

Fold 0
acc: 88.72%
Fold 1
acc: 93.68%
Fold 2
acc: 96.68%
Fold 3
acc: 96.96%
Fold 4
acc: 97.08%
Fold 5
acc: 97.96%
Fold 6
acc: 97.24%
Fold 7
acc: 97.32%
Fold 8
acc: 96.44%
Fold 9
acc: 97.52%
97.52% (+/- 0.00%)


In [45]:
cv_scores = {}
for num_filters in [32, 64, 128, 256]:
    for strides in [1, 2, 3, 4]:
        for kernel_size in [2, 3, 4]:
            print('num_filters{}_strides{}_ks'.format(num_filters, strides, kernel_size))
            model = get_model(num_filters = num_filters, strides=strides)
            score = cross_validate_model(model, x_train_seq, y_train)
            cv_scores[(num_filters, strides, kernel_size)] = score

num_filters32_strides1_ks
Fold 0
acc: 87.44%
Fold 1
acc: 93.64%
Fold 2
acc: 95.76%
Fold 3
acc: 95.72%
Fold 4
acc: 96.80%
Fold 5
acc: 97.44%
Fold 6
acc: 96.84%
Fold 7
acc: 96.52%
Fold 8
acc: 96.88%
Fold 9
acc: 97.04%
97.04% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.24%
Fold 1
acc: 94.24%
Fold 2
acc: 94.96%
Fold 3
acc: 96.60%
Fold 4
acc: 97.08%
Fold 5
acc: 97.36%
Fold 6
acc: 96.48%
Fold 7
acc: 97.48%
Fold 8
acc: 97.08%
Fold 9
acc: 97.36%
97.36% (+/- 0.00%)
num_filters32_strides1_ks
Fold 0
acc: 87.04%
Fold 1
acc: 94.12%
Fold 2
acc: 96.32%
Fold 3
acc: 96.64%
Fold 4
acc: 96.80%
Fold 5
acc: 97.36%
Fold 6
acc: 96.80%
Fold 7
acc: 97.60%
Fold 8
acc: 97.08%
Fold 9
acc: 97.28%
97.28% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 86.00%
Fold 1
acc: 93.00%
Fold 2
acc: 95.64%
Fold 3
acc: 96.68%
Fold 4
acc: 96.24%
Fold 5
acc: 96.68%
Fold 6
acc: 96.84%
Fold 7
acc: 97.32%
Fold 8
acc: 97.36%
Fold 9
acc: 96.84%
96.84% (+/- 0.00%)
num_filters32_strides2_ks
Fold 0
acc: 85.08%
Fold 1
acc: 93.

In [0]:
best_param = max(cv_scores.items(), key=operator.itemgetter(1))[0]

In [47]:
best_model = get_model(num_filters = best_param[0], strides = best_param[1], kernel_size = best_param[2])
es = EarlyStopping(monitor='val_loss', patience = 2)
best_model.fit(x_train_seq, y_train, validation_split=0.2, epochs = 10, callbacks=[es])
best_model.evaluate(x_test_seq, y_test)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


[0.30502273245334627, 0.88444]

## Unsupervised data for embedding pretraining

In [0]:
df = pd.read_csv('imdb_master.csv', encoding="latin-1")
df.head()

Unnamed: 0.1,Unnamed: 0,type,review,label,file
0,0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [0]:
len(df), len(df[df.type == 'test']), len(df[df['type'] == 'train'])

(100000, 25000, 75000)

In [0]:
texts = df[df['label'] == 'unsup']['review']

In [0]:
sentences = [t.split() for t in texts] # TODO: use Keras tokenizer
model = Word2Vec(sentences, min_count=1)
print(model)
words = list(model.wv.vocab)
print(words)
print(model['sentence'])
model.save('model.bin')
new_model = Word2Vec.load('model.bin')

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

  
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
