# L665 ML for NLPSpring 2018 

## Assignment 3 - Task 1.1 

Sentence Classification based on Kim, Yoon paper "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014) <br>
Model variation: CNN-rand

Dataset used: MR - Movie Reviews <br>
Reference: Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005

Author: Carlos Sathler

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline  

## Get data

In [2]:
from sklearn.utils import shuffle

SEED = 0

df_neg = pd.read_table('input/rt-polarity.neg', names=['review'],  header=None, encoding='latin-1')
df_pos = pd.read_table('input/rt-polarity.pos', names=['review'],  header=None, encoding='latin-1')
df_neg['rating'] = 0
df_pos['rating'] = 1
df_all = shuffle(pd.concat((df_neg, df_pos), axis=0), random_state=SEED)
print('Dataset size: {}'.format(df_all.index.size))
print('Count of positive reviews: {}'.format(df_all[df_all['rating']==1].index.size))
print('Count of negative reviews: {}'.format(df_all[df_all['rating']==0].index.size))
df_all.head()

Dataset size: 10662
Count of positive reviews: 5331
Count of negative reviews: 5331


Unnamed: 0,review,rating
1837,the sentimental cliches mar an otherwise excel...,1
3318,"if you love the music , and i do , its hard to...",1
3381,"though harris is affecting at times , he canno...",0
3387,poignant japanese epic about adolescent anomie...,1
36,"cantet perfectly captures the hotel lobbies , ...",1


## Create input sequences

In [3]:
# following guidelines outlined here:
# https://keras.io/preprocessing/text/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/text.py#L134
# https://keras.io/preprocessing/sequence/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/sequence.py#L248

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# tokenize text and create dictionary mapping tokens to integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_all.review)

# create sequences of integers to represent reviews, and find longest sentence
seqs = tokenizer.texts_to_sequences(df_all.review)
max_len = max([len(seq) for seq in seqs])

# pad sequences to feed to the embedding layer
seqs = pad_sequences(seqs, maxlen=max_len, dtype='int32', padding='pre', truncating='pre', value=0.0)

Using TensorFlow backend.


In [4]:
print('Number of documents  = {}'.format(tokenizer.document_count))
print('Size of vocabulary   = {}'.format(len(tokenizer.word_index)))
print('Maximum sequence len = {}'.format(max_len))

Number of documents  = 10662
Size of vocabulary   = 19498
Maximum sequence len = 51


## Create model

Using "ConvNet Architectures" guidelines from Stanford "CS231n: Convolutional Neural Networks for Visual Recognition" class notes: http://cs231n.github.io/convolutional-networks/ <br>
"INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC <br>
where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3)..."

In [5]:
MAX_SEQ = max_len
VOC_SIZE = len(tokenizer.word_index)

from keras.models import Sequential
from keras.optimizers import SGD
from keras.layers import Embedding, Conv1D, Dropout, MaxPooling1D, Flatten, Dense
from keras import regularizers

# model where N=1, M=1 and K=1
def get_model(output_dim=64, filter_size=128, window_size=3, stride=1, pool_size=2, dense_dim=16):

    model = Sequential()
    model.add(Embedding(VOC_SIZE+1, output_dim, input_length=MAX_SEQ, embeddings_initializer='random_uniform'))
    # first convolution looks at window size = window_size
    model.add(Conv1D(filter_size, kernel_size=(output_dim*window_size),\
                     padding='same', strides=(stride), activation='relu'))
    model.add(Dropout(0.5))
    # do max pooling
    model.add(MaxPooling1D(pool_size=(pool_size), padding='same'))
    # flatten tensor
    model.add(Flatten())
    # add one fully connected layer
    model.add(Dense(dense_dim,activation='relu'))
    model.add(Dropout(0.5))
    # add output layer
    model.add(Dense(1, kernel_regularizer=regularizers.l2(0.01),\
                       activity_regularizer=regularizers.l1(0.01),\
                       activation='sigmoid'))
    
    #sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model
    
get_model().summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 51, 64)            1247936   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 51, 128)           1572992   
_________________________________________________________________
dropout_1 (Dropout)          (None, 51, 128)           0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 26, 128)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3328)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                53264     
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
__________

In [6]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score

EPOCHS = 20
BATCH_SIZE = 128

# evaluate using 10-fold CV as in Yoon Kim article
FOLDS = 10

model = get_model()
X = seqs
y = np.array(df_all.rating.tolist())
kfold = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
acc = list()
i = 0
for train_index, valid_index in kfold.split(X, y):
    i += 1
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]
    model = get_model()
    model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=2, validation_data=(X_valid, y_valid),\
              callbacks=[EarlyStopping(patience=1, monitor='val_loss')])
    y_hat = model.predict(X_valid)
    y_pred = [round(pred) for pred in y_hat.reshape(-1)]
    acc.append(accuracy_score(y_valid, y_pred))
    print('\n\t>> Score for split {}: {}\n'.format(i, acc[-1]))

print('Average accuracy = {}'.format(np.mean(np.array(acc))))

Train on 9594 samples, validate on 1068 samples
Epoch 1/20
 - 72s - loss: 1.1969 - acc: 0.5009 - val_loss: 1.1187 - val_acc: 0.5000
Epoch 2/20
 - 70s - loss: 0.9929 - acc: 0.6153 - val_loss: 1.0060 - val_acc: 0.7163
Epoch 3/20
 - 73s - loss: 0.7927 - acc: 0.8888 - val_loss: 1.0022 - val_acc: 0.7622
Epoch 4/20
 - 71s - loss: 0.7045 - acc: 0.9553 - val_loss: 1.0828 - val_acc: 0.7537

	>> Score for split 1: 0.7537453183520599

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 73s - loss: 1.2202 - acc: 0.4995 - val_loss: 1.1603 - val_acc: 0.5000
Epoch 2/20
 - 70s - loss: 1.0770 - acc: 0.5649 - val_loss: 1.0185 - val_acc: 0.6632
Epoch 3/20
 - 72s - loss: 0.9042 - acc: 0.8433 - val_loss: 1.0009 - val_acc: 0.7411
Epoch 4/20
 - 71s - loss: 0.8058 - acc: 0.9235 - val_loss: 1.0282 - val_acc: 0.7261

	>> Score for split 2: 0.726078799249531

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 74s - loss: 1.2087 - acc: 0.4982 - val_loss: 1.1463 - val_acc: 0.5000
Epoch 2/20
 -