# L665 ML for NLPSpring 2018 

## Assignment 3 - Task 2

Sentence Classification with recurrent neural net (LSTM)
I will compare result with results for CCN-Rand reported in paper by Kim, Yoon entitled "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014) <br>

Dataset used: MR - Movie Reviews <br>
Reference: Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005

Author: Carlos Sathler

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline  

## Get data

In [2]:
from sklearn.utils import shuffle

SEED = 0

df_neg = pd.read_table('input/rt-polarity.neg', names=['review'],  header=None, encoding='latin-1')
df_pos = pd.read_table('input/rt-polarity.pos', names=['review'],  header=None, encoding='latin-1')
df_neg['rating'] = 0
df_pos['rating'] = 1
df_all = shuffle(pd.concat((df_neg, df_pos), axis=0), random_state=SEED)
print('Dataset size: {}'.format(df_all.index.size))
print('Count of positive reviews: {}'.format(df_all[df_all['rating']==1].index.size))
print('Count of negative reviews: {}'.format(df_all[df_all['rating']==0].index.size))
df_all.head()

Dataset size: 10662
Count of positive reviews: 5331
Count of negative reviews: 5331


Unnamed: 0,review,rating
1837,the sentimental cliches mar an otherwise excel...,1
3318,"if you love the music , and i do , its hard to...",1
3381,"though harris is affecting at times , he canno...",0
3387,poignant japanese epic about adolescent anomie...,1
36,"cantet perfectly captures the hotel lobbies , ...",1


## Create input sequences

In [3]:
# following guidelines outlined here:
# https://keras.io/preprocessing/text/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/text.py#L134
# https://keras.io/preprocessing/sequence/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/sequence.py#L248

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# tokenize text and create dictionary mapping tokens to integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_all.review)

# create sequences of integers to represent reviews, and find longest sentence
seqs = tokenizer.texts_to_sequences(df_all.review)
max_len = max([len(seq) for seq in seqs])

# pad sequences to feed to the embedding layer
seqs = pad_sequences(seqs, maxlen=max_len, dtype='int32', padding='pre', truncating='pre', value=0.0)

Using TensorFlow backend.


In [4]:
print('Number of documents  = {}'.format(tokenizer.document_count))
print('Size of vocabulary   = {}'.format(len(tokenizer.word_index)))
print('Maximum sequence len = {}'.format(max_len))

Number of documents  = 10662
Size of vocabulary   = 19498
Maximum sequence len = 51


## Create model

LSTM model.

In [15]:
MAX_SEQ = max_len
VOC_SIZE = len(tokenizer.word_index)

from keras.models import Model
from keras.optimizers import SGD
from keras.layers import Input, Embedding, LSTM, Dropout, Bidirectional, Dense, Flatten
from keras import regularizers

# model where N=1, M=1 and K=1
def get_model(output_dim=64, dense_dim=16):
    
    input = Input(shape=(MAX_SEQ,), dtype='float64')
    embed = Embedding(VOC_SIZE+1, output_dim, input_length=MAX_SEQ, embeddings_initializer='random_uniform')(input)
    lstm = Bidirectional(LSTM(units=100, return_sequences=True))(embed)
    lstm = Flatten()(lstm)
    dense = Dense(dense_dim, activation='relu')(lstm)
    dense = Dropout(0.5)(dense)
    
    pred = Dense(1, kernel_regularizer=regularizers.l2(0.01),\
                 activity_regularizer=regularizers.l1(0.01),\
                 activation='sigmoid')(lstm)
    
    model = Model(input, pred)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model
    
get_model().summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_21 (InputLayer)        (None, 51)                0         
_________________________________________________________________
embedding_21 (Embedding)     (None, 51, 64)            1247936   
_________________________________________________________________
bidirectional_4 (Bidirection (None, 51, 200)           132000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 10200)             0         
_________________________________________________________________
dense_25 (Dense)             (None, 1)                 10201     
Total params: 1,390,137
Trainable params: 1,390,137
Non-trainable params: 0
_________________________________________________________________


In [17]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score

EPOCHS = 20
BATCH_SIZE = 128

# evaluate using 10-fold CV as in Yoon Kim article
FOLDS = 10

model = get_model()
X = seqs
y = np.array(df_all.rating.tolist())
kfold = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
acc = list()
i = 0
for train_index, valid_index in kfold.split(X, y):
    i += 1
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]
    model = get_model()
    model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=2, validation_data=(X_valid, y_valid),\
              callbacks=[EarlyStopping(patience=2, monitor='val_loss')])
    y_hat = model.predict(X_valid)
    y_pred = [round(pred) for pred in y_hat.reshape(-1)]
    acc.append(accuracy_score(y_valid, y_pred))
    print('\n\t>> Score for split {}: {}\n'.format(i, acc[-1]))

print('Average accuracy = {}'.format(np.mean(np.array(acc))))

Train on 9594 samples, validate on 1068 samples
Epoch 1/20
 - 15s - loss: 1.1537 - acc: 0.5175 - val_loss: 1.0334 - val_acc: 0.6470
Epoch 2/20
 - 12s - loss: 0.9032 - acc: 0.7753 - val_loss: 1.0090 - val_acc: 0.7425
Epoch 3/20
 - 12s - loss: 0.7460 - acc: 0.9138 - val_loss: 1.1125 - val_acc: 0.7612
Epoch 4/20
 - 12s - loss: 0.6919 - acc: 0.9557 - val_loss: 1.2229 - val_acc: 0.7491

	>> Score for split 1: 0.7490636704119851

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 16s - loss: 1.1566 - acc: 0.5142 - val_loss: 1.0464 - val_acc: 0.6445
Epoch 2/20
 - 12s - loss: 0.9033 - acc: 0.7693 - val_loss: 0.9929 - val_acc: 0.7120
Epoch 3/20
 - 12s - loss: 0.7506 - acc: 0.9100 - val_loss: 1.0859 - val_acc: 0.7373
Epoch 4/20
 - 12s - loss: 0.6915 - acc: 0.9598 - val_loss: 1.0984 - val_acc: 0.7326

	>> Score for split 2: 0.7326454033771107

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 15s - loss: 1.1512 - acc: 0.5206 - val_loss: 1.0431 - val_acc: 0.5854
Epoch 2/20
 