# L665 ML for NLPSpring 2018 

## Assignment 3 - Task 2.4

Sentence Classification with recurrent neural net (LSTM)
I will compare result with results for CCN-Rand reported in paper by Kim, Yoon entitled "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014) <br>

In this notebook, I combine LSTM with CNN for the classification task

Dataset used: MR - Movie Reviews <br>
Reference: Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005

Author: Carlos Sathler

In [1]:
import numpy as np
import pandas as pd

import spacy

import matplotlib.pyplot as plt
%matplotlib inline  

## Get data

In [2]:
from sklearn.utils import shuffle

SEED = 0

df_neg = pd.read_table('input/rt-polarity.neg', names=['review'],  header=None, encoding='latin-1')
df_pos = pd.read_table('input/rt-polarity.pos', names=['review'],  header=None, encoding='latin-1')
df_neg['rating'] = 0
df_pos['rating'] = 1
df_all = shuffle(pd.concat((df_neg, df_pos), axis=0), random_state=SEED)
print('Dataset size: {}'.format(df_all.index.size))
print('Count of positive reviews: {}'.format(df_all[df_all['rating']==1].index.size))
print('Count of negative reviews: {}'.format(df_all[df_all['rating']==0].index.size))
df_all.head()

Dataset size: 10662
Count of positive reviews: 5331
Count of negative reviews: 5331


Unnamed: 0,review,rating
1837,the sentimental cliches mar an otherwise excel...,1
3318,"if you love the music , and i do , its hard to...",1
3381,"though harris is affecting at times , he canno...",0
3387,poignant japanese epic about adolescent anomie...,1
36,"cantet perfectly captures the hotel lobbies , ...",1


## Create input sequences for text tokens

In [3]:
# following guidelines outlined here:
# https://keras.io/preprocessing/text/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/text.py#L134
# https://keras.io/preprocessing/sequence/
# https://github.com/keras-team/keras/blob/master/keras/preprocessing/sequence.py#L248

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# tokenize text and create dictionary mapping tokens to integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_all.review)

# create sequences of integers to represent reviews, and find longest sentence
seqs = tokenizer.texts_to_sequences(df_all.review)
max_len = max([len(seq) for seq in seqs])

# pad sequences to feed to the embedding layer
seqs = pad_sequences(seqs, maxlen=max_len, dtype='int32', padding='pre', truncating='pre', value=0.0)

Using TensorFlow backend.


In [4]:
print('Number of documents  = {}'.format(tokenizer.document_count))
print('Size of vocabulary   = {}'.format(len(tokenizer.word_index)))
print('Maximum sequence len = {}'.format(max_len))

Number of documents  = 10662
Size of vocabulary   = 19498
Maximum sequence len = 51


## Create model

The model will have 2 branches that will be merged on axis = 1, then fed to a final 1 deep fully connect MLP layer

Branch 1 will be for LSTM using embedding layer with vectorized representation of tokens <br>
Branch 2 will be for CNN using embedding layer with vectorized representation of tokens <br>

For branch 2 I will use "ConvNet Architectures" guidelines from Stanford "CS231n: Convolutional Neural Networks for Visual Recognition" class guidelines: http://cs231n.github.io/convolutional-networks/ <br>
"INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC <br>
where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3)..."

In this case N=1, M=1 and K=1

In [8]:
MAX_SEQ = max_len
VOC_SIZE1 = len(tokenizer.word_index)
VOC_SIZE2 = VOC_SIZE1

from keras.models import Model
from keras.optimizers import SGD
from keras.layers import Input, Embedding, Conv1D, Dropout, MaxPooling1D, Flatten, Dense, concatenate
from keras.layers import LSTM, Bidirectional
from keras import regularizers

def get_model(output_dim=64, filter_size=128, window_size=3, stride=1, pool_size=2, dense_dim=16):
    
    # Branch 1
    input1 = Input(shape=(MAX_SEQ,), dtype='float64')
    embed1 = Embedding(VOC_SIZE1+1, output_dim, input_length=MAX_SEQ, embeddings_initializer='random_uniform')(input1)
    lstm = Bidirectional(LSTM(units=100, return_sequences=True))(embed1)
    lstm = Flatten()(lstm)
    D1 = Dense(dense_dim, activation='relu')(lstm)
    D1 = Dropout(0.5)(D1)
    
    # Branch2
    krn_size = output_dim * window_size
    input2 = Input(shape=(MAX_SEQ,), dtype='int32')
    embed2 = Embedding(VOC_SIZE2+1, output_dim, input_length=MAX_SEQ, embeddings_initializer='random_uniform')(input2)
    C2 = Conv1D(filter_size, kernel_size=krn_size ,padding='same', strides=(stride), activation='relu')(embed2)
    C2 = Dropout(0.5)(C2)
    M2 = MaxPooling1D(pool_size=(pool_size), padding='same')(C2)
    F2 = Flatten()(M2)
    D2 = Dense(dense_dim, activation='relu')(F2)
    D2 = Dropout(0.5)(D2)
    
    # Merge branches by concatenating along axis 1
    L1 = concatenate([D1, D2], axis=1)
    G1= Dense(dense_dim, activation='relu')(L1)
    G1 = Dropout(0.5)(G1)

    pred = Dense(1, kernel_regularizer=regularizers.l2(0.01),\
                 activity_regularizer=regularizers.l1(0.01),\
                 activation='sigmoid')(G1)
    
    model = Model(inputs=[input1, input2], outputs=pred)
    
    #sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model
    
get_model().summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 51)           0                                            
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 51, 64)       1247936     input_6[0][0]                    
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 51)           0                                            
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 51, 128)      1572992     embedding_6[0][0]                
__________________________________________________________________________________________________
embedding_

In [11]:
%%time

from sklearn.model_selection import StratifiedKFold, cross_val_score
from keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score

EPOCHS = 20
BATCH_SIZE = 128

# evaluate using 10-fold CV as in Yoon Kim article
FOLDS = 10

# concatenates input data so kfold splits all data as one
X = np.hstack((seqs, seqs))
y = np.array(df_all.rating.tolist())
kfold = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
acc = list()
i = 0
for train_index, valid_index in kfold.split(X, y):
    i += 1
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]
    model = get_model()
    # divide input back into seqs and tag_seqs
    model.fit([X_train[:,:MAX_SEQ], X_train[:,MAX_SEQ:]], y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=2, \
              validation_data=([X_valid[:,:MAX_SEQ], X_valid[:,MAX_SEQ:]], y_valid),\
              callbacks=[EarlyStopping(patience=1, monitor='val_loss')])
    y_hat = model.predict([X_valid[:,:MAX_SEQ], X_valid[:,MAX_SEQ:]])
    y_pred = [round(pred) for pred in y_hat.reshape(-1)]
    acc.append(accuracy_score(y_valid, y_pred))
    print('\n\t>> Score for split {}: {}\n'.format(i, acc[-1]))

print('Average accuracy = {}'.format(np.mean(np.array(acc))))

Train on 9594 samples, validate on 1068 samples
Epoch 1/20
 - 89s - loss: 1.2232 - acc: 0.5006 - val_loss: 1.1278 - val_acc: 0.5028
Epoch 2/20
 - 84s - loss: 1.0532 - acc: 0.6356 - val_loss: 1.0058 - val_acc: 0.5974
Epoch 3/20
 - 82s - loss: 0.8934 - acc: 0.7575 - val_loss: 0.9998 - val_acc: 0.7022
Epoch 4/20
 - 85s - loss: 0.7856 - acc: 0.8499 - val_loss: 1.1399 - val_acc: 0.7612

	>> Score for split 1: 0.7612359550561798

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 86s - loss: 1.2783 - acc: 0.4972 - val_loss: 1.1533 - val_acc: 0.5000
Epoch 2/20
 - 93s - loss: 1.1266 - acc: 0.5719 - val_loss: 1.0305 - val_acc: 0.6623
Epoch 3/20
 - 92s - loss: 0.9601 - acc: 0.7208 - val_loss: 1.0314 - val_acc: 0.7111

	>> Score for split 2: 0.7110694183864915

Train on 9596 samples, validate on 1066 samples
Epoch 1/20
 - 86s - loss: 1.2195 - acc: 0.4991 - val_loss: 1.0849 - val_acc: 0.5000
Epoch 2/20
 - 85s - loss: 1.0407 - acc: 0.6110 - val_loss: 1.0058 - val_acc: 0.6323
Epoch 3/20
 