# NLP with LSTMs for sentiment classification (w2v and char based)

In this notebook NLP is experimented with long short term memory units, LSTMs. 

Tasks performed with LSTMs:
- 1) Sentiment analysis for IMDB moview review dataset (with word2vec based model)
- 2) Sentiment analysis for IMDB moview review dataset (with character based model)

In [0]:
# Load libraries
import numpy as np
import pandas as pd
pd.options.display.width=120
#pd.set_option('display.width',75)
#pd.options.display.max_columns=8
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from collections import OrderedDict
import copy
from sklearn.feature_extraction.text import TfidfVectorizer

The alternatives are as below, let's use tf.keras here.
- multibackend Keras 
- tf.keras. 

In [3]:
import tensorflow as tf
from tensorflow import keras
tf.__version__

'1.15.0'

In [4]:
keras.__version__

'2.2.4-tf'

In [0]:
# These are for word based model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten,LSTM

In [6]:
# This is an addition for character based model
from keras.layers import Embedding

Using TensorFlow backend.


## 1) Sentiment analysis with LSTMs (based on word2vec embeddings)

[link text](https://)Stanford AI department provides dataset for IMDB moview reviews in https://ai.stanford.edu/%7eamaas/data/sentiment
- This is a dataset for binary sentiment classification  
- 25,000 highly polarised movie reviews for training, and 25,000 for testing. 
- There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.   

Published papers based on this dataset:
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

### a) Load and preprocess the imdb data

Download the original dataset. We'll use the train directory only, which contains text files in pos and neg folders.

In [0]:
import glob
import os

from random import shuffle

def preprocess_data(filepath):
    """
    This is dependent on your training data source but the idea is to have it as general as possible.
    """
    positive_path=os.path.join(filepath,'pos')
    negative_path=os.path.join(filepath,'neg')
    pos_label=1
    neg_label=0
    dataset=[]
    for filename in glob.glob(os.path.join(positive_path,'*.txt')):
        with open(filename,'r') as f:
            dataset.append((pos_label,f.read()))
    for filename in glob.glob(os.path.join(negative_path,'*.txt')):
        with open(filename,'r') as f:
            dataset.append((neg_label,f.read()))  
    shuffle(dataset)
    return(dataset)

In [0]:
dataset=preprocess_data('imdb')
dataset[0]

(1,
 "FUTZ is the only show preserved from the experimental theatre movement in New York in the 1960s (the origins of Off Off Broadway). Though it's not for everyone, it is a genuinely brilliant, darkly funny, even more often deeply disturbing tale about love, sex, personal liberty, and revenge, a serious morality tale even more relevant now in a time when Congress wants to outlaw gay marriage by trashing our Constitution. The story is not about being gay, though -- it's about love and sex that don't conform to social norms and therefore must be removed through violence and hate. On the surface, it tells the story of a man who falls in love with a pig, but like any great fable, it's not really about animals, it's about something bigger -- stifling conformity in America.<br /><br />The stage version won international acclaim in its original production, it toured the U.S. and Europe, and with others of its kind, influenced almost all theatre that came after it. Luckily, we have preserved

In [0]:
import pickle

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/imdb_dataset','wb') as fp:
  pickle.dump(dataset,fp)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/imdb_dataset','rb') as fp:
  dataset=pickle.load(fp) 

### b) Tokenize and vectorise the imdb data

In [0]:
from nltk.tokenize import TreebankWordTokenizer
import os

In [29]:
import gensim.downloader as api
wv=api.load('word2vec-google-news-300')



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
def tokenize_and_vectorize(dataset):
    tokenizer=TreebankWordTokenizer()
    vectorized_data=[]
    expected=[]
    for sample in dataset:
        tokens=tokenizer.tokenize(sample[1])
        sample_vecs=[]
        for token in tokens:
            try:
                sample_vecs.append(wv[token])
            except KeyError:
                pass # No matching token in the Google w2v vocab
        vectorized_data.append(sample_vecs)
    return vectorized_data

In [0]:
def collect_expected(dataset):
    """Peel off the target values from the dataset"""
    expected=[]
    for sample in dataset:
        expected.append(sample[0])
    return expected

In [0]:
# Pass the imdb data into the two functions
vectorized_data=tokenize_and_vectorize(dataset)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/vectorized_data','wb') as fp:
  pickle.dump(vectorized_data,fp)

In [0]:
expected=collect_expected(dataset)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/expected','wb') as fp:
  pickle.dump(expected,fp)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/vectorized_data','rb') as fp:
  vectorized_data=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
vectorized_data=vectorized_data[0:4999]

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/expected','rb') as fp:
  expected=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
expected=expected[0:4999]

### c) Create training and test set

In [0]:
# Data is already shuffled so the splitting can be done through slicing.
split_point=int(len(vectorized_data)*0.8)
x_train=vectorized_data[:split_point]
x_test=vectorized_data[split_point:]
y_train=expected[:split_point]
y_test=expected[split_point:]

### d) Padding and truncating the token sequences

In [0]:
# LSTM parameters (otherwise the same as in CNN case, except no filters nor kernels nor hidden_dims)
maxlen=400 # max length of the sequences (to be padded/truncated to this length)
batch_size=32 # number of samples before backpropagating and updating the weights
embedding_dims=300
epochs=2
num_neurons=50 # number of neurons in each LSTM cell

In [0]:
def pad_trunc(data,maxlen):
    """ For a given dataset pad with zero vectors or truncate to maxlength"""
    new_data=[]
    # create a vector of 0s the length of word embedding vectors
    zero_vector=[]
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)
    for sample in data:
        if len(sample) > maxlen:
            temp=sample[:maxlen]
        elif len(sample)<maxlen:
            temp=sample
            # Append the appropriate number of zero_vectors to the list
            additional_elems=maxlen-len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp=sample
        new_data.append(temp)
    return new_data

In [0]:
#Alternative way to define the pad_trunc function
#def pad_trunc_2(data,maxlen,emb_dim):
#    new_data=[smp[:maxlen]+[[0.]*emb_dim]*(maxlen-len(smp)) for smp in data]
#    return new_data

In [0]:
# Perform the padding and truncation
# When using RNN recurrent neural network (either simpleRNN,LSTM or GRU cells), truncating/padding not normally needed.
# Here we do it simply to get results that can be compared with CNN case.
# With CNN, truncating/padding was needed since at the end there was Dense network that requires fixed length input.
x_train=pad_trunc(x_train,maxlen)
x_test=pad_trunc(x_test,maxlen)
x_train=np.reshape(x_train,(len(x_train),maxlen,embedding_dims))
x_test=np.reshape(x_test,(len(x_test),maxlen,embedding_dims))
y_train=np.array(y_train)
y_test=np.array(y_test)

### e) Build the LSTM network

In [21]:
model=Sequential([
    LSTM(num_neurons,return_sequences=True,  # we want output at each time step
        input_shape=(maxlen,embedding_dims)),
    Dropout(0.2),
    Flatten(), # Data needs to be flattened, since output from RNN/LSTM network is two dimensional: 400*50
    Dense(1,activation="sigmoid") # If several classes to predict: Dense(num_classes, activation('sigmoid')
])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [22]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 400, 50)           70200     
_________________________________________________________________
dropout (Dropout)            (None, 400, 50)           0         
_________________________________________________________________
flatten (Flatten)            (None, 20000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 20001     
Total params: 90,201
Trainable params: 90,201
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy']) # If several classes: loss='categorical_crossentropy'

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


### f) Train and save/load the model

In [0]:
# To set the seed (to enable reproducing the same results) -> same initial random weights.
np.random.seed(1337)

In [25]:
# Train the model
model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs,validation_data=(x_test,y_test))

Train on 3999 samples, validate on 1000 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f8d5dc406d8>

In [0]:
# Save the model
model_structure=model.to_json()
with open("/content/gdrive/My Drive/ColabFolder/imdb/lstm_model1.json","w") as json_file:
    json_file.write(model_structure)  # this only saves the structure, not the weights
model.save_weights("/content/gdrive/My Drive/ColabFolder/imdb/lstm_weights1.h5")

In [0]:
# load the model
from keras.models import model_from_json
with open("/content/gdrive/My Drive/ColabFolder/imdb/lstm_model1.json","r") as json_file:
    json_string=json_file.read()  # this only saves the structure, not the weights
model.load_weights("/content/gdrive/My Drive/ColabFolder/imdb/lstm_weights1.h5")

### g) Test the model by predicting

In [0]:
sample1="I hate that the dismal weather had me down for so long, \
when will it break! Ugh, when does happiness return? The sun is blinding \
and the puffy clouds are too thin. I can't wait for the weekend."

In [0]:
del vectorized_data
del expected
del x_train; del x_test; del y_train; del y_test;

In [0]:
vec_list=tokenize_and_vectorize([(1,sample1)])  # target value = 1 is just dummy value, not used here
test_vec_list=pad_trunc(vec_list,maxlen)
test_vec=np.reshape(test_vec_list,(len(test_vec_list),maxlen,embedding_dims))


In [37]:
print("Sample's sentiment, 1-pos, 0-neg : ")
model.predict_classes(test_vec) # returns class 

Sample's sentiment, 1-pos, 0-neg : 


array([[0]], dtype=int32)

In [38]:
print("Raw output of sigmoid function : ")
model.predict(test_vec) # returns probability (>0.5 ->1, <0.5 ->0)

Raw output of sigmoid function : 


array([[0.38133448]], dtype=float32)

### h) Hyperparameter tuning

What is possible:
- padding/truncating is in fact not generally required for LSTMs, it is only required  when e.g. CNN is combined with Dense networks, since Dense network requires fixed length input

Dense network can of course combined with LSTM network also.
- then , LSTM can be considered to create a thought vector that it inputs to Dense network
- in that case it is good to understand which length of the thought vector would be optimised.

In [0]:
# Optimize the thought vector length, coming out of LSTM network.
def test_len(data,maxlen):
    total_len=truncated=exact=padded=0
    for sample in data:
        total_len+=len(sample)
        if len(sample)>maxlen:
            truncated+=1
        elif len(sample)<maxlen:
            padded+=1
        else:
            exact+=1
    print('Padded: {}'.format(padded))
    print('Equal: {}'.format(exact))
    print('Truncated: {}'.format(truncated))
    print('Avg length: {}'.format(total_len/len(data)))

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/vectorized_data','rb') as fp:
  vectorized_data=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
vectorized_data=vectorized_data[0:4999]

In [42]:
#dataset=preprocess_data('imdb')
#vectorized_data=tokenize_and_vectorize(dataset)
test_len(vectorized_data,400)

Padded: 4489
Equal: 0
Truncated: 510
Avg length: 206.71374274854972


In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/expected','rb') as fp:
  expected=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
expected=expected[0:4999]

In [0]:
# Let's define max length to be 200, which is close to average length
maxlen=200  # other parameters remain the same.

# reperform the padding /truncation for x_train and x_test (not needed for y_train nor y_test)
split_point=int(len(vectorized_data)*0.8)
x_train=vectorized_data[:split_point]
x_test=vectorized_data[split_point:]
y_train=expected[:split_point]
y_test=expected[split_point:]

x_train=pad_trunc(x_train,maxlen)
x_test=pad_trunc(x_test,maxlen)
x_train=np.reshape(x_train,(len(x_train),maxlen,embedding_dims))
x_test=np.reshape(x_test,(len(x_test),maxlen,embedding_dims))


In [0]:
# More optimally sized LSTM network. Structure is the same, just maxlen value is different.
model=Sequential([
    LSTM(num_neurons,return_sequences=True,  # we want output at each time step
        input_shape=(maxlen,embedding_dims)),
    Dropout(0.2),
    Flatten(), # Data needs to be flattened, since output from RNN/LSTM network is two dimensional: 400*50
    Dense(1,activation="sigmoid") # If several classes to predict: Dense(num_classes, activation('sigmoid') or activation('softmax'))
])

In [0]:
model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy']) # If several classes: loss='categorical_crossentropy'

In [48]:
# Train the optimized LSTM network
np.random.seed(1337)
model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs,validation_data=(x_test,y_test))

Train on 3999 samples, validate on 1000 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f8cba96bef0>

In [0]:
# Benefit of optimised LSTM: faster training, still accuracy didn't change very much.

In [0]:
# Save the model with a different name
model_structure=model.to_json()
with open("/content/gdrive/My Drive/ColabFolder/imdb/lstm_model2.json","w") as json_file:
    json_file.write(model_structure)  # this only saves the structure, not the weights
model.save_weights("/content/gdrive/My Drive/ColabFolder/imdb/lstm_weights2.h5")

In [0]:
del x_train; del x_test; del y_train; del y_test;
del vectorized_data

In [0]:
del wv

In [0]:
del model

## 2) Sentiment analysis with LSTMs (based on characters)

Here the sequence of characters is used for predicting the sentiment, instead of words (or word2vec embeddings).

### a) Load the data

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/imdb_dataset','rb') as fp:
  dataset=pickle.load(fp) 

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
dataset=dataset[0:4999]

In [0]:
# Load the data
#dataset=preprocess_data('imdb')
# Extract the target (y values)
#expected=collect_expected(dataset)

In [0]:
# Calculate the average sample length
def avg_len(data):
    total_len=0
    for sample in data:
        total_len+=len(sample[1])
    return total_len/len(data)

In [61]:
avg_len(dataset) # This gives the average length for the 5000 samples that we use here
# result with the whole dataset: 1325.1. Thus character based LSTM network will be much longer compared to word based.

1334.4052810562112

### b) Insert UNK (unknown) characters

In [0]:
# Prepare the strings for a character based model
# UNK is used as a single character for everything that doesn't match the VALID list (could be e.g. HTML tags)
def clean_data(data):
    """Shift to lower case, replace unknowns with UNK and listify"""
    new_data=[]
    VALID='abcdefghijklmnopqrstuvwyz0123456789"\'?!.,:; '
    for sample in data:
        new_sample=[]
        for char in sample[1].lower():
            if char in VALID:
                new_sample.append(char)
            else:
                new_sample.append('UNK')
        new_data.append(new_sample)
    return new_data

### c) Pad and truncate the character sequences

In [0]:
# Let's use maxlen that is a bit higher than the avg length.
def char_pad_trunc(data,maxlen=1500):
    """Truncate to maxlen or add in PAD tokens"""
    new_dataset=[]
    for sample in data:
        if len(sample) > maxlen:
            new_data=sample[:maxlen]
        elif len(sample) < maxlen:
            pads=maxlen-len(sample)
            new_data=sample+['PAD']*pads
        else:
            new_data=sample
        new_dataset.append(new_data)
    return new_dataset

### d) Create character based model vocabulary

In [0]:
# Create characters mapped to integer indices, and vice versa.
def create_dicts(data):
    """Modified from Keras LSTM example"""
    chars=set()
    for sample in data:
        chars.update(set(sample))
    char_indices=dict((c,i) for i,c in enumerate(chars))
    indices_char=dict((i,c) for i,c in enumerate(chars))
    return char_indices,indices_char

### e) One-hot encoding for characters

In [0]:
def onehot_encode(dataset,char_indices,maxlen=1500):
    """
    One-hot encode the tokens
    
    Args:
        dataset list of lists of tokens
        char_indices dict of (key=character,value=index)
        maxlen int Length of each sample
    Return:
        np array of shape (samples,tokens,encoding length)
    """
    X=np.zeros((len(dataset),maxlen,len(char_indices.keys())))
    for i,sentence in enumerate(dataset):
        for t,char in enumerate(sentence):
            X[i,t,char_indices[char]]=1
    return X                               

### f) Preprocess the data (clean,pad/trunc,vocabulary,one-hot encoding)

In [0]:
listified_data=clean_data(dataset) #Insert also the UNKs
common_length_data=char_pad_trunc(listified_data,maxlen=1500)
char_indices,indices_char=create_dicts(common_length_data)
encoded_data=onehot_encode(common_length_data,char_indices,1500)

### g) Create training and test sets

In [0]:
split_point=int(len(encoded_data)*0.8)
x_train=encoded_data[:split_point]
x_test=encoded_data[split_point:]
y_train=expected[:split_point]
y_test=expected[split_point:]

### h) Build a character-based LSTM

In [0]:
num_neurons=40
maxlen=1500

In [0]:
# Otherwise the same LSTM model except new values for num_neurons and maxlen
# Note, also the second value in input_shape is no longer word2vec dimension of 300
model=Sequential([
    LSTM(num_neurons,return_sequences=True,  # we want output at each time step
        input_shape=(maxlen,len(char_indices.keys()))),  # length of sequences * length of one-hot encoding
    Dropout(0.2),
    Flatten(), # Data needs to be flattened, since output from RNN/LSTM network is two dimensional: 400*50
    Dense(1,activation="sigmoid") # If several classes to predict: Dense(num_classes, activation('sigmoid') or activation('softmax'))
])

In [0]:
model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy']) # If several classes: loss='categorical_crossentropy'

### i) Train the character based LSTM

In [0]:
batch_size=32
epochs=10

In [72]:
np.random.seed(1337)
model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs,validation_data=(x_test,y_test))

Train on 3999 samples, validate on 1000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f8d524bd898>

In [0]:
# Obviously the training set was too small for this case. 
# Since accuracy for training set is 99 % and only 52.9 % for the validation set, it clearly indicates overfitting.

In [0]:
# Save the model with a different name
model_structure=model.to_json()
with open("/content/gdrive/My Drive/ColabFolder/imdb/lstm_model3.json","w") as json_file:
    json_file.write(model_structure)  # this only saves the structure, not the weights
model.save_weights("/content/gdrive/My Drive/ColabFolder/imdb/lstm_weights3.h5")