# CNNs with word2vec embedding for sentiment classification

In this notebook NLP is experimented with convolutional neural networks, CNNs. 

Tasks performed with CNNs:
- Sentiment analysis with IMDB movie review dataset

In [0]:
# Load libraries
import numpy as np
import pandas as pd
pd.options.display.width=120
#pd.set_option('display.width',75)
#pd.options.display.max_columns=8
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
#import nlpia
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
from collections import OrderedDict
import copy
from sklearn.feature_extraction.text import TfidfVectorizer

The alternatives are as below, let's use tf.keras here.
- multibackend Keras 
- tf.keras. 

In [14]:
import tensorflow as tf
from tensorflow import keras
tf.__version__

'1.15.0'

In [15]:
keras.__version__

'2.2.4-tf'

In [0]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D

## Sentiment analysis with CNNs

Stanford AI department provides dataset for IMDB moview reviews in https://ai.stanford.edu/%7eamaas/data/sentiment
- This is a dataset for binary sentiment classification  
- 25,000 highly polarised movie reviews for training, and 25,000 for testing. 
- There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.   

Published papers based on this dataset:
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

### a) Load and preprocess the imdb data

Download the original dataset. We'll use the train directory only, which contains text files in pos and neg folders.

In [0]:
import glob
import os

from random import shuffle

def preprocess_data(filepath):
    positive_path=os.path.join(filepath,'pos')
    negative_path=os.path.join(filepath,'neg')
    pos_label=1
    neg_label=0
    dataset=[]
    for filename in glob.glob(os.path.join(positive_path,'*.txt')):
        with open(filename,'r') as f:
            dataset.append((pos_label,f.read()))
    for filename in glob.glob(os.path.join(negative_path,'*.txt')):
        with open(filename,'r') as f:
            dataset.append((neg_label,f.read()))  
    shuffle(dataset)
    return(dataset)

In [0]:
dataset=preprocess_data('/content/gdrive/My Drive/ColabFolder/imdb')

In [0]:
import pickle

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/imdb_dataset','wb') as fp:
  pickle.dump(dataset,fp)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/imdb_dataset','rb') as fp:
  dataset=pickle.load(fp)

In [0]:
dataset[0]

(0,
 'The Comeback starts off looking promising, with a brutal death scene by a mask wearing killer. The mask itself is pretty cool too, and looks almost identical to the one used in the 1990\'s slasher film "Granny". From then on the film is mostly boring. We get a few more deaths, which again are good, but there\'s not enough of them. The reason the deaths are so good is because they are frenzied and bloody. The story behind the film is actually rather interesting and would have worked very well had it not been so boring for the most part. <br /><br />I would avoid this unless you\'re a die-hard collector - there\'s not enough here to even make it an average slasher flick.')

In [0]:
dataset[1]

(1,
 'I found this movie hilarious. The spoofs on other popular movies of that time were some of the funniest I have seen in this sort of movie. Give it a try. If you saw the movies that this movie is spoofing, and you get the humor, you should enjoy the movie.<br /><br />I (and the others who watched the movie with me) felt the funniest part in the movie (this is not a spoiler because I will NOT tell you what actually happens) was a scene with the "flashy thingy" from MIB. When they first discover the device and do not know what it is does... and then again later in the movie... you\'ll understand when you get there.<br /><br />My only complaint about the movie is that I have never been able to find it in DVD so that I could buy a copy.')

In [0]:
len(dataset)

25000

### b) Tokenize and vectorise the imdb data

In [0]:
import gensim.downloader as api

In [35]:
wv=api.load('word2vec-google-news-300')



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
def tokenize_and_vectorize(dataset):
    tokenizer=TreebankWordTokenizer()
    vectorized_data=[]
    expected=[]
    for sample in dataset:
        tokens=tokenizer.tokenize(sample[1])
        sample_vecs=[]
        for token in tokens:
            try:
                sample_vecs.append(wv[token])
            except KeyError:
                pass # No matching token in the Google w2v vocab
        vectorized_data.append(sample_vecs)
    return vectorized_data

In [0]:
def collect_expected(dataset):
    """Peel off the target values from the dataset"""
    expected=[]
    for sample in dataset:
        expected.append(sample[0])
    return expected

In [0]:
# Pass the imdb data into the two functions
vectorized_data=tokenize_and_vectorize(dataset)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/vectorized_data','wb') as fp:
  pickle.dump(vectorized_data,fp)

In [0]:
expected=collect_expected(dataset)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/expected','wb') as fp:
  pickle.dump(expected,fp)

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/vectorized_data','rb') as fp:
  vectorized_data=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
vectorized_data=vectorized_data[0:4999]

In [0]:
with open ('/content/gdrive/My Drive/ColabFolder/imdb/expected','rb') as fp:
  expected=pickle.load(fp)

In [0]:
# Let's reduce the dataset, otherwise there is not enough RAM available
expected=expected[0:4999]

### c) Create training and test set

In [0]:
# Data is already shuffled so the splitting can be done through slicing.
split_point=int(len(vectorized_data)*0.8)
x_train=vectorized_data[:split_point]
x_test=vectorized_data[split_point:]
y_train=expected[:split_point]
y_test=expected[split_point:]

### d) Padding and truncating the token sequences

In [0]:
# CNN parameters
maxlen=400 # max length of the sequences (to be padded/truncated to this length)
batch_size=32 # number of samples before backpropagating and updating the weights
embedding_dims=300
filters=250
kernel_size=3
hidden_dims=250 # number of neurons in the dense network at the end of the chain
epochs=2

In [0]:
def pad_trunc(data,maxlen):
    """ For a given dataset pad with zero vectors or truncate to maxlength"""
    new_data=[]
    # create a vector of 0s the length of word embedding vectors
    zero_vector=[]
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)
    for sample in data:
        if len(sample) > maxlen:
            temp=sample[:maxlen]
        elif len(sample)<maxlen:
            temp=sample
            # Append the appropriate number of zero_vectors to the list
            additional_elems=maxlen-len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp=sample
        new_data.append(temp)
    return new_data

In [0]:
#Alternative way to define the pad_trunc function
#def pad_trunc_2(data,maxlen,emb_dim):
#    new_data=[smp[:maxlen]+[[0.]*emb_dim]*(maxlen-len(smp)) for smp in data]
#    return new_data

In [0]:
# Perform the padding and truncation
x_train=pad_trunc(x_train,maxlen)
x_test=pad_trunc(x_test,maxlen)
x_train=np.reshape(x_train,(len(x_train),maxlen,embedding_dims))
x_test=np.reshape(x_test,(len(x_test),maxlen,embedding_dims))
y_train=np.array(y_train)
y_test=np.array(y_test)

### e) Build the CNN architecture

In [18]:
model=Sequential([
    Conv1D(filters=filters,kernel_size=kernel_size,padding="valid",activation="relu",strides=1,input_shape=(maxlen,embedding_dims)),
    GlobalMaxPooling1D(), # default n=2
    Dense(hidden_dims,activation="relu"),
    Dropout(0.2),
    Dense(1,activation="sigmoid")   # If several classes to predict: Dense(num_classes, activation('sigmoid') or activation('softmax'))
    
])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [19]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d (Conv1D)              (None, 398, 250)          225250    
_________________________________________________________________
global_max_pooling1d (Global (None, 250)               0         
_________________________________________________________________
dense (Dense)                (None, 250)               62750     
_________________________________________________________________
dropout (Dropout)            (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 251       
Total params: 288,251
Trainable params: 288,251
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) # If several classes: loss='categorical_crossentropy'

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


### f) Train the model

In [0]:
# To set the seed (to enable reproducing the same results) -> same initial random weights.
np.random.seed(1337)

In [23]:
# Train the model
model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs,validation_data=(x_test,y_test))

Train on 3999 samples, validate on 1000 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f5c6c968f98>

In [0]:
# Save the model
model_structure=model.to_json()
with open("/content/gdrive/My Drive/ColabFolder/imdb/cnn_model.json","w") as json_file:
    json_file.write(model_structure)  # this only saves the structure, not the weights
model.save_weights("/content/gdrive/My Drive/ColabFolder/imdb/cnn_weights.h5")

### g) Test the model by predicting

In [0]:
sample1="I hate that the dismal weather had me down for so long, \
when will it break! Ugh, when does happiness return? The sun is blinding \
and the puffy clouds are too thin. I can't wait for the weekend."

In [38]:
vec_list=tokenize_and_vectorize([(1,sample1)])  # target value = 1 is just dummy value, not used here
test_vec_list=pad_trunc(vec_list,maxlen)
test_vec=np.reshape(test_vec_list,(len(test_vec_list),maxlen,embedding_dims))
model.predict(test_vec) # returns probability (>0.5 ->1, <0.5 ->0)

array([[0.11028806]], dtype=float32)

In [39]:
model.predict_classes(test_vec) # returns predicted class (0 or 1)

array([[0]], dtype=int32)