Text Classification using multi-channel CNN Model
===================================

The IMDB dataset consists of 25,000 reviews, each with a binary label (1 = positive, 0 = negative). The task is to predict the sentiment, based on the review. Here is an example review:

“Okay, sorry, but I loved this movie. I just love the whole 80’s genre of these kind of movies, because you don’t see many like this...” 

A standard CNN model uses Embedding Layer and Conv1D layers to classify Text. This model can be expanded by using multiple parallel channels that read the source documents using different kernel sizes. We will use this architecture of multi-channel Conv1D Neural network to classify the IMDB Movie Reviews as positive or negative sentiment. We will use the TensorFlow 2.0 with tf.keras libraries and the functional API to input multiple channel data in parallel and classify the documents. The same architecture can be used for Text classification in many different scenarios. The Embedding Vector here gets trained on the fly, and this approach works as the number of samples is quite high (12500 samples for each sentiment).

The steps are :


*   Pre-processing and encoding text data using tf.keras Tokenizer
*   Building the Model with tf.keras Functional API


*   Training the Model
*   Evaluating the Model





Import necessary libraries. We will use Tensorflow 2.0 GPU version for this

Install TensorFlow GPU version, and make sure you select GPU enabled Runtime in Google colab

In [0]:
#!pip install --upgrade tensorflow-gpu

Collecting tensorflow-gpu
[?25l  Downloading https://files.pythonhosted.org/packages/25/44/47f0722aea081697143fbcf5d2aa60d1aee4aaacb5869aee2b568974777b/tensorflow_gpu-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (380.8MB)
[K     |████████████████████████████████| 380.8MB 46kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-2.0.0


In [0]:
import tensorflow as tf
print(tf.__version__)
import string
import re
import nltk
from os import listdir
from collections import Counter
from numpy import array
import numpy as np
import pandas as pd
from nltk.corpus import stopwords

2.0.0


In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Text pre-processing**

Tokenize and clean the document by removing punctuations, numerals, stopwords


In [0]:
# turn a doc into clean tokens
def clean_doc(doc):
# split into tokens by white space
    tokens = doc.split()
# prepare regex for char filtering
    re_punc = re.compile( ' [%s] ' % re.escape(string.punctuation))
# remove punctuation from each word
    tokens = [re_punc.sub( '' , w) for w in tokens]
# remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
    stop_words = set(stopwords.words( 'english' ))
    tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens


Read the train and test dataset, from the csv files at the links provided. Upload the files using the File - > upload link in Google colab, and refer the files as /content/filename. Alternatively you can mount your google drive and read from your google drive, after uploading the files.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
train_df=pd.read_csv("/content/gdrive/My Drive/Imdb_train.csv") # Replace with /content/Imdb_train.csv if reading from File uploaded into Google colab
test_df=pd.read_csv("/content/gdrive/My Drive/Imdb_test.csv")

In [0]:
train_df[['sentence','polarity']].head()

Unnamed: 0,sentence,polarity
0,Okay at first this movie seemed pretty good ev...,0
1,"Some of the worst, least natural acting perfor...",0
2,THAT'S certainly a strange way to promote a fi...,0
3,"OK, the story - a simpleminded loony enters a ...",0
4,I would rather have 20 root canals than go thr...,0


The train data has 12500 rows of positive sentiment (polarity=1) and 12500 rows of negative sentiment(polarity=0)

In [0]:
train_df['polarity'].value_counts()

1    12500
0    12500
Name: polarity, dtype: int64

Create separate lists for positive reviews and negative reviews

In [0]:
train_pos_sentences = train_df['sentence'].loc[train_df.polarity==1]
train_neg_sentences = train_df['sentence'].loc[train_df.polarity==0]
train_pos_sentences = train_pos_sentences.reset_index(drop=True)
train_neg_sentences = train_neg_sentences.reset_index(drop=True)

In [0]:
test_pos_sentences = test_df['sentence'].loc[test_df.polarity==1]
test_neg_sentences = test_df['sentence'].loc[test_df.polarity==0]
test_pos_sentences = test_pos_sentences.reset_index(drop=True)
test_neg_sentences = test_neg_sentences.reset_index(drop=True)

Check few values

In [0]:
print(train_pos_sentences[0:5].values)

['The cast is excellent, the acting good, the plot interesting, the evolvement full of suspense...but it is hard to cram all those elements into a film that is barely 80 minutes long. If more time was taken to develop the plot and subplots, it would have a much better effect. Another 30 minutes of substance would have made this a very good film rather then just a good one.'
 "The 3rd and in my view the best of the Blackadder series.<br /><br />The only downside is that there is no Lord Percy who was the funniest character from the previous series but Hugh Laurie's Prince Regent is suitably madcap laugh a line.<br /><br />As a package it's quality through and through with convincing regency sets, superb cutting sarcasm and little bits of the wacky, the 'macbeth' actors standing out and Prince Georges 'lucky us' chicken impression, and the missing words from Dr Johnson's dictionary.<br /><br />Few comedies have been quite as both clever as they are funny, okay the odd lame observation or

In [0]:
print(train_neg_sentences[0:5].values)

['Okay at first this movie seemed pretty good even though it was moving rather quick and even though they only had a $60,000 budget it was good but if you found your sister dead in a lake and found out who might have killed her why would you go chase him around and pull a gun on him with only one bullet and waste it and end up running from him all retarded and get yourself killed? Plus after you found your sister dead in the lake and found a clue and figured out who the killer was why wouldn\'t you hand that clue over to the police who think you killed her? And at the end of the movie when she acts like her sister who was a waitress and she is talking to the bad guy she should of met him somewhere and recorded him saying she was dead and what happened for her "proof". I don\'t know I was not happy with the ending. This movie could of been so much better if it lasted longer and the acting was better and if the ending did not suck so bad! Do not waste your money on this movie because if 

Check the result of tokenization and cleaning on one sample document

In [0]:
text = train_pos_sentences[0]
tokens = clean_doc(text)
print(tokens)

['The', 'cast', 'acting', 'plot', 'evolvement', 'full', 'hard', 'cram', 'elements', 'film', 'barely', 'minutes', 'If', 'time', 'taken', 'develop', 'plot', 'would', 'much', 'better', 'Another', 'minutes', 'substance', 'would', 'made', 'good', 'film', 'rather', 'good']


In [0]:
# load doc and add to vocab
def add_sent_to_vocab(text,vocab):
# clean text
    tokens = clean_doc(text)
# update counts
    vocab.update(tokens)

def process_docs_to_vocab(sentList, vocab):
    for i in range(len(sentList)):
        text = sentList[i]
        add_sent_to_vocab(text,vocab)
       
# turn a doc into clean tokens
def clean_doc_wVocab(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory, into tokens
def process_docs_to_tokens(sentList, vocab):
	documents = list()
	# walk through all sentences
	for i in range(len(sentList)):
		doc = sentList[i]
		# clean doc
		tokens = clean_doc_wVocab(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = tf.keras.preprocessing.sequence.pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = tf.keras.preprocessing.text.Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

Create vocabulary from the train and test corpus, check the 50 most frequent words from the vocabulary

In [0]:
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs_to_vocab(train_df['sentence'], vocab)
process_docs_to_vocab(test_df['sentence'], vocab)
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

103269
[('The', 67317), ('movie', 60762), ('film', 54277), ('one', 41334), ('like', 36028), ('This', 24329), ('would', 23578), ('good', 22582), ('It', 21475), ('really', 21322), ('even', 20979), ('see', 20364), ('get', 17333), ('much', 16827), ('story', 16443), ('time', 15312), ('make', 14859), ('could', 14689), ('also', 14607), ('people', 14414), ('great', 14385), ('first', 14283), ('made', 13418), ('think', 13083), ('bad', 12847), ('many', 12175), ('never', 11984), ('two', 11489), ('But', 11435), ('little', 11206), ('way', 11089), ('And', 11060), ('well', 10789), ('watch', 10688), ('know', 10656), ('seen', 10554), ('characters', 10448), ('character', 10307), ('movies', 10118), ('love', 10033), ('best', 9899), ('ever', 9897), ('In', 9482), ('films', 9406), ('still', 9405), ('plot', 9305), ('acting', 9182), ('show', 9095), ('He', 8894), ('better', 8866)]


In [0]:
# load and clean a dataset
def load_clean_dataset(vocab):
	# load documents
	neg = process_docs_to_tokens(train_neg_sentences, vocab)
	pos = process_docs_to_tokens(train_pos_sentences, vocab)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# load and clean a dataset
def load_clean_dataset_test(vocab):
	# load documents
	neg = process_docs_to_tokens(test_neg_sentences, vocab)
	pos = process_docs_to_tokens(test_pos_sentences, vocab)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

Create the list of tokenized train data, with the labels (0 / 1) . Create tf.keras Tokenizer instance and use this to encode the train docs. Set the max_length based on the maximum number of tokens in the entire train_docs corpus


In [0]:
# load training data
train_docs, ytrain = load_clean_dataset(vocab)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)

Vocabulary size: 63542
Maximum length: 1441


Check the original review, the cleaned review

In [0]:
print(train_df['sentence'][0])

Okay at first this movie seemed pretty good even though it was moving rather quick and even though they only had a $60,000 budget it was good but if you found your sister dead in a lake and found out who might have killed her why would you go chase him around and pull a gun on him with only one bullet and waste it and end up running from him all retarded and get yourself killed? Plus after you found your sister dead in the lake and found a clue and figured out who the killer was why wouldn't you hand that clue over to the police who think you killed her? And at the end of the movie when she acts like her sister who was a waitress and she is talking to the bad guy she should of met him somewhere and recorded him saying she was dead and what happened for her "proof". I don't know I was not happy with the ending. This movie could of been so much better if it lasted longer and the acting was better and if the ending did not suck so bad! Do not waste your money on this movie because if you 

In [0]:
train_docs[0]

'Okay first movie seemed pretty good even though moving rather quick even though budget good found sister dead lake found might killed would go chase around pull gun one bullet waste end running retarded get killed Plus found sister dead lake found clue figured killer wouldnt hand clue police think killed And end movie acts like sister waitress talking bad guy met somewhere recorded saying dead happened proof dont know happy ending This movie could much better lasted longer acting better ending suck bad Do waste money movie writing review happy'

Encode the docs using the tf.keras Tokenizer, this process replaces the tokens with the token index from the vocabulary. Check a sample encoded document and make sure the token indexes correspond to the words in the review.

In [0]:
Xtrain = encode_docs(tokenizer, max_length, train_docs)

In [0]:
print(Xtrain.shape)
print(Xtrain[0])

(25000, 1441)
[784  22   2 ...   0   0   0]


Check how the Tokenizer has arranged the tokens, various imp properties of the Tokenizer can be printed after fitting on the corpus of reviews.

In [0]:
t = tokenizer
print(t.word_counts)
#print(t.document_count)
#print(t.word_index)
#print(t.word_docs)



In [0]:
print(t.word_index)



Make sure the encoded document is matching with the actual tokenized review sentence. The above shows the following token sequence #s which are present in the first movie review: 784 - > okay, 22 - > first, 2- > movie . This confirms the tokenization worked correctly


**Multi channel CNN1D Model using tf.keras Functional API**

Define the CNN1D model using Functional API. There are 3 input channels and in each channel there is Embedding, Conv1D, Dropout, MaxPooling1D. The kernel size is varied in each of the channels, and finally the output of each channel is concatenated and passed through Dense layers to the Output layer with sigmoid activation, to generate the prediction.

In [0]:
def define_model(length, vocab_size):
	# channel 1
	inputs1 = tf.keras.layers.Input(shape=(length,))
	embedding1 = tf.keras.layers.Embedding(vocab_size, 100)(inputs1)
	conv1 = tf.keras.layers.Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
	drop1 = tf.keras.layers.Dropout(0.5)(conv1)
	pool1 = tf.keras.layers.MaxPooling1D(pool_size=2)(drop1)
	flat1 = tf.keras.layers.Flatten()(pool1)
	# channel 2
	inputs2 = tf.keras.layers.Input(shape=(length,))
	embedding2 = tf.keras.layers.Embedding(vocab_size, 100)(inputs2)
	conv2 = tf.keras.layers.Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
	drop2 = tf.keras.layers.Dropout(0.5)(conv2)
	pool2 = tf.keras.layers.MaxPooling1D(pool_size=2)(drop2)
	flat2 = tf.keras.layers.Flatten()(pool2)
	# channel 3
	inputs3 = tf.keras.layers.Input(shape=(length,))
	embedding3 = tf.keras.layers.Embedding(vocab_size, 100)(inputs3)
	conv3 = tf.keras.layers.Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
	drop3 = tf.keras.layers.Dropout(0.5)(conv3)
	pool3 = tf.keras.layers.MaxPooling1D(pool_size=2)(drop3)
	flat3 = tf.keras.layers.Flatten()(pool3)
	# merge
	merged = tf.keras.layers.concatenate([flat1, flat2, flat3])
	# interpretation
	dense1 = tf.keras.layers.Dense(10, activation='relu')(merged)
	outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense1)
	model = tf.keras.models.Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)
	# compile
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	# summarize
	print(model.summary())
	#plot_model(model, show_shapes=True, to_file='multichannel.png')
	return model

In [0]:
print(vocab_size)
print(max_length)
# define model
model = define_model( max_length, vocab_size)

63542
1441
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 1441)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 1441)]       0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 1441)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1441, 100)    6354200     input_1[0][0]                    
___________________________________________________________________________________

In [0]:
# fit model
model.fit([Xtrain,Xtrain,Xtrain], ytrain, epochs=10)

Train on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fe7e9d3c2b0>

Save the model and load the model from saved model , so you do not need to train the model every time in order to test it. Ignore this step for the first run.

In [0]:
model.save('/content/gdrive/My Drive/sent_model_multi_input.h5')

In [0]:
model = tf.keras.models.load_model('/content/gdrive/My Drive/sent_model_multi_input.h5')

**Pre-process the Test documents using the same tokenizer instance and vocabulary as used while processing the Train documents**

In [0]:
test_docs, ytest = load_clean_dataset_test(vocab)

Encode Test documents in similar way

In [0]:
Xtest = encode_docs(tokenizer, max_length, test_docs)

**Model Evaluation**

In [0]:
# evaluate model on training dataset
_, acc = model.evaluate([Xtrain,Xtrain,Xtrain], ytrain, verbose=0)
print(' Train Accuracy: %f' % (acc*100))

 Train Accuracy: 99.975997


In [0]:
# evaluate model on test dataset
_, acc = model.evaluate([Xtest,Xtest,Xtest], ytest, verbose=0)
print( ' Test Accuracy: %f ' % (acc*100))

 Test Accuracy: 84.675997 


Create a review of your own or take a sample review and check the prediction. The review might be straightforward positive or negative, or it can start with a positive note and finally end up being very negative. The model should be able to predict the sentiment in all these cases.

In [0]:
#Positive simple review
review = "The move is enjoyable. Recommended for all ages. The storyline is good and direction is good"
#"Okay at first this movie seemed pretty good even though it was moving rather quick and even though they only had a $60,000 budget it was good but if you found your sister dead in a lake and found out who might have killed her why would you go chase him around and pull a gun on him with only one bullet and waste it and end up running from him all retarded and get yourself killed? Plus after you found your sister dead in the lake and found a clue and figured out who the killer was why wouldn\'t you hand that clue over to the police who think you killed her? And at the end of the movie when she acts like her sister who was a waitress and she is talking to the bad guy she should of met him somewhere and recorded him saying she was dead and what happened for her proof. I don\'t know I was not happy with the ending. This movie could of been so much better if it lasted longer and the acting was better and if the ending did not suck so bad! Do not waste your money on this movie because if you do you will be writing a review on here too and will not be happy."
#"Beautiful attracts excellent idea, but ruined with a bad selection of the actors. The main character is a loser and his woman friend and his friend upset viewers. Apart from the first episode all the other become more boring and boring. First, it considers it illogical behavior. No one normal would not behave the way the main character behaves. It all represents a typical Halmark way to endear viewers to the reduced amount of intelligence. Does such a scenario, or the casting director and destroy this question is on Halmark producers. Cat is the main character is wonderful. The main character behaves according to his friend selfish.";
#"The move is enjoyable. Recommended for all ages. The storyline is good and direction is good";
#log, and incongruous to the film. As for the story, it was a bit preachy and militant in tone. Overall, I was disappointed, but I would go again just to see the same excitement on my child's face. I liked Lumpy's laugh...";
#"The move is enjoyable. Recommended for all ages. The storyline is good and direction is good"
#"The movie started good. But after half-time, the story line faded, there was too much theatrical element. Not recommended"
#"The characters voices were very good. I was only really bothered by Kanga. The music, however, was twice as loud in parts than the dialog, and incongruous to the film. As for the story, it was a bit preachy and militant in tone. Overall, I was disappointed, but I would go again just to see the same excitement on my child's face. I liked Lumpy's laugh..."
#"Beautiful attracts excellent idea, but ruined with a bad selection of the actors. The main character is a loser and his woman friend and his friend upset viewers. Apart from the first episode all the other become more boring and boring. First, it considers it illogical behavior. No one normal would not behave the way the main character behaves. It all represents a typical Halmark way to endear viewers to the reduced amount of intelligence. Does such a scenario, or the casting director and destroy this question is on Halmark producers. Cat is the main character is wonderful. The main character behaves according to his friend selfish."
#"The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."
#'This is a bad movie. Do not watch it. It sucks.'
#'Everyone will enjoy this film. I love it, recommended!'
#'Okay at first this movie seemed pretty good even though it was moving rather quick and even though they only had a $60,000 budget it was good but if you found your sister dead in a lake and found out who might have killed her why would you go chase him around and pull a gun on him with only one bullet and waste it and end up running from him all retarded and get yourself killed? Plus after you found your sister dead in the lake and found a clue and figured out who the killer was why wouldn\'t you hand that clue over to the police who think you killed her? And at the end of the movie when she acts like her sister who was a waitress and she is talking to the bad guy she should of met him somewhere and recorded him saying she was dead and what happened for her "proof". I don\'t know I was not happy with the ending. This movie could of been so much better if it lasted longer and the acting was better and if the ending did not suck so bad! Do not waste your money on this movie because if you do you will be writing a review on here too and will not be happy.'


In [0]:
#Review that started with a positive note, but ended with a very negative impression
review = "Okay at first this movie seemed pretty good even though it was moving rather quick and even though they only had a $60,000 budget it was good but if you found your sister dead in a lake and found out who might have killed her why would you go chase him around and pull a gun on him with only one bullet and waste it and end up running from him all retarded and get yourself killed? Plus after you found your sister dead in the lake and found a clue and figured out who the killer was why wouldn\'t you hand that clue over to the police who think you killed her? And at the end of the movie when she acts like her sister who was a waitress and she is talking to the bad guy she should of met him somewhere and recorded him saying she was dead and what happened for her proof. I don\'t know I was not happy with the ending. This movie could of been so much better if it lasted longer and the acting was better and if the ending did not suck so bad! Do not waste your money on this movie because if you do you will be writing a review on here too and will not be happy."


Encode the review in the same way and using the same tokenizer instance

In [0]:
line = clean_doc_wVocab(review, vocab)
print(line)
X_encoded = encode_docs(tokenizer, max_length, [line])
print(X_encoded)

Okay first movie seemed pretty good even though moving rather quick even though budget good found sister dead lake found might killed would go chase around pull gun one bullet waste end running retarded get killed Plus found sister dead lake found clue figured killer wouldnt hand clue police think killed And end movie acts like sister waitress talking bad guy met somewhere recorded saying dead happened proof dont know happy ending This movie could much better lasted longer acting better ending suck bad Do waste money movie writing review happy
[[784  22   2 ...   0   0   0]]


Predict the sentiment on this single review. The prediction will be a probability score. If >= 0.5 its considered 1 and hence POSITIVE and else NEGATIVE. Check to see if the prediction is accurate, even with sentences which start off with positive note and end with negative overall impression

In [0]:
yhat = model.predict([X_encoded,X_encoded,X_encoded], verbose=0)
	# retrieve predicted percentage and label
percent_pos = yhat[0,0]
print(percent_pos)
if round(percent_pos) >= 0.5:
	sentiment = 'POSITIVE'
else:
	sentiment = 'NEGATIVE'
    
print(sentiment)

8.37729e-06
NEGATIVE
