Sentiment Analysis of Movie Reviews
===================================

The IMDB dataset consists of 25,000 reviews, each with a binary label (1 = positive, 0 = negative). Here is an example review:

> “Okay, sorry, but I loved this movie. I just love the whole 80’s genre of these kind of movies, because you don’t see many like this...” -~CupidGrl~

In [1]:
import string
import re
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from nltk.corpus import stopwords
import string
import re
# turn a doc into clean tokens
def clean_doc(doc):
# split into tokens by white space
    tokens = doc.split()
# prepare regex for char filtering
    re_punc = re.compile( ' [%s] ' % re.escape(string.punctuation))
# remove punctuation from each word
    tokens = [re_punc.sub( '' , w) for w in tokens]
# remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
    stop_words = set(stopwords.words( 'english' ))
    tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r',encoding="utf8")
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

In [10]:
# load the document
filename = 'C:/Users/GCNDP/SentimentTrain/train/neg/10_2.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

['This', 'film', 'lot', 'plot', 'relatively', 'however', 'director', 'editors', 'seriously', 'let', 'film', 'feel', 'bad', 'could', 'The', 'acting', 'characters', 'ever', 'edited', 'clearly', 'learnt', 'new', 'edit', 'techniques', 'wanted', 'splash', 'There', 'lots', 'quick', 'edits', 'almost', 'every', 'clearly', 'meant', 'symbolic', 'end', 'wanted', 'like', 'film', 'expected', 'decent', 'resolution', 'breakdown', 'equilibrium', 'alas', 'left', 'feeling', 'like', 'wasted', 'time', 'film', 'makers', 'wasted']


In [3]:
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
# load doc
    doc = load_doc(filename) #encoding="utf8"
# clean doc
    tokens = clean_doc(doc)
# update counts
    vocab.update(tokens)
    #print(tokens)

# load all docs in a directory
def process_docs_to_vocab(directory, vocab):
# walk through all files in the folder
    for filename in listdir(directory):
# create the full path of the file to open
        path = directory + '/' + filename
# add doc to vocab
        #print(path)
        add_doc_to_vocab(path, vocab)
        
# turn a doc into clean tokens
def clean_doc_wVocab(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens

# load all docs in a directory, into tokens
def process_docs_to_tokens(directory, vocab):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc_wVocab(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents

# integer encode and pad documents
def encode_docs(tokenizer, max_length, docs):
    # integer encode
    encoded = tokenizer.texts_to_sequences(docs)
    # pad sequences
    padded = pad_sequences(encoded, maxlen=max_length, padding='post')
    return padded

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

In [4]:
from os import listdir
from collections import Counter

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs_to_vocab('C:/Users/GCNDP/SentimentTrain/train/pos', vocab)
process_docs_to_vocab('C:/Users/GCNDP/SentimentTrain/train/neg', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

75355
[('The', 33762), ('movie', 30506), ('film', 27402), ('one', 20692), ('like', 18133), ('This', 12279), ('would', 11923), ('good', 11436), ('It', 10952), ('really', 10815), ('even', 10607), ('see', 10155), ('get', 8777), ('story', 8527), ('much', 8507), ('time', 7765), ('make', 7485), ('could', 7462), ('also', 7422), ('first', 7339), ('people', 7335), ('great', 7191), ('made', 6962), ('think', 6659), ('bad', 6506), ('many', 6062), ('never', 6043), ('But', 5897), ('two', 5869), ('little', 5790), ('way', 5649), ('And', 5590), ('well', 5420), ('watch', 5314), ('seen', 5304), ('know', 5270), ('character', 5215), ('characters', 5180), ('movies', 5128), ('best', 4975), ('love', 4974), ('ever', 4924), ('still', 4863), ('In', 4788), ('films', 4740), ('plot', 4698), ('acting', 4648), ('show', 4472), ('He', 4466), ('better', 4406)]


In [5]:
# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w', encoding='utf-8')
    # write text
    file.write(data)
    # close file
    file.close()
    
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, "C:/Users/GCNDP/SentimentTrain/vocab.txt")

46000


In [6]:
# load the vocabulary
vocab_filename = 'C:/Users/GCNDP/SentimentTrain/vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

In [7]:
# load and clean a dataset
def load_clean_dataset(vocab):
	# load documents
	neg = process_docs_to_tokens('C:/Users/GCNDP/SentimentTrain/train/neg', vocab)
	pos = process_docs_to_tokens('C:/Users/GCNDP/SentimentTrain/train/pos', vocab)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# load and clean a dataset
def load_clean_dataset_test(vocab):
	# load documents
	neg = process_docs_to_tokens('C:/Users/GCNDP/SentimentTrain/test/neg', vocab)
	pos = process_docs_to_tokens('C:/Users/GCNDP/SentimentTrain/test/pos', vocab)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

In [None]:
# load training data
train_docs, ytrain = load_clean_dataset(vocab)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)
# calculate the maximum sequence length
max_length = max([len(s.split()) for s in train_docs])
print('Maximum length: %d' % max_length)

In [7]:
Xtrain = encode_docs(tokenizer, max_length, train_docs)

In [15]:
print(Xtrain.shape)
print(Xtrain[5])


(25000, 1406)
[  9 631  37 ...   0   0   0]


In [26]:
from keras import models
from keras import layers
from keras import optimizers
# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, 100, input_length=max_length))
    model.add(layers.Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(layers.MaxPooling1D(pool_size=2))
    model.add(layers.Flatten())
    model.add(layers.Dense(10, activation='relu' ))
    model.add(layers.Dense(1, activation='sigmoid' ))
# compile network
    model.compile(loss='binary_crossentropy' , optimizer=optimizers.Adam() , metrics=['acc'])
# summarize defined model
    model.summary()
    plot_model(model, to_file= 'C:/users/GCNDP/model.png' , show_shapes=True)
    return model

In [34]:
print(vocab_size)
print(max_length)
# define model
model = define_model(vocab_size, max_length)


38428
1406
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1406, 100)         3842800   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1399, 32)          25632     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 699, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 22368)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                223690    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 4,092,133
Trainable params: 4,092,133
Non-trainable params: 0
_______________________________________________________

<keras.callbacks.History at 0x1375a668>

In [None]:
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

In [35]:
model.save('C:/Users/GCNDP/SentimentTrain/sent_model_2.h5')

In [6]:
import keras
from keras import models
model = models.load_model('C:/Users/GCNDP/SentimentTrain/sent_model_2.h5')

In [9]:
test_docs, ytest = load_clean_dataset_test(vocab)

In [10]:
Xtest = encode_docs(tokenizer, max_length, test_docs)

In [11]:
# evaluate model on training dataset
_, acc = model.evaluate(Xtrain, ytrain, verbose=0)
print( ' Train Accuracy: %f ' % (acc*100))
# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0)
print( ' Test Accuracy: %f ' % (acc*100))

 Train Accuracy: 100.000000 
 Test Accuracy: 85.080000 


In [22]:
review = "the movie had a good plot to it, and the actors did a magnificent job. Overall, i would recomend it.";
#"The move is enjoyable. Recommended for all ages. The storyline is good and direction is good";
#log, and incongruous to the film. As for the story, it was a bit preachy and militant in tone. Overall, I was disappointed, but I would go again just to see the same excitement on my child's face. I liked Lumpy's laugh...";
#"The move is enjoyable. Recommended for all ages. The storyline is good and direction is good"
#"The movie started good. But after half-time, the story line faded, there was too much theatrical element. Not recommended"
#"The characters voices were very good. I was only really bothered by Kanga. The music, however, was twice as loud in parts than the dialog, and incongruous to the film. As for the story, it was a bit preachy and militant in tone. Overall, I was disappointed, but I would go again just to see the same excitement on my child's face. I liked Lumpy's laugh..."
#"Beautiful attracts excellent idea, but ruined with a bad selection of the actors. The main character is a loser and his woman friend and his friend upset viewers. Apart from the first episode all the other become more boring and boring. First, it considers it illogical behavior. No one normal would not behave the way the main character behaves. It all represents a typical Halmark way to endear viewers to the reduced amount of intelligence. Does such a scenario, or the casting director and destroy this question is on Halmark producers. Cat is the main character is wonderful. The main character behaves according to his friend selfish."
#"The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."
#'This is a bad movie. Do not watch it. It sucks.'
#'Everyone will enjoy this film. I love it, recommended!'


In [23]:
line = clean_doc_wVocab(review, vocab)
print(line)
X_encoded = encode_docs(tokenizer, max_length, [line])
print(X_encoded)

movie good plot actors magnificent job Overall would recomend
[[ 2  7 43 ...  0  0  0]]


In [24]:
yhat = model.predict(X_encoded, verbose=0)
	# retrieve predicted percentage and label
percent_pos = yhat[0,0]
print(percent_pos)
if round(percent_pos) >= 0.5:
	sentiment = 'POSITIVE'
else:
	sentiment = 'NEGATIVE'
    
print(sentiment)

0.9391502
POSITIVE
