<b>Chris Huber\
CSC820, Prof Anagha Kulkarni\
Spring 2022\
HA15\
ha15.ipynb</b>

<b>Description</b>: This notebook implements a lesson posted at https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/. We manually create train and test datasets based on a prefix contained in the filenames which splits it into a train set of 58109 and a test set of 6,611 sentences. I then create an embedding using Word2Vec and the vocabulary from the corpus. The Keras CNN model uses that embedding layer along with a convolutional and pooling layer which creates a movie review sentiment analysis predictor. I then use that model to evaluate on the test dataset for accuracy over 10 epochs.

<h3>Lesson 07: Movie Review Sentiment Analysis Project</h3>

In [20]:
# import libraries
from numpy import array
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Embedding
from keras.layers.convolutional import Conv1D, MaxPooling1D
from gensim.models import Word2Vec

<h3>Define Utility Functions</h3>

In [21]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
 
# turn a doc into clean tokens
def clean_doc(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens
 
# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
    # load the doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    
    print(tokens)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    return ' '.join(tokens)

# turn a doc into clean tokens
def doc_to_clean_lines(doc, vocab):
    clean_lines = list()
    lines = doc.splitlines()
    for line in lines:
        # split into tokens by white space
        tokens = line.split()
        # remove punctuation from each token
        table = str.maketrans('', '', punctuation)
        tokens = [w.translate(table) for w in tokens]
        # filter out tokens not in vocab
        tokens = [w for w in tokens if w in vocab]
        clean_lines.append(tokens)
    return clean_lines

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    # load doc
    doc = load_doc(filename)
    # clean doc
    tokens = clean_doc(doc)
    # update counts
    vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # add doc to vocab
        add_doc_to_vocab(path, vocab)

<h3>Define Vocabulary</h3>

In [22]:
# define vocab
vocab = Counter()

# add all docs to vocab
process_docs('./txt_sentoken/pos', vocab)
process_docs('./txt_sentoken/neg', vocab)

# print the size of the vocab
print(len(vocab))

# print the top words in the vocab
print(vocab.most_common(50))

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


In [23]:
# keep tokens with a min occurrence
min_occurance = 2
tokens = [k for k,c in vocab.items() if c >= min_occurance]
print(len(tokens))

25767


<h3>Create Vocabulary and Export as Text File</h3>

In [24]:
# load all docs in a directory
def process_docs(directory, vocab, is_train):
    lines = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any reviews in the test set
        if is_train and filename.startswith('cv9'):
            continue
        if not is_train and not filename.startswith('cv9'):
            continue
        # create the full path of the file to open|
        path = directory + '/' + filename
        # load and clean the doc
        doc = load_doc(path)
        doc_lines = doc_to_clean_lines(doc, vocab)
        # add lines to list
        lines += doc_lines
    return lines

# turn a doc into clean tokens
def clean_doc(doc, vocab):
    # split into tokens by white space
    tokens = doc.split()
    
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    
    # filter out tokens not in vocab
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

# save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    
    # open file
    file = open(filename, 'w')
    
    # write text
    file.write(data)
    
    # close file
    file.close()
 
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

In [25]:
# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
 
# load training data
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
print('Number of negative docs: %d' % len(negative_docs))
print('Number of positive docs: %d' % len(positive_docs))
sentences = negative_docs + positive_docs
print('Total training sentences: %d' % len(sentences))

Number of negative docs: 28584
Number of positive docs: 29525
Total training sentences: 58109


In [26]:
# train word2vec model
model = Word2Vec(sentences, vector_size=100, window=5, workers=8, min_count=1)
# summarize vocabulary size in model
# words = list(model.wv.vocab)
print('Vocabulary size: %d' % len(model.wv))
 
# save model in ASCII (word2vec) format
filename = 'embedding_word2vec.txt'
model.wv.save_word2vec_format(filename, binary=False)

Vocabulary size: 25767


In [27]:
# Create concatenated train docs with negative reviews first, then positive
train_docs = negative_docs + positive_docs
len(train_docs)

58109

<h3>Create Keras Tokenizer Fit Using Newly Created Embedding</h3>

In [28]:
# create the tokenizer
tokenizer = Tokenizer()

# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)

In [29]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)

In [30]:

# pad sequences
max_length = max([len(s) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# generate training labels for 58109 sentences using counts from prelabelled sets
ytrain = array([0 for _ in range(28584)] + [1 for _ in range(29525)])
print(ytrain)

[0 0 0 ... 1 1 1]


In [31]:
# load selected test reviews and get counts to assign labels
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
print('Number of negative docs: %d' % len(negative_docs))
print('Number of positive docs: %d' % len(positive_docs))
test_docs = negative_docs + positive_docs

Number of negative docs: 3199
Number of positive docs: 3412


In [32]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(3199)] + [1 for _ in range(3412)])

In [33]:
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
vocab_size

25768

<h3> Add Embedding Layer Created Using Exported Word2Vec Text File</h3>

In [35]:
from numpy import asarray, zeros
from keras.layers import Embedding

# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, tokenizer.word_index)
# create the embedding layer
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=max_length, trainable=False)

In [37]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 75, 100)           2576800   
                                                                 
 conv1d_2 (Conv1D)           (None, 68, 32)            25632     
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 34, 32)           0         
 1D)                                                             
                                                                 
 flatten_2 (Flatten)         (None, 1088)              0         
                                                                 
 dense_4 (Dense)             (None, 10)                10890     
                                                                 
 dense_5 (Dense)             (None, 1)                 11        
                                                      

In [38]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

Epoch 1/10
1816/1816 - 35s - loss: 0.6170 - accuracy: 0.6468 - 35s/epoch - 19ms/step
Epoch 2/10
1816/1816 - 35s - loss: 0.4241 - accuracy: 0.7998 - 35s/epoch - 19ms/step
Epoch 3/10
1816/1816 - 37s - loss: 0.2042 - accuracy: 0.9107 - 37s/epoch - 20ms/step
Epoch 4/10
1816/1816 - 36s - loss: 0.1009 - accuracy: 0.9555 - 36s/epoch - 20ms/step
Epoch 5/10
1816/1816 - 36s - loss: 0.0629 - accuracy: 0.9721 - 36s/epoch - 20ms/step
Epoch 6/10
1816/1816 - 37s - loss: 0.0476 - accuracy: 0.9782 - 37s/epoch - 20ms/step
Epoch 7/10
1816/1816 - 36s - loss: 0.0394 - accuracy: 0.9822 - 36s/epoch - 20ms/step
Epoch 8/10
1816/1816 - 36s - loss: 0.0368 - accuracy: 0.9836 - 36s/epoch - 20ms/step
Epoch 9/10
1816/1816 - 37s - loss: 0.0342 - accuracy: 0.9849 - 37s/epoch - 20ms/step
Epoch 10/10
1816/1816 - 37s - loss: 0.0313 - accuracy: 0.9859 - 37s/epoch - 20ms/step


<keras.callbacks.History at 0x232293a5100>

In [39]:
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 61.911964
