In this project, I will be processing text with a RNN in a 2x2 experimental design. The experiment design uses the train and test pattern with 50 epochs and a batch size of 100 and Adam Otimizer. Two word vector embeddings - glove.6B.50d.txt and glove.6B.100d.txt, were used along with two vocabulary sizes to create the following 2x2 experimental design:

1. 10000 words with glove50d
2. 20000 words with glove50d
3. 10000 words with glove100d
4. 20000 words with glove100d

As the optional experiment, I used AdaDelta optimizer with 10000 and 20000 words from glove100d vector embedding, as Adam Optimizer had higher performance for this vector embedding.

<b>Management Problem:</b>
How would you advise senior management?
Based on the results of this experiment, I would advise senior management to use a RNN Model having glove.6B.100d.txt embedding vector and Adam Optimizer. This model will be trained on existing customer feedback with a batch size of 100 taking 50 epochs. Using this training the model can identify the negative customer feelings. To reduce the processing time required by the model, we need to use the pretrained word vectors as a whole or just the subset. In the model that produces better results, we have used a subset of 20000 word vectors.

## Importing the required packages

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd

import os  # operating system functions
import os.path  # for manipulation of file path names

import re  # regular expressions

from collections import defaultdict

import nltk
from nltk.tokenize import TreebankWordTokenizer

import tensorflow as tf

# Scikit Learn for random splitting of the data  
from sklearn.model_selection import train_test_split

# Setting seed for reproducible results
RANDOM_SEED = 9999

# To remove all future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
    
# Setting the embeddings directory
embeddings_directory = 'embeddings/gloVe.6B'

## Functions for Importing Embeddings

In [2]:
# A Function to create word dictionary with or without indexes.
# It takes the embeddings file name and an flag
# to check if indexes are needed.

def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, 
    we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, 
    otherwise we return only a direct 
    `word_to_embedding_dict` dictionnary mapping 
    from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

In [3]:
def get_embeddings(filename):
    # Getting the embeddings
    embeddings_filename = os.path.join(embeddings_directory, filename)
    
    # Loading the embeddings
    print('\nLoading embeddings from', embeddings_filename)
    word_to_index, index_to_embedding = \
        load_embedding_from_disks(embeddings_filename, with_indexes=True)
    print("Embedding loaded from disks.")
    
    # Getting the vocab size and embedding dimensions
    embedding_dim = index_to_embedding.shape[1]
    print("Embedding is of shape: {}".format(index_to_embedding.shape))
    return word_to_index, index_to_embedding, embedding_dim

## Creating utility functions to generate limited indexes and smaller vocab lists for the two different vocabulary sizes and data preparation.

In [4]:
# Function to create smaller vocab lists based on vocabulary size.

def get_index_for_vocabsize(EVOCABSIZE, word_to_index,
                            index_to_embedding, embedding_dim):
    
    # Need a callable function for dictionary
    def default_factory():
        return EVOCABSIZE
    
    # dictionary has the items() function, returns list of (key, value) tuples
    limited_word_to_index = defaultdict(default_factory, \
        {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})

    # Select the first EVOCABSIZE rows to the index_to_embedding
    limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
    # Set the unknown-word row to be all zeros as previously
    limited_index_to_embedding = np.append(limited_index_to_embedding, 
        index_to_embedding[index_to_embedding.shape[0] - 1, :].\
            reshape(1,embedding_dim), 
        axis = 0)
    print("Limited indexes and word dict created with shape:",
          limited_index_to_embedding.shape)
    return limited_word_to_index, limited_index_to_embedding

In [5]:
# Utility function to get file names within a directory

def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

In [6]:
# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs

codelist = ['\r', '\n', '\t']

In [7]:
# Text parsing function for creating text documents

def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)

In [8]:
# Read data function that takes filename as a param and reads it into 
# a text list which is then split into words.

def read_data(filename):
    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)
    return data

## Data Setup - Processing Movie Reviews

In [9]:
def gather_movie_reviews(dir_end):

    dir_start = '/Assignment 8/'
    dir_name = os.path.join(dir_start, dir_end)
    docs = []

    filenames = listdir_no_hidden(path=dir_name)
    num_files = len(filenames)

    for i in range(len(filenames)):
        file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
        assert file_exists
    
    print('\nDirectory:',dir_end)
    print('%d files found' % len(filenames))
    
    for i in range(num_files):
        words = read_data(os.path.join(dir_name, filenames[i]))
        docs.append(words)
    
    return docs

In [10]:
# gather data for 500 negative movie reviews

neg_dir_end = 'run-jump-start-rnn-sentiment-v002/movie-reviews-negative'

negative_documents = gather_movie_reviews(neg_dir_end)

# gather data for 500 positive movie reviews

pos_dir_end = 'run-jump-start-rnn-sentiment-v002/movie-reviews-positive'

positive_documents = gather_movie_reviews(pos_dir_end)


Directory: run-jump-start-rnn-sentiment-v002/movie-reviews-negative
500 files found

Directory: run-jump-start-rnn-sentiment-v002/movie-reviews-positive
500 files found


In [11]:
# Testing the min max length of the movie reviews

max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length)

max_review_length: 1052
min_review_length: 22


In [12]:
# Construct list of 1000 lists with 40 words in each list

from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))

## Building RNN Model Shell

It takes Vocabulary size and Optimizer as parameters and returns the result dataframe after running the model for 50 epochs taking a batch size of 100

In [13]:
def gen_RNN_model(optimizer, limited_word_to_index, limited_index_to_embedding):

    # create list of lists of lists for embeddings
    embeddings = []    
    for doc in documents:
        embedding = []
        for word in doc:
            embedding.append(limited_index_to_embedding[\
                                                limited_word_to_index[word]]) 
        embeddings.append(embedding)

    # Make embeddings a numpy array for use in an RNN 
    # Create training and test sets with Scikit Learn

    embeddings_array = np.array(embeddings)

    # Define the labels to be used 500 negative (0) and 500 positive (1)
    thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                          np.ones((500), dtype = np.int32)), axis = 0)

    # Random splitting of the data in to training (80%) and test (20%)  
    X_train, X_test, y_train, y_test = \
        train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                         random_state = RANDOM_SEED)

    reset_graph()

    # number of words per document
    n_steps = embeddings_array.shape[1]
    # dimension of  pre-trained embeddings
    n_inputs = embeddings_array.shape[2]
    n_neurons = 20  # analyst specified number of neurons
    n_outputs = 2  # thumbs-down or thumbs-up

    X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
    y = tf.placeholder(tf.int32, [None])

    basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
    outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

    logits = tf.layers.dense(states, n_outputs)
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    loss = tf.reduce_mean(xentropy)
    
    training_op = optimizer.minimize(loss)
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    init = tf.global_variables_initializer()

    n_epochs = 50
    batch_size = 100

    result = []
    result_list = [None] * 50
    e = 0

    with tf.Session() as sess:
        init.run()
        for epoch in range(n_epochs):
            for iteration in range(y_train.shape[0] // batch_size):          
                X_batch = X_train[iteration*batch_size:(iteration + 1)\
                                  *batch_size,:]
                y_batch = y_train[iteration*batch_size:(iteration + 1)\
                                  *batch_size]
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
            acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
            result_list[epoch] = [acc_train,acc_test]
    result_list_df = pd.DataFrame(result_list)
    result_list_df.columns=['Train Accuracy','Test Accuracy']
    return result_list_df

## Setting up the 2 x 2 design

In [22]:
# Optimizer
OPT = tf.train.AdamOptimizer(learning_rate=0.001)

# Vocabsize
EVOCABSIZE1 = 10000
EVOCABSIZE2 = 20000

# Embedding
FILENAME1 = 'glove.6B.50d.txt'
FILENAME2 = 'glove.6B.100d.txt'

word_to_index1, index_to_embedding1, embedding_dim1 = \
    get_embeddings(FILENAME1)
word_to_index2, index_to_embedding2, embedding_dim2 = \
    get_embeddings(FILENAME2)


Loading embeddings from embeddings/gloVe.6B\glove.6B.50d.txt
Embedding loaded from disks.
Embedding is of shape: (400001, 50)

Loading embeddings from embeddings/gloVe.6B\glove.6B.100d.txt
Embedding loaded from disks.
Embedding is of shape: (400001, 100)


## Model 1 using Vocab size 10000, glove.6B.50d.txt embedding vector and Adam Optimizer

In [23]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE1, word_to_index1,
                            index_to_embedding1, embedding_dim1)

# Generate and run the model
result_df_model1 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 1\n')
print(result_df_model1.mean())

Limited indexes and word dict created with shape: (10001, 50)

Mean Train and Test Accuracies from Model 1

Train Accuracy    0.6710
Test Accuracy     0.6162
dtype: float64


## Model 2 using Vocab size 20000, glove.6B.50d.txt embedding vector and Adam Optimizer

In [24]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE2, word_to_index1,
                            index_to_embedding1, embedding_dim1)

# Generate and run the model
result_df_model2 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 2\n')
print(result_df_model2.mean())

Limited indexes and word dict created with shape: (20001, 50)

Mean Train and Test Accuracies from Model 2

Train Accuracy    0.6772
Test Accuracy     0.6084
dtype: float64


## Model 3 using Vocab size 10000, glove.6B.100d.txt embedding vector and Adam Optimizer

In [25]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE1, word_to_index2,
                            index_to_embedding2, embedding_dim2)

# Generate and run the model
result_df_model3 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 3\n')
print(result_df_model3.mean())

Limited indexes and word dict created with shape: (10001, 100)

Mean Train and Test Accuracies from Model 3

Train Accuracy    0.7522
Test Accuracy     0.6233
dtype: float64


## Model 4 using Vocab size 20000, glove.6B.100d.txt embedding vector and Adam Optimizer

In [26]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE2, word_to_index2,
                            index_to_embedding2, embedding_dim2)

# Generate and run the model
result_df_model4 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 4\n')
print(result_df_model4.mean())

Limited indexes and word dict created with shape: (20001, 100)

Mean Train and Test Accuracies from Model 4

Train Accuracy    0.7898
Test Accuracy     0.6108
dtype: float64


## ---Testing two additional models with AdaDelta Optimizer---

In [27]:
OPT = tf.train.AdadeltaOptimizer(learning_rate=0.001)

## Model 5 using Vocab size 10000, glove.6B.100d.txt embedding vector and AdaDelta Optimizer

In [28]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE1, word_to_index2,
                            index_to_embedding2, embedding_dim2)

# Generate and run the model
result_df_model5 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 5\n')
print(result_df_model5.mean())

Limited indexes and word dict created with shape: (10001, 100)

Mean Train and Test Accuracies from Model 5

Train Accuracy    0.4600
Test Accuracy     0.5342
dtype: float64


## Model 6 using Vocab size 20000, glove.6B.100d.txt embedding vector and AdaDelta Optimizer

In [29]:
# Get the limited word index array and embedding using vocabsize
limited_word_to_index, limited_index_to_embedding = \
    get_index_for_vocabsize(EVOCABSIZE2, word_to_index2,
                            index_to_embedding2, embedding_dim2)

# Generate and run the model
result_df_model6 = gen_RNN_model(OPT, limited_word_to_index,
                                 limited_index_to_embedding)

print('\nMean Train and Test Accuracies from Model 6\n')
print(result_df_model6.mean())

Limited indexes and word dict created with shape: (20001, 100)

Mean Train and Test Accuracies from Model 6

Train Accuracy    0.460
Test Accuracy     0.493
dtype: float64
