**The original study intends to test how the type of pretrained word vector and vocabulary size impact a basic RNN's performance on sentiment analysis. However, the analyst had difficulty to attain fastTex in .txt format and wasn't able to find a solution before the deadline. Therefore, the study was modified to test how vocabulary size and dimension of embedding impact a basic RNN's performance on sentiment analysis. Since the embedding dimension decides the number of inputs of the RNN, this study also exams the impact of RNN design on sentiment analysis result.** 


**As mentioned, the study was conducted in a 2x2 factorial design. Factor 1: vocabulary size; Factor 2: dimension of embedding. To satisfy the design, four (4) pretrained word vectors were utilized: gloVe.6b.50d (50 dimensions, 400K vocabulary size), gloVe.6b.100d (100 dimensions, 400K vocabulary size), gloVe.Twitter.50d (50 dimensions, 1.2M vocabulary size), gloVe.Twitter.100d (100 dimensions, 1.2M vocabulary size).** 


**The initial analysis of movie reviews data shows that the reviews vary from 22 to 1052 words. It was decided to use the first 20 and last 20 words of each review as the word sequences for analysis. The 500 negative reviews and 500 positive reviews were converted in to a list of 1000 lists of 40 words in each list.**


**The 10000 most frequently used words from each of the four (4) embeddings were utilized to construct the embedding to map the preprocessed review data (list of 10000 lists of 40 words) to a numpy array in the shape of (1000, 40, original embedding dimension). This numpy array was used inputs of the RNN. Another numpy array of 500 0s (Thumbs down) and 500 1s (Thumbs up) was constructed as dependent variable. Simple train/test split was used as cross-validation method for this study. 80% of the data was randomly chose as train set while the rest was used as test set.**


**A total of (4) tests were run under the 2x2 factor design. Model's accuracy on train set and accuracy on test set were used as performance index. The results show that the basic RNN is prone to overfitting. When the number of inputs was held constant, the RNN designed with embedding compressed from a pre-trained word vector with larger vocabulary provides better predictive accuracy than the one designed with embedding from a smaller vocabulary vector. The result also shows that the RNN model with more inputs, in other words, utilizing embedding of higher dimensions, achieved better predictive accuracy on both train and test set than the one with less inputs (lower dimension).** 


**For this study, we preprocessed the review data to include only the first 20 and last 20 words of each review document. We also utilized compressed embeddings instead of the full pre-trained word vectors to build the RNN. We also noticed overfitting with the basic RNN structure. Therefore, as the next step, the management should have data scientists to (1) test RNN performance with different word sequences; (2) test RNN performances with full pre-trained vectors of different vocabulary size; (3) utilize more sophisticated RNN structure including LMST/GRU cell, dropout layer, additional fully-connected layer, etc; (4) consider a multiclass classification where the most critical reviews, either the most positive or the most negative ones, can be identified** 

**The following codes are borrowed and slightly modified from Miller (2018) run-jump-start-rnn-sentiment-v002.**

# SET UP

Importing needed packages for this analysis. 

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd

import os  # operating system functions
import os.path  # for manipulation of file path names

import re  # regular expressions

from collections import defaultdict, OrderedDict

import nltk
from nltk.tokenize import TreebankWordTokenizer

import tensorflow as tf

from sklearn.model_selection import train_test_split

In [2]:
RANDOM_SEED = 9999

Setup to make output stable across runs.

In [3]:
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

No stopword removal. 

In [4]:
REMOVE_STOPWORDS = False 

# MOVIE REVIEWS DATA

Code for working with movie reviews data. Source: Miller, T. W. (2016). Web and Network Data Science.Upper Saddle River, N.J.: Pearson Education. ISBN-13: 978-0-13-388644-3
This original study used a simple bag-of-words approach to sentiment analysis, along with pre-defined lists of negative and positive words. Code available at:  https://github.com/mtpa/wnds    

Define utility function to get file names within a directory.

In [5]:
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

Define list of codes to be dropped from document carriage-returns, line-feeds, tabs.

In [6]:
codelist = ['\r', '\n', '\t']  

We will not remove stopwords in this research as they are important to keep sentence intact.

In [7]:
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

Add a number of words along with contractions and other word strings to the usual English stopwords to be dropped from a document collection.

In [9]:
more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

stoplist = nltk.corpus.stopwords.words('english') + more_stop_words + some_proper_nouns_to_remove

Define text parsing function for creating text documents.

In [10]:
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)

Definte read_data function to store move reviews data in a list of lists where each list represents a document and document is a list of words. We then break the text into words.

In [11]:
def read_data(filename):
    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank
    return data

Gather data for 500 negative movie reviews.

In [12]:
dir_name = 'movie-reviews-negative'
    
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: movie-reviews-negative
500 files found


In [13]:
negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    negative_documents.append(words)


Processing document files under movie-reviews-negative


Gather data for 500 positive movie reviews.

In [14]:
dir_name = 'movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: movie-reviews-positive
500 files found


In [15]:
positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    positive_documents.append(words)


Processing document files under movie-reviews-positive


Check the maximum length and minimum length of the reviews.

In [16]:
max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

max_review_length: 1052
min_review_length: 22


Note that reviews vary from 22 to 1052 words. We will use the first 20 and last 20 words of each review as our word sequences for analysis. Convert positive/negative documents into numpy array of 1000 lists (500 positive reviews & 500 negative reviews) with 40 words in each list. 

In [17]:
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    

In [18]:
len(documents)

1000

# PRETRAINED WORD VECTORS

To test how vocabulary size and embedding dimension may influence the performance of sentimental analysis with RNN, we will utilize four (4) embeddings: gloVe.6b.50d (50 dimensions, 400K vocabulary size), gloVe.6b.100d (100 dimensions, 400K vocabulary size), gloVe.Twitter.50d (50 dimensions, 1.2M vocabulary size), gloVe.Twitter.100d (100 dimensions, 1.2M vocabulary size). 

Setup utility function for loading embeddings follows methods as described in (https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer).

In [19]:
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, 
    we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, 
    otherwise we return only a direct 
    `word_to_embedding_dict` dictionnary mapping 
    from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

Load the first embedding: gloVe.6b.50d

In [20]:
embeddings_directory = 'embeddings/gloVe.6B' # GloVe.840B.300d (300 dimensions and 2.2 M size/fastText(en) (300 dimensions and 2.5M)
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('\nLoading embeddings from', embeddings_filename)
word_to_index_6B5, index_to_embedding_6B5 = load_embedding_from_disks(embeddings_filename, 
                                                                      with_indexes=True)
print("Embedding loaded from disks.")


Loading embeddings from embeddings/gloVe.6B\glove.6B.50d.txt
Embedding loaded from disks.


Load the second embedding: gloVe.6b.100d

In [22]:
embeddings_directory = 'embeddings/gloVe.6B' 
filename = 'glove.6B.100d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('\nLoading embeddings from', embeddings_filename)
word_to_index_6B10, index_to_embedding_6B10 = load_embedding_from_disks(embeddings_filename, 
                                                                        with_indexes=True)
print("Embedding loaded from disks.")


Loading embeddings from embeddings/gloVe.6B\glove.6B.100d.txt
Embedding loaded from disks.


Load the third embedding: gloVe.Twitter.50d

In [24]:
embeddings_directory = 'embeddings/gloVe.twitter.27B' 
filename = 'glove.twitter.27B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('\nLoading embeddings from', embeddings_filename)
word_to_index_T5, index_to_embedding_T5 = load_embedding_from_disks(embeddings_filename, 
                                                                    with_indexes=True)
print("Embedding loaded from disks.")


Loading embeddings from embeddings/gloVe.twitter.27B\glove.twitter.27B.50d.txt
Embedding loaded from disks.


Load the fourth embedding: gloVe.Twitter.100d

In [26]:
embeddings_directory = 'embeddings/gloVe.twitter.27B' 
filename = 'glove.twitter.27B.100d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('\nLoading embeddings from', embeddings_filename)
word_to_index_T10, index_to_embedding_T10 = load_embedding_from_disks(embeddings_filename, 
                                                                      with_indexes=True)
print("Embedding loaded from disks.")


Loading embeddings from embeddings/gloVe.twitter.27B\glove.twitter.27B.100d.txt
Embedding loaded from disks.


# BENCHMARK TEST WITH 2 X 2 FACTORIAL DESIGN

Set up numpy array to host the results. 

In [28]:
Train_Accuracy=[]
Test_Accuracy=[]

Define vocabulary size for the language model. To reduce the size of the vocabulary to the 10000 most frequently used words.

In [29]:
EVOCABSIZE = 10000

In [30]:
def default_factory():
    return EVOCABSIZE 

A total of four (4) tests are run in the 2x2 factorial design: Test 1: original vocabulary size = 400K, embedding dimension = 50; Test 2: original vocabulary size = 1.2M, embedding dimension = 50; Test 3: original vocabulary size = 400K, embedding dimension = 100; Test 4: original vocabulary size = 1.2M, embedding dimension = 100

Construct RNN for Test 1 and Test 2 with number of neurons set to be 20 and learning rate set to be 0.001.

In [31]:
n_steps = 40  # number of words per document 
n_inputs = 50  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

In [32]:
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [33]:
init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

Conduct Test 1 and Test 2: in each loop, we first define the limited dictionary based on desired vocabulary size of 10000 and the word vector, then create list of lists of lists for embeddings which is converted to a numpy array to be used in a basic RNN for sentiment analysis. For each test, we run 50 epoches. The accuracy on train set and test set are recorded for comparison analysis later.

In [34]:
word2index = [word_to_index_6B5, word_to_index_T5]
index2embd = [index_to_embedding_6B5, index_to_embedding_T5]

In [35]:
for word_to_index, index_to_embedding in zip(word2index, index2embd):
    limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() 
                                                          if v < EVOCABSIZE})
    limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE, :]
    limited_index_to_embedding = np.append(limited_index_to_embedding, 
                                           index_to_embedding[index_to_embedding.shape[0] - 1, :].reshape(1,
                                           index_to_embedding.shape[1]), axis=0)
    del index_to_embedding
    embeddings = []
    for doc in documents:
        embedding = []
        for word in doc:
            embedding.append(limited_index_to_embedding[limited_word_to_index[word]])
        embeddings.append(embedding)
    embeddings_array = np.array(embeddings)
    thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                                     np.ones((500), dtype = np.int32)), axis = 0)
    X_train, X_test, y_train, y_test = train_test_split(embeddings_array, 
                                                        thumbs_down_up, test_size=0.20, 
                                                        random_state = RANDOM_SEED)
    with tf.Session() as sess:
        init.run()
        for epoch in range(n_epochs):
            for iteration in range(y_train.shape[0] // batch_size):
                X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
                y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        Train_Accuracy.append(acc_train)
        Test_Accuracy.append(acc_test)

Construct RNN for Test 3 and Test 4. 

In [36]:
tf.reset_default_graph()

n_steps = 40  # number of words per document 
n_inputs = 100  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

In [37]:
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [38]:
init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

Conduct Test 3 and Test 4: in each loop, we first define the limited dictionary based on desired vocabulary size of 10000 and the word vector, then create list of lists of lists for embeddings which is converted to a numpy array to be used in a basic RNN for sentiment analysis. For each test, we run 50 epoches. The accuracy on train set and test set are recorded for comparison analysis later.

In [39]:
word2index = [word_to_index_6B10, word_to_index_T10]
index2embd = [index_to_embedding_6B10, index_to_embedding_T10]

In [40]:
for word_to_index, index_to_embedding in zip(word2index, index2embd):
    limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
    limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE, :]
    limited_index_to_embedding = np.append(limited_index_to_embedding, 
                                           index_to_embedding[index_to_embedding.shape[0] - 1, :].reshape(1,
                                           index_to_embedding.shape[1]), axis=0)
    del index_to_embedding
    embeddings = []
    for doc in documents:
        embedding = []
        for word in doc:
            embedding.append(limited_index_to_embedding[limited_word_to_index[word]])
        embeddings.append(embedding)
    embeddings_array = np.array(embeddings)
    thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                                     np.ones((500), dtype = np.int32)), axis = 0)
    X_train, X_test, y_train, y_test = train_test_split(embeddings_array, 
                                                        thumbs_down_up, test_size=0.20, 
                                                        random_state = RANDOM_SEED)
    with tf.Session() as sess:
        init.run()
        for epoch in range(n_epochs):
            for iteration in range(y_train.shape[0] // batch_size):
                X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
                y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
                acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        Train_Accuracy.append(acc_train)
        Test_Accuracy.append(acc_test)

In [41]:
print(Train_Accuracy)

[0.82, 0.84, 0.86, 0.86]


In [42]:
print(Test_Accuracy)

[0.62, 0.69, 0.67, 0.765]


In [43]:
dimensions = [50, 50, 100, 100]
vocabularysize = ['400K', '1.2M', '400K', '1.2M']

In [44]:
results = pd.DataFrame(OrderedDict([
                        ('Dimensions', dimensions),
                        ('Vocabulary_Size', vocabularysize),
                        ('Train_Accuracy', Train_Accuracy),
                        ('Test_Accuracy', Test_Accuracy),
                        ]))

In [45]:
print(results)

   Dimensions Vocabulary_Size  Train_Accuracy  Test_Accuracy
0          50            400K            0.82          0.620
1          50            1.2M            0.84          0.690
2         100            400K            0.86          0.670
3         100            1.2M            0.86          0.765


# CONCLUSIONS

We first notice that the train accuracy was higher than test accuracy across all four (4) tests. This indicates that the current RNN structure is overfitting. We should consider using techniques such as dropout, batch normalization, LMST or GRU cell, etc to reduce the probability of overfitting. 

When n_inputs = 50, the RNN which utilized embedding compressed from a pre-trained word vectors of larger vocabulary outperformed the one whose embedding was drawn from a smaller vocabulary word vector on both train set and test set. When n_inputs = 100, the RNN utilized embedding from a larger vocabulary outperformed the one with embedding from smaller vocabulary on test set. Based on the results of the four (4) tests, it seems that the size of the vocabulary of pre-trained word vectors is positively associated with the predictive accuracy of RNN on sentiment analysis. 

When the vocabulary size was held constant, the RNN built with embeddings of higher dimensions, in other words, the RNN with more inputs, consistently outperforms the one with lower dimensions (less inputs). This was evident by comparing test 1 and test 3, test 2 and test 4. 

Therefore, the management is recommended to use a RNN system dedicately designed with LMST/GRU cell and dropout layer, instead of a basic structure, to avoide overfitting. The management should also consider utilizing embedding of higher dimensions as it will improve the model's performance. Depending on the field, the management may find none of the pre-trained word vectors useful as they don't provide the unique details required by the specific industry. 