\setlength{\parindent}{0pt}  

Christopher Wilbar   
MSDS_422-DL_SEC55    
Assignment 8: Language Modeling with an RNN   

## 1. Summary and Problem Definition  

**Problem Definition:**   
Management is thinking about using a language model to classify written customer reviews and call and complaint logs. Management is exploring the use of Recurrent Neural Networks using pretrained word vectors. Managment seeks to perform initial exploration of different pre-trained vectors, different dimensions, and different vocabulary sizes to train a model on psotive/negative moview reviews.


If the most critical customer messages can be identified, then customer support personnel can be assigned to contact those customers.

How would you advise senior management? What kinds of systems and methods would be most relevant to the customer services function? Considering the results of this assignment in particular, what is needed to make an automated customer support system that is capable of identifying negative customer feelings? What can data scientists do to make language models more useful in a customer service function?f


    
**Summary**:   
In this notebook, we explore using gLove.6B and GloVe.Twitter.27B pretrained word vectors, using 50 and 200 dimensional versions, and a limited vocabulary of the top 25000 versus 250000 words by fitting a 25 neuron Recurrent Neural Network. The models have various success predicting whether a reviewer gives a thumbs-up or thumbs-down to a movie based on the first 20 and last 20 words of a review. 


## 2. Results and Recommendations

**Results and Recomendations** 

RNN's are great tools for natural language processing. 

My results suggested using less dimensions makes the model more slighlty generalizable (better test accuracy) and had faster processing times, though better tuning might be needed for the higher parameters. 

In general, a larger vocabulary size produced better test accuracy for the Glove Twitter vectors though interestingly not for the 6B word vectors. FUrther analysis needed to determine why.

In order to identify negative customer feelings in a specific customer service setting, it likely would be beneficial to train custom word vectors with specific words that relate to the specific application setting to get better results.

By training an RNN with custom word vectors, it would be possible to identify the reviews with the highest probability of negativity and automate fitlering these messages to support personnel that could reply most quickly to the most negative reviews. The model should continue to learn with new reviews, with adjusted ratings, and an adjusted vocabulary to get the best results.

In [1]:
# Benchmark Experiment: Scikit Learn Artificial Neural Networks

#   Pretrained Word Vector Dimensions  Vocab Size  Processing Time  \
# 0               gloVe.6B         50       25000        11.872679   
# 1               gloVe.6B         50      250000        12.016687   
# 2      GloVe.Twitter.27B         50       25000        12.118693   
# 3      GloVe.Twitter.27B         50      250000        12.106693   

#    Training Set Accuracy  Test Set Accuracy  
# 0                   0.98              0.645  
# 1                   0.98              0.625  
# 2                   0.94              0.675  
# 3                   0.97              0.725 


# Benchmark Experiment: Scikit Learn Artificial Neural Networks

#   Pretrained Word Vector Dimensions  Vocab Size  Processing Time  \
# 0               gloVe.6B        200       25000        20.247158   
# 1               gloVe.6B        200      250000        21.438226   
# 2      GloVe.Twitter.27B        200       25000        20.269160   
# 3      GloVe.Twitter.27B        200      250000        20.046147   

#    Training Set Accuracy  Test Set Accuracy  
# 0                   1.00              0.645  
# 1                   0.95              0.640  
# 2                   1.00              0.670  
# 3                   1.00              0.685 

## 3. Research Design and Methods Used

**Research Design**   

500 Thumbs-up and 500 Thumbs-down moview reviews collected randomly for analysis. Pretrained word vectors downloaded using the chakin package in Python. 

**Methods Used**  
Pyhton v.3 Jupyter Notebook was created to perform the analysis.
The following packages were used:
pandas, numpy, sklearn, os, tensorflow, time, re, collections, nltk
  . 
Because the response variable was one of 2 classes, binary classification methods are appropriate.

Feature engineering performmed on reviews to remove common words with little value.
Tensorflow learn was the primary tool for analysis. 
Pretrained word vectors downloaded using chakin package. 

RNN using Tensorflow with 25 neurons and 1 FC layer is used across various differnt pretrained vectors, vocabulary sizes, and dimensions.

## 4. Programming Work

In [2]:
# coding: utf-8

# Program modifed version of a Program by Thomas W. Miller, August 16, 2018

# Previous work involved gathering embeddings via chakin
# Following methods described in
#    https://github.com/chakki-works/chakin
# The previous program, run-chakin-to-get-embeddings-v001.py
# downloaded pre-trained GloVe embeddings, saved them in a zip archive,
# and unzipped that archive to create the four word-to-embeddings
# text files for use in language models. 

# This program sets uses word embeddings to set up defaultdict 
# dictionary data structures, that can them be employed in language
# models. This is demonstrated with a simple RNN model for predicting
# sentiment (thumbs-down versus thumbs-up) for movie reviews.

In [3]:
RANDOM_SEED = 88

import pandas as pd
import numpy as np
from time import time #time tracking

import os  # operating system functions
import os.path  # for manipulation of file path names

import re  # regular expressions

from collections import defaultdict

import nltk
from nltk.tokenize import TreebankWordTokenizer

import tensorflow as tf


In [4]:
# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [5]:
REMOVE_STOPWORDS = False  # no stopword removal 

In [6]:
# Select the pre-defined embeddings source        
# Define vocabulary size for the language model    
# Create a word_to_embedding_dict for GloVe.6B.50d
#Prepare for 2x2 Design

EVOCABSIZE = [25000,250000]
DIMENSIONS = 200

embeddings_directory = './embeddings/'
vectors = ['gloVe.6B','GloVe.Twitter.27B']
embeddings_filenames = [os.path.join(embeddings_directory + vectors[0], vectors[0].lower()+'.'+str(DIMENSIONS)+'d.txt'),\
                       os.path.join(embeddings_directory + vectors[1], vectors[1].lower()+'.'+str(DIMENSIONS)+'d.txt')]

In [7]:
embeddings_filenames

['./embeddings/gloVe.6B\\glove.6b.200d.txt',
 './embeddings/GloVe.Twitter.27B\\glove.twitter.27b.200d.txt']

In [8]:
word_vectors = [vectors[0],vectors[0],vectors[1],vectors[1]]
vocab_size = [EVOCABSIZE[0],EVOCABSIZE[1],EVOCABSIZE[0],EVOCABSIZE[1]]
dimensions = [str(DIMENSIONS)]*4

In [9]:
# Utility function for loading embeddings follows methods described in
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# Creates the Python defaultdict dictionary word_to_embedding_dict
# for the requested pre-trained word embeddings
# 
# Note the use of defaultdict data structure from the Python Standard Library
# collections_defaultdict.py lets the caller specify a default value up front
# The default value will be retuned if the key is not a known dictionary key
# That is, unknown words are represented by a vector of zeros
# For word embeddings, this default value is a vector of zeros
# Documentation for the Python standard library:
#   Hellmann, D. 2017. The Python 3 Standard Library by Example. Boston: 
#     Addison-Wesley. [ISBN-13: 978-0-13-429105-5]
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, 
    we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, 
    otherwise we return only a direct 
    `word_to_embedding_dict` dictionnary mapping 
    from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

In [10]:
def load_embeddings(embedding):
    print('\nLoading embeddings from', embeddings_filenames[embedding])
    load_start = time()
    word_to_index, index_to_embedding = \
    load_embedding_from_disks(embeddings_filenames[embedding], with_indexes=True)
    load_end = time()
    load_time = load_end-load_start
    return word_to_index, index_to_embedding, load_time

In [11]:
word_to_index0, index_to_embedding0, load_times0 = load_embeddings(0)
word_to_index1, index_to_embedding1, load_times1 = load_embeddings(1)
print("Embedding loaded from disks.")


Loading embeddings from ./embeddings/gloVe.6B\glove.6b.200d.txt

Loading embeddings from ./embeddings/GloVe.Twitter.27B\glove.twitter.27b.200d.txt
Embedding loaded from disks.


In [12]:
word_to_indexes = [word_to_index0, word_to_index1]
index_to_embeddings = [index_to_embedding0,index_to_embedding1]
load_times = [load_times0,load_times0,load_times1,load_times1]
del index_to_embedding0
del index_to_embedding1

In [13]:
vocab_sizes  = [index_to_embeddings[embedding].shape[0] for embedding in range(len(embeddings_filenames))]

In [14]:
embedding_dim = [index_to_embeddings[embedding].shape[1] for embedding in range(len(embeddings_filenames))]

In [15]:
# # Note: unknown words have representations with values [0, 0, ..., 0]

# # Additional background code from
# # https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# # shows the general structure of the data structures for word embeddings
# # This code is modified for our purposes in language modeling 

complete_vocabulary_size = [0]*len(embeddings_filenames)

def embedding_test(embedding):
    print("\n---Testing Vocabulary : ", vectors[embedding])
    print("Embedding is of shape: {}".format(index_to_embeddings[embedding]))
    print("This means (number of words, number of dimensions per word)\n")
    print("The first words are words that tend occur more often.")

    print("Note: for unknown words, the representation is an empty vector,\n"
      "and the index is the last one. The dictionnary has a limit:")
    print("    {} --> {} --> {}".format("A word", "Index in embedding", 
      "Representation"))
    word = "worsdfkljsdf"  # a word obviously not in the vocabulary
    idx = word_to_indexes[embedding][word] # index for word obviously not in the vocabulary
    complete_vocabulary_size[embedding] = idx 
    embd = list(np.array(index_to_embeddings[embedding][idx], dtype=int)) # "int" compact print
    print("    {} --> {} --> {}".format(word, idx, embd))
    word = "the"
    idx = word_to_indexes[embedding][word]
    embd = list(index_to_embeddings[embedding][idx])  # "int" for compact print only.
    print("    {} --> {} --> {}".format(word, idx, embd))

In [16]:
for i in range(len(embeddings_filenames)):    
    embedding_test(i) 


---Testing Vocabulary :  gloVe.6B
Embedding is of shape: [[-0.071549   0.093459   0.023738  ...  0.33617    0.030591   0.25577  ]
 [ 0.17651    0.29208   -0.0020768 ... -0.20774   -0.23189   -0.10814  ]
 [ 0.12289    0.58037   -0.069635  ... -0.039174  -0.16236   -0.096652 ]
 ...
 [-0.44607    0.025024   0.10755   ... -0.19955   -0.17528   -0.31097  ]
 [-0.51113   -0.47518    0.22871   ... -0.0057218  0.16466   -0.39074  ]
 [ 0.         0.         0.        ...  0.         0.         0.       ]]
This means (number of words, number of dimensions per word)

The first words are words that tend occur more often.
Note: for unknown words, the representation is an empty vector,
and the index is the last one. The dictionnary has a limit:
    A word --> Index in embedding --> Representation
    worsdfkljsdf --> 400000 --> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [17]:
def sentence_test(embeddingno, test_sentence):
    print('\nTest sentence: ', test_sentence, '\n')
    words_in_test_sentence = test_sentence.split()

    print('Test sentence embeddings from complete vocabulary of', 
          complete_vocabulary_size[embeddingno], 'words:\n')
    for word in words_in_test_sentence:
        word_ = word.lower()
        embedding = index_to_embeddings[embeddingno][word_to_index0[word_]]
        print(word_ + ": ", embedding)
    

In [18]:
# Show how to use embeddings dictionaries with a test sentence
# This is a famous typing exercise with all letters of the alphabet
# https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'

for i in range(len(embeddings_filenames)):   
    sentence_test(i,a_typing_test_sentence)


Test sentence:  The quick brown fox jumps over the lazy dog 

Test sentence embeddings from complete vocabulary of 400000 words:

the:  [-7.1549e-02  9.3459e-02  2.3738e-02 -9.0339e-02  5.6123e-02  3.2547e-01
 -3.9796e-01 -9.2139e-02  6.1181e-02 -1.8950e-01  1.3061e-01  1.4349e-01
  1.1479e-02  3.8158e-01  5.4030e-01 -1.4088e-01  2.4315e-01  2.3036e-01
 -5.5339e-01  4.8154e-02  4.5662e-01  3.2338e+00  2.0199e-02  4.9019e-02
 -1.4132e-02  7.6017e-02 -1.1527e-01  2.0060e-01 -7.7657e-02  2.4328e-01
  1.6368e-01 -3.4118e-01 -6.6070e-02  1.0152e-01  3.8232e-02 -1.7668e-01
 -8.8153e-01 -3.3895e-01 -3.5481e-02 -5.5095e-01 -1.6899e-02 -4.3982e-01
  3.9004e-02  4.0447e-01 -2.5880e-01  6.4594e-01  2.6641e-01  2.8009e-01
 -2.4625e-02  6.3302e-01 -3.1700e-01  1.0271e-01  3.0886e-01  9.7792e-02
 -3.8227e-01  8.6552e-02  4.7075e-02  2.3511e-01 -3.2127e-01 -2.8538e-01
  1.6670e-01 -4.9707e-03 -6.2714e-01 -2.4904e-01  2.9713e-01  1.4379e-01
 -1.2325e-01 -5.8178e-02 -1.0290e-03 -8.2126e-02  3.6935e-01

In [19]:
def default_factory0():
    return EVOCABSIZE[0]  # last/unknown-word row in limited_index_to_embedding
def default_factory1():
    return EVOCABSIZE[1]  # last/unknown-word row in limited_index_to_embedding

In [20]:
def create_limited_word_to_index(embeddingno, defaultfactory, vocabsizeidx):
    limited_word_to_index = defaultdict(defaultfactory, \
        {k: v for k, v in word_to_indexes[embeddingno].items() if v < EVOCABSIZE[vocabsizeidx]})
    return limited_word_to_index

def create_limited_index_to_embedding(embeddingno,vocabsizeidx) :
    # Select the first EVOCABSIZE rows to the index_to_embedding
    limited_index_to_embedding = index_to_embeddings[embeddingno][0:EVOCABSIZE[vocabsizeidx],:]
    # Set the unknown-word row to be all zeros as previously
    limited_index_to_embedding = np.append(limited_index_to_embedding, 
        index_to_embeddings[embeddingno][index_to_embeddings[embeddingno].shape[0] - 1, :].\
            reshape(1,embedding_dim[embeddingno]), 
        axis = 0)
    return limited_index_to_embedding


In [21]:
limited_word_to_indexes = [[0]]*len(word_vectors)
limited_index_to_embeddings = [[0]]*len(word_vectors)

In [22]:
for i in range(0,len(word_vectors),len(EVOCABSIZE)):    
    limited_word_to_indexes[i] = create_limited_word_to_index(int(i/2),default_factory0,0)
    limited_index_to_embeddings[i] = create_limited_index_to_embedding (int(i/2),0)
for i in range(1,len(word_vectors),len(EVOCABSIZE)):    
    limited_word_to_indexes[i] = create_limited_word_to_index(int(i/2),default_factory1,1)
    limited_index_to_embeddings[i] = create_limited_index_to_embedding (int(i/2),1)
    
del index_to_embeddings

In [23]:
# Verify the new vocabulary: should get same embeddings for test sentence
def sentence_test_limited(conditionno, test_sentence):
    print('\nTest sentence: ', test_sentence, '\n')
    words_in_test_sentence = test_sentence.split()

    print('Test sentence embeddings from complete vocabulary of', 
          len(limited_index_to_embeddings[conditionno])-1, 'words:\n')
    for word in words_in_test_sentence:
        word_ = word.lower()
        embedding = limited_index_to_embeddings[conditionno][limited_word_to_indexes[conditionno][word_]]
        print(word_ + ": ", embedding)

In [24]:
for i in range(len(word_vectors)):
    sentence_test_limited(i,a_typing_test_sentence)


Test sentence:  The quick brown fox jumps over the lazy dog 

Test sentence embeddings from complete vocabulary of 25000 words:

the:  [-7.1549e-02  9.3459e-02  2.3738e-02 -9.0339e-02  5.6123e-02  3.2547e-01
 -3.9796e-01 -9.2139e-02  6.1181e-02 -1.8950e-01  1.3061e-01  1.4349e-01
  1.1479e-02  3.8158e-01  5.4030e-01 -1.4088e-01  2.4315e-01  2.3036e-01
 -5.5339e-01  4.8154e-02  4.5662e-01  3.2338e+00  2.0199e-02  4.9019e-02
 -1.4132e-02  7.6017e-02 -1.1527e-01  2.0060e-01 -7.7657e-02  2.4328e-01
  1.6368e-01 -3.4118e-01 -6.6070e-02  1.0152e-01  3.8232e-02 -1.7668e-01
 -8.8153e-01 -3.3895e-01 -3.5481e-02 -5.5095e-01 -1.6899e-02 -4.3982e-01
  3.9004e-02  4.0447e-01 -2.5880e-01  6.4594e-01  2.6641e-01  2.8009e-01
 -2.4625e-02  6.3302e-01 -3.1700e-01  1.0271e-01  3.0886e-01  9.7792e-02
 -3.8227e-01  8.6552e-02  4.7075e-02  2.3511e-01 -3.2127e-01 -2.8538e-01
  1.6670e-01 -4.9707e-03 -6.2714e-01 -2.4904e-01  2.9713e-01  1.4379e-01
 -1.2325e-01 -5.8178e-02 -1.0290e-03 -8.2126e-02  3.6935e-01 

In [25]:
# ------------------------------------------------------------
# code for working with movie reviews data 
# Source: Miller, T. W. (2016). Web and Network Data Science.
#    Upper Saddle River, N.J.: Pearson Education.
#    ISBN-13: 978-0-13-388644-3
# This original study used a simple bag-of-words approach
# to sentiment analysis, along with pre-defined lists of
# negative and positive words.        
# Code available at:  https://github.com/mtpa/wnds       
# ------------------------------------------------------------

In [26]:
# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

In [27]:
# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

In [28]:
# text parsing function for creating text documents 
# there is more we could do for data preparation 
# stemming... looking for contractions... possessives... 
# but we will work with what we have in this parsing function
# if we want to do stemming at a later time, we can use
#     porter = nltk.PorterStemmer()  
# in a construction like this
#     words_stemmed =  [porter.stem(word) for word in initial_words]  
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)   

In [29]:
# -----------------------------------------------
# gather data for 500 negative movie reviews
# -----------------------------------------------
dir_name = 'movie-reviews-negative'
    
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: movie-reviews-negative
500 files found


In [30]:
# Read data for negative movie reviews
# Data will be stored in a list of lists where the each list represents 
# a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    negative_documents.append(words)
    #print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    #print('Sample string (Document %d) %s'%(i,words[:50]))


Processing document files under movie-reviews-negative


In [31]:
# -----------------------------------------------
# gather data for 500 positive movie reviews
# -----------------------------------------------
dir_name = 'movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: movie-reviews-positive
500 files found


In [32]:
# Read data for positive movie reviews
# Data will be stored in a list of lists where the each list 
# represents a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    positive_documents.append(words)
    # print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    # print('Sample string (Document %d) %s'%(i,words[:50]))


Processing document files under movie-reviews-positive


In [33]:
# -----------------------------------------------------
# convert positive/negative documents into numpy array
# note that reviews vary from 22 to 1052 words   
# so we use the first 20 and last 20 words of each review 
# as our word sequences for analysis
# -----------------------------------------------------
max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

max_review_length: 1052
min_review_length: 22


In [34]:
# construct list 4 lists of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))  
    
# create list of lists of lists for embeddings
embeddings = [[],[],[],[]]   
for i in range(len(limited_index_to_embeddings)):
    for doc in documents:
        embedding = []
        for word in doc:
            embedding.append(limited_index_to_embeddings[i][limited_word_to_indexes[i][word]]) 
        embeddings[i].append(embedding)

In [35]:
limited_index_to_embeddings[i][limited_word_to_indexes[i]['story']]

array([ 8.6258e-02, -1.1147e-01, -4.2871e-01, -2.8312e-01, -1.0809e-01,
        1.1587e-01,  4.4988e-01,  3.8744e-01, -3.0769e-01, -8.0108e-02,
       -3.4105e-02, -8.1727e-02, -2.6876e-01, -7.4932e-02,  1.0842e-02,
        2.6350e-01,  4.7296e-01,  7.2050e-01, -2.2647e-02,  2.0183e-01,
        1.7395e-01, -4.6669e-01, -7.1247e-02,  5.4053e-01, -5.1786e-02,
        2.1078e-01, -1.7229e-01, -3.4428e-01,  1.9561e-01, -1.9806e-01,
        8.4391e-02,  2.2984e-01, -1.7583e-01, -1.2812e-01, -4.4452e-02,
        5.3369e-02,  4.1649e-01, -1.5325e-01,  2.3733e-01, -5.5631e-01,
        3.1745e-01, -5.4240e-01,  7.7504e-01, -2.9218e-01, -2.7242e-01,
        3.7939e-01, -1.2790e-01,  2.3380e-01,  5.6552e-01, -2.0144e-01,
        3.0116e-03,  3.3248e-01, -1.4910e-01,  8.4401e-02, -8.2072e-02,
       -4.8399e-01,  3.2357e-01,  2.6116e-01,  1.4982e-01,  6.5079e-02,
        8.2326e-02, -6.1386e-02, -2.3392e-01,  1.5088e-02, -2.5994e-01,
        7.5520e-02, -1.2213e-01,  2.3010e-01,  5.0098e-01,  3.95

In [36]:
# -----------------------------------------------------    
# Check on the embeddings list of list of lists 
# -----------------------------------------------------
for j in range(len(embeddings)):
    print('\n---------Results for Embedding: ',j, )
    
    # Show the first word in the first document
    test_word = documents[0][0]    
    print('First word in first document:', test_word)    
    print('Embedding 00 for this word:\n', 
          limited_index_to_embeddings[j][limited_word_to_indexes[j][test_word]])
    print('Corresponding embedding from embeddings list of list of lists\n',
          embeddings[j][0][0][:])

    # Show the seventh word in the tenth document
    test_word = documents[6][9]    
    print('\nSeventh word in tenth document:', test_word)    
    print('Embedding for this word:\n', 
          limited_index_to_embeddings[j][limited_word_to_indexes[j][test_word]])
    print('Corresponding embedding from embeddings list of list of lists\n',
          embeddings[j][6][9][:])

    # Show the last word in the last document
    test_word = documents[999][39]    
    print('First word in first document:', test_word)    
    print('Embedding for this word:\n', 
          limited_index_to_embeddings[j][limited_word_to_indexes[j][test_word]])
    print('Corresponding embedding from embeddings list of list of lists\n',
          embeddings[j][999][39][:])    


---------Results for Embedding:  0
First word in first document: story
Embedding 00 for this word:
 [-0.35058    0.58245   -0.065584  -0.41768    0.22424   -0.018073
 -1.3356    -0.47482    0.23183    0.10959    0.83464    0.37482
  0.38829   -0.59514    0.37206   -0.058546   0.13618    0.68434
 -0.43494    0.45186    0.14374    2.0624    -0.094351  -0.049338
  0.53175    0.17554    0.12168   -0.33087   -0.21675   -0.1083
 -0.56548    0.34792    0.28643   -0.094931  -0.12818   -0.25668
 -0.53174   -0.22513    0.19393   -0.65102   -0.67151    0.026963
 -0.44885    0.13237   -0.31898    0.18187    0.74176    0.079775
  0.43745    0.051039  -0.098213  -0.10155    0.60516    0.32244
  0.0046332  0.47619   -0.50256   -0.16815    0.64469   -0.16853
 -0.85942    0.45803   -0.63957   -0.33473   -0.45855   -1.1143
  0.27468   -0.14675   -0.22982    0.047588   0.40896    0.051899
 -0.31375    0.65951   -0.18551    0.22501   -0.79543   -0.43312
 -0.32808    0.13564   -0.081915   0.30187   -0.319

In [37]:
# -----------------------------------------------------    
# Make embeddings a numpy array for use in an RNN 
# Create training and test sets with Scikit Learn
# -----------------------------------------------------
embeddings_array = np.array(embeddings)

# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = [np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)]*4

In [38]:
# --------------------------------------------------------------------------      
# We use a very simple Recurrent Neural Network for this assignment
# Géron, A. 2017. Hands-On Machine Learning with Scikit-Learn & TensorFlow: 
#    Concepts, Tools, and Techniques to Build Intelligent Systems. 
#    Sebastopol, Calif.: O'Reilly. [ISBN-13 978-1-491-96229-9] 
#    Chapter 14 Recurrent Neural Networks, pages 390-391
#    Source code available at https://github.com/ageron/handson-ml
#    Jupyter notebook file 14_recurrent_neural_networks.ipynb
#    See section on Training an sequence Classifier, # In [34]:
#    which uses the MNIST case data...  we revise to accommodate
#    the movie review data in this assignment    
# --------------------------------------------------------------------------  


In [39]:
reset_graph()

In [40]:
n_neurons = 25  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

In [41]:
# Scikit Learn for random splitting of the data  
from sklearn.model_selection import train_test_split

In [42]:
def model_run(conditions,embeddings,thumbs):
    reset_graph()
    
    print('\nResults for conditions = ', conditions)
    
    embeddings_array = embeddings[conditions,:,:,:]
    thumbs_down_up = thumbs[conditions]

    # Random splitting of the data in to training (80%) and test (20%)  
    X_train, X_test, y_train, y_test = \
        train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                         random_state = RANDOM_SEED)

    n_steps = embeddings_array.shape[1]  # number of words per document 
    n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings

    X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
    y = tf.placeholder(tf.int32, [None])

    basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
    outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

    logits = tf.layers.dense(states, n_outputs)
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    loss = tf.reduce_mean(xentropy)
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

    init = tf.global_variables_initializer()

    n_epochs = 100
    batch_size = 100

    start_time = time()
    with tf.Session() as sess:
        init.run()
        for epoch in range(n_epochs):            
            for iteration in range(y_train.shape[0] // batch_size):          
                X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
                y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
                #print('  Batch ', iteration, ' training observations from ',  
                 #     iteration*batch_size, ' to ', (iteration + 1)*batch_size-1,)
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
            acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
            
            if (epoch%10==0) | (epoch==(n_epochs-1)) :
                print('\n  ---- Epoch ', epoch, ' ----\n')
                print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)
    end_time = time()       
    runtime = end_time - start_time        
    
    return  runtime, acc_train, acc_test

In [43]:
# Initialize Tracking Vectors
processingtime = [0]*len(word_vectors)
train_accuracy = [0]*len(word_vectors)
test_accuracy = [0]*len(word_vectors)

In [44]:
for condition in range(len(word_vectors)):
    processingtime[condition], train_accuracy[condition], test_accuracy[condition] =\
        model_run(condition,embeddings_array,thumbs_down_up)


Results for conditions =  0

For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use keras.layers.dense instead.

  ---- Epoch  0  ----


  Train accuracy: 0.48 Test accuracy: 0.525

  ---- Epoch  10  ----


  Train accuracy: 0.7 Test accuracy: 0.58

  ---- Epoch  20  ----


  Train accuracy: 0.87 Test accuracy: 0.695

  ---- Epoch  30  ----


  Train accuracy: 0.93 Test accuracy: 0.665

  ---- Epoch  40  ----


  Train accuracy: 0.94 Test accuracy: 0.65

  ---- Epoch  50  ----


  

In [45]:
from collections import OrderedDict  

results = pd.DataFrame(OrderedDict([('Pretrained Word Vector', word_vectors),
                        ('Dimensions', dimensions),
                        ('Vocab Size', vocab_size),
                        ('Processing Time', processingtime),
                        ('Training Set Accuracy', train_accuracy),
                        ('Test Set Accuracy', test_accuracy)]))

print('\nBenchmark Experiment: Scikit Learn Artificial Neural Networks\n')
print(results) 


Benchmark Experiment: Scikit Learn Artificial Neural Networks

  Pretrained Word Vector Dimensions  Vocab Size  Processing Time  \
0               gloVe.6B        200       25000        20.247158   
1               gloVe.6B        200      250000        21.438226   
2      GloVe.Twitter.27B        200       25000        20.269160   
3      GloVe.Twitter.27B        200      250000        20.046147   

   Training Set Accuracy  Test Set Accuracy  
0                   1.00              0.645  
1                   0.95              0.640  
2                   1.00              0.670  
3                   1.00              0.685  


In [46]:
# word_to_indexes, index_to_embeddings, load_times = [load_embeddings(embedding) for embedding in range(len(embeddings_filenames))]

In [47]:
# print('\nLoading embeddings from', embeddings_filenames[0])
# word_to_index0, index_to_embedding0 = \
#     load_embedding_from_disks(embeddings_filenames[0], with_indexes=True)

# print('\nLoading embeddings from', embeddings_filenames[1])
# word_to_index1, index_to_embedding1 = \
#     load_embedding_from_disks(embeddings_filenames[1], with_indexes=True)

# print("Embedding loaded from disks.")

In [48]:
# # Note: unknown words have representations with values [0, 0, ..., 0]

# # Additional background code from
# # https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# # shows the general structure of the data structures for word embeddings
# # This code is modified for our purposes in language modeling 
# vocab_size0, embedding_dim0 = index_to_embedding0.shape
# print("Embedding is of shape: {}".format(index_to_embedding0.shape))
# print("This means (number of words, number of dimensions per word)\n")
# print("The first words are words that tend occur more often.")

# print("Note: for unknown words, the representation is an empty vector,\n"
#       "and the index is the last one. The dictionnary has a limit:")
# print("    {} --> {} --> {}".format("A word", "Index in embedding", 
#       "Representation"))
# word = "worsdfkljsdf"  # a word obviously not in the vocabulary
# idx = word_to_index0[word] # index for word obviously not in the vocabulary
# complete_vocabulary_size0 = idx 
# embd = list(np.array(index_to_embedding0[idx], dtype=int)) # "int" compact print
# print("    {} --> {} --> {}".format(word, idx, embd))
# word = "the"
# idx = word_to_index0[word]
# embd = list(index_to_embedding0[idx])  # "int" for compact print only.
# print("    {} --> {} --> {}".format(word, idx, embd))

In [49]:

# vocab_size1, embedding_dim1 = index_to_embedding1.shape
# print("Embedding is of shape: {}".format(index_to_embedding1.shape))
# print("This means (number of words, number of dimensions per word)\n")
# print("The first words are words that tend occur more often.")

# print("Note: for unknown words, the representation is an empty vector,\n"
#       "and the index is the last one. The dictionnary has a limit:")
# print("    {} --> {} --> {}".format("A word", "Index in embedding", 
#       "Representation"))
# word = "worsdfkljsdf"  # a word obviously not in the vocabulary
# idx = word_to_index1[word] # index for word obviously not in the vocabulary
# complete_vocabulary_size1 = idx 
# embd = list(np.array(index_to_embedding1[idx], dtype=int)) # "int" compact print
# print("    {} --> {} --> {}".format(word, idx, embd))
# word = "the"
# idx = word_to_index1[word]
# embd = list(index_to_embedding1[idx])  # "int" for compact print only.
# print("    {} --> {} --> {}".format(word, idx, embd))

In [50]:
# a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
# print('\nTest sentence: ', a_typing_test_sentence, '\n')
# words_in_test_sentence = a_typing_test_sentence.split()

# print('Test sentence embeddings from complete vocabulary of', 
#       complete_vocabulary_size0, 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = index_to_embedding0[word_to_index0[word_]]
#     print(word_ + ": ", embedding)

In [51]:
# print('Test sentence embeddings from complete vocabulary of', 
#       complete_vocabulary_size1, 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = index_to_embedding1[word_to_index1[word_]]
#     print(word_ + ": ", embedding)

In [52]:
# # Define vocabulary size for the language model    
# # To reduce the size of the vocabulary to the n most frequently used words

# # dictionary has the items() function, returns list of (key, value) tuples
# limited_word_to_index00 = defaultdict(default_factory0, \
#     {k: v for k, v in word_to_index0.items() if v < EVOCABSIZE[0]})

# # Select the first EVOCABSIZE rows to the index_to_embedding
# limited_index_to_embedding00 = index_to_embedding0[0:EVOCABSIZE[0],:]
# # Set the unknown-word row to be all zeros as previously
# limited_index_to_embedding00 = np.append(limited_index_to_embedding00, 
#     index_to_embedding0[index_to_embedding0.shape[0] - 1, :].\
#         reshape(1,embedding_dim0), 
#     axis = 0)

# # dictionary has the items() function, returns list of (key, value) tuples
# limited_word_to_index01 = defaultdict(default_factory1, \
#     {k: v for k, v in word_to_index0.items() if v < EVOCABSIZE[1]})

# # Select the first EVOCABSIZE rows to the index_to_embedding
# limited_index_to_embedding01 = index_to_embedding0[0:EVOCABSIZE[1],:]
# # Set the unknown-word row to be all zeros as previously
# limited_index_to_embedding01 = np.append(limited_index_to_embedding01, 
#     index_to_embedding0[index_to_embedding0.shape[0] - 1, :].\
#         reshape(1,embedding_dim0), 
#     axis = 0)

# # Delete large numpy array to clear some CPU RAM
# del index_to_embedding0

In [53]:
# # dictionary has the items() function, returns list of (key, value) tuples
# limited_word_to_index10 = defaultdict(default_factory0, \
#     {k: v for k, v in word_to_index1.items() if v < EVOCABSIZE[0]})

# # Select the first EVOCABSIZE rows to the index_to_embedding
# limited_index_to_embedding10 = index_to_embedding1[0:EVOCABSIZE[0],:]
# # Set the unknown-word row to be all zeros as previously
# limited_index_to_embedding10 = np.append(limited_index_to_embedding10, 
#     index_to_embedding1[index_to_embedding1.shape[0] - 1, :].\
#         reshape(1,embedding_dim1), 
#     axis = 0)

# # dictionary has the items() function, returns list of (key, value) tuples
# limited_word_to_index11 = defaultdict(default_factory1, \
#     {k: v for k, v in word_to_index1.items() if v < EVOCABSIZE[1]})

# # Select the first EVOCABSIZE rows to the index_to_embedding
# limited_index_to_embedding11 = index_to_embedding1[0:EVOCABSIZE[1],:]
# # Set the unknown-word row to be all zeros as previously
# limited_index_to_embedding11 = np.append(limited_index_to_embedding11, 
#     index_to_embedding1[index_to_embedding1.shape[0] - 1, :].\
#         reshape(1,embedding_dim1), 
#     axis = 0)

# # Delete large numpy array to clear some CPU RAM
# del index_to_embedding1

In [54]:
# # Note that a small EVOCABSIZE may yield some zero vectors for embeddings
# print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE[0], 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = limited_index_to_embedding[0][limited_word_to_index[0][word_]]
#     print(word_ + ": ", embedding)
    
# print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE[1], 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = limited_index_to_embedding[1][limited_word_to_index[1][word_]]
#     print(word_ + ": ", embedding)

In [55]:
# # Verify the new vocabulary: should get same embeddings for test sentence
# # Note that a small EVOCABSIZE may yield some zero vectors for embeddings
# print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE[0], 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = limited_index_to_embedding10[limited_word_to_index10[word_]]
#     print(word_ + ": ", embedding)
    
# print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE[1], 'words:\n')
# for word in words_in_test_sentence:
#     word_ = word.lower()
#     embedding = limited_index_to_embedding11[limited_word_to_index11[word_]]
#     print(word_ + ": ", embedding)

In [56]:
# limited_index_to_embeddings = [limited_index_to_embedding00,limited_index_to_embedding01,\
#                                limited_index_to_embedding10,limited_index_to_embedding11]
# limited_word_to_indexes = [limited_word_to_index00,limited_word_to_index01,\
#                                limited_word_to_index10,limited_word_to_index11]