### **Sentiment Analysis of Movie Titles**
 Using the Bag of Words Meets Bag of Popcorn Kaggle Tutorial Competition

**Mandey Brown, Sam Fortescue, Mark Gardner, Emma Hamilton**

Primary Packages Utilized

In [None]:
#used to remove html tags
from bs4 import BeautifulSoup   

#for uploading files (if done using below code)
import io

#remove punctuation and numbers
import re

#natural language tool kit --> remove a stop word
import nltk     

#other packages
import sklearn    
import pandas as pd 
import numpy as np 

Reading in the Data

In [None]:
unlab_train = pd.read_csv("/path-to/unlabeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)

lab_train = pd.read_csv("/path-to/labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)

test = pd.read_csv("/path-to/testData.tsv", header=0, delimiter="\t", \
                   quoting=3 )

# Part 1 Bag of Words

In [None]:
print('The first review is:')
print(lab_train["review"][0])

The first review is:
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bi

In [None]:
#download and import the stopwords from the natural language toolkit library
nltk.download('stopwords')
from nltk.corpus import stopwords

#function to preprocess the data
  #remove html tags
  #remove stop words
#converts the review to a string made of "important" words
def review_to_words( raw_review ):
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    
    # 3. Convert to lower case and split into individual words
    words = letters_only.lower().split()                             
    
    # 4. Convert stop word to a set
      # sets are faster to search
    stops = set(stopwords.words("english"))                  
    
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    
    # 6. Join the words back into one string separated by space
        # return the result.
    return( " ".join( meaningful_words ))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#use the previously defined function to clean all of the movie reviews
print("Cleaning the labeled training set movie reviews...")

clean_train_reviews = []

for i in range( 0, len(lab_train["review"])):
   clean_train_reviews.append( review_to_words( lab_train["review"].iloc[i] ))

print("Done Cleaning.")

Cleaning the labeled training set movie reviews...
Done Cleaning.


In [None]:
# create the bag of words
print("Creating the Bag-of-words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer"
# CountVectorizor is used to transform the cleaned review
# into a vector based on the frequency of each word that occurs

#Note: this count vectorizor is limited to the first 5000 words per review
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)
print("Bag-of-words Created.")

Creating the Bag-of-words...

Bag-of-words Created.


In [None]:
#fit_transform() has two tasks: 
#First, it fits the model and learns the vocab
#Second, it transforms training data into feature vectors
    #feature vectors are used to represent the frequency of each word in each review
train_data_features = vectorizer.fit_transform(clean_train_reviews)
np.array(train_data_features)

array(<25000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 1975048 stored elements in Compressed Sparse Row format>,
      dtype=object)

In [None]:
#Fit the RF model using the labeled training set
#the bag of words are the features 
#the sentiment labels are the response variable

from sklearn.ensemble import RandomForestClassifier
print("Training the Random Forest (this may take a while)...")

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, lab_train["sentiment"] )

print("Done training RF.")

Training the Random Forest (this may take a while)...
Done training RF.


In [None]:
#need to also clean the test data so that it can be
#compared to the training data
clean_test_reviews = []

print("Cleaning the test set movie reviews...")
for i in range(0,len(test["review"])):
  clean_test_reviews.append( review_to_words( test["review"].iloc[i] ))

print("Done cleaning test data.")

Cleaning the test set movie reviews...
Done cleaning test data.


In [None]:
#also get a bag of words for the test data
#and convert to feature vectors
#once again to be able to compare with the training data
test_data_features = vectorizer.transform(clean_test_reviews)
np.array(test_data_features)

array(<25000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 1927864 stored elements in Compressed Sparse Row format>,
      dtype=object)

In [None]:
#predict the sentiment scores now using the test data feature and the RF
print("Predicting test labels...")
result = forest.predict(test_data_features)
print("Predicted Sentiment scores calculated.")

Predicting test labels...
Predicted Sentiment scores calculated.


In [None]:
#get an output to submit to the competition and gain insight into an accuracy 
#to compare with the other approaches
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

In [None]:
#Use pandas to write the comma-separated csv file
#this csv will be submitted to the kaggle competition
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )
print("CSV for Part 1 created.")

CSV for Part 1 created.


# Part 2 Word Vectors

In [None]:
#utilizes the cleaning function from part1 with 
#the option of not removing the stopwords 
#(which is more ideal for Word2Vec b/c relies on broader context of sentences)

#function will clean the data and convert review/sentence
#into a list of words
def review_to_wordlist( review, remove_stopwords=False ):
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
     
    # 2. Remove non-letters (i.e numbers etc.)
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    
    # 4. Remove stop words (less ideal for Word2Vec)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # 5. Return a list of words
    return(words)

In [None]:
#Word-2-Vec expects single sentence input
    #each as a list of words
#punkt tokenizer will account for variety of ways to end sentences in 
    #the english language (?, !, etc.)
    #also does not rely in spacing and capitalization to do so

#install and download that
nltk.download('punkt') 
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
#function to split a review into sentences

#Note this will produce a warning about urls, this is not an issue
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # 1. Use the NLTK tokenizer to split paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, remove_stopwords ))
    
    # Return the list of sentences (each sentence is a list of words
        #returns a list of lists
    return sentences

sentences_lab = []  # Initialize an empty list of sentences
sentences_unlab = []

print("Parsing sentences from labeled training set")
for review in lab_train["review"]:
    sentences_lab += review_to_sentences(review, tokenizer)
print("Finished parsing sentences in the labeled training set!")

print("Parsing sentences from unlabeled set")
for review in unlab_train["review"]:
    sentences_unlab += review_to_sentences(review, tokenizer)
print("Finished parsing sentences in the unlabeled training set!")

In [None]:
#logging will allow for Word2Vec to create a nice output message
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
        level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

In [None]:
#initialize and train the Word2Vec model
from gensim.models import Word2Vec

print("Training Word2Vec model...")
model = Word2Vec(sentences_lab, workers=num_workers, \
                size=num_features, min_count = min_word_count, \
                window = context, sample = downsampling, seed=1)

In [None]:
#This makes the model more memory efficient
model.init_sims(replace=True)

# name the model to use in part 3
model_name = "word2vec_model"
model.save(model_name)

2022-03-01 00:22:41,831 : INFO : precomputing L2-norms of word weight vectors
2022-03-01 00:22:41,913 : INFO : saving Word2Vec object under word2vec_model, separately None
2022-03-01 00:22:41,916 : INFO : not storing attribute vectors_norm
2022-03-01 00:22:41,919 : INFO : not storing attribute cum_table
2022-03-01 00:22:42,327 : INFO : saved word2vec_model


# Part 3 Clustering with Word2Vec

In [None]:
#load the model created in Part2
model = Word2Vec.load("word2vec_model")

2022-03-01 00:23:57,927 : INFO : loading Word2Vec object from word2vec_model
2022-03-01 00:23:58,251 : INFO : loading wv recursively from word2vec_model.wv.* with mmap=None
2022-03-01 00:23:58,255 : INFO : setting ignored attribute vectors_norm to None
2022-03-01 00:23:58,258 : INFO : loading vocabulary recursively from word2vec_model.vocabulary.* with mmap=None
2022-03-01 00:23:58,264 : INFO : loading trainables recursively from word2vec_model.trainables.* with mmap=None
2022-03-01 00:23:58,266 : INFO : setting ignored attribute cum_table to None
2022-03-01 00:23:58,268 : INFO : loaded word2vec_model


In [None]:
#Word2Vec naturally creates clusters semantically related words
#use a K-means approach to find the center of the word clusters

#Note: This code does take a while to compute
    #To see how long it might take a timer is set
    #It usually takes around 20 minutes
import time
start = time.time() # Start time

#Set "k" (or the number of clusters) to be 1/5th the vocab size
    #This way there are about only 5 words per cluster
      #This was found to be the most effective approach via the 
      #trial and error computed by the group who originally made
      #the kaggle competition
word_vectors = model.wv.syn0
num_clusters = word_vectors.shape[0] // 5

#Initialize a K-Means object
  #use it to extract the centroids
from sklearn.cluster import KMeans
print("Running K means")
kmeans_clustering = KMeans( n_clusters = num_clusters )

#store the cluster assignment for each word
idx = kmeans_clustering.fit_predict( word_vectors )

#see how long it took to compute the K-Means clustering
end = time.time()
elapsed = end - start
print("Time taken for K Means clustering: ", elapsed, "seconds.")


  from ipykernel import kernelapp as app


Running K means
Time taken for K Means clustering:  456.1618766784668 seconds.


In [None]:
#idx contains the cluster assignment for each word
#index2word (from the model created in part 2) contains the vocab from that model

#They are zipped into one dictionary
word_centroid_map = dict(zip( model.wv.index2word, idx ))

In [None]:
# Print the first ten clusters

# Note: clusters can differ b/c Word2Vec relies on a random number seed
for cluster in range(0,20):
    # Print the cluster number
    print("\nCluster %d" % cluster)
        
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in range(0,len(word_centroid_map.values())):
      if( list(word_centroid_map.values())[i] == cluster ): 
        words.append(list(word_centroid_map.keys())[i])
    print(words)

In [None]:
#function to convert reviews into bag-of-centroids
  #similar to bag-of-words
    #instead uses semantically related clusters instead of individual words

#will return an array for each review
  #each with a number of features equal to the number of clusters

def create_bag_of_centroids( wordlist, word_centroid_map ):
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    
    # Return the "bag of centroids"
    return bag_of_centroids

In [None]:
print("Creating Bag-of-Centroids Training data")
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (lab_train["review"].size, num_clusters), dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1

print("Done Creating Bag-of-Centroids Training data")

Creating Bag-of-Centroids Training data
Done Creating Bag-of-Centroids training data


In [None]:
# Repeat for test reviews (so the data can be compares)
print("Creating Bag-of-Centroids Test data")

test_centroids = np.zeros(( test["review"].size, num_clusters), \
        dtype="float32" )

counter = 0
for review in clean_test_reviews:
  test_centroids[counter] = create_bag_of_centroids( review, \
    word_centroid_map )
  counter += 1

print("Done Creating Bag-of-Centroids Test data")

Creating Bag-of-Centroids Test data
Done Creating Bag-of-Centroids training data


In [None]:
# Fit a random forest and extract predictions
forest = RandomForestClassifier(n_estimators = 100)

# Fitting the forest may take a few minutes
print("Fitting a Random Rorest to labeled training data")
forest = forest.fit(train_centroids, lab_train["sentiment"])
result = forest.predict(test_centroids)

print("Done fitting the RF")

Fitting a Random Rorest to labeled training data
Done fitting the RF


In [None]:
# Write the test results
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})

In [None]:
#create a CSV file from the output/results that will be submitted to the kaggle competition
output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )

# Additional Exploration: Bag-of-Words

This approach expands upon the Bag-of-Words approach previously done. It will break up the training data to fit the model to allow for model accuracy tests to gain further insight into how well the Bag-of-Words model is at predicting the sentiment of a movie review

Split the labeled training data

In [None]:
#Split training data here 
#allows for later model accuracy
X = lab_train.drop('sentiment', axis = 1)
y = lab_train['sentiment']

#to split the training data to do model accuracy
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=11)

Data Cleaning and Text Preprocessing

In [None]:
#Note: this utilizes the review_to_words function defined in the Part 1 section
num_reviews = X_train["review"].size

print("Cleaning and parsing the training set movie reviews...\n")

clean_train_reviews = []

for i in range( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print("Review %d of %d\n" % ( i+1, num_reviews ))                                                                    
    clean_train_reviews.append( review_to_words( X_train["review"].iloc[i] ))

Cleaning and parsing the training set movie reviews...

Review 1000 of 15000

Review 2000 of 15000

Review 3000 of 15000

Review 4000 of 15000

Review 5000 of 15000

Review 6000 of 15000

Review 7000 of 15000

Review 8000 of 15000

Review 9000 of 15000

Review 10000 of 15000

Review 11000 of 15000

Review 12000 of 15000

Review 13000 of 15000

Review 14000 of 15000

Review 15000 of 15000



Creating the Bag of Words

In [None]:
print("Creating the Bag-of-Words")

#Note: this utilizes the CountVectorizor (vectorizer) previously defined in Part 1

train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()

print("Sucessfully created Bag-of-Words")

Creating the Bag-of-Words
Sucessfully created Bag-of-Words


In [None]:
#print the vocabullary that the Bag-of-Words identified
vocab = vectorizer.get_feature_names()
print(vocab)





In [None]:
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

Data Cleaning and Text Preprocessing for Test Data

In [None]:
# Create an empty list and append the clean reviews one by one
num_reviews = len(X_test["review"])
clean_test_reviews = [] 

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        print("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words( X_test["review"].iloc[i] )
    clean_test_reviews.append( clean_review )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

Cleaning and parsing the test set movie reviews...

Review 1000 of 10000

Review 2000 of 10000

Review 3000 of 10000

Review 4000 of 10000

Review 5000 of 10000

Review 6000 of 10000

Review 7000 of 10000

Review 8000 of 10000

Review 9000 of 10000

Review 10000 of 10000



Random Forest

In [None]:
print("Training the Random Forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 150 trees
forest = RandomForestClassifier(n_estimators=150)

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable

# This may take a few minutes to run
forest = forest.fit( train_data_features, y_train )

print("Done training RF")

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

Training the Random Forest...
Done training RF


Model Accuracy

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

cm = confusion_matrix(y_test, result)
print("Confusion Matrix:")
print(cm)

f1 = f1_score(y_test, result)
print("F1 score:")
print(f1)

print("Accuracy score:")
accuracy_score(y_test, result)

Confusion Matrix:
[[4276  765]
 [ 748 4211]]
F1 score:
0.8477101157523905
Accuracy score:


0.8487