---
Bag of Words Meets Bags of Popcorn
===

![](images/bag_of_popcorn.png)

Use word2vec to model movie reviews.

Based on [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial)

---
By the end of this exercise, you should be able to
---

1. Apply word2vec to a dataset
2. Make comparisons between models
3. Enter a Kaggle contest

---
Overview
---

1. Skim [introduction section](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words) and [word vector](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors)
1. Work through [more fun with word vectors](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors)
1. Create models with the given architecture
    1. Start with `bag_of_words.py` to run a Random Forest
    1. Submit your model [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/submissions/) to get a score
1. Tune (aka, hyperparamter search)
1. Submit your best score to Slack

---
Code & Data
----

The code is in the `lab` folder and the `data` is in a subfolder.

You __always__ want to visualizely inspect your raw data. However, this data is fairly "big".   
I suggest using a text editor designed for large data like [Ultra Edit](http://www.ultraedit.com/) or use CLI.

In [9]:
# cd ~/github/DSCI6004-instructor/week_7/7_1_word2vec/lab/data

In [10]:
%%bash
cd ~/Users/justw/OneDrive/DSCI6004-student/week_7/7_1_word2vec/lab/data
head labeledTrainData.tsv

bash: line 1: cd: /Users/justw/Users/justw/OneDrive/DSCI6004-student/week_7/7_1_word2vec/lab/data: No such file or directory
head: labeledTrainData.tsv: No such file or directory


In [11]:
%%bash
pwd

/Users/justw/OneDrive/DSCI6004-student/week_7/7_1_word2vec


In [12]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import pandas as pd       
train = pd.read_csv("/Users/justw/OneDrive/DSCI6004-student/week_7/7_1_word2vec/lab/data/labeledTrainData.tsv", header=0, \
                    delimiter="\t", quoting=3)

train.tail()

Unnamed: 0,id,sentiment,review
24995,"""3453_3""",0,"""It seems like more consideration has gone int..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui..."
24998,"""10194_3""",0,"""This 30 minute documentary BuÃ±uel made in the..."
24999,"""8478_8""",1,"""I saw this movie as a child and it broke my h..."


Errata
-----

- install beautiful-soup via `conda install -c anaconda beautifulsoup4=4.5.1`
- Unzip files before fitting models

In [13]:
import warnings
warnings.filterwarnings('ignore')

# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  

# Print the raw review and then the output of get_text(), for 
# comparison

# print (train["review"][0])
# print (example1.get_text())

import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print (letters_only)

lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [14]:
import nltk
# nltk.download()  # Download text data sets, including stop words

In [15]:
from nltk.corpus import stopwords # Import the stop word list
print (stopwords.words("english"))

# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [18]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

clean_review = review_to_words( train["review"][0] )
print(clean_review)

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [22]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range( 0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( train["review"][i] ) )

print ("Cleaning and parsing the training set movie reviews...\n")
clean_train_reviews = []
for i in range( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print ("Review %d of %d\n" % ( i+1, num_reviews ))                                                                    
    clean_train_reviews.append( review_to_words( train["review"][i] ))

Cleaning and parsing the training set movie reviews...

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



In [27]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

print("Done!")

Creating the bag of words...

Done!


In [26]:
train_data_features.shape

(25000, 5000)

In [31]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print(vocab[:50], end="")

['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham', 'absence', 'absent', 'absolute', 'absolutely', 'absurd', 'abuse', 'abusive', 'abysmal', 'academy', 'accent', 'accents', 'accept', 'acceptable', 'accepted', 'access', 'accident', 'accidentally', 'accompanied', 'accomplished', 'according', 'account', 'accuracy', 'accurate', 'accused', 'achieve', 'achieved', 'achievement', 'acid', 'across', 'act', 'acted', 'acting', 'action', 'actions', 'activities', 'actor', 'actors', 'actress', 'actresses', 'acts', 'actual', 'actually', 'ad', 'adam']

In [37]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab[:100], dist):
    print (count, tag, end=" ")

187 abandoned 125 abc 108 abilities 454 ability 1259 able 85 abraham 116 absence 83 absent 352 absolute 1485 absolutely 306 absurd 192 abuse 91 abusive 98 abysmal 297 academy 485 accent 203 accents 300 accept 130 acceptable 144 accepted 92 access 318 accident 200 accidentally 88 accompanied 124 accomplished 296 according 186 account 81 accuracy 284 accurate 123 accused 179 achieve 139 achieved 124 achievement 90 acid 971 across 1251 act 658 acted 6490 acting 3354 action 311 actions 83 activities 2389 actor 4486 actors 1219 actress 369 actresses 394 acts 793 actual 4237 actually 148 ad 302 adam 98 adams 453 adaptation 80 adaptations 154 adapted 810 add 439 added 166 adding 347 addition 337 adds 113 adequate 124 admire 621 admit 134 admittedly 101 adorable 510 adult 376 adults 100 advance 90 advanced 153 advantage 510 adventure 204 adventures 91 advertising 259 advice 90 advise 346 affair 93 affect 113 affected 104 afford 126 aforementioned 343 afraid 212 africa 255 african 187 afternoon

In [39]:
print ("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, train["sentiment"] )
"Done!"

Training the random forest...


'Done!'

In [42]:
# Read the test data
test = pd.read_csv("/Users/justw/OneDrive/DSCI6004-student/week_7/7_1_word2vec/lab/data/testData.tsv", header=0, delimiter="\t", \
                   quoting=3 )

# Verify that there are 25,000 rows and 2 columns
print (test.shape)

# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])
clean_test_reviews = [] 

print ("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        print ("Review %d of %d\n" % (i+1, num_reviews))
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )

(25000, 2)
Cleaning and parsing the test set movie reviews...

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



---
How to evaluate and submit your model
---

[Evaluation](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/evaluation)

Hints:
---

- Competition is closed. You won't be able to enter the Kaggle competition but your solutions can be scored.
- Start with a very basic model then try to improve:
    1. Random Forest with bag of words
    2. See if word2vec helps
- `word2vec_average_vectors.py` takes a while to run.
- `word2vec_bag_of_centroids.py` takes a looooong time ðŸ˜´ to run.
- While your models are running brainstorm ways of making them better!
- [Compare your process and results to Standford students](https://cs224d.stanford.edu/reports/SadeghianAmir.pdf)

<br>
<br> 
<br>

----