## Bag of Words Meets Bags of Popcorn


## I. How do we get the data?


In [1]:
import pandas as pd
train = pd.read_csv('/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/Contest1_Bag_of_Words/labeledTrainData.tsv', header=0, delimiter='\t',
                   quoting=3)

print('\nWhat are the column names?\n')
print(train.columns.values)

print('\n\nWhat does the data look like?\n')
print(train['review'][0])


What are the column names?

['id' 'sentiment' 'review']


What does the data look like?

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which

## II. Data Cleaning and Text Preprocessing

In [2]:
print('How do we remove the HTML tags?\n')
print('We use a package called BeautifulSoup\n')
from bs4 import BeautifulSoup

print('How do we use BeautifulSoup?\n')
example1 = BeautifulSoup(train['review'][0])

print(train['review'][0])
print(example1.get_text())

print('How do we deal with punctuation and numbers?')
print('We use a package called re - regular expressions\n\n')
print('re.sub("[^a-zA-Z]", " ", example1.get_text()')
print('This states find anything that is NOT a lowercase or uppercase letter and replace with a space')


import re
letters_only = re.sub("[^a-zA-Z]", " ", example1.get_text())

print(letters_only)


How do we remove the HTML tags?

We use a package called BeautifulSoup

How do we use BeautifulSoup?

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for th

In [3]:
print('\nHow do we extract individual words?')
lower_case = letters_only.lower()
words = lower_case.split()
print(words)


How do we extract individual words?
['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'co

In [None]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
# How can we remove stop words?
from nltk.corpus import stopwords
print stopwords.words("english")

# The following only includes words that aren't in stopwords
words = [w for w in words if not w in stopwords.words('english')]
print (words)

In [None]:
# Now let's put this into a function
# What are the questions we asked?
# 1) How do we get rid of HTML?
# 2) How do we get rid of punctuation?
# 3) How do we split data into individual words?
# 4) How do we remove words with little meaning?

def review_to_words(raw_review):
    review_text = BeautifulSoup(raw_review).get_text()
    letters_only = review_text.re("[^a-zA-Z]", ' ', review_text)
    words = letters_only.lower().split()
    stops = set(stopwords.words('English'))
    meaningful_words = [w for w in words if not w in stops]
    return (" ". join(meaningful_words))

# Now we can clean all the reviews at once
num_reviews = train['review'].size
clean_train_reviews = []

# If the index is divisible by 1000 print a message
for i in range(0, num_reviews):
    if (i % 1000 == 0):
        print('Review %d of %d\n' %(i+1, num_reviews))
    clean_train_reviews.append(review_to_words(train['review'][i]))
    


## III. Creating Features from a Bag of Words

In [None]:
# Now what do we do with this word list?
# Let's figure out how many of each word is in each review.

# What does a CountVectorizer do?
# It creates these lists
# It will take in all of the words, create the dictionary, then count
# how many of each word occurred in each review
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', tokenizer=None,
                            preprocessor=None, stop_words=None,
                            max_features = 5000)

# fit_transform() will learn the vocabulary and transform data into
# feature vectors

train_data_features = fit_transform(clean_train_reviews)

# convert the results to an array
train_data_features = train_data_features.toarray()

In [None]:
# Now how can we get a list of the top 5000 words in all the reviews?
# First the vocabulary

vocab = vectorizer.get_feature_names()
print(vocab)

# Now let's sum up all the columns of train_data_features
word_counts = np.sum(train_data_features, axis=0)

for tag, count in zip(vocab, word_counts):
    print(count, tag)


## IV. Random Forest

In [None]:
# What are we doing here?
# For example if the word 'delightful' comes up in every review
# that is positive, it highly indicates a positive review for the 
# test data.  We want to know which grouping of words have the 
# highest chance of a positive review vs a negative review.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)
# This means that each tree will use 100 of the estimators

# Now we have to train the model on the training data
# How do we create the model? Use fit with the data and the result

forest = forest.fit(train_data_features, train['sentiment'])

# Thus we now have our Random Forest model trained.

## V. Submission

In [None]:
# The purpose of creating a submission is to check the output of 
# putting in test data into your model
# Why does the vectorizer have to be transform and not fit_transform?
# The first option learns the vocabulary first while the second
# uses what it has already learned

# What are the steps we took to prepare the data?
# We start with noisy data.

# 1. Read the data
test = pd.read_csv("/Users/aiswaryasankar/Desktop/Kaggle/kaggle_winter/Contest1_Bag_of_Words/testData.tsv', header=0, delimiter='\t',
                   quoting=3")
                   
# 2. Create an array for the cleaned reviews
# Create an instance of BeautifulSoup
# How did we pass in each review into BeautifulSoup? What method?
# Cleaned - no HTML, then remove punctuation, stopwords, punctuation
# Then use the previous model that you learned
# Predict using that model
# Paste results into a csv file
                   
cleaned_test_reviews = []
num_reviews = len(test['review'])

for i in xrange(0, num_reviews):
    if ((i+1) % 1000 == 0):
        print('Review %d of %d\n' %(i+1, num_reviews))
    clean_review = review_to_words(test['review'][i])
    cleaned_test_reviews.append(clean_review)

# Now run through already trained bag of words
# What exactly does it mean to 'train' a bag of words?
test_data_features = vectorizer.transform(cleaned_test_reviews)
test_data_features = test_data_features.to_array()
                   
# Now put it into the model, we fitted earlier, now predict
result = forest.predict(test_data_features)

# Now we want to write out results to a file
# Copy the results to a pandas dataframe with an 'id' column
# and a sentiment column
# Where do we get the ids from? From the original test file
# Where do we get sentiment from? From the result we just predicted
output = pd.DataFrame(data={"id": test['id'], 'sentiment': result})
                   
output.to_csv("Bag_of_Words_model.csv", index=False, quoting=3)
                            