The UCI Machine Learning database has a nice labeled [dataset of sentiment labelled sentences](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) for us to use. This dataset was created for the paper [From Group to Individual Labels using Deep Features](http://mdenil.com/media/papers/2015-deep-multi-instance-learning.pdf), Kotzias et. al., KDD 2015.

Pick one of the company data files and build your own classifier. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.

The zip file provided at the link above contains a CSV of reviews data for three companies: Amazon, IMDB, and Yelp. For this project, I will be working with the IMDB dataset.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [155]:
#Extract the TXT from the ZIP file
import zipfile
filename = "Documents/Thinkful/Module 18/sentiment_labelled_sentences.zip"

zf = zipfile.ZipFile(filename)
imdb_raw = pd.read_csv(zf.open('sentiment labelled sentences/imdb_labelled.txt'), delimiter = '\t', header=None)
imdb_raw.columns = ['message', 'sentiment']

In [156]:
imdb_raw.head()

Unnamed: 0,message,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [157]:
#Instead of defining lists of positive and negative words manually, let's extract them.

#pos_words = [' best ', ' good ', ' great ', ' greatest ', ' love ', ' loved ', ' perfect ', '']
#neg_words = [' disappoint ', ' disappointed ', ' disappointing ', ' disappointment ', ' horrible ', ' worst ', '  ']

#First, we clean the data so we can extract similar words, i.e. turn everything lowercase and remove punctuation
clean_msg = [row.lower()
           .replace(",", "").replace(".", "").replace("!", "").replace("?", "")
           .replace(";", "").replace(":", "").replace("*", "")
           .replace(" - ", " ").replace("(", "")
           .replace(")", "").replace("/", "")
           for row in imdb_raw['message']]

imdb_raw['message'] = clean_msg
imdb_raw.head()

Unnamed: 0,message,sentiment
0,a very very very slow-moving aimless movie abo...,0
1,not sure who was more lost the flat characters...,0
2,attempting artiness with black & white and cle...,0
3,very little music or anything to speak of,0
4,the best scene in the movie was when gerardo i...,1


In [158]:
#Split into two datasets: positive and negative
pos_msg = imdb_raw['message'][imdb_raw['sentiment']==1]
pos_msg = list(pos_msg.reset_index(drop=True))

neg_msg = imdb_raw['message'][imdb_raw['sentiment']==0]
neg_msg = list(neg_msg.reset_index(drop=True))

#Combine all strings into one long string for each dataset
seperator = ' '
pos_long = seperator.join(pos_msg)
neg_long = seperator.join(neg_msg)

#Extract words and number of repeats in each dataset
from collections import Counter
pos_counts = Counter(pos_long.split())
neg_counts = Counter(neg_long.split())

In [159]:
#Get list of positive and negative words that appear at least twice
pos_words = [x[0] for x in pos_counts.most_common() if x[1] >= 2]
neg_words = [x[0] for x in neg_counts.most_common() if x[1] >= 2]

#Remove any words that appear in both lists
for word in pos_words[:]:
    if word in neg_words:
        pos_words.remove(word)
        neg_words.remove(word)


In [160]:
#Now that we have a list of positive and negative keywords, compare them to all messages
for word in pos_words:
    imdb_raw['Positive: ' + str(word)] = imdb_raw.message.str.contains(' ' + str(word) + ' ', case=False)
    
for word in neg_words:
    imdb_raw['Negative: ' + str(word)] = imdb_raw.message.str.contains(' ' + str(word) + ' ', case=False)
       

In [161]:
#Convert sentiment column into boolean for positive sentiment
imdb_raw['sentiment'] = (imdb_raw['sentiment'] == 1)

In [162]:
#Use Bernoulli to test for accuracy of the model based on current keywords
data = imdb_raw[imdb_raw.columns.drop(['message','sentiment'])]
target = imdb_raw['sentiment']

In [163]:
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 748 points : 104


### Test Other Data
Now that we have a working model for the IMDB data, let's test the same model on reviews from Amazon to see if the classifiers transfer over well.

In [164]:
#Extract the TXT from the ZIP file
amazon_raw = pd.read_csv(zf.open('sentiment labelled sentences/amazon_cells_labelled.txt'), delimiter = '\t', header=None)
amazon_raw.columns = ['message', 'sentiment']

clean_msg = [row.lower()
           .replace(",", "").replace(".", "").replace("!", "").replace("?", "")
           .replace(";", "").replace(":", "").replace("*", "")
           .replace(" - ", " ").replace("(", "")
           .replace(")", "").replace("/", "")
           for row in amazon_raw['message']]

amazon_raw['message'] = clean_msg
amazon_raw.head()

Unnamed: 0,message,sentiment
0,so there is no way for me to plug it in here i...,0
1,good case excellent value,1
2,great for the jawbone,1
3,tied to charger for conversations lasting more...,0
4,the mic is great,1


In [165]:
#Now that we have a list of positive and negative keywords, compare them to all messages
for word in pos_words:
    amazon_raw['Positive: ' + str(word)] = amazon_raw.message.str.contains(' ' + str(word) + ' ', case=False)
    
for word in neg_words:
    amazon_raw['Negative: ' + str(word)] = amazon_raw.message.str.contains(' ' + str(word) + ' ', case=False)
       
#Convert sentiment column into boolean for positive sentiment
amazon_raw['sentiment'] = (amazon_raw['sentiment'] == 1)        

#Use Bernoulli to test for accuracy of the model based on current keywords
data = amazon_raw[amazon_raw.columns.drop(['message','sentiment'])]
target = amazon_raw['sentiment']

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 339


### Conclusions:
The model designed using IMDB data had much higher accuracy when analyzing the IMDB dataset as when it was used to analyze the Amazon dataset. ~14% of the IMDB messages were mislabeled with this model, while ~34% of Amazon messages were mislabeled. This is likely due to the fact that IMDB reviews have to do with movies, while Amazon reviews have to do with products, so the positive and negative words used for each review probably tend to differ greatly.