# Challenge - Feedback Analysis

###### Citations:
   - Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
       Proceedings of the ACM SIGKDD International Conference on Knowledge 
       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
       Washington, USA, 
   - Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing 
       and Comparing Opinions on the Web." Proceedings of the 14th 
       International World Wide Web conference (WWW-2005), May 10-14, 
       2005, Chiba, Japan.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [11]:
label_amazon = 'amazon_cells_labelled.txt'
label_imdb = 'imdb_labelled.txt'
label_yelp = 'yelp_labelled.txt'

In [12]:
site = label_amazon

In [13]:
pos_words = 'positive-words.txt'
neg_words = 'negative-words.txt'

pos = []
with open(pos_words) as p:
    for line in p:
        if not line.startswith(';') and line != '\n':
            pos.append(line.strip('\n'))
            
neg = []
with open(neg_words) as n:
    for line in n:
        if not line.startswith(';') and line != '\n':
            neg.append(line.strip('\n'))

In [14]:
reviews = []
sentiments = []

with open(site) as s:
    for line in s:
        reviews.append(line[:-2].strip())
        sentiments.append(int(line[-2:].strip().strip('\n')))

df = pd.DataFrame({'sentiment': sentiments, 'review': reviews})

In [15]:
sent_words = pos + neg

for word in sent_words:
    df[str(word)] = df['review'].str.lower().str.contains(str(word), regex=False)

In [16]:
df

Unnamed: 0,sentiment,review,a+,abound,abounds,abundance,abundant,accessable,accessible,acclaim,...,wrongly,wrought,yawn,zap,zapped,zaps,zealot,zealous,zealously,zombie
0,0,So there is no way for me to plug it in here i...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1,"Good case, Excellent value.",False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1,Great for the jawbone.,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0,Tied to charger for conversations lasting more...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1,The mic is great.,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,0,I have to jiggle the plug to get it to line up...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,0,If you have several dozen or several hundred c...,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,1,If you are Razr owner...you must have this!,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,0,"Needless to say, I wasted my money.",False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,0,What a waste of money and time!.,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [17]:
data = df[sent_words]
target = df['sentiment']

In [18]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(data, target)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 137


### Mislabeled points:
 - Amazon - 137
 - IMDB   - 123
 - Yelp   - 144