Pick one of the company data files and build your own classifier. When you're satisfied with its performance (at this point just using the accuracy measure shown in the example), test it on one of the other datasets to see how well these kinds of classifiers translate from one context to another.

Include your model and a brief writeup of your feature engineering and selection process to submit and review with your mentor.

## Yelp

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

#For each website (Yelp in this case), there exist 500 positive and 500 negative sentences. 
#Those were selected randomly for larger datasets of reviews. 
#We attempted to select sentences that have a clearly positive or negative connotaton, 
#the goal was for no neutral sentences to be selected.

In [3]:
# Grab and process the RAW data.
data_path = ("https://raw.githubusercontent.com/browsingATM/Thinkful/master/Unit%202/yelp_labelled.txt")
            
yelp_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
yelp_raw.columns = ['review', 'score']
#score of 1 = positive, score of 0 = negative


In [17]:

keywords = ['cold', 'bad service', 'expensive', 'mean', 'not affordable', 'empty'
           'terrible', 'angry', 'boring', ]


In [18]:
for key in keywords:
    yelp_raw[str(key)] = yelp_raw.review.str.contains(
    ' ' + str(key) + ' ',
    case=False
    )

In [19]:
data = yelp_raw[keywords]
target = yelp_raw['review']

In [20]:
#binary data so using BNB
from sklearn.naive_bayes import BernoulliNB

#instantiate our model and store it in a new variable
bnb = BernoulliNB()

#fit data to our model
bnb.fit(data, target)

#classify, storing the result in a new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
     data.shape[0],
     (target != y_pred).sum()
     ))

Number of mislabeled points out of a total 1000 points : 998


## Amazon 

In [12]:
# Grab and process the RAW data.
data_path = ("https://raw.githubusercontent.com/browsingATM/Thinkful/master/Unit%202/amazon_cells_labelled.txt")
            
amazon_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
amazon_raw.columns = ['review', 'score']
#score of 1 = positive, score of 0 = negative

In [21]:

keywords = ['cold', 'bad service', 'expensive', 'mean', 'not affordable', 'empty'
           'terrible', 'angry', 'boring', ]

In [22]:
for key in keywords:
    amazon_raw[str(key)] = amazon_raw.review.str.contains(
    ' ' + str(key) + ' ',
    case=False
    )

In [23]:
data = amazon_raw[keywords]
target = amazon_raw['review']

In [24]:
#binary data so using BNB
from sklearn.naive_bayes import BernoulliNB

#instantiate our model and store it in a new variable
bnb = BernoulliNB()

#fit data to our model
bnb.fit(data, target)

#classify, storing the result in a new variable
y_pred = bnb.predict(data)

# display results
print("Number of mislabeled points out of a total {} points : {}".format(
     data.shape[0],
     (target != y_pred).sum()
     ))

Number of mislabeled points out of a total 1000 points : 998


In [None]:
#questions
# feature engineering with 2 columns, one binary and the other strings?
# i was surprised that the model yielded the same results twice with different sets of keywords.
# I would have thought the opposite.