# Challenge - Feedback Analysis
------------------------------------------------

### Source of sentiment words data (gathered November 2018):
 - [http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar](http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar)

###### Citations:
   - Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
       Proceedings of the ACM SIGKDD International Conference on Knowledge 
       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
       Washington, USA, 
   - Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing 
       and Comparing Opinions on the Web." Proceedings of the 14th 
       International World Wide Web conference (WWW-2005), May 10-14, 
       2005, Chiba, Japan.
---------------------

### Import modules and enable the display of plots in this notebook

In [1]:
import io
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

%matplotlib inline

### Ignore harmless seaborn warnings

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load the positive and negative sentiment word datasets into DataFrames

In [3]:
pos_words = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/positive_words.txt'
neg_words = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/negative_words.txt'

pos = pd.read_csv(pos_words)
neg = pd.read_csv(neg_words)

In [4]:
sent_words = [pos, neg]

### Create variable names for website review datasets

In [5]:
label_amazon = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/amazon_cells_labelled.txt'
label_imdb = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/imdb_labelled.txt'
label_yelp = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/yelp_labelled.txt'

In [6]:
sites = [label_amazon, label_imdb, label_yelp]

### Create a function to build an initial DataFrame from website reviews

In [7]:
def build_site_dataframe(website_reviews):
    wr = requests.get(website_reviews).content
    df = pd.read_table(io.StringIO(wr.decode('utf-8')), 
                       delimiter='\t\d\n', header=None, 
                       engine='python')
    df = df[0].str.split(pat='\t', n=-1, expand=True)
    df = df.rename(columns={0: 'review', 1: 'sentiment'})
    df = df[['sentiment', 'review']]
    return df

### Create a function to add positive and negative sentiment words to the website review DataFrame

In [8]:
def add_sent_words_to_df(df, sent_words):
    for sent in sent_words:
        for word in sent.iloc[:,0]:
            df[str(word)] = df['review'].str.lower().str.contains(str(word), regex=False)
    return df

### Create a function to run a Bernoulli Naive Bayes classifier on a prepped DataFrame with additional features

In [9]:
def run_bernoulli(prepped_df, site):
    data = prepped_df.iloc[:,2:]
    target = prepped_df['sentiment']

    # Our data is binary / boolean, so we're importing the Bernoulli classifier.
    from sklearn.naive_bayes import BernoulliNB

    # Instantiate our model and store it in a new variable.
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = bnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    print('Testing on Sample (Accuracy): ' + str(bnb.score(data, target)))
    print('\n')

### Run the above functions in a data pipeline for all website review datasets

In [10]:
for site in sites:
    run_bernoulli(add_sent_words_to_df(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 137
Testing on Sample (Accuracy): 0.863


imdb
Number of mislabeled points out of a total 1000 points : 123
Testing on Sample (Accuracy): 0.877


yelp
Number of mislabeled points out of a total 1000 points : 144
Testing on Sample (Accuracy): 0.856


