# Challenge - Iterate and Evaluate Your Classifier
------------------------------------------------

### Source of sentiment words data (gathered November 2018):
 - [http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar](http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar)

###### Citations:
   - Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
       Proceedings of the ACM SIGKDD International Conference on Knowledge 
       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
       Washington, USA, 
   - Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing 
       and Comparing Opinions on the Web." Proceedings of the 14th 
       International World Wide Web conference (WWW-2005), May 10-14, 
       2005, Chiba, Japan.
---------------------

### Import modules and enable the display of plots in this notebook

In [1]:
import io
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

%matplotlib inline

### Ignore harmless seaborn warnings

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load the positive and negative sentiment word datasets into DataFrames

In [3]:
pos_words = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/positive_words.txt'
neg_words = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/negative_words.txt'

pos = pd.read_csv(pos_words)
neg = pd.read_csv(neg_words)

In [4]:
sent_words = [pos, neg]

### Create variable names for website review datasets

In [5]:
label_amazon = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/amazon_cells_labelled.txt'
label_imdb = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/imdb_labelled.txt'
label_yelp = 'https://raw.githubusercontent.com/djrgit/coursework/master/thinkful/data_science/my_progress/unit_2_supervised_learning/yelp_labelled.txt'

In [6]:
sites = [label_amazon, label_imdb, label_yelp]

# Original Model:

### Create a function to build an initial DataFrame from website reviews

In [7]:
def build_site_dataframe(website_reviews):
    wr = requests.get(website_reviews).content
    df = pd.read_table(io.StringIO(wr.decode('utf-8')), 
                       delimiter='\t\d\n', header=None, 
                       engine='python')
    df = df[0].str.split(pat='\t', n=-1, expand=True)
    df = df.rename(columns={0: 'review', 1: 'sentiment'})
    df = df[['sentiment', 'review']]
    return df

### Create a function to add positive and negative sentiment words to the website review DataFrame

In [8]:
def add_sent_words_to_df(df, sent_words):
    for sent in sent_words:
        for word in sent.iloc[:,0]:
            df[str(word)] = df['review'].str.lower().str.contains(str(word), regex=False)
    return df

### Create a function to run a Bernoulli Naive Bayes classifier on a prepped DataFrame with additional features

In [9]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB

# Test your model with different holdout groups.
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.metrics import confusion_matrix

In [10]:
def run_bernoulli(prepped_df, site):
    
    # Include all additional word features in the data
    data = prepped_df.iloc[:,2:]
    
    target = prepped_df['sentiment']

    # Instantiate our model and store it in a new variable.
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = bnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=101)
    fit_train = bnb.fit(X_train, y_train)
    print('With 20% Holdout: ' + str(fit_train.score(X_test, y_test)))
    predictions = bnb.predict(X_test)
    print('Confusion matrix: ')
    print(confusion_matrix(predictions, y_test))
    print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

    print('Cross validation scores (5 folds): ')
    print(np.array(cross_val_score(bnb, data, target, cv=5)))
    print('\n')

### Run the above functions in a data pipeline for all website review datasets

In [11]:
for site in sites:
    run_bernoulli(add_sent_words_to_df(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 137
With 20% Holdout: 0.66
Confusion matrix: 
[[46  1]
 [67 86]]
Testing on Sample: 0.863
Cross validation scores (5 folds): 
[0.77  0.82  0.785 0.745 0.785]


imdb
Number of mislabeled points out of a total 1000 points : 123
With 20% Holdout: 0.735
Confusion matrix: 
[[84 38]
 [15 63]]
Testing on Sample: 0.877
Cross validation scores (5 folds): 
[0.75  0.76  0.805 0.74  0.78 ]


yelp
Number of mislabeled points out of a total 1000 points : 144
With 20% Holdout: 0.725
Confusion matrix: 
[[56  9]
 [46 89]]
Testing on Sample: 0.856
Cross validation scores (5 folds): 
[0.795 0.775 0.75  0.8   0.805]




# Revised Models:

### Create a function to add positive and negative sentiment word counts to the website review DataFrame

In [12]:
def add_sent_words_to_df_r1(df, sent_words):
    
    df[sent_words[0].columns[0]] = 0
    df[sent_words[1].columns[0]] = 0
    df['pos_minus_neg'] = 0
    df['pos_minus_neg_gt_0'] = 0
    
    for sent in sent_words:
        for word in sent.iloc[:,0]:
            df[str(word)] = df['review'].str.lower().str.contains(str(word), regex=False)
            df[sent.columns[0]] += df['review'].map(lambda r: r.lower().count(str(word)))
            
    df['pos_minus_neg'] = df[sent_words[0].columns[0]] - df[sent_words[1].columns[0]]
    df['pos_minus_neg_gt_0'] = df['pos_minus_neg'] > 0
            
    return df

## Revised Model 1:

### Base Bernoulli model training data on whether sentiment is more positive or negative/neutral

In [13]:
def run_bernoulli_r1(prepped_df, site):
    
    # Base data on a single feature of whether reviews contain more positive or negative/neutral features
    data = prepped_df.iloc[:,5:6]
    
    target = prepped_df['sentiment']
    

    # Instantiate our model and store it in a new variable.
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = bnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=101)
    fit_train = bnb.fit(X_train, y_train)
    print('With 20% Holdout: ' + str(fit_train.score(X_test, y_test)))
    predictions = bnb.predict(X_test)
    print('Confusion matrix: ')
    print(confusion_matrix(predictions, y_test))
    print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

    print('Cross validation scores (5 folds): ')
    print(np.array(cross_val_score(bnb, data, target, cv=5)))
    print('\n')

In [14]:
for site in sites:
    run_bernoulli_r1(add_sent_words_to_df_r1(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 212
With 20% Holdout: 0.775
Confusion matrix: 
[[90 22]
 [23 65]]
Testing on Sample: 0.788
Cross validation scores (5 folds): 
[0.78  0.825 0.775 0.785 0.775]


imdb
Number of mislabeled points out of a total 1000 points : 285
With 20% Holdout: 0.685
Confusion matrix: 
[[87 51]
 [12 50]]
Testing on Sample: 0.715
Cross validation scores (5 folds): 
[0.69  0.72  0.71  0.7   0.755]


yelp
Number of mislabeled points out of a total 1000 points : 277
With 20% Holdout: 0.745
Confusion matrix: 
[[80 29]
 [22 69]]
Testing on Sample: 0.723
Cross validation scores (5 folds): 
[0.745 0.75  0.69  0.715 0.715]




## Revised Model 2:

### Combine 'Revised Model 1' with the original model

In [15]:
def run_bernoulli_r2(prepped_df, site):
    
    # Base data on combined features from 'Revised Model 1' with the original model
    data = prepped_df.iloc[:,5:]
    
    target = prepped_df['sentiment']

    # Instantiate our model and store it in a new variable.
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = bnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=101)
    fit_train = bnb.fit(X_train, y_train)
    print('With 20% Holdout: ' + str(fit_train.score(X_test, y_test)))
    predictions = bnb.predict(X_test)
    print('Confusion matrix: ')
    print(confusion_matrix(predictions, y_test))
    print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

    print('Cross validation scores (5 folds): ')
    print(np.array(cross_val_score(bnb, data, target, cv=5)))
    print('\n')

In [16]:
for site in sites:
    run_bernoulli_r2(add_sent_words_to_df_r1(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 157
With 20% Holdout: 0.795
Confusion matrix: 
[[85 13]
 [28 74]]
Testing on Sample: 0.843
Cross validation scores (5 folds): 
[0.805 0.85  0.77  0.795 0.8  ]


imdb
Number of mislabeled points out of a total 1000 points : 139
With 20% Holdout: 0.755
Confusion matrix: 
[[85 35]
 [14 66]]
Testing on Sample: 0.861
Cross validation scores (5 folds): 
[0.74 0.78 0.82 0.77 0.8 ]


yelp
Number of mislabeled points out of a total 1000 points : 179
With 20% Holdout: 0.795
Confusion matrix: 
[[78 17]
 [24 81]]
Testing on Sample: 0.821
Cross validation scores (5 folds): 
[0.765 0.77  0.76  0.78  0.77 ]




## Revised Model 3:

In [17]:
def add_sent_words_to_df_r3(df, sent_words):
    
    df[sent_words[0].columns[0]] = 0
    df[sent_words[1].columns[0]] = 0
    df['pos_minus_neg'] = 0
    df['pos_minus_neg_gt_0'] = 0
    
    for sent in sent_words:
        for word in sent.iloc[:,0]:
            df[str(word)] = df['review'].map(lambda r: r.lower().count(str(word)))
            df[sent.columns[0]] += df['review'].map(lambda r: r.lower().count(str(word)))
            
    df['pos_minus_neg'] = df[sent_words[0].columns[0]] - df[sent_words[1].columns[0]]
    df['pos_minus_neg_gt_0'] = df['pos_minus_neg'] > 0
            
    return df

### Create a function to run a Multinomial Naive Bayes classifier on a prepped DataFrame with additional features

In [18]:
from sklearn.naive_bayes import MultinomialNB

In [19]:
def run_multinomial(prepped_df, site):
    
    # Include all additional word features in the data
    data = prepped_df.iloc[:,6:]
    
    target = prepped_df['sentiment']

    # Instantiate our model and store it in a new variable.
    mnb = MultinomialNB()

    # Fit our model to the data.
    mnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = mnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=101)
    fit_train = mnb.fit(X_train, y_train)
    print('With 20% Holdout: ' + str(fit_train.score(X_test, y_test)))
    predictions = mnb.predict(X_test)
    print('Confusion matrix: ')
    print(confusion_matrix(predictions, y_test))
    print('Testing on Sample: ' + str(mnb.fit(data, target).score(data, target)))

    print('Cross validation scores (5 folds): ')
    print(np.array(cross_val_score(mnb, data, target, cv=5)))
    print('\n')

In [20]:
for site in sites:
    run_multinomial(add_sent_words_to_df_r3(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 132
With 20% Holdout: 0.68
Confusion matrix: 
[[55  6]
 [58 81]]
Testing on Sample: 0.868
Cross validation scores (5 folds): 
[0.745 0.81  0.78  0.74  0.79 ]


imdb
Number of mislabeled points out of a total 1000 points : 121
With 20% Holdout: 0.76
Confusion matrix: 
[[83 32]
 [16 69]]
Testing on Sample: 0.879
Cross validation scores (5 folds): 
[0.765 0.755 0.825 0.74  0.785]


yelp
Number of mislabeled points out of a total 1000 points : 145
With 20% Holdout: 0.73
Confusion matrix: 
[[59 11]
 [43 87]]
Testing on Sample: 0.855
Cross validation scores (5 folds): 
[0.785 0.755 0.75  0.785 0.785]




## Revised Model 4:

In [21]:
def add_sent_words_to_df_r4(df, sent_words):
    
    df['pos_minus_neg'] = 0
    df[sent_words[0].columns[0]] = 0
    df[sent_words[1].columns[0]] = 0
    df['pos_minus_neg_gt_0'] = 0
    
    for sent in sent_words:
        for word in sent.iloc[:,0]:
            df[str(word)] = df['review'].map(lambda r: r.lower().count(str(word)))
            df[sent.columns[0]] += df['review'].map(lambda r: r.lower().count(str(word)))
            
    df['pos_minus_neg'] = df[sent_words[0].columns[0]] - df[sent_words[1].columns[0]]
    df['pos_minus_neg_gt_0'] = df['pos_minus_neg'] > 0
            
    return df

### Include more features in the Multinomial model

In [22]:
def run_multinomial_r4(prepped_df, site):
    
    # Include all numerical feature columns where ALL observations are not negative
    data = prepped_df.iloc[:,3:]
    
    target = prepped_df['sentiment']
    

    # Instantiate our model and store it in a new variable.
    mnb = MultinomialNB()

    # Fit our model to the data.
    mnb.fit(data, target)

    # Classify, storing the result in a new variable.
    y_pred = mnb.predict(data)

    # Display our results.
    print(site[site.rfind('/')+1:site.rfind('_')])
    print("Number of mislabeled points out of a total {} points : {}".format(data.shape[0], 
                                                                             (target != y_pred).sum()))
    
    # Use train_test_split to create the necessary training and test groups
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=101)
    fit_train = mnb.fit(X_train, y_train)
    print('With 20% Holdout: ' + str(fit_train.score(X_test, y_test)))
    predictions = mnb.predict(X_test)
    print('Confusion matrix: ')
    print(confusion_matrix(predictions, y_test))
    print('Testing on Sample: ' + str(mnb.fit(data, target).score(data, target)))

    print('Cross validation scores (5 folds): ')
    print(np.array(cross_val_score(mnb, data, target, cv=5)))
    print('\n')

In [23]:
for site in sites:
    run_multinomial_r4(add_sent_words_to_df_r4(build_site_dataframe(site), sent_words), site)

amazon_cells
Number of mislabeled points out of a total 1000 points : 153
With 20% Holdout: 0.745
Confusion matrix: 
[[68  6]
 [45 81]]
Testing on Sample: 0.847
Cross validation scores (5 folds): 
[0.8   0.855 0.795 0.8   0.81 ]


imdb
Number of mislabeled points out of a total 1000 points : 152
With 20% Holdout: 0.775
Confusion matrix: 
[[82 28]
 [17 73]]
Testing on Sample: 0.848
Cross validation scores (5 folds): 
[0.78  0.775 0.865 0.79  0.795]


yelp
Number of mislabeled points out of a total 1000 points : 181
With 20% Holdout: 0.77
Confusion matrix: 
[[65  9]
 [37 89]]
Testing on Sample: 0.819
Cross validation scores (5 folds): 
[0.8   0.8   0.77  0.78  0.775]




### It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

   - Do any of your classifiers seem to overfit?
    >- _None of these models appear to exhibit overfitting, but an exploration of the error curves in a test set vs a cross-validation set and the training set would provide deeper insight._
    
   - Which seem to perform the best? Why?
    >- _The **original model (Bernoulli)** and **Revised Model 3 (Multinomial)** both seem to perform the best (see sample accuracy values below).  All features in these models are based on the raw data itself, either by recording the presence of individual features or by recording their counts._
        - Amazon, IMDB, Yelp  - Sample Accuracies
        - 0.863, 0.877, 0.856 - Original **
        - 0.788, 0.715, 0.723 - Rev 1
        - 0.843, 0.861, 0.821 - Rev 2
        - 0.868, 0.879, 0.855 - Rev 3 **
        - 0.847, 0.848, 0.819 - Rev 4

   
   - Which features seemed to be most impactful to performance?
    >- _Multinomial models seem to carry more information (quantity) about the data in each feature, as opposed to only boolean values._

### Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.