## Challenge Description

Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:

    Do any of your classifiers seem to overfit?
    Which seem to perform the best? Why?
    Which features seemed to be most impactful to performance?

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import string
from sklearn.naive_bayes import BernoulliNB
import itertools
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

pd.set_option('display.max_rows', 100)

In [8]:
data_amazon = "amazon_cells_labelled.txt"
amazon_raw = pd.read_csv(data_amazon, sep='\t', header=None)
amazon_raw.columns = ['sentence', 'score']

In [9]:
amazon_raw.head()

Unnamed: 0,sentence,score
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [10]:
#https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
word_dict = {}
punc_trans = str.maketrans('', '', string.punctuation)
num_trans = str.maketrans('', '', string.digits)
for sentence in amazon_raw['sentence']:
    words = [word.lower() \
             .translate(punc_trans) \
             .translate(num_trans) \
             .strip()
             for word in sentence.split(' ')]
    for word in words:
        word_dict[word] = word_dict.get(word, 0) + 1

In [11]:
word_series = pd.Series(word_dict)
word_series.head()

           128
a          218
abhor        1
ability      2
able         4
dtype: int64

In [12]:
keywords = ['great', 'good', 'quality', 'recommend', 'excellent',
           'best', 'like', 'nice']
stretch_keys = ['great', 'good', 'quality', 'recommend', 'excellent',
           'best', 'service', 'price', 'like', 'nice', 'works', 'work',
               'dont', 'price', 'really', 'service']
all_words = list(word_series.drop('').index)

for key in all_words:
    amazon_raw[str(key)] = amazon_raw.sentence.str.contains(
        ' ' + str(key) + ' ',
        case=False)

In [13]:
amazon_raw['score'] = (amazon_raw['score'] == 1)

Below I refine the "greedy tuning" workflow from the first challenge by incorporating cross validation as a model's score as well as a final hold out test set that is only seen after cross validation decides which model to use.

In [16]:
target = amazon_raw['score']

CONSTANT_SPLIT = 42
def greedy_tuning(start_error_rate, keys, cap=None):
    if cap == None:
        cap = (len(keys))
    bnb_best = None   
    best_score_mean = 0
    greedy_list = []
    test_keys = keys.copy()
    for i in range(cap):
        best_add = None
        for keyword in test_keys:
            test_list = greedy_list.copy()
            test_list.append(keyword)
            data = amazon_raw[test_list]
            X_train, X_test, y_train, y_test = train_test_split(
                data, target, test_size=0.1, random_state=CONSTANT_SPLIT)
            bnb = BernoulliNB()
            scores = cross_val_score(bnb, X_train, y_train, cv=5)
            score_mean = scores.mean()
            if score_mean > best_score_mean:
                best_add, best_score_mean = keyword, score_mean
                bnb_best = bnb
        if best_add == None:
            break
        else:
            test_keys.remove(best_add)
            greedy_list.append(best_add)
    data = amazon_raw[greedy_list]
    X_train, X_test, y_train, y_test = train_test_split(
                data, target, test_size=0.1, random_state=CONSTANT_SPLIT)
    bnb_best.fit(X_train, y_train)
    y_pred = bnb_best.predict(X_test)
    print(greedy_list)
    print("CV Score: ", best_score_mean)
    print("Score on Test Set: ", (y_pred != y_test).sum()/len(y_test))
    true_pos = ((y_test == 1) & (y_pred == 1)).sum()
    true_neg = ((y_test == 0) & (y_pred == 0)).sum()
    false_pos = ((y_test == 0) & (y_pred == 1)).sum()
    false_neg = ((y_test == 1) & (y_pred == 0)).sum()

    specificity = true_neg / (true_neg + false_pos)
    sensitivity = true_pos / (true_pos + false_neg)

    print("Confusion Matrix: ")
    print(true_neg, false_neg)
    print(false_pos, true_pos)

    print("Sensitivity: ", sensitivity)
    print("Specificity: ", specificity)
    return greedy_list, bnb_best

Running the function with the same parameters as before, we get the following results.

In [17]:
greedy_list_min, bnb_min = greedy_tuning(1000, keywords)
greedy_list_med, bnb_med = greedy_tuning(1000, stretch_keys)
greedy_list_all, bnb_all = greedy_tuning(1000, all_words)
greedy_list_cap, bnb_cap = greedy_tuning(1000, all_words, cap=10)

['great', 'good', 'best', 'recommend', 'like', 'excellent', 'nice']
CV Score:  0.6022222222222222
Score on Test Set:  0.41
Confusion Matrix: 
46 37
4 13
Sensitivity:  0.26
Specificity:  0.92
['great', 'works', 'good', 'best', 'recommend', 'price', 'like', 'excellent', 'nice']
CV Score:  0.6288888888888888
Score on Test Set:  0.38
Confusion Matrix: 
46 34
4 16
Sensitivity:  0.32
Specificity:  0.92
['not', 'of', 'that', 'only', 'then', 'too', 'and', 'you', 'waste', 'if', 'completely', 'had', 'works', 'buy', 'difficult', 'make', 'old', 'plug', 'anything', 'disappointed', 'price', 'bad', 'best', 'after', 'service', 'buying', 'blackberry', 'came', 'customer', 'being', 'cases', 'plugged', 'color', 'definitely', 'there', 'finally', 'pull', 'good', 'will', 'worst', 'reviews', 'return', 'job', 'hate', 'keyboard', 'wasted', 'area', 'pretty', 'enough', 'any', 'forced', 'hours', 'consumer', 'razr', 'below']
CV Score:  0.78
Score on Test Set:  0.39
Confusion Matrix: 
22 11
28 39
Sensitivity:  0.78


For comparison the old results are as follows, on a test set 20% of the size total data with a random seed of 42:

```
['good', 'quality', 'recommend', 'excellent', 'best', 'great'] 0.405
Confusion Matrix: 
89 82
4 25
Sensitivity:  0.2336448598130841
Specificity:  0.956989247311828
['good', 'works', 'quality', 'recommend', 'price', 'best'] 0.355
Confusion Matrix: 
90 68
3 39
Sensitivity:  0.3644859813084112
Specificity:  0.967741935483871
['not', 'get', 'buy', 'me', 'right', 'do', 'last', 'return', 'service', 'terrible', 'too', 'ask', 'bad', 'cannot'] 0.255
Confusion Matrix: 
44 4
49 103
Sensitivity:  0.9626168224299065
Specificity:  0.4731182795698925
['not', 'get', 'buy', 'me', 'right', 'do', 'last', 'return', 'service', 'terrible'] 0.28
Confusion Matrix: 
40 5
53 102
Sensitivity:  0.9532710280373832
Specificity:  0.43010752688172044
```

It appears that cross validation encourages the classifier to use more words and has slightly better balance in sensitivity and specificity. It however, does not match the same performance, albeit on a different test set.

TODO Optimizer for balancing sensitivity and specificity.

In [46]:
spec_improvers = ['good', 'quality', 'recommend', 'excellent', 'best', 'great']

In [52]:
for imp in spec_improvers:
    print(imp, word_series[imp])
    print(amazon_raw.groupby(imp)['score'].mean())
    
# idea, greedy tuner which takes penalties for imbalanced 
# specificity and sensitivity

good 77
good
False    0.486316
True     0.760000
Name: score, dtype: float64
quality 49
quality
False    0.495346
True     0.636364
Name: score, dtype: float64
recommend 26
recommend
False    0.493852
True     0.750000
Name: score, dtype: float64
excellent 27
excellent
False    0.495968
True     1.000000
Name: score, dtype: float64
best 23
best
False    0.492386
True     1.000000
Name: score, dtype: float64
great 97
great
False    0.483971
True     0.969697
Name: score, dtype: float64
