### Problem Statement
It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate by engineering new features, removing poor features, or tuning parameters. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:
<br>Do any of your classifiers seem to overfit?
<br>Which seem to perform the best? Why?
<br>Which features seemed to be most impactful to performance?
### Outline
Version 1: Original
<br>Version 2: Words in both sentiments
<br>Version 3: Only positive adjectives
<br>Version 4: Only positive verbs
<br>Version 5: Only positive adverbs
<br>Write Up

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
import random
data_path = "data/imdb_labelled.txt"
reviews_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
reviews_raw.columns = ['review', 'positive']

In [2]:
def keyword_features(df, keywords):
    df_keywords = df.copy()
    for key in keywords:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
        df_keywords[str(key)] = df_keywords.review.str.contains(
            ' ' + str(key) + ' ',
            case=False
        )
    return df_keywords

def evaluate_version(df, keywords):
    random.seed(120)
    bnb = BernoulliNB()
    data = df[keywords]
    target = df['positive']
    total_points = data.shape[0]
    # baseline
    baseline = (target == True).sum()
    train_test_score = bnb.fit(data, target).score(data, target)
    print ("Baseline positive: {}".format(round(1 - baseline / total_points, 3) ))
    print ("Train = Test Score: {}".format(round(train_test_score, 3)))
    # folds
    random_array = list(range(total_points))
    random.shuffle(random_array)
    random_series = pd.Series(random_array)
    folds = 10
    fold_scores = np.zeros(folds)
    fold_sens = np.zeros(folds)
    fold_spec = np.zeros(folds)
    fold_size = int(total_points / folds)
    
    # cross validation
    for fold in range(folds):
        test_vals = list(random_series[fold * fold_size:(fold + 1) * fold_size])
        train_vals = list(random_series)
        train_vals = list(set(train_vals) - set(test_vals))
        fold_X_train = data.iloc[train_vals]
        fold_X_test = data.iloc[test_vals]
        fold_Y_train = target.iloc[train_vals]
        fold_Y_test = target.iloc[test_vals]
        fold_fit = bnb.fit(fold_X_train, fold_Y_train)
        fold_pred = bnb.predict(fold_X_test)
        fold_scores[fold] = fold_fit.score(fold_X_test, fold_Y_test)
        # confusion matrix
        spam_confusion_matrix = confusion_matrix(fold_Y_test, fold_pred)
        # sensitivity / specificity
        fold_sens[fold] = round(spam_confusion_matrix[0][0] / (spam_confusion_matrix[0][0] + spam_confusion_matrix[0][1]), 3)
        fold_spec[fold] = round(spam_confusion_matrix[1][0] / (spam_confusion_matrix[1][0] + spam_confusion_matrix[1][1]), 3)
    fold_df = pd.DataFrame(data = np.transpose([fold_scores, fold_sens, fold_spec]), columns = ['Score', 'Sensitivity', 'Specificity'])
    print (fold_df)
    print ("Means\n{}".format(fold_df.mean()))
    print ("Standard Deviations\n{}".format(fold_df.std()))

### Version 1: Original

In [3]:
keywords_v1 = ['superb', 'amazing', 'interesting', 'good', 'great', 'best', 'love', 'perfect', 'masterpiece', 'beautiful', 'excellent', 'wonderful', 'art', 'like', 'liked', 'enjoy', 'enjoyed']

reviews_v1 = keyword_features(reviews_raw, keywords_v1)

evaluate_version(reviews_v1, keywords_v1)

Baseline positive: 0.484
Train = Test Score: 0.618
      Score  Sensitivity  Specificity
0  0.581081        0.933        0.659
1  0.594595        0.971        0.725
2  0.540541        1.000        0.791
3  0.689189        0.907        0.613
4  0.635135        0.919        0.649
5  0.662162        1.000        0.641
6  0.581081        0.923        0.800
7  0.662162        1.000        0.641
8  0.608108        0.865        0.649
9  0.540541        0.917        0.816
Means
Score          0.609459
Sensitivity    0.943500
Specificity    0.698400
dtype: float64
Standard Deviations
Score          0.051537
Sensitivity    0.046739
Specificity    0.077280
dtype: float64


### Version 2: Words in both sentiments
Our first iteration tells us that these keywords adequately predict positive sentiments correctly. However, the specificity score is quite low, meaning these keywords do not predict negative sentiments correctly. This must mean that there are keywords in our list which are present in negative sentiment reviews. For version 2, let's take some words out that might be present in both negative and positive sentiments.

In [4]:
keywords_v2 = ['superb', 'amazing', 'great', 'best', 'love', 'perfect', 'masterpiece', 'beautiful', 'excellent', 'wonderful', 'liked', 'enjoy', 'enjoyed']

reviews_v2 = keyword_features(reviews_raw, keywords_v2)

evaluate_version(reviews_v2, keywords_v2)

Baseline positive: 0.484
Train = Test Score: 0.588
      Score  Sensitivity  Specificity
0  0.527027        0.933        0.750
1  0.581081        1.000        0.775
2  0.527027        1.000        0.814
3  0.675676        0.953        0.710
4  0.608108        0.946        0.730
5  0.594595        1.000        0.769
6  0.581081        0.949        0.829
7  0.581081        1.000        0.795
8  0.594595        0.919        0.730
9  0.527027        0.972        0.895
Means
Score          0.57973
Sensitivity    0.96720
Specificity    0.77970
dtype: float64
Standard Deviations
Score          0.045694
Sensitivity    0.031272
Specificity    0.055730
dtype: float64


### Version 3: Only positive adjectives

In [5]:
keywords_v3 = ['superb', 'amazing', 'great', 'perfect', 'masterpiece', 'beautiful', 'excellent', 'wonderful', 'outstanding', 'exceptional', 'marvelous', 'magnificent', 'preeminent', 'first-rate', 'terrific', 'tremendous', 'fantastic', 'fabulous']

reviews_v3 = keyword_features(reviews_raw, keywords_v3)

evaluate_version(reviews_v3, keywords_v3)

Baseline positive: 0.484
Train = Test Score: 0.56
      Score  Sensitivity  Specificity
0  0.459459        0.967        0.886
1  0.527027        1.000        0.875
2  0.527027        1.000        0.814
3  0.675676        0.977        0.742
4  0.594595        1.000        0.811
5  0.554054        1.000        0.846
6  0.554054        0.974        0.914
7  0.554054        1.000        0.846
8  0.567568        0.946        0.811
9  0.500000        0.972        0.947
Means
Score          0.551351
Sensitivity    0.983600
Specificity    0.849200
dtype: float64
Standard Deviations
Score          0.057615
Sensitivity    0.019161
Specificity    0.059117
dtype: float64


### Version 4: Only positive verbs

In [6]:
keywords_v4 = ['like', 'liked', 'love', 'loved', 'enjoy', 'enjoyed', 'adore', 'adored', 'appreciate', 'appreciated', 'respect', 'respected', 'approve', 'approved']

reviews_v4 = keyword_features(reviews_raw, keywords_v4)

evaluate_version(reviews_v4, keywords_v4)

Baseline positive: 0.484
Train = Test Score: 0.524
      Score  Sensitivity  Specificity
0  0.445946        0.967        0.909
1  0.472973        0.882        0.875
2  0.472973        1.000        0.907
3  0.418919        0.023        0.032
4  0.513514        0.054        0.027
5  0.500000        0.000        0.051
6  0.459459        0.051        0.086
7  0.513514        1.000        0.923
8  0.527027        0.946        0.892
9  0.472973        0.944        0.974
Means
Score          0.47973
Sensitivity    0.58670
Specificity    0.56760
dtype: float64
Standard Deviations
Score          0.033859
Sensitivity    0.478765
Specificity    0.447322
dtype: float64


### Version 5: Only positive adverbs

In [7]:
keywords_v5 = ['superbly', 'amazingly', 'greatly', 'perfectly', 'beautifully', 'excellently', 'wonderfully', 'outstandingly', 'exceptionally', 'marvelously', 'magnificently', 'preeminently', 'terrifically', 'tremendously', 'fantastically', 'fabulously']

reviews_v5 = keyword_features(reviews_raw, keywords_v5)

evaluate_version(reviews_v5, keywords_v5)

Baseline positive: 0.484
Train = Test Score: 0.516
      Score  Sensitivity  Specificity
0  0.594595        0.000        0.000
1  0.527027        0.000        0.025
2  0.540541        0.000        0.070
3  0.405405        0.000        0.032
4  0.500000        0.000        0.000
5  0.527027        0.029        0.026
6  0.472973        0.000        0.000
7  0.527027        0.000        0.000
8  0.500000        0.000        0.000
9  0.500000        0.000        0.026
Means
Score          0.509459
Sensitivity    0.002900
Specificity    0.017900
dtype: float64
Standard Deviations
Score          0.048952
Sensitivity    0.009171
Specificity    0.022845
dtype: float64


### Write-up
The original version, which combines the different ways to praise a product, remains the most accurate, albeit a low specificity. The keywords used here are mostly positive and vary between adjectives, nouns and verbs. The low specificity must be due to some keywords being used in negative reviews.
<br>Each version performs worse than the original, but has unique attributes and can be learned from. At the cost of a lower score, the version which removes ambiguous words from the original increases the sensitivity and specificity. 
<br>The adjective-only version achieves the highest average sensitivities and specificities. This is probably because positive reviews use positive adjectives and negative reviews don't. The verb-only version scores well, but has very low sensitivity and specificity. This is most likely due to the scarcity of these verbs in the reviews.
<br>The adverb-only version performs horribly, but still beats the baseline positive score. This model is slightly better than guessing, illustrated by the almost-zero sensitivity and specificity.
<br>If I were to use this model in the wild, I would choose Version 2 for its accuracy in Type 1 and Type 2 errors.