To-do:
1. Add random baseline
2. Add interpretation of confusion matrix and classification report
3. Compare custom and pre-built naive bayes (count same, visualize per book, correlation)
4. Visualize accuracy/book for each model
5. Add final interpretation

# Evaluate Models

This notebook will be used to compare all three models in terms of their accuracy in predicting the book in which a sentence belongs and their efficiency in doing so, measured by the time it takes to make all predictions. The three models will be compared based on the validation dataset. Once a final model is chosen, some hyperparameter tuning and error analysis will be conducted to try to improve its score. Finally, the improved model will be tested against the unseen training dataset to obtain a final accuracy metric.

In [1]:
# import required libraries
import pandas as pd
import numpy as np

import json
from time import time

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics

## Import Training, Validation and Testing Datasets

In [2]:
# define the path for the processed datasets
PATH = "data/processed/"

# read the training dataset
hp_sentences_train = pd.read_csv(f"{PATH}training_df.csv")

# read the validation dataset
hp_sentences_val = pd.read_csv(f"{PATH}validation_df.csv")

# read the testing dataset
hp_sentences_test = pd.read_csv(f"{PATH}testing_df.csv")

In [3]:
# show the first 5 rows of the training dataset
hp_sentences_train.head()

Unnamed: 0,sentence,book
0,A wild-looking old woman dressed all in green ...,1
1,Harry was thinking about this time yesterday a...,1
2,"He had been down at Hagrid’s hut, helping him ...",1
3,"“We’re looking for a big, old-fashioned one — ...",1
4,I forbid you to tell the boy anything!” A brav...,1


In [4]:
# show the first 5 rows of the validation dataset
hp_sentences_val.head()

Unnamed: 0,sentence,book
0,“She obviously makes more of an effort if you’...,1
1,We’ve eaten all our food and you still seem to...,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1
3,He gave his father a sharp tap on the head wit...,1
4,He kept threatening to tell her what really bi...,1


In [5]:
# show the first 5 rows of the testing dataset
hp_sentences_test.head()

Unnamed: 0,sentence,book
0,"Excuse me, I’m a prefect!” “How could a troll ...",1
1,Harry wasn’t sure he could explain.,1
2,There was a tabby cat standing on the corner o...,1
3,"Peeves threw the chalk into a bin, which clang...",1
4,It didn’t so much as quiver when a car door sl...,1


## Re-Train Models
Using the same steps used and described in the other two notebooks, I will train the models to be used in this notebook.

### Custom Naive bayes

In [6]:
# read the frequency JSON file as a dictionary
with open(f"{PATH}freq_dict.json", "r") as freq_dict_file:
    freq = json.load(freq_dict_file)

# read the book counts JSON file as a dictionary
with open(f"{PATH}book_counts_dict.json", "r") as book_counts_dict_file:
    book_counts = json.load(book_counts_dict_file)

In [7]:
# import sentence preprocessing function
from utils import process_sentence

In [8]:
def predict_book (df, sentence, freq, book_counts, process_sentence=process_sentence):
    """
    Predicts the book in which a sentence appears using the Naive Bayes technique.
    
    Parameters:
        df (dataframe): dataframe with the sentences from the Harry Potter books
        sentence (string): sentence from a Harry Potter book
        
    Returns:
        book (integer in the range 1-7): Harry Potter book in which the sentence is predicted to appear
    """
    
    # get the list of processed tokens for the sentence
    tokens = process_sentence(sentence)
    
    # initiate dictionary that will hold the probability of the sentence appearing in each book
    prob_books = {}
    
    # iterate through the seven book possibilities
    for book in range(1, 8):
        
        # store the total number of sentences in the dataframe and the number of sentences in the iterated book
        total_sentences = len(df)
        book_sentences = len(df[df["book"] == book])
        
        # calculate the probability of a random sentence appearing in the iterated book
        prob_books[book] = book_sentences / total_sentences
        
        # iterate through the tokens in the processed sentence
        for token in tokens:
            
            # calculate the probability that the word appears in the iterated book
            token_book_prob = freq.get(token + str(book), 0) / book_counts[str(book)]
            
            # multiply the running probability of the sentence appearing in the iterated book 
            # by the probability of the word appearing in the book
            prob_books[book] *= token_book_prob
    
    # return the book with the highest probability for the given sentence
    return max(prob_books, key=prob_books.get)

In [9]:
# create new column in dataframe with the predicted book and measure how long it took in seconds
start_time = time()
hp_sentences_val["CustomNB"] = hp_sentences_val["sentence"].apply(lambda sentence: predict_book(hp_sentences_val, sentence, freq, book_counts))
elapsed_time_customNB = time() - start_time

### Pre-Built Naive Bayes

In [10]:
# create a pipeline with the three steps required to train the classifier and make predictions
hp_classifier_nb = Pipeline([
    ('count_vect', CountVectorizer()), # create a word count vector
    ('freq_vect', TfidfTransformer()), # normalize the term frequencies
    ('classify', MultinomialNB()) # use a Naive Bayes multinomial classifier
])

In [11]:
# train the model on the sentences in the training dataset
hp_classifier_nb.fit(hp_sentences_train["sentence"], hp_sentences_train["book"])

Pipeline(steps=[('count_vect', CountVectorizer()),
                ('freq_vect', TfidfTransformer()),
                ('classify', MultinomialNB())])

In [12]:
# create new column in dataframe with the predicted book and measure how long it took in seconds
start_time = time()
hp_sentences_val["PrebuiltNB"] = hp_classifier_nb.predict(hp_sentences_val["sentence"])
elapsed_time_prebuiltNB = time() - start_time

### Linear SVC

In [13]:
# create a pipeline with the three steps required to train the classifier and make predictions
hp_classifier_svc = Pipeline([
    ('count_vect', CountVectorizer()), # create a word count vector
    ('freq_vect', TfidfTransformer()), # normalize the term frequencies
    ('classify', LinearSVC()) # use a Linear SVC classifier
])

In [14]:
# train the model on the sentences in the training dataset
hp_classifier_svc.fit(hp_sentences_train["sentence"], hp_sentences_train["book"])

Pipeline(steps=[('count_vect', CountVectorizer()),
                ('freq_vect', TfidfTransformer()), ('classify', LinearSVC())])

In [15]:
# create new column in dataframe with the predicted book and measure how long it took in seconds
start_time = time()
hp_sentences_val["LinearSVC"] = hp_classifier_svc.predict(hp_sentences_val["sentence"])
elapsed_time_linearSVC = time() - start_time

## Evaluate & Compare Models
Now that the models have been trained and produced predictions on our validation dataset, it is time to evaluate their results and compare their performances. I will perform the following activities to identify the model to keep and to better understand their behaviors:
1. Compare each model's accuracy and efficiency
2. Analyze each model's classification report and confusion matrix

### 1. Compare each model's accuracy and efficiency

In [16]:
# import accuracy function
from utils import calc_accuracy

In [17]:
# show the first 5 rows of the validation datasets with the predictions from each model
hp_sentences_val.head()

Unnamed: 0,sentence,book,CustomNB,PrebuiltNB,LinearSVC
0,“She obviously makes more of an effort if you’...,1,5,5,7
1,We’ve eaten all our food and you still seem to...,1,1,5,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1,1,5,1
3,He gave his father a sharp tap on the head wit...,1,1,4,1
4,He kept threatening to tell her what really bi...,1,1,5,3


In [21]:
# create list with the accuracy metric of each model
MODEL_NAMES = ["CustomNB", "PrebuiltNB", "LinearSVC"]
accuracy_list = []

for model in MODEL_NAMES:
    accuracy_list.append(calc_accuracy(hp_sentences_val["book"], hp_sentences_val[model]))

    
# create dataframe with the accuracy and efficiency (time to predict) of each model
model_performance_df = pd.DataFrame(data = {
                                        "Accuracy": accuracy_list,
                                        "Efficiency": [elapsed_time_customNB, elapsed_time_prebuiltNB, elapsed_time_linearSVC]
                                    }, index = MODEL_NAMES)

# display the dataframe
model_performance_df

Unnamed: 0,Accuracy,Efficiency
CustomNB,0.371639,192.95371
PrebuiltNB,0.387878,0.226398
LinearSVC,0.453368,0.214321


We can see that the custom model is the least accurate, slightly behind the pre-built Naive bayes model, but is by far the least efficient having took 192 seconds to make all predictions. The two pre-built models took approximately the same amount of time to make their predictions (~0.2 seconds), but the Linear SVC model was the clear winner in terms of accuracy at 45.34%. Therefore, this is the model I will keep to complete the project, but I will still spend more time to analyze each model's performance.

### 2. Analyze each model's classification report and confusion matrix

In [26]:
# print classification report for custom naive bayes
print("Custom Naive Bayes\n\n")
print("Classification Report:")
print(metrics.classification_report(hp_sentences_val["book"], hp_sentences_val["CustomNB"]))

# print confusion matrix for custom naive bayes
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(hp_sentences_val["book"], hp_sentences_val["CustomNB"]))

Custom Naive Bayes


Classification Report:
              precision    recall  f1-score   support

           1       0.15      0.39      0.22       888
           2       0.39      0.25      0.30       983
           3       0.41      0.31      0.35      1297
           4       0.47      0.38      0.42      2082
           5       0.40      0.42      0.41      2382
           6       0.40      0.32      0.36      1644
           7       0.46      0.43      0.44      1993

    accuracy                           0.37     11269
   macro avg       0.38      0.36      0.36     11269
weighted avg       0.40      0.37      0.38     11269


Confusion Matrix:
[[ 348   61  104  102  132   57   84]
 [ 203  243   95  120  161   74   87]
 [ 203   71  408  145  251   92  127]
 [ 325   85  122  791  406  148  205]
 [ 483   75  112  225 1010  209  268]
 [ 321   46   88  133  291  533  232]
 [ 363   48   76  153  284  214  855]]


In [27]:
# print classification report for pre-built naive bayes
print("Pre-Built Naive Bayes\n\n")
print("Classification Report:")
print(metrics.classification_report(hp_sentences_val["book"], hp_sentences_val["PrebuiltNB"]))

# print confusion matrix for pre-built naive bayes
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(hp_sentences_val["book"], hp_sentences_val["PrebuiltNB"]))

Pre-Built Naive Bayes


Classification Report:
              precision    recall  f1-score   support

           1       0.63      0.01      0.03       888
           2       0.74      0.03      0.06       983
           3       0.79      0.09      0.15      1297
           4       0.47      0.46      0.46      2082
           5       0.30      0.83      0.44      2382
           6       0.64      0.19      0.29      1644
           7       0.52      0.49      0.50      1993

    accuracy                           0.39     11269
   macro avg       0.58      0.30      0.28     11269
weighted avg       0.54      0.39      0.33     11269


Confusion Matrix:
[[  12    1    6  173  585   14   97]
 [   0   29    5  197  616   36  100]
 [   0    1  111  199  839   24  123]
 [   1    2    3  957  933   22  164]
 [   2    0    2  192 1984   27  175]
 [   3    4    6  155  916  307  253]
 [   1    2    7  164  800   48  971]]


In [28]:
# print classification report for linear SVC
print("Linear SVC\n\n")
print("Classification Report:")
print(metrics.classification_report(hp_sentences_val["book"], hp_sentences_val["LinearSVC"]))

# print confusion matrix for linear SVC
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(hp_sentences_val["book"], hp_sentences_val["LinearSVC"]))

Linear SVC


Classification Report:
              precision    recall  f1-score   support

           1       0.38      0.34      0.36       888
           2       0.45      0.31      0.37       983
           3       0.42      0.39      0.40      1297
           4       0.48      0.50      0.49      2082
           5       0.45      0.52      0.48      2382
           6       0.44      0.40      0.42      1644
           7       0.49      0.53      0.51      1993

    accuracy                           0.45     11269
   macro avg       0.44      0.43      0.43     11269
weighted avg       0.45      0.45      0.45     11269


Confusion Matrix:
[[ 305   49  102  130  130   72  100]
 [  86  306  107  128  175   82   99]
 [  97   66  509  174  211   89  151]
 [  88   85  147 1037  347  161  217]
 [  98   73  145  306 1232  236  292]
 [  63   47  111  175  326  656  266]
 [  61   52   96  205  308  207 1064]]
