# Model 1 - Custom Naive Bayes Model

The first model I will train for this use case is a naive bayes model from scratch as taught in the DeepLearning.AI NLP specialization. I will restrain myself from looking at the provided code and instead, I will try to code this model based solely on the theory. After this model, I will compare the accuracy and performance with a random baseline and two models from the SciKit Learn library (including a Naive Bayes classifyer) to decide on a final model. I will then try to improve the accuracy of the best model by analyzing the errors and performing hyperparameter tuning (if applicable) to achieve optimal results.

In [1]:
# import required libraries
import pandas as pd
import numpy as np

import random

import json
import os.path
from os import path

In [2]:
# define the path for the processed datasets
PATH = "data/processed/"

In [3]:
# define a random seed for repeatability
random.seed(50)

In [4]:
# read the training dataset
hp_sentences = pd.read_csv(f"{PATH}training_df.csv")

In [5]:
# show the first 5 rows of the training dataset
hp_sentences.head()

Unnamed: 0,sentence,book
0,A wild-looking old woman dressed all in green ...,1
1,Harry was thinking about this time yesterday a...,1
2,"He had been down at Hagrid’s hut, helping him ...",1
3,"“We’re looking for a big, old-fashioned one — ...",1
4,I forbid you to tell the boy anything!” A brav...,1


## Pre-Process Sentences
The first step is to pre-process all the sentences by completing the following tasks:
1. Lowercase every word
2. Remove punctuation
3. Remove digits
4. Tokenize the sentence
5. Remove stop words
6. Stem each word

This function is held in a separate file for repeatability.

In [6]:
# import sentence preprocessing function
from utils import process_sentence

In [7]:
# validate the preprocessing function on 20 random sentences

# generate a list of 20 indices from the dataframe
indices = random.sample(list(hp_sentences.index), 20)

# iterate through the random indices
for i, index in enumerate(indices):
    
    # print the raw sentence and its processed version
    raw_sentence = hp_sentences["sentence"].iloc[index]
    print(f"Sentence #{i+1}:")
    print(raw_sentence) 
    print(process_sentence(raw_sentence), "\n\n")

Sentence #1:
Wormy was here last weekend, I thought he seemed down, but that was probably the news about the McKinnons; I cried all evening when I heard.
['wormi', 'last', 'weekend', 'thought', 'seem', 'probabl', 'news', 'mckinnon', 'cri', 'even', 'heard'] 


Sentence #2:
“I don’t care ... I’ll breathe freely again when this tournament’s over, and that’s not until June.
['dont', 'care', 'ill', 'breath', 'freeli', 'tournament', 'that', 'june'] 


Sentence #3:
If only Arthur could have got us cars from the Ministry again ... but Fudge wouldn’t let him borrow so much as an empty ink bottle these days... How Muggles can stand traveling without magic ...” But the great black dog gave a joyful bark and gamboled around them, snapping at pigeons, and chasing its own tail.
['arthur', 'could', 'got', 'us', 'car', 'ministri', 'fudg', 'wouldnt', 'let', 'borrow', 'much', 'empti', 'ink', 'bottl', 'day', 'muggl', 'stand', 'travel', 'without', 'magic', 'great', 'black', 'dog', 'gave', 'joy', 'bark', '

## Create Frequency Dictionary
The next step is to create a frequency dictionary that shows how many times each word appears in each book. I also count the total number of words (i.e. tokens) that appear in each book in order to calculate the probabilities using Bayes Theorem. To speed up the process when making changes to the model and when reading the model in other notebooks, I downlaod the generated frequency dictionary to a JSON file.

In [8]:
def create_freq_dict (df, process_sentence=process_sentence):
    """
    Creates a frequency dictionary based on the sentences and the book in which they appear.
    
        Parameters:
            df (dataframe): dataframe with the sentences from the Harry Potter books
            
        Returns:
            freq_dict (dict): dictionary with the number of times each word appears in each book
    """

    # initiate dictionary that will hold the word frequencies and number of words per book
    freq_dict = {}
    book_counts = {}
    
    # iterate through the rows in the dataframe
    for i in df.index:
        
        # store the book number and processed sentence for the row
        book_num = df["book"].iloc[i]
        sentence = process_sentence(df["sentence"].iloc[i])
        
        # iterate through the processed tokens for the row
        for token in sentence:
            # add 1 to count of words in book
            book_counts[str(book_num)] = book_counts.get(str(book_num), 0) + 1
            
            # add 1 to the existing frequency count for the word or add it to the dictionary if not already there
            freq_dict[token+str(book_num)] = freq_dict.get(token+str(book_num), 0) + 1
    
    # return both dictionaries with the word frequencies and the book counts
    return freq_dict, book_counts

In [9]:
# if the frequency dictionary was not already downloaded
if path.exists(f"{PATH}freq_dict.json") == False:
    
    # generate the frequency dictionary and book counts dictionary
    freq,book_counts = create_freq_dict(hp_sentences)
    
    # download the frequency dictionary as a JSON file
    with open(f"{PATH}freq_dict.json", "w") as freq_dict_file:
        json.dump(freq, freq_dict_file)
    
    # download the book counts dictionary as a JSON file
    with open(f"{PATH}book_counts_dict.json", "w") as book_counts_dict_file:
        json.dump(book_counts, book_counts_dict_file)
        
# read the frequency JSON file as a dictionary
with open(f"{PATH}freq_dict.json", "r") as freq_dict_file:
    freq = json.load(freq_dict_file)

# read the book counts JSON file as a dictionary
with open(f"{PATH}book_counts_dict.json", "r") as book_counts_dict_file:
    book_counts = json.load(book_counts_dict_file)

# Train Model to Predict Book
Using the frequency dictionary, it is possible to create a function that predicts the book in which a sentence appears. This is done by processing the sentence into standardized tokens and multiplying together the probability that each token belongs to a book and the probability of a random sentence belonging to that book. The book with the highest probability becomes the prediction. The probability of a token belonging to a book is calculated by dividing the number of times the token appeared in the book by the number of tokens in the book.

As the name of the method suggests, this is a naive way of predicting the book as we are assuming that the probability of a token appearing in each book is independent - which it is not. However, it should give us a good enough baseline to compare with other models.

In [10]:
def predict_book (df, sentence, freq, book_counts, process_sentence=process_sentence):
    """
    Predicts the book in which a sentence appears using the Naive Bayes technique.
    
    Parameters:
        df (dataframe): dataframe with the sentences from the Harry Potter books
        sentence (string): sentence from a Harry Potter book
        
    Returns:
        book (integer in the range 1-7): Harry Potter book in which the sentence is predicted to appear
    """
    
    # get the list of processed tokens for the sentence
    tokens = process_sentence(sentence)
    
    # initiate dictionary that will hold the probability of the sentence appearing in each book
    prob_books = {}
    
    # iterate through the seven book possibilities
    for book in range(1, 8):
        
        # store the total number of sentences in the dataframe and the number of sentences in the iterated book
        total_sentences = len(df)
        book_sentences = len(df[df["book"] == book])
        
        # calculate the probability of a random sentence appearing in the iterated book
        prob_books[book] = book_sentences / total_sentences
        
        # iterate through the tokens in the processed sentence
        for token in tokens:
            
            # calculate the probability that the word appears in the iterated book
            token_book_prob = freq.get(token + str(book), 0) / book_counts[str(book)]
            
            # multiply the running probability of the sentence appearing in the iterated book 
            # by the probability of the word appearing in the book
            prob_books[book] *= token_book_prob
    
    # return the book with the highest probability for the given sentence
    return max(prob_books, key=prob_books.get)

In [11]:
# create new column in dataframe with the predicted book
hp_sentences["predicted_book"] = hp_sentences["sentence"].apply(lambda sentence: predict_book(hp_sentences, sentence, freq, book_counts))

In [12]:
# import accuracy calculation function
from utils import calc_accuracy

In [13]:
# calculate accuracy of the classification model on the training dataset
calc_accuracy(hp_sentences["book"], hp_sentences["predicted_book"])

0.6355455282904077

An accuracy of 63.5% on the training data for this type of problem is weak, as expected. It is expected that the model will perform worse on the validation set.

# Test Model on Validation Set
Now that we have trained our model and confirmed that it works, let's measure it's accuracy on unseen data from our validation set.

In [14]:
# import validation dataset and see first five rows
hp_sentences_val = pd.read_csv(f"{PATH}validation_df.csv")
hp_sentences_val.head()

Unnamed: 0,sentence,book
0,“She obviously makes more of an effort if you’...,1
1,We’ve eaten all our food and you still seem to...,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1
3,He gave his father a sharp tap on the head wit...,1
4,He kept threatening to tell her what really bi...,1


In [15]:
# create new column in dataframe with the predicted book
hp_sentences_val["predicted_book"] = hp_sentences_val["sentence"].apply(lambda sentence: predict_book(hp_sentences_val, sentence, freq, book_counts))

In [16]:
# calculate accuracy of the classification model on the validation dataset
calc_accuracy(hp_sentences_val["book"], hp_sentences_val["predicted_book"])

0.37163900967255303

Our model achieved a final accuracy metric of 37.2%, meaning that it is only able to predict the right book in which a sentence appeared every three sentences, approximately. This is a poor result, but is still superior to our random baseline of 1/7 = 14.3%. Next step is to train two other models using the SciKit Learn library and to compare the results of all three models to find the most accurate.

# Notebook Complete!