# Model 1 - Custom Naive Bayes Model

The first model I will train on the training data is a custom naive bayes model that was taught in the DeepLearning.AI NLP specialization. I will restrain myself from looking at the provided code and instead, I will try to code this model based solely on the theory. After this model, I will compare the accuracy and performance with two models from the SciKit Learn library to decide on a final model.

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import random

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import string

In [2]:
PATH = "data/processed/"

In [3]:
random.seed(50)

In [4]:
hp_sentences = pd.read_csv(f"{PATH}training_df.csv")

In [5]:
hp_sentences.head()

Unnamed: 0,sentence,book
0,A wild-looking old woman dressed all in green ...,1
1,Harry was thinking about this time yesterday a...,1
2,"He had been down at Hagrid’s hut, helping him ...",1
3,"“We’re looking for a big, old-fashioned one — ...",1
4,I forbid you to tell the boy anything!” A brav...,1


## Pre-Process Sentences
The first step is to pre-process all the sentences by completing the following tasks:
1. Lowercase every word
2. Remove punctuation
3. Remove digits
4. Tokenize the sentence
5. Remove stop words
6. Stem each word

In [6]:
def process_sentence (txt):
    """
    Returns the processed version of the sentence in a list.
    
        Parameters:
            sentence (str): A sentence in the form of a string
            
        Returns:
            processed_sentence (list of str): A list of the processed tokens that make up the sentence
    """
    
    # lowercase every word
    txt_lower = txt.lower()
    
    # remove punctuation and digits
    txt_punc_digit = ""
    for char in txt_lower:
        if (char not in string.punctuation + "“’”—") and (char.isdigit() == False):
            txt_punc_digit += char
            
    # tokenize the sentence
    txt_token = word_tokenize(txt_punc_digit)
    
    # remove stop words and stem each word
    stemmer = PorterStemmer()
    txt_processed = []
    for token in txt_token:
        if (token not in stopwords.words('english')):
            txt_processed.append(stemmer.stem(token))
            
    return txt_processed

In [7]:
# validate the preprocessing function on 20 random sentences

# generate a list of 20 indices from the dataframe
indices = random.sample(list(hp_sentences.index), 20)

# iterate through the random indices
for i, index in enumerate(indices):
    
    # print the raw sentence and its processed version
    raw_sentence = hp_sentences["sentence"].iloc[index]
    print(f"Sentence #{i+1}:")
    print(raw_sentence) 
    print(process_sentence(raw_sentence), "\n\n")

Sentence #1:
Wormy was here last weekend, I thought he seemed down, but that was probably the news about the McKinnons; I cried all evening when I heard.
['wormi', 'last', 'weekend', 'thought', 'seem', 'probabl', 'news', 'mckinnon', 'cri', 'even', 'heard'] 


Sentence #2:
“I don’t care ... I’ll breathe freely again when this tournament’s over, and that’s not until June.
['dont', 'care', 'ill', 'breath', 'freeli', 'tournament', 'that', 'june'] 


Sentence #3:
If only Arthur could have got us cars from the Ministry again ... but Fudge wouldn’t let him borrow so much as an empty ink bottle these days... How Muggles can stand traveling without magic ...” But the great black dog gave a joyful bark and gamboled around them, snapping at pigeons, and chasing its own tail.
['arthur', 'could', 'got', 'us', 'car', 'ministri', 'fudg', 'wouldnt', 'let', 'borrow', 'much', 'empti', 'ink', 'bottl', 'day', 'muggl', 'stand', 'travel', 'without', 'magic', 'great', 'black', 'dog', 'gave', 'joy', 'bark', '

## Create Frequency Dictionary
The next step is to create a frequency dictionary that shows how many times each word appears in each book.

In [13]:
def create_freq_dict (df, process_sentence=process_sentence):
    """
    Creates a frequency dictionary based on the sentences and the book in which they appear.
    
        Parameters:
            df (dataframe): dataframe with at least a sentence and book column
            
        Returns:
            freq_dict (dict): dictionary with the number of times each word appears in each book
    """

    # initiate dictionary that will hold the word frequencies and number of words per book
    freq_dict = {}
    book_counts = {}
    
    # iterate through the rows in the dataframe
    for i in df.index:
        
        # store the book number and processed sentence for the row
        book_num = df["book"].iloc[i]
        sentence = process_sentence(df["sentence"].iloc[i])
        
        # iterate through the processed tokens for the row
        for token in sentence:
            # add 1 to count of words in book
            book_counts[book_num] = book_counts.get(book_num, 0) + 1
            
            # add 1 to the existing frequency count for the word or add it to the dictionary if not already there
            freq_dict[(token, book_num)] = freq_dict.get((token, book_num), 0) + 1
        
    return freq_dict, book_counts

In [14]:
# store the frequency dictionary and book counts
freq,book_counts = create_freq_dict(hp_sentences)

In [39]:
def predict_book (sentence, process_sentence=process_sentence):

    tokens = process_sentence(sentence)

    prob_books = {}

    for book in range(1, 8):

        total_sentences = len(hp_sentences)
        book_sentences = len(hp_sentences[hp_sentences["book"] == book])

        prob_books[book] = book_sentences / total_sentences

        for token in tokens:

            token_book_prob = freq.get((token, book), 0) / book_counts[book]

            prob_books[book] *= token_book_prob

    return max(prob_books, key=prob_books.get)

In [40]:
indices = random.sample(list(hp_sentences.index), 100)

test_subset = hp_sentences.iloc[indices]
test_subset

Unnamed: 0,sentence,book
11723,Ron looked as though he was suffering some sor...,4
32778,"“Alecto, Amycus’s sister, teaches Muggle Studi...",7
3461,"“A memory,” said Riddle quietly.",2
15793,"“There’s a Common Welsh Green over there, the ...",4
37952,"“Don’t worry, Dumbledore,” he said coolly.",7
...,...,...
14709,There was a great scraping and banging as all ...,4
9554,“For ... for some things ...” He would have li...,3
7031,...” Harry stared up into the grave face and f...,3
28842,"Meanwhile, Lavender kept sidling up to Harry t...",6


In [41]:
test_subset["prediction"] = test_subset["sentence"].apply(predict_book)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_subset["prediction"] = test_subset["sentence"].apply(predict_book)


In [43]:
len(test_subset[test_subset["book"] == test_subset["prediction"]]) / 100

0.65