# Model 2 & 3 - Pre-built Models

The following two models I will train will come from the SciKit Learn open source library. The first will be a Naive Bayes classifier similar to the one I previously built from scratch. The second will be a Linear SVC which is the estimator recommended by the library's documentation for this type of problem (classification with less than 100k labeled data samples). I expect these models to be more efficient and accurate than the one built from scratch as they were optimized by experts in the community. The next step will be to evaluate and compare all the models on the unseen testing dataset.

![SciKit Learn Estimator Decision Tree](img/EDA/estimator_decision_tree.png)

In [1]:
# import required libraries
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics

## Import Training and Validation Datasets

In [2]:
# define the path for the processed datasets
PATH = "data/processed/"

# read the training dataset
hp_sentences_train = pd.read_csv(f"{PATH}training_df.csv")

# read the validation dataset
hp_sentences_val = pd.read_csv(f"{PATH}validation_df.csv")

In [3]:
# show the first 5 rows of the training dataset
hp_sentences_train.head()

Unnamed: 0,sentence,book
0,A wild-looking old woman dressed all in green ...,1
1,Harry was thinking about this time yesterday a...,1
2,"He had been down at Hagrid’s hut, helping him ...",1
3,"“We’re looking for a big, old-fashioned one — ...",1
4,I forbid you to tell the boy anything!” A brav...,1


In [4]:
# show the first 5 rows of the validation dataset
hp_sentences_val.head()

Unnamed: 0,sentence,book
0,“She obviously makes more of an effort if you’...,1
1,We’ve eaten all our food and you still seem to...,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1
3,He gave his father a sharp tap on the head wit...,1
4,He kept threatening to tell her what really bi...,1


## Naive Bayes Classification Model
The first pre-built model I used is a Naive Bayes text classifier similar to the one I built from scratch. Recall, my model achieved an accuracy of 63.6% on the training data and 37.2% on the validation data. This model is expected to be slightly more accurate but considerably more efficient. It also has the advantage of significantly reducing the development time through the use of pre-built models and functions.

Similar to the custom model, the first step is to preprocess the data. The same tasks previously performed manually are handled by the CountVectorizer model which preprocesses, tokenizes and filters stopwords from the raw sentences. It then creates a vector where each row represents a sentence and each column represents the number of times that word appeared in the sentence. 

The second step is to use the TfidfTransformer model which divides the number of occurences of each word in a sentence by the total number of words in the sentenece to avoid discrependancies between shorter and longer sentences. It then downscales the weights for the words that occur often across sentences as they are less informative than words that appear in few sentences.

The final step is to pass the generated word vector in the naive bayes multinomial classifier to either train the model or generate a prediction.

These three steps are consolidated into one pipeline that can be used to train the model and also make predictions.

In [5]:
# create a pipeline with the three steps required to train the classifier and make predictions
hp_classifier_nb = Pipeline([
    ('count_vect', CountVectorizer()), # create a word count vector
    ('freq_vect', TfidfTransformer()), # normalize the term frequencies
    ('classify', MultinomialNB()) # use a Naive Bayes multinomial classifier
])

In [6]:
# train the model on the sentences in the training dataset
hp_classifier_nb.fit(hp_sentences_train["sentence"], hp_sentences_train["book"])

Pipeline(steps=[('count_vect', CountVectorizer()),
                ('freq_vect', TfidfTransformer()),
                ('classify', MultinomialNB())])

In [7]:
# predict the books for the sentences in the training dataset
book_predictions_train_nb = hp_classifier_nb.predict(hp_sentences_train["sentence"])

# test the classifier's accuracy on the training data
np.mean(book_predictions_train_nb == hp_sentences_train["book"])

0.496907321030217

In [9]:
# predict the books for the sentences in the validation dataset
book_predictions_val_nb = hp_classifier_nb.predict(hp_sentences_val["sentence"])

# test the classifier's accuracy on the training data
np.mean(book_predictions_val_nb == hp_sentences_val["book"])

0.38787825006655424

As will almost always be the case, the model is less accurate on the validation dataset compared to the training data because some words in the validation dataset don't appear in the training data or have different meanings.

The accuracy metric for the model is 38.8% which is slightly superior than the accuracy achieved by the custom model (37.2%). However it was significantly faster to train and predict. It is also worth noting that the custom model was more accurate on the training data compared to the pre-built version.

## Linear SVC Classification Model
The second pre-built model I used is a Linear SVC (Support Vector Classifier) as was recommended by the ScikitLearn document as described above. The steps used to train the model and make predictions are the same as for the Naive Bayes model explained above with the only difference being the pre-built model used (Linear SVC instead of Naive Bayes).

This is another benefit of using pre-built models from a library. I am able in very little time to compare two (and more) models to see which one is most accurate and efficient.

In [10]:
# create a pipeline with the three steps required to train the classifier and make predictions
hp_classifier_svc = Pipeline([
    ('count_vect', CountVectorizer()), # create a word count vector
    ('freq_vect', TfidfTransformer()), # normalize the term frequencies
    ('classify', LinearSVC()) # use a Linear SVC classifier
])

In [11]:
# train the model on the sentences in the training dataset
hp_classifier_svc.fit(hp_sentences_train["sentence"], hp_sentences_train["book"])

Pipeline(steps=[('count_vect', CountVectorizer()),
                ('freq_vect', TfidfTransformer()), ('classify', LinearSVC())])

In [14]:
# predict the books for the sentences in the training dataset
book_predictions_train_svc = hp_classifier_svc.predict(hp_sentences_train["sentence"])

# test the classifier's accuracy on the training data
np.mean(book_predictions_train_svc == hp_sentences_train["book"])

0.7676688298519571

In [15]:
# predict the books for the sentences in the validation dataset
book_predictions_val_svc = hp_classifier_svc.predict(hp_sentences_val["sentence"])

# test the classifier's accuracy on the training data
np.mean(book_predictions_val_svc == hp_sentences_val["book"])

0.453367645753838

With an accuracy metric of 45.33% on the validation dataset, it seems that the Linear SVC model is the most accurate of the three options. While this accuracy is still relatively low, it is a significant improvement from our baseline and two previous models. 

The following notebook will seek to confirm that this is the most accurate model and attempt to improve it's accuracy through hyperparameter tuning and error analysis.

# Notebook Complete!