# Optimize Final Model
As was determined in the previous notebook, the best performing model of the three was the pre-built Linear SVC model with an accuracy of 45% on the validation set. In this notebook, I will try to improve the model's performance through hyperparameter tuning and error analysis and will obtain a final accuracy score on the unseen testing dataset.

In [1]:
# import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import metrics

## Import Training, Validation and Testing Datasets

In [2]:
# define the path for the processed datasets
PATH = "data/processed/"

# read the training dataset
hp_sentences_train = pd.read_csv(f"{PATH}training_df.csv")

# read the validation dataset
hp_sentences_val = pd.read_csv(f"{PATH}validation_df.csv")

# read the testing dataset
hp_sentences_test = pd.read_csv(f"{PATH}testing_df.csv")

In [3]:
# show the first 5 rows of the training dataset
hp_sentences_train.head()

Unnamed: 0,sentence,book
0,A wild-looking old woman dressed all in green ...,1
1,Harry was thinking about this time yesterday a...,1
2,"He had been down at Hagrid’s hut, helping him ...",1
3,"“We’re looking for a big, old-fashioned one — ...",1
4,I forbid you to tell the boy anything!” A brav...,1


In [4]:
# show the first 5 rows of the training dataset
hp_sentences_val.head()

Unnamed: 0,sentence,book
0,“She obviously makes more of an effort if you’...,1
1,We’ve eaten all our food and you still seem to...,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1
3,He gave his father a sharp tap on the head wit...,1
4,He kept threatening to tell her what really bi...,1


In [5]:
# show the first 5 rows of the training dataset
hp_sentences_test.head()

Unnamed: 0,sentence,book
0,"Excuse me, I’m a prefect!” “How could a troll ...",1
1,Harry wasn’t sure he could explain.,1
2,There was a tabby cat standing on the corner o...,1
3,"Peeves threw the chalk into a bin, which clang...",1
4,It didn’t so much as quiver when a car door sl...,1


## Load Linear SVC Model

In [6]:
# create a pipeline with the three steps required to train the classifier and make predictions
hp_classifier_svc = Pipeline([
    ('count_vect', CountVectorizer()), # create a word count vector
    ('freq_vect', TfidfTransformer()), # normalize the term frequencies
    ('classify', LinearSVC()) # use a Linear SVC classifier
])

In [7]:
# train the model on the sentences in the training dataset
hp_classifier_svc.fit(hp_sentences_train["sentence"], hp_sentences_train["book"])

Pipeline(steps=[('count_vect', CountVectorizer()),
                ('freq_vect', TfidfTransformer()), ('classify', LinearSVC())])

In [8]:
# create new column in dataframe with the predicted book and measure how long it took in seconds
hp_sentences_val["LinearSVC"] = hp_classifier_svc.predict(hp_sentences_val["sentence"])

In [9]:
# show the first 5 rows of the validation dataset with the predictions
hp_sentences_val.head()

Unnamed: 0,sentence,book,LinearSVC
0,“She obviously makes more of an effort if you’...,1,7
1,We’ve eaten all our food and you still seem to...,1,1
2,"Please cheer up, Hagrid, we saved the Stone, i...",1,1
3,He gave his father a sharp tap on the head wit...,1,1
4,He kept threatening to tell her what really bi...,1,3


## Analyze errors
The first activity to improve the model's performance is to analyze the sentences that resulted in errors to identify patterns and modify the model consequentially. I will achieve this in the following three ways:
1. Assess the relationship between the length of sentences and its error rate.
2. Determine if some words/tokens are associated to a higher error rate.
3. Manually identify patterns in a sample of 50 errors.

**Need to find way to pre-process and tokenize the sentences the same way as done by our pipeline to find the length and tokens that appear the most often**

### 1. Assess the relationship between the length of sentences and its error rate