# Different models

Let's see how different models predict on the *Simpsons* data.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

## Pre-processing ##


First, let's start with the code to generate a document-feature matrix

In [3]:
df = pd.read_csv('simpsons.csv')
df = df.loc[(df['raw_character_text'] == 'Lisa Simpson') | (df['raw_character_text'] == 'Bart Simpson')]
text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?


## Building the models ##

Now, we will use three different classifiers from `sklearn`: *Naïve Bayes*, *k-NN* and *Random Forest*. One approach is to simply fit all the models on the same data and see how they perform. For an even more precise estimate we'd do this for several train/test splits and average them. But now let's just stick to one.

In [4]:
X = docu_feat #the document-feature matrix is the X matrix
y = df['raw_character_text'] #creating the y vector
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it


In [None]:
#Naive Bayes
nb = MultinomialNB() #create the model
nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

#k-NN
knn = KNeighborsClassifier() #create the model
knn = knn.fit(X_train, y_train) #fit the model X=features, y=character

#Random Forest
rf = RandomForestClassifier() #create the model
rf = rf.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the models ##

In [8]:
nb_score = nb.score(X_test, y_test)
knn_score = knn.score(X_test, y_test)
rf_score = rf.score(X_test, y_test)

print(f"The accuracy for Naive Bayes: {nb_score}")
print(f"The accuracy for k-NN: {knn_score}")
print(f"The accuracy for Random Forest: {rf_score}")


The accuracy for Naive Bayes: 0.6361716171617162
The accuracy for k-NN: 0.5566996699669967
The accuracy for Random Forest: 0.6035643564356435


The accuracy for Naive Bayes is best, followed by Random Forest and then k-NN. We could also look into precision and recall if we want to, but I'll leave that as an exercise for the reader.

So one approach would be to take Naive Bayes as our model for now based on this exploration.

If we wanted to improve its performance we might do some pre-processing on the data (e.g., stemming, adding features like utterance length, etc.). For example, the length of the utterance. Let's add it to the dataframe:

## Adding another feature to our chosen model

As an example, let's calculate the length of the dialogue line and add it to the dataframe:

In [95]:
df["spoken_words"] = df["spoken_words"].fillna("") #There are NaN in that need to be removed. Let's fill them with an empty string
df["length"] = df["spoken_words"].apply(len)
df.head()

Unnamed: 0,raw_character_text,spoken_words,length
1,Lisa Simpson,Where's Mr. Bergstrom?,22
3,Lisa Simpson,That life is worth living.,26
7,Bart Simpson,Victory party under the slide!,30
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!,29
11,Lisa Simpson,Do you know where I could find him?,35


Let's see if there is an average difference here:

In [71]:
df.groupby("raw_character_text")["length"].mean()

raw_character_text
Bart Simpson    41.49662
Lisa Simpson    46.29698
Name: length, dtype: float64

Yes, Lisa's lines are longer. Let's add the variable to the document-feature matrix. It's not a word but it can still be used as a column. We'll have to use numpy syntax for that, as the document-feature matrix is a numpy array, not a dataframe:

In [75]:
X_plus = np.c_[X.toarray(), df["length"]] #X.toarray() converts the sparse array to a normal array. This takes up somewhat more memory. However, not as much as making it into a dataframe. Then np.c_ concatenates the columns.
X_train_plus, X_test_plus, y_train_plus, y_test_plus = train_test_split(X_plus, y, test_size=0.3) #split the data and store it


Let's run another model:

In [78]:
#Naive Bayes
nb = MultinomialNB() #create the model
nb = nb.fit(X_train_plus, y_train_plus) #fit the model X=features, y=character
nb_score = nb.score(X_test_plus, y_test_plus)
nb_score

0.6357755775577558

Now the improvement may be so slight that it really depends on the train-test split. So let's run a few more, including the original model, and calculate an average (this will take some time, about 5-10 minutes on my laptop): 

In [87]:
nb1_score = [] #the original data
nb2_score = [] #the original data with extra length column

for i in range(0,10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #the original data
    X_train_plus, X_test_plus, y_train_plus, y_test_plus = train_test_split(X_plus, y, test_size=0.3) #data with extra column
    
    nb1 = MultinomialNB() #create the model
    nb1 = nb1.fit(X_train, y_train) #train with original data
    nb1_score.append(nb1.score(X_test, y_test)) #calculate scores and add to list
    
    nb2 = MultinomialNB() #create the model
    nb2 = nb2.fit(X_train_plus, y_train_plus) #train with original data + length column
    nb2_score.append(nb2.score(X_test_plus, y_test_plus)) #calculate scores and add to list
    

    

In [98]:
from numpy import mean, around

print(f"The scores for the original model are: {around(nb1_score, 3)}, mean {round(mean(nb1_score),3)}")
print(f"The scores for the extended model are: {around(nb2_score, 3)}, mean {round(mean(nb2_score),3)}")      

The scores for the original model are: [0.635 0.642 0.633 0.632 0.632 0.647 0.631 0.644 0.638 0.644], mean 0.638
The scores for the extended model are: [0.645 0.627 0.637 0.628 0.64  0.637 0.64  0.647 0.639 0.649], mean 0.639


The scores aren't that different at all, which I don't really understand... Given the large differences in length between Lisa and Bart's lines I think they would differ significantly. It's possible that I did something wrong. However, this is an example of how you could improve your model with some extra features after you've chosen a model.