### To Be or Not To Be

In this notebook, we try to predict the player of a Shakespear play based on the Play, PlayerLinenumber, Act, Scene, and Line
<br>
<br>
First we import our packages including two classifiers we will apply on the data set

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing

In [3]:
df = pd.read_csv('./Data/Shakespeare_data.csv')
print(df.head())

   Dataline      Play  PlayerLinenumber ActSceneLine         Player  \
0         1  Henry IV               NaN          NaN            NaN   
1         2  Henry IV               NaN          NaN            NaN   
2         3  Henry IV               NaN          NaN            NaN   
3         4  Henry IV               1.0        1.1.1  KING HENRY IV   
4         5  Henry IV               1.0        1.1.2  KING HENRY IV   

                                          PlayerLine  
0                                              ACT I  
1                       SCENE I. London. The palace.  
2  Enter KING HENRY, LORD JOHN OF LANCASTER, the ...  
3             So shaken as we are, so wan with care,  
4         Find we a time for frighted peace to pant,  


### Feature Engineering
<br>
We can see there are NaN values and we want to remove the rows containing them to make the data usable
<br><br>
We also see that PlayerLine will require natural language processing. We will remove PlayerLine to avoid this complexity
<br>
There could also be a bias on the speech patterns that could affect the model since the same person wrote every line
<br>
And we can drop dataline because it is redundant given the row index


In [4]:
df = df.dropna()
df = df.drop(columns = ['PlayerLine'])
df = df.drop(columns = ['Dataline'])

df.head()

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player
3,Henry IV,1.0,1.1.1,KING HENRY IV
4,Henry IV,1.0,1.1.2,KING HENRY IV
5,Henry IV,1.0,1.1.3,KING HENRY IV
6,Henry IV,1.0,1.1.4,KING HENRY IV
7,Henry IV,1.0,1.1.5,KING HENRY IV


ActSceneLine has to be split into three seperate usable columns because its current value is not a usable number for a model
<br>
Then it can be dropped

In [5]:
dfActSceneLine = df['ActSceneLine'].str.split(pat='.', n=2, expand=True)
df = df.drop(columns = ['ActSceneLine'])

df['Act'] = dfActSceneLine[0]
df['Scene'] = dfActSceneLine[1]
df['Line'] = dfActSceneLine[2]
df

Unnamed: 0,Play,PlayerLinenumber,Player,Act,Scene,Line
3,Henry IV,1.0,KING HENRY IV,1,1,1
4,Henry IV,1.0,KING HENRY IV,1,1,2
5,Henry IV,1.0,KING HENRY IV,1,1,3
6,Henry IV,1.0,KING HENRY IV,1,1,4
7,Henry IV,1.0,KING HENRY IV,1,1,5
8,Henry IV,1.0,KING HENRY IV,1,1,6
9,Henry IV,1.0,KING HENRY IV,1,1,7
10,Henry IV,1.0,KING HENRY IV,1,1,8
11,Henry IV,1.0,KING HENRY IV,1,1,9
12,Henry IV,1.0,KING HENRY IV,1,1,10


Note PlayerLinenumber can't help us predict player beyond knowing that a player likely has consecutive lines.
<br>

We want to predict Player by using Play, PlayerLinenumber, Act, Scene, and Line
<br>
To do this, we need numeric values for the model. But Players and Play are strings in our dataframe
<br><br>
Label encoding Player and Play solves this problem

In [6]:
labelEncoder = preprocessing.LabelEncoder()

df['Play'] = labelEncoder.fit_transform(df['Play'])
df['Player'] = labelEncoder.fit_transform(df['Player'])
df.head()

Unnamed: 0,Play,PlayerLinenumber,Player,Act,Scene,Line
3,9,1.0,457,1,1,1
4,9,1.0,457,1,1,2
5,9,1.0,457,1,1,3
6,9,1.0,457,1,1,4
7,9,1.0,457,1,1,5


### Classify

Now we will try to predict player using the other columns

In [7]:
indepVars = df[['Play', 'PlayerLinenumber', 'Act', 'Scene', 'Line']]
depVars = df['Player']

In [8]:
# split into training set and test set
# train on 70 percent of the data and test on 30
X_train, X_test, Y_train, Y_test = train_test_split(indepVars, depVars, test_size=0.3, random_state=1)

First we try to use a Random Forest

In [9]:
# create random forest classifier
classifier = RandomForestClassifier(n_estimators=10)

# train classifier on test data
classifier.fit(X_train, Y_train)

# predict on test to see what percentage is correct
randomForestPrediction = classifier.predict(X_test)

In [10]:
print( metrics.accuracy_score(Y_test, randomForestPrediction) )

0.7866290496417929


This is good, but we know that Random Forest performs slowly and does not scale well with more features.
<br>
So we will try to use a decision tree classifier which scales better

In [11]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train, Y_train)
decisionTreePrediction = classifier.predict(X_test)

In [12]:
print( metrics.accuracy_score(Y_test, decisionTreePrediction) )

0.7712863754517213


The accuracy of the decision tree classifier is very close to the random forest classifier.
<br>
This means the decision tree classifier is the better choice of model because it scales well.
<br><br>
As noted earlier, PlayerLinenumber may not be very helpful, so we will try both classifiers without this attribute.

In [13]:
indepVars = df[['Play', 'Act', 'Scene', 'Line']]
depVars = df['Player']

X_train, X_test, Y_train, Y_test = train_test_split(indepVars, depVars, test_size=0.3, random_state=1)

In [14]:
classifier = RandomForestClassifier(n_estimators=10)
classifier.fit(X_train, Y_train)
randomForestPrediction = classifier.predict(X_test)
print( metrics.accuracy_score(Y_test, randomForestPrediction) )

0.711437266214417


In [15]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train, Y_train)
decisionTreePrediction = classifier.predict(X_test)
print( metrics.accuracy_score(Y_test, decisionTreePrediction) )

0.7105496734926774


Again, the accuracy of each classifier is very close.<br>
We see that the classifiers loses about 7% accuracy without the PlayerLinenumber column. This isn't huge, but the extra accuracy could be valuable depending on the application