# To Be Or Not To Be

a project for EECS 731 by Benjamin Wyss

Examining Shakespeare play data to build a classification model that predicts the character who speaks a specific line

###### python imports

In [173]:
import numpy as np
import pandas as pd
import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_predict
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
plt.close('all')

### Reading Data Set From CSV

All of Shakespeare's plays, characters, lines, and acts: 

Taken from https://www.kaggle.com/kingburrito666/shakespeare-plays on 9/16/20

In [174]:
df = pd.read_csv('../data/raw/Shakespeare_data.csv')

In [175]:
df

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


## Exploratory Data Analysis

### Cleaning the data set

Removing rows with NaN values because these rows correspond to stage directions, not characters' lines, and are thus not of value to the target classification model.

Additionally, the Dataline column is removed since it does not relate to character's lines. Hence, it will not add value to the target classification model

In [176]:
df = df.dropna()
df = df[['Play', 'PlayerLinenumber', 'ActSceneLine', 'Player', 'PlayerLine']]

In [177]:
df

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
111390,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
111391,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first


### Transforming the data set

###### Column Splitting

The ActSceneLine column is separated into 3 columns--Act, Scene, and Line--so that we obtain a numeric representation of this data which can be analyzed by the target classification model

In [178]:
actSceneLine = df['ActSceneLine'].str.split('.', n = 2, expand = True)
df['Act'] = pd.to_numeric(actSceneLine[0])
df['Scene'] = pd.to_numeric(actSceneLine[1])
df['Line'] = pd.to_numeric(actSceneLine[2])
df = df[['Play', 'PlayerLinenumber', 'Act', 'Scene', 'Line', 'Player', 'PlayerLine']]

In [179]:
df

Unnamed: 0,Play,PlayerLinenumber,Act,Scene,Line,Player,PlayerLine
3,Henry IV,1.0,1,1,1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1,1,2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1,1,3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1,1,4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1,1,5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...,...,...
111390,A Winters Tale,38.0,5,3,179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
111391,A Winters Tale,38.0,5,3,180,LEONTES,"Lead us from hence, where we may leisurely"
111392,A Winters Tale,38.0,5,3,181,LEONTES,Each one demand an answer to his part
111393,A Winters Tale,38.0,5,3,182,LEONTES,Perform'd in this wide gap of time since first


###### One-Hot Encoding

The Play column is encoded into multiple columns via one-hot encoding since the target classification model will not be able to reason about string data. One-hot encoding is selected over label encoding since label encoding suffers from arbitrary closeness of numeric labels.

In [180]:
df = pd.get_dummies(df, columns=['Play'])

In [181]:
df

Unnamed: 0,PlayerLinenumber,Act,Scene,Line,Player,PlayerLine,Play_A Comedy of Errors,Play_A Midsummer nights dream,Play_A Winters Tale,Play_Alls well that ends well,...,Play_Richard III,Play_Romeo and Juliet,Play_Taming of the Shrew,Play_The Tempest,Play_Timon of Athens,Play_Titus Andronicus,Play_Troilus and Cressida,Play_Twelfth Night,Play_Two Gentlemen of Verona,Play_macbeth
3,1.0,1,1,1,KING HENRY IV,"So shaken as we are, so wan with care,",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,1,1,2,KING HENRY IV,"Find we a time for frighted peace to pant,",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1.0,1,1,3,KING HENRY IV,And breathe short-winded accents of new broils,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1.0,1,1,4,KING HENRY IV,To be commenced in strands afar remote.,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1.0,1,1,5,KING HENRY IV,No more the thirsty entrance of this soil,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111390,38.0,5,3,179,LEONTES,"Is troth-plight to your daughter. Good Paulina,",0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
111391,38.0,5,3,180,LEONTES,"Lead us from hence, where we may leisurely",0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
111392,38.0,5,3,181,LEONTES,Each one demand an answer to his part,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
111393,38.0,5,3,182,LEONTES,Perform'd in this wide gap of time since first,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Visualizing the data set



### Feature Engineering



In [182]:
wordCounts = df['PlayerLine'].head().str.split().apply(pd.value_counts)

In [183]:
wordCounts

Unnamed: 0,with,wan,shaken,we,So,"are,",as,so,"care,",frighted,...,To,in,strands,thirsty,this,the,soil,more,No,entrance
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,...,,,,,,,,,,
4,,,,1.0,,,,,,1.0,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,1.0,1.0,1.0,,,,,,,
7,,,,,,,,,,,...,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Model Construction

In [184]:
playerCol = df.pop('Player')
df.insert(0, 'Player', playerCol)

In [185]:
df.pop('PlayerLine')
df.pop('Line')
array = df.values
X = array[:, 1:]
Y = array[:, 0]

In [186]:
X_train, X_validate, Y_train, Y_validate = train_test_split(X, Y, test_size=0.3, shuffle=True)

In [187]:
kFold = StratifiedKFold(n_splits=10, shuffle=True)
crossValPredictions = cross_val_predict(DecisionTreeClassifier(), X_train, Y_train, cv=kFold)



In [188]:
accuracy_score(Y_train, crossValPredictions)

0.8275548189006331

In [189]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)

DecisionTreeClassifier()

In [190]:
predictions = model.predict(X_validate)

In [191]:
accuracy_score(Y_validate, predictions)

0.8357636467380968