# To Be Or Not To Be

a project for EECS 731 by Benjamin Wyss

Examining Shakespeare play data to build a classification model that predicts the character who speaks a specific line

###### python imports

In [62]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.close('all')

### Reading Data Set From CSV

All of Shakespeare's plays, characters, lines, and acts: 

Taken from https://www.kaggle.com/kingburrito666/shakespeare-plays on 9/16/20

In [63]:
df = pd.read_csv('../data/raw/Shakespeare_data.csv')

In [64]:
df

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


## Exploratory Data Analysis

### Cleaning the data set

Removing rows with NaN values because these rows correspond to stage directions, not character's lines, and are thus not of value to the target classification model.

Additionally, the Dataline column is removed since it does not relate to character's lines. Hence, it will not add value to the target classification model

In [65]:
df = df.dropna()
df = df[['Play', 'PlayerLinenumber', 'ActSceneLine', 'Player', 'PlayerLine']]

In [66]:
df

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
111390,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
111391,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first


### Transforming the data set

###### Column Splitting

ActSceneLine is separated into 3 columns--Act, Scene, and Line--so that we obtain a numeric representation of this data which can be analyzed by the target classification model

In [67]:
actSceneLine = df['ActSceneLine'].str.split('.', n = 2, expand = True)
df['Act'] = pd.to_numeric(actSceneLine[0])
df['Scene'] = pd.to_numeric(actSceneLine[1])
df['Line'] = pd.to_numeric(actSceneLine[2])
df = df[['Play', 'PlayerLinenumber', 'Act', 'Scene', 'Line', 'Player', 'PlayerLine']]

In [68]:
df

Unnamed: 0,Play,PlayerLinenumber,Act,Scene,Line,Player,PlayerLine
3,Henry IV,1.0,1,1,1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1,1,2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1,1,3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1,1,4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1,1,5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...,...,...
111390,A Winters Tale,38.0,5,3,179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
111391,A Winters Tale,38.0,5,3,180,LEONTES,"Lead us from hence, where we may leisurely"
111392,A Winters Tale,38.0,5,3,181,LEONTES,Each one demand an answer to his part
111393,A Winters Tale,38.0,5,3,182,LEONTES,Perform'd in this wide gap of time since first


###### Label Encoding



In [73]:
df

Unnamed: 0,Play,PlayerLinenumber,Act,Scene,Line,Player,PlayerLine,Player_Category
3,Henry IV,1.0,1,1,1,KING HENRY IV,"So shaken as we are, so wan with care,",KING HENRY IV
4,Henry IV,1.0,1,1,2,KING HENRY IV,"Find we a time for frighted peace to pant,",KING HENRY IV
5,Henry IV,1.0,1,1,3,KING HENRY IV,And breathe short-winded accents of new broils,KING HENRY IV
6,Henry IV,1.0,1,1,4,KING HENRY IV,To be commenced in strands afar remote.,KING HENRY IV
7,Henry IV,1.0,1,1,5,KING HENRY IV,No more the thirsty entrance of this soil,KING HENRY IV
...,...,...,...,...,...,...,...,...
111390,A Winters Tale,38.0,5,3,179,LEONTES,"Is troth-plight to your daughter. Good Paulina,",LEONTES
111391,A Winters Tale,38.0,5,3,180,LEONTES,"Lead us from hence, where we may leisurely",LEONTES
111392,A Winters Tale,38.0,5,3,181,LEONTES,Each one demand an answer to his part,LEONTES
111393,A Winters Tale,38.0,5,3,182,LEONTES,Perform'd in this wide gap of time since first,LEONTES


In [74]:
df.dtypes

Play                  object
PlayerLinenumber     float64
Act                    int64
Scene                  int64
Line                   int64
Player                object
PlayerLine            object
Player_Category     category
dtype: object