# To Be Or Not To Be
In this project we will be looking at a Shakespeare dataset that contains all the lines spoken from shakespeare's works, the play it was spoken in, the Act Scene and Line number, the Player Line number, and the Player who spoke the line. We will add some features to make this dataset more useful. We will seperate the ActSceneLine col into their own respective columns. Also we will one hot encode the Play column which may be useful for the classifier in telling which play the line is from. Another important feature we will add is adding a column for every unique character in shakespeare's works and then cross referencing every Player in the current scene and putting a 1 in their current row and column if they are in the scene and a 0 if not. The idea behind this being that there will be much less players to choose from if we know who is in the current scene thus giving the classifier better odds at predicting the right player.

We will start by importing essential libraries and loading the Shakespeare dataset into pandas

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Shakespeare_data.csv', index_col = "Dataline")

lets check out the data for missing values

In [3]:
#checks the data for missing values
for col in data.columns: 
    counter = 0
    empty = pd.isnull(data[col])
    for el in empty:
        if(el == True):
            counter = counter + 1
    print(str(col) + ": " + str(counter))

data.head()

Play: 0
PlayerLinenumber: 3
ActSceneLine: 6243
Player: 7
PlayerLine: 0


Unnamed: 0_level_0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
Dataline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Henry IV,,,,ACT I
2,Henry IV,,,,SCENE I. London. The palace.
3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


So it seems we have some columns that have missing data, this is fine but we are going to go ahead and set these missing data values to 0.

In [4]:
data["ActSceneLine"] = data["ActSceneLine"].fillna("0.0.0")
data = data.fillna(0)
data.head()

Unnamed: 0_level_0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
Dataline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Henry IV,0.0,0.0.0,0,ACT I
2,Henry IV,0.0,0.0.0,0,SCENE I. London. The palace.
3,Henry IV,0.0,0.0.0,0,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


We are going to cplit up the ActSceneLine into three categories Act Scene and Line

In [5]:
Act = []
Scene = []
Line = []
for el in data.ActSceneLine:
    text  = str(el).split(".")
    Act.append(text[0])
    Scene.append(text[1])
    Line.append(text[2])
data["Act"] = Act
data["Scene"] = Scene
data["Line"] = Line
data = data.drop(columns=["ActSceneLine"])
data.head()

Unnamed: 0_level_0,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line
Dataline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Henry IV,0.0,0,ACT I,0,0,0
2,Henry IV,0.0,0,SCENE I. London. The palace.,0,0,0
3,Henry IV,0.0,0,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ...",0,0,0
4,Henry IV,1.0,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1
5,Henry IV,1.0,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2


We want the want to keep the Play column but we want to put it into a format the classifier can work with, so we will use the OneHotEncoder method.

In [6]:
y = pd.get_dummies(data.Play)
for col in y.columns:
    data[col] = y[col]
data = data.drop(columns = ["Play"])



Now lets label Encode the Player column so it can be used in the classifier

In [7]:
data["Player_cat"] = data["Player"].astype('category').cat.codes
data.head(2)

Unnamed: 0_level_0,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line,A Comedy of Errors,A Midsummer nights dream,A Winters Tale,Alls well that ends well,...,Romeo and Juliet,Taming of the Shrew,The Tempest,Timon of Athens,Titus Andronicus,Troilus and Cressida,Twelfth Night,Two Gentlemen of Verona,macbeth,Player_cat
Dataline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0,ACT I,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0.0,0,SCENE I. London. The palace.,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lets split our data into training and testing data sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ["Player","PlayerLine", "Player_cat"]), data["Player_cat"], test_size=.2, random_state=41)

lets make a naive bayes classifier and run it on our data set to see how it performs. The goal here is to see how it performs before we add more features and then compare the results to the same classifier after we have added more features.

In [9]:
clf =  GaussianNB()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

print(score)

0.2192549371633752


We get an accuracy score of around %20, this is very low

Lets get the characters stated at the beginning of each scene 

In [10]:
actors = [" "]
for el in range(len(data["Line"])):
    
    if(data.at[el+1,"Line"] == "1"):
        actors = []
        str = data.at[el,"PlayerLine"].replace("and",",").replace("of", "").replace("LORD", '').split(",")
        for sent in str:
            
            
            for word in sent.split():
                if(not word.isupper()):
                    sent = sent.replace(word,"")
                    
            actors.append(sent)      
    data.at[el+1, "actors"] = actors

        

Lets One Hot Encode the Player names so we can match them later on with characters in the current Scene

In [11]:
y = pd.get_dummies(data.Player)
for col in y.columns:
    data[col] = y[col]

Next we need to check if the actors match a players name


In [12]:
pnames = data['Player'].unique()
pnames[0] = ""
print(pnames)

['' 'KING HENRY IV' 'WESTMORELAND' 'FALSTAFF' 'PRINCE HENRY' 'POINS'
 'EARL OF WORCESTER' 'NORTHUMBERLAND' 'HOTSPUR' 'SIR WALTER BLUNT'
 'First Carrier' 'Ostler' 'Second Carrier' 'GADSHILL' 'Chamberlain'
 'BARDOLPH' 'PETO' 'First Traveller' 'Thieves' 'Travellers' 'LADY PERCY'
 'Servant' 'FRANCIS' 'Vintner' 'Hostess' 'Sheriff' 'Carrier' 'MORTIMER'
 'GLENDOWER' 'EARL OF DOUGLAS' 'Messenger' 'VERNON' 'WORCESTER'
 'ARCHBISHOP OF YORK' 'SIR MICHAEL' 'LANCASTER' 'BEDFORD' 'GLOUCESTER'
 'EXETER' 'OF WINCHESTER' 'CHARLES' 'ALENCON' 'REIGNIER'
 'BASTARD OF ORLEANS' 'JOAN LA PUCELLE' 'First Warder' 'Second Warder'
 'WOODVILE' 'Mayor' 'Officer' 'Boy' 'SALISBURY' 'TALBOT' 'GARGRAVE'
 'GLANSDALE' 'Sergeant' 'First Sentinel' 'BURGUNDY' 'Sentinels' 'Soldier'
 'Captain' 'OF AUVERGNE' 'Porter' 'PLANTAGENET' 'SUFFOLK' 'SOMERSET'
 'WARWICK' 'Lawyer' 'First Gaoler' 'KING HENRY VI' 'ALL' 'First Soldier'
 'Watch' 'FASTOLFE' 'BASSET' 'YORK' 'General' 'LUCY' 'JOHN TALBOT'
 'Legate' 'Scout' 'MARGARET' 'SU FFOL

Now we check the actors in the current scene and then for every actor in the scene we put a 1 in their name column if they are present. The idea behind this feature is that hopefully the classifier will learn to only pick from character who are in the current scene, thus greatly reducing the amount of characters to choose from.

In [13]:
for el in range(len(data["actors"])):
    matching = []
    sn = []
    for name in data.at[el+1,"actors"]:
        for word in name.split():
            matching = [s for s in pnames if word in s]
        
        sn.append(matching)
        flat_list = [item for sublist in sn for item in sublist]
    for thing in flat_list:
        data.at[el+1, thing] = 1
    
        

Now lets try our classifier out on our new dataset and see if it has improved at all

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ["Player","PlayerLine", "Player_cat","actors"]), data["Player_cat"], test_size=.2, random_state=40)

In [17]:
clf =  GaussianNB()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)

print(score)

0.5834380610412926


After we have added our features our classifiers accuracy score has gone up to around 60%. This is a significant increase from our previos score which was around 20%. The significant increase in accuracy score indicates that our new features added value to the dataset. 