# To be, or not to be

#### Submitted By: Arman Ghasemi KU ID: 2970754 Email: arman.ghasemi@ku.edu


This project contains a datasets of all of Shakespeare's plays. The goal is to build one or more classification models to determine the player using the other columns as features.

#### Data Source : https://www.kaggle.com/kingburrito666/shakespeare-plays

## Importing Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Reading Dataset

In [2]:
DF = pd.read_csv('https://raw.githubusercontent.com/armangh67/To-be-or-not-to-be/master/Datasest/Shakespeare_data.csv' , low_memory = False)
DF.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
8,9,Henry IV,1.0,1.1.6,KING HENRY IV,Shall daub her lips with her own children's bl...
9,10,Henry IV,1.0,1.1.7,KING HENRY IV,"Nor more shall trenching war channel her fields,"


### Cleaning the Dataset

In [3]:
DF.isnull().sum()

Dataline               0
Play                   0
PlayerLinenumber       3
ActSceneLine        6243
Player                 7
PlayerLine             0
dtype: int64

All the null values should be droped.

In [4]:
DF = DF.dropna()

## Feature Engineering

The "ActScenceLine" column can be divided to three columns as shown below.

In [5]:
acts = []
scenes = []
lines = []
for row in DF["ActSceneLine"]:
    act = row.split('.')[0]
    acts.append(act)

    scene = row.split('.')[1]
    scenes.append(scene)

    line = row.split('.')[2]
    lines.append(line)

In [6]:
DF.insert(len(DF.columns), 'Act', acts, True)
DF.insert(len(DF.columns), 'Scene', scenes, True)
DF.insert(len(DF.columns), 'Line', lines, True)
DF = DF.drop(['ActSceneLine'] , axis = 1)
DF.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line
3,4,Henry IV,1.0,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1
4,5,Henry IV,1.0,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2
5,6,Henry IV,1.0,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3
6,7,Henry IV,1.0,KING HENRY IV,To be commenced in strands afar remote.,1,1,4
7,8,Henry IV,1.0,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5
8,9,Henry IV,1.0,KING HENRY IV,Shall daub her lips with her own children's bl...,1,1,6
9,10,Henry IV,1.0,KING HENRY IV,"Nor more shall trenching war channel her fields,",1,1,7
10,11,Henry IV,1.0,KING HENRY IV,Nor bruise her flowerets with the armed hoofs,1,1,8
11,12,Henry IV,1.0,KING HENRY IV,"Of hostile paces: those opposed eyes,",1,1,9
12,13,Henry IV,1.0,KING HENRY IV,"Which, like the meteors of a troubled heaven,",1,1,10


Also the columns "Player" and "Play" can be transformed to the codes to be usefull for the classification models.

In [7]:
LE = LabelEncoder()
DF['PlayerCode'] = LE.fit_transform(DF['Player'].astype('str')) #generating PlayerCode column

In [8]:
DF['PlayCode'] = LE.fit_transform(DF['Play'].astype('str')) #generating PlayCode column
DF.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,Player,PlayerLine,Act,Scene,Line,PlayerCode,PlayCode
3,4,Henry IV,1.0,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1,457,9
4,5,Henry IV,1.0,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2,457,9
5,6,Henry IV,1.0,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3,457,9
6,7,Henry IV,1.0,KING HENRY IV,To be commenced in strands afar remote.,1,1,4,457,9
7,8,Henry IV,1.0,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5,457,9
8,9,Henry IV,1.0,KING HENRY IV,Shall daub her lips with her own children's bl...,1,1,6,457,9
9,10,Henry IV,1.0,KING HENRY IV,"Nor more shall trenching war channel her fields,",1,1,7,457,9
10,11,Henry IV,1.0,KING HENRY IV,Nor bruise her flowerets with the armed hoofs,1,1,8,457,9
11,12,Henry IV,1.0,KING HENRY IV,"Of hostile paces: those opposed eyes,",1,1,9,457,9
12,13,Henry IV,1.0,KING HENRY IV,"Which, like the meteors of a troubled heaven,",1,1,10,457,9


It can be seen that with feature engineering we are able to add some columns that give us additional values and help us to implement the classification model far more better.

## Classification Models

## k-nearest neighbors algorithm

In [9]:
X = DF.drop (['Play' , 'Player' , 'PlayerLine'] , axis = 1)
Y=DF['Player'] 

In [10]:
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.4)
knn = KNeighborsClassifier(leaf_size = 50 , n_neighbors = 7)
knn.fit(xTrain, yTrain)

KNeighborsClassifier(algorithm='auto', leaf_size=50, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

In [11]:
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(xTest , yTest)))

Accuracy of K-NN classifier on test set: 0.92


In [12]:
yPrediction = knn.predict(xTest)
Prediction_Table = pd.DataFrame ({'Actual':yTest , 'Prediction': yPrediction})
Prediction_Table.head(20)

Unnamed: 0,Actual,Prediction
31697,First Brother,First Brother
71636,DOGBERRY,DON PEDRO
75739,OTHELLO,OTHELLO
79249,KING RICHARD II,KING RICHARD II
10703,KING HENRY VI,KING HENRY VI
39218,FLUELLEN,FLUELLEN
96472,SATURNINUS,SATURNINUS
17369,JAQUES,JAQUES
18851,MARK ANTONY,MARK ANTONY
33356,HAMLET,HAMLET


## Random Forest

In [13]:
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.4)
RF = RandomForestClassifier(n_estimators=50)
RF.fit(xTrain,yTrain)
print('Accuracy of Random Forest classifier on tests set: {:.2f}'.format(RF.score(xTest,yTest)))

Accuracy of Random Forest classifier on tests set: 0.99


In [14]:
yPrediction = RF.predict(xTest)
Prediction_Table = pd.DataFrame ({'Actual':yTest , 'Prediction': yPrediction})
Prediction_Table.head(20)

Unnamed: 0,Actual,Prediction
38955,KING HENRY V,KING HENRY V
99358,ALEXANDER,ALEXANDER
80842,DUCHESS OF YORK,DUCHESS OF YORK
57407,MACBETH,MACBETH
86860,ROMEO,ROMEO
13989,BERTRAM,BERTRAM
78851,KING RICHARD II,KING RICHARD II
102239,THERSITES,THERSITES
40328,BUCKINGHAM,BUCKINGHAM
73684,DESDEMONA,DESDEMONA


## Conclusion

This was a really interesting dataset to deal with. First of all, we divided ActScenceLines to three different parameters and also encoding the Play and Player columns to use them as the attribute in our models. Then we use to classification models, KNeighbors and Random Forest. As it is shown above the accuracy of these two methods was pretty good which is questionable.