## EECS 731 Project 2
### Adam Podgorny

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.cluster import KMeans

In [2]:
shakespeare = pd.read_csv("Shakespeare_data.csv")

In [3]:
shakespeare

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
...,...,...,...,...,...,...
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first
111394,111395,A Winters Tale,38.0,5.3.183,LEONTES,We were dissever'd: hastily lead away.


So now that we have the dataset imported and a cursory glance at it, let's see what's here. First and foremost, we should get rid of the extraneous lines that don't have any characters associated with them, as those will not contribute to the analysis at all.

I have a bit of an unfair advantage here, since I took a class specifically on Shakespeare in my high school, with a teacher who wore a Shakespeare tie most days. So with this in mind, and just thinking about literature in general, the number of lines, and the length of those lines, are probably indicative to some degree of who is speaking. While this may not be a perfect indicator, a ranking of linespace by actor may give a slight probabilitistic advantage. We would expect central characters to speak more, and have longer lines. Implicitly, longer runs of lines may indicate a significant player. 

As Shakespeare characters like to think about themselves a lot as well, the number of times per line a character uses first person references {I, my, mine}, this may be a useful fallback indicator.

The first obvious partition point would be in the play itself. Players from other plays will obvious not be appearing, so that is logical as a first categorization point. I will be using OneHotEncoding to try to make this play nicely with KMeans, as there shouldn't be too many extracted features from the lines themselves. From this I think that so long as the play itself sort of creates the hyperplane, the dimensionality shouldn't be too much of an issue.


To that end, I think a decision tree should work pretty well, and I am screwing around with K Means just to see what it can do. Mostly because I like to use KMeans as a basic approach and test against that. 

Obviously, I _could_ feature engineer by say, assigning the number of lines a character has to the speaker's line, which will more or less map uniquely to the target class, but that feels cheap, so I won't do it. 

In [4]:
shakespeare_ = shakespeare.dropna()
shakespeare_ = shakespeare_.reset_index()
shakespeare_ = shakespeare_.drop(['Dataline'], axis=1)
shakespeare_ = shakespeare_.drop(['index'], axis=1)
# First off, let's drop the lines that don't contain any lines,
#or characters, as these aren't really relevant to our analysis
##And the dataline isn't really useful anymore, nor is the spliced indices, so, they're gone
shakespeare_

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
105147,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
105148,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
105149,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
105150,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first


In [5]:
shakespeare_['line_length'] = shakespeare['PlayerLine'].apply(lambda x : len(x))

Much better

In [6]:
plays = shakespeare_['Play'].unique()
len(plays)
#That's managable

36

In [7]:
players = shakespeare_['Player'].unique()
len(players)

# That's a lot. K means may not work, but at least it will be funny. And this shows a decision tree approach may be best anyway

934

In [8]:
#Thanks person on stackoverflow
#shakespeare_["SceneLength"] = 0
#shakespeare_

players = shakespeare_['Player'].tolist()
scene_length = []
players

c = 1
bf = False
s = False
lp = len(players)
for i in range(0, len(players)):
    if ((players[i] != players[i-1]) or (s == False)):
        c = 1
        s = True
        pcurrent = players[i]
        pnext = players[i+1]
        while ((pcurrent == pnext) and (bf == False)):
            if ((i + c) < lp):
                pcurrent = players[i + c - 1]
            else: 
                bf = True
            c = c + 1
            if ((i + c) < lp):
                pcurrent = players[i + c - 1]
        scene_length.append(c-1)
    else:
        scene_length.append(c-1)

In [9]:
shakespeare_["SceneLength"] = scene_length

In [10]:
shakespeare_['firstpersoncount'] = shakespeare_["PlayerLine"].apply(lambda x: x.count(" I ") + x.count(" my ") + x.count(" My"))

In [11]:
#shakespeare_['Act', "Scene", "Line"] = shakespeare_['ActSceneLine'].apply(lambda x: x.split(".")[0], x.split(".")[1], x.split(".")[2])

shakespeare_['Act'] = shakespeare_['ActSceneLine'].apply(lambda x: x.split(".")[0])
shakespeare_['Act'] = pd.to_numeric(shakespeare_['Act'])
shakespeare_['Scene'] = shakespeare_['ActSceneLine'].apply(lambda x: x.split(".")[1])
shakespeare_['Scene'] = pd.to_numeric(shakespeare_['Scene'])
shakespeare_['Line'] = shakespeare_['ActSceneLine'].apply(lambda x: x.split(".")[2])
shakespeare_['Line'] = pd.to_numeric(shakespeare_['Line'])

Interestingly, these features correlate very little with each other (except the obvious). This is either very scary or very useful in that they don't have a lot of mutual entropy between them.

In [12]:
shakespeare_.corr()

Unnamed: 0,PlayerLinenumber,line_length,SceneLength,firstpersoncount,Act,Scene,Line
PlayerLinenumber,1.0,0.000148,-0.05982,0.010162,0.092598,-0.115289,0.902706
line_length,0.000148,1.0,-0.004532,0.005971,0.005451,-0.011869,-0.003301
SceneLength,-0.05982,-0.004532,1.0,-0.027933,-0.019013,-0.009795,0.015393
firstpersoncount,0.010162,0.005971,-0.027933,1.0,-0.00399,-0.001828,0.009116
Act,0.092598,0.005451,-0.019013,-0.00399,1.0,0.076692,0.061356
Scene,-0.115289,-0.011869,-0.009795,-0.001828,0.076692,1.0,-0.121451
Line,0.902706,-0.003301,0.015393,0.009116,0.061356,-0.121451,1.0


You know, I'm not sure K Means will work here. Perhaps a better approach to benchmark against DTs will be best, so I will use random forests

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

I am also going to save this so I can try to facet the data and see what cool things shake out

In [14]:
shakespeare_.to_csv("p_shakespeare.csv")

If intuition about the attached screenshots of the facets holds, there is in fact some utility in the derived features, or at the very least, some probabilistic reason to try to use them. While the individual blocks look small, this is also over the entire works of Shakespeare, and as such, will artificially make the contribution of any one character look small. I am trying to get more faceting done, but I believe the site is unhappy with me as it keeps crashing after increasingly shorter intervals. Indeed, though, we do see some elevated line counts as being characteristic of some players. This however means, predictions may be more confident towards the end of the play, rather than the beginning.

On to Model Building

I'm going to use the sklearn train_test split to train these. I am not too worried about the fact things that should be sequential will be out of sequence. Theoretically, we could do time series on this. In fact, implicitly time series is done due to having the scene length assigned to any given players speaking time. This in turn, could be used to try to assemble sequence of a play without the line numbers, if we were so interested.

In [17]:
players_true = shakespeare_['Player']
subspeare = shakespeare_.drop(['Player'], axis=1)
subspeare = subspeare.drop(['ActSceneLine'], axis=1)
subspeare = subspeare.drop(['PlayerLine'], axis=1)
subspeare = pd.get_dummies(subspeare, columns=['Play'])
#players_true = pd.get_dummies(players_true, columns=['Player'])

train_x, test_x, train_y, test_y = train_test_split(subspeare, players_true, test_size=0.15, random_state=1337)

In [18]:
d_tree = tree.DecisionTreeClassifier()
d_tree.fit(train_x, train_y)

rf = RandomForestClassifier(random_state=0)
rf.fit(train_x, train_y)

#km = KMeans(n_clusters = 934, random_state=0)
#km.fit(train_x, train_y)

##K means excluded for time factors as training 934 clusters is time intensive

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

A smarter way to avoid the K means problem would have been to run an analysis for each play. But then creating a unified model of that may be difficult unless a preprocessing step is added to the selection for evaluation. Which is doable, but at this juncture, comparing one decision tree against a random forest felt like a smarter use of time

In [20]:
d_tree.score(test_x, test_y)

0.8241298421352945

In [25]:
rf.score(test_x, test_y)

0.7844417675775058

As we can see, the decision tree was able to very effectively use the data we've seen in the columns, which considering how little real context there is of the original text itself, is fairly impressive. This is especially true consideirng that there were 934 players involved. Reduction of the players who only had say, one line, would have potentially given more accuracy.

Interestingly, the random forest performed _worse_ somehow. This leads to to believe that there was some overfitting that occured in the course of training, or that some selective feature reduction may have been necessary. It is also possible there will simply too many trees and there was some sort of wierd consensus poisoning going on. That isn't to say the Random Forest performed poorly, and perhaps a different seed would perform differently. This may also indicate the need for a different train/test split. 
