This project seeks to create a model that can figure out the player in a Shakespeare play based on lines. 

# Brainstorming

## Exploring the Data
* Find most commonly used words accross all plays
* Find most commonly used words within each play
* Extract all play names
* Extract the names of each player in each play

## Classification Ideas
* Play  
 * Allows to quickly cut out large chunks of possible players
 * Extremely coarse grained, cannot find player within a play (unless only one player)
 * Cannot choose a player not in the Play
 * Machine learning unnecessary for this one
* Act and Scene
 * Can be derived from 
* 'Word Cloud' of PlayLine
 * Possible issue: overlap of common words between players
 * Refinements
  * Remove common words across all plays
  * Remove common words within a play
* Combine the two ideas into a two stage process
 1. Create a different model for each play
 2. Choose model based on whichever play the lines came from

In [1]:
import numpy as np
import pandas as pd
import string
import collections

# Cleaning the Data
* Remove datapoints where Player = NaN
* Remove punctuation from PlayerLine
* Make all words in PlayerLine lowercase

In [2]:
shakespeare = pd.read_csv("../data/Shakespeare_data.csv")
shakespeare = shakespeare.dropna()
asl = list(map(lambda x: x.split('.'), shakespeare['ActSceneLine'].to_list()))
act = list(map(lambda x: x[0], asl))
scene = list(map(lambda x: x[1], asl))
line  = list(map(lambda x: x[2], asl))
shakespeare['Act'] = act
shakespeare['Scene'] = scene
shakespeare['Line'] = line
lines = shakespeare['PlayerLine'].to_list()
clean_lines = []
for l in lines:
    lp = l
    for p in string.punctuation:
        lp = lp.replace(p,'')
    clean_lines.append(lp.lower().strip())
shakespeare['CleanLine'] = clean_lines
shakespeare

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Act,Scene,Line,CleanLine
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",1,1,1,so shaken as we are so wan with care
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",1,1,2,find we a time for frighted peace to pant
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,1,1,3,and breathe shortwinded accents of new broils
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.,1,1,4,to be commenced in strands afar remote
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil,1,1,5,no more the thirsty entrance of this soil
...,...,...,...,...,...,...,...,...,...,...
111390,111391,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,",5,3,179,is trothplight to your daughter good paulina
111391,111392,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely",5,3,180,lead us from hence where we may leisurely
111392,111393,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part,5,3,181,each one demand an answer to his part
111393,111394,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first,5,3,182,performd in this wide gap of time since first


In [3]:
WordTuple = collections.namedtuple('WordTuple','Play PlayerLinenumber Player PlayerLine Act Scene Line CleanLine Word')
newtuples = []
for t in shakespeare.itertuples(index=False):
    for l in t.CleanLine.split():
        nt = WordTuple(t.Play, t.PlayerLinenumber, t.Player, t.PlayerLine, t.Act, t.Scene, t.Line, t.CleanLine, l)
        newtuples.append(nt)
shakespeare = pd.DataFrame(newtuples, columns =['Play', 'PlayerLinenumber', 'Player', 'PlayerLine', 'Act', 'Scene', 'Line', 'CleanLine', 'Word'])

0                so
1            shaken
2                as
3                we
4               are
            ...    
788773         were
788774    disseverd
788775      hastily
788776         lead
788777         away
Name: Word, Length: 788778, dtype: object

In [13]:
player_words = shakespeare.loc[:,['Player', 'Word']]
new_table_p = player_words.pivot_table(index='Player', columns='Word', aggfunc=len, fill_value=0)

In [6]:
cols = list(new_table.columns)
ind = list(new_table.index)
new_table[cols] = new_table_p[cols].div(new_table_p[cols].sum(axis=1), axis=0)

In [14]:
new_table

Word,1,10,2,2d,2s,3,4,4d,5,5s,...,zenelophon,zenith,zephyrs,zir,zo,zodiac,zodiacs,zone,zounds,zwaggered
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Lord,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A Patrician,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A Player,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AARON,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
ABERGAVENNY,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Young MARCIUS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
of BUCKINGHAM,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
of King Henry VI,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
of Prince Edward,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Unfortunately
I spent too much time trying to wrangle the data into a usable format. After getting to the point where I had a matrix of Players and Words, I couldn't turn it into a form that made sense to scikit-learn. 