## Trimming Player Line 

The goal of this notebook is to trim the PlayerLine catagory of data to use to develop a model. Some of the operations, such as removing stop words, are expensive. To avoid running the operations each time, I downloaded the results of the word trimming to a new file, `PlayerLine_trimmed.csv`.

The steps in this file are taken from : https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f

In [1]:
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn import tree

In [2]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from bs4 import BeautifulSoup
import string 
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer 
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arfritzz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arfritzz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
data = pd.read_csv('data/1028_2124_bundle_archive/Shakespeare_data.csv')

In [4]:
data.columns

Index(['Dataline', 'Play', 'PlayerLinenumber', 'ActSceneLine', 'Player',
       'PlayerLine'],
      dtype='object')

In [5]:
data.dtypes

Dataline              int64
Play                 object
PlayerLinenumber    float64
ActSceneLine         object
Player               object
PlayerLine           object
dtype: object

What if we made the player a number? So that we could talk about what players are saying what? 
What if we only looked at the major players... could count how many times they speak and only use the ones that pop up the most. 

Need to delete the columns where NaN is present for player. 

In [6]:
data = data.dropna()

In [7]:
data = data.drop(columns="Dataline")

In [8]:
data

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
5,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
6,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
7,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
...,...,...,...,...,...
111390,A Winters Tale,38.0,5.3.179,LEONTES,"Is troth-plight to your daughter. Good Paulina,"
111391,A Winters Tale,38.0,5.3.180,LEONTES,"Lead us from hence, where we may leisurely"
111392,A Winters Tale,38.0,5.3.181,LEONTES,Each one demand an answer to his part
111393,A Winters Tale,38.0,5.3.182,LEONTES,Perform'd in this wide gap of time since first


#### Naming Issue 
In "A Comedy of Errors", there is ANTIPHOLUS OF SYRACUSE and ANTIPHOLUS OF EPHESUS and I think that that the ANTIPHOLUS part of the name got separated so we can delete all of thos cells with the entry for player as ANTIPHOLUS and rename the player columns that say OF EPHESUS and OF SYRACUSE  to ANTIPHOLUS OF SYRACUSE and ANTIPHOLUS OF EPHESUS

In "Love's Labour's Lost" the character Don Armado is named DON in the `PlayerLine` column and ADRIANO DE ARMADO in the `Player` column. So, we can delete the cells for DON in the `PlayerLine` and rename in to Don Armando in the `Player` column. 

In [9]:
data.drop(data.loc[data['PlayerLine']=="ANTIPHOLUS"].index, inplace=True)
data['Player'].replace({'OF SYRACUSE': 'ANTIPHOLUS OF SYRACUSE', 'OF EPHESUS': 'ANTIPHOLUS OF EPHESUS'})

3         KING HENRY IV
4         KING HENRY IV
5         KING HENRY IV
6         KING HENRY IV
7         KING HENRY IV
              ...      
111390          LEONTES
111391          LEONTES
111392          LEONTES
111393          LEONTES
111394          LEONTES
Name: Player, Length: 104977, dtype: object

In [10]:
data.drop(data.loc[data['PlayerLine']=="DON"].index, inplace=True)
data['Player'].replace({'ADRIANO DE ARMADO': 'DON ARMADO',})

3         KING HENRY IV
4         KING HENRY IV
5         KING HENRY IV
6         KING HENRY IV
7         KING HENRY IV
              ...      
111390          LEONTES
111391          LEONTES
111392          LEONTES
111393          LEONTES
111394          LEONTES
Name: Player, Length: 104876, dtype: object

## Preprocessing text data

### Goal 

My goal was to preprocess the text data to use it for analysis in the `ShakespeareAnalysis` notebook. I wanted to reason if a line was positive or negative and see if that could help increase the accuracy when identifying the player.  

First, I removed punctuation from each of the lines. We really only care about important words, not the punction around those words.  

In [11]:
def remove_punctuation(text): 
        no_punct = "".join([c for c in text if c not in string.punctuation])
        return no_punct

In [12]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda x: remove_punctuation(x))
data.head()

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
3,Henry IV,1.0,1.1.1,KING HENRY IV,So shaken as we are so wan with care
4,Henry IV,1.0,1.1.2,KING HENRY IV,Find we a time for frighted peace to pant
5,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe shortwinded accents of new broils
6,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote
7,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil


Here, I broke the strings into a list of words based on spaces. 

In [13]:
tokenizer = RegexpTokenizer(r'\w+')

In [14]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda x: tokenizer.tokenize(x.lower()))
data['PlayerLine'].head(10)

3        [so, shaken, as, we, are, so, wan, with, care]
4     [find, we, a, time, for, frighted, peace, to, ...
5     [and, breathe, shortwinded, accents, of, new, ...
6        [to, be, commenced, in, strands, afar, remote]
7     [no, more, the, thirsty, entrance, of, this, s...
8     [shall, daub, her, lips, with, her, own, child...
9     [nor, more, shall, trenching, war, channel, he...
10    [nor, bruise, her, flowerets, with, the, armed...
11           [of, hostile, paces, those, opposed, eyes]
12    [which, like, the, meteors, of, a, troubled, h...
Name: PlayerLine, dtype: object

Next, we must remove stop words. Stop words are words like "the" that get in the way of understanding the meaning of a line. 

In [15]:
def remove_stopwords(text): 
    words = [w for w in text if w not in stopwords.words('english')]
    return words

In [16]:
data['PlayerLine'] = data['PlayerLine'].apply(lambda x: remove_stopwords(x))
data['PlayerLine'].head(5)

3                             [shaken, wan, care]
4             [find, time, frighted, peace, pant]
5    [breathe, shortwinded, accents, new, broils]
6              [commenced, strands, afar, remote]
7                       [thirsty, entrance, soil]
Name: PlayerLine, dtype: object

Now, we cut off prefixes and suffixes. For example, a conjugated word like "slept" would just become sleep. This helps us reason about the words without caring about every tense the word could be in. 

In [17]:
lemmatizer = WordNetLemmatizer() 

def word_lemmatizer(text): 
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

In [18]:
data['PlayerLine'].apply(lambda x : word_lemmatizer(x))
data['PlayerLine'].head(5)

3                             [shaken, wan, care]
4             [find, time, frighted, peace, pant]
5    [breathe, shortwinded, accents, new, broils]
6              [commenced, strands, afar, remote]
7                       [thirsty, entrance, soil]
Name: PlayerLine, dtype: object

Finally, you can add the data back together using the join function. 

In [19]:
stemmer = PorterStemmer()

def word_stemmer(text): 
        stem_text = " ".join([stemmer.stem(i) for i in text])
        return stem_text

In [20]:
data['PlayerLine'].apply(lambda x : word_stemmer(x))
data['PlayerLine'].head(5)

3                             [shaken, wan, care]
4             [find, time, frighted, peace, pant]
5    [breathe, shortwinded, accents, new, broils]
6              [commenced, strands, afar, remote]
7                       [thirsty, entrance, soil]
Name: PlayerLine, dtype: object

## Exporting data

Here, after we finished trimming, I exported the data to a csv. I used the first csv when performing the analysis. 

In [21]:
data.to_csv('PlayerLine_trimmed_2.csv')

There are 104876 rows of data. For Model training, we only want to use 80% of the data for training. I will create a subset of the data for training and a subset of the data for testing. Can use `data_train = data.sample(frac=0.8)` to get 80% of the data but then you can't get the other 20% for testing. 

In [22]:
data_shuf = shuffle(data)
data_train = data_shuf[:84000]
data_test = data_shuf[-21052:]

In [23]:
data_train

Unnamed: 0,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
49081,King Lear,17.0,1.1.50,KING LEAR,"[interest, territory, cares, state]"
598,Henry IV,42.0,1.3.269,EARL OF WORCESTER,"[scottish, prisoners]"
43085,Henry VIII,11.0,5.3.14,Chancellor,"[good, lord, archbishop, im, sorry]"
17094,As you like it,34.0,3.2.148,CELIA,"[heaven, would, gifts]"
20863,Antony and Cleopatra,15.0,3.13.45,CLEOPATRA,"[blown, rose, may, stop, nose]"
...,...,...,...,...,...
103000,Twelfth Night,19.0,1.2.49,VIOLA,"[fair, behavior, thee, captain]"
57048,macbeth,13.0,3.3.23,BANQUO,"[rain, tonight]"
43046,Henry VIII,7.0,5.2.22,CRANMER,"[mong, boys, grooms, lackeys, pleasures]"
47362,Julius Caesar,30.0,2.2.127,CAESAR,"[cinna, metellus, trebonius]"


### Mindless Musings

The following are just mindless things I did when brainstorming for this project. I kept them in case I wanted to reference them in future projects. 


Lets just look at the play and the player to find the most popular player's for each play. You can then say, if I am in play A, this player will speak the most. 

In [24]:
play_player = data.drop(columns="PlayerLinenumber")
play_player = play_player.drop(columns="ActSceneLine")
play_player = play_player.drop(columns="PlayerLine")
play_player = play_player[play_player["Play"] == "Henry IV"]

In [25]:
play_player

Unnamed: 0,Play,Player
3,Henry IV,KING HENRY IV
4,Henry IV,KING HENRY IV
5,Henry IV,KING HENRY IV
6,Henry IV,KING HENRY IV
7,Henry IV,KING HENRY IV
...,...,...
3199,Henry IV,KING HENRY IV
3200,Henry IV,KING HENRY IV
3201,Henry IV,KING HENRY IV
3202,Henry IV,KING HENRY IV


 `count_player` is a list of the most common Players in the Shakespear Plays. Could say `value_counts(normalize=True)` to normalize all the values.  

In [26]:
count_player = data_train['Player'].value_counts()
count_player

Player
GLOUCESTER       1446
HAMLET           1207
IAGO              893
FALSTAFF          828
KING HENRY V      804
                 ... 
Second Gaoler       1
JOHN MORTIMER       1
Tutor               1
LADY  CAPULET       1
NICHOLAS            1
Name: count, Length: 926, dtype: int64

We can also put the counts into bins to better see the distrubtion of how many times a player speaks. 

In [27]:
count_player.value_counts(bins=20)

(-0.446, 73.25]      616
(73.25, 145.5]       118
(145.5, 217.75]       71
(217.75, 290.0]       55
(290.0, 362.25]       20
(434.5, 506.75]       12
(506.75, 579.0]        9
(362.25, 434.5]        8
(651.25, 723.5]        6
(579.0, 651.25]        5
(795.75, 868.0]        2
(1157.0, 1229.25]      1
(723.5, 795.75]        1
(868.0, 940.25]        1
(1373.75, 1446.0]      1
(940.25, 1012.5]       0
(1012.5, 1084.75]      0
(1084.75, 1157.0]      0
(1229.25, 1301.5]      0
(1301.5, 1373.75]      0
Name: count, dtype: int64

We see that most people are mentioned between 0 and 91.5 times. I guess we could make a model that says, in any shakespeare play, there will be x players who speak more than once, and y players who only speak once. 