# The Oscars

Dalton Hahn (2762306)

## Shakespearean Play Data

https://www.kaggle.com/kingburrito666/shakespeare-plays/download

## Data Visualization and Storytelling

I want to examine Shakespeare's playset and see if there are trends that I believe may be present in the data.  Specifically, I will try to examine the following:
1. What is the ratio/trend in "airtime" that Shakespeare gives to men vs. women
2. Does Shakespeare become more verbose with his later plays than his earlier plays
3. What is the proportion of "airtime" that Shakespeare grants to his main characters vs. his auxiliary characters
4. Word cloud of entire datasets most used words

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import math
from statistics import mean, stdev

In [2]:
df = pd.read_csv("../data/external/Shakespeare_data.csv")

In [3]:
df.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [4]:
# Remove NaN rows from the dataset (these represent stage directions/non-dialogue)
print("With NaNs = ", df.shape)
df = df.dropna()
print("Without NaNs = ", df.shape)
df = df.reset_index(drop=True)
df.head()

With NaNs =  (111396, 6)
Without NaNs =  (105152, 6)


Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
1,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
2,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
3,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
4,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil


## Augmenting Data

Need to augment our data with some additional information in order to create the visualizations described previously in the notebook.  Specifically, we need to add:

- Year the play was written/initially performed
- Gender to characters
- Primary vs. auxiliary to characters

### Adding year to play

In [5]:
# Get the unique play set
print(df["Play"].unique())

['Henry IV' 'Henry VI Part 1' 'Henry VI Part 2' 'Henry VI Part 3'
 'Alls well that ends well' 'As you like it' 'Antony and Cleopatra'
 'A Comedy of Errors' 'Coriolanus' 'Cymbeline' 'Hamlet' 'Henry V'
 'Henry VIII' 'King John' 'Julius Caesar' 'King Lear' 'Loves Labours Lost'
 'macbeth' 'Measure for measure' 'Merchant of Venice'
 'Merry Wives of Windsor' 'A Midsummer nights dream'
 'Much Ado about nothing' 'Othello' 'Pericles' 'Richard II' 'Richard III'
 'Romeo and Juliet' 'Taming of the Shrew' 'The Tempest' 'Timon of Athens'
 'Titus Andronicus' 'Troilus and Cressida' 'Twelfth Night'
 'Two Gentlemen of Verona' 'A Winters Tale']


In [6]:
# Create dictionary with year corresponding to play and add as new column
# NOTE: Years of plays were taken from Wikipedia articles regarding the history of the plays (All Henry VI Part X written 1591)

year_dict = {
    "Henry IV": 1597, "Henry VI Part 1": 1591, "Henry VI Part 2": 1591,"Henry VI Part 3": 1591,
    "Alls well that ends well": 1598,"As you like it": 1599,"Antony and Cleopatra": 1607,"A Comedy of Errors": 1594,
    "Coriolanus": 1605,"Cymbeline": 1611,'Hamlet': 1599,'Henry V': 1599,'Henry VIII': 1613,'King John': 1595,
    'Julius Caesar': 1599,'King Lear': 1605,'Loves Labours Lost': 1598,'macbeth': 1606,'Measure for measure': 1603,
    'Merchant of Venice': 1596,'Merry Wives of Windsor': 1602,'A Midsummer nights dream': 1595,
    'Much Ado about nothing': 1598,'Othello': 1603,'Pericles': 1607,'Richard II': 1595,'Richard III': 1593,
    'Romeo and Juliet': 1591,'Taming of the Shrew': 1590,'The Tempest': 1610,'Timon of Athens': 1605,
    'Titus Andronicus': 1588,'Troilus and Cressida': 1602,'Twelfth Night': 1601,'Two Gentlemen of Verona': 1589,
    'A Winters Tale': 1610
}

df_years = df
df_years["Year"] = ""

for index,row in df_years.iterrows():
    row = row.copy()
    year = year_dict[row["Play"]]
    df_years.loc[index, "Year"] = year

In [7]:
df_years.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Year
0,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",1597
1,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",1597
2,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,1597
3,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.,1597
4,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil,1597


In [8]:
df_years.to_csv("../data/processed/play_year.csv")

### Adding gender to characters

In [9]:
# Create dictionary with gender of character corresponding to play they are in

# NOTE: Utilized work by Douglas Duhaime https://github.com/duhaime/mining_the_bard who previously 
# wrote a script to populate an XML file with the gender information of 842 of Shakespeare's characters, will 
# match these in our dataset where possible

gender = pd.read_csv("../data/external/shakespeare_gender.txt", sep='\t', lineterminator='\n',
                       names=["File", "Character", "Num_Lines", "Play_Type", "Year", "Play", "Gender"])
gender = gender.drop(columns=["File", "Num_Lines", "Play_Type", "Year"])
gender.head()

Unnamed: 0,Character,Play,Gender
0,Lennox,Macbeth (1606),male
1,Duncan,Macbeth (1606),male
2,Seyton,Macbeth (1606),male
3,YoungSiward,Macbeth (1606),male
4,Banquo,Macbeth (1606),male


In [10]:
print(gender["Play"].unique())

['Macbeth (1606)' 'Much Ado about Nothing (1598)'
 'Antony and Cleopatra (1607)' "'The Winter''s Tale (1610)'"
 'Romeo and Juliet (1595)' 'Othello (1604)' 'Two Noble Kinsmen (1613)'
 'Two Gentlemen of Verona (1593)' 'King Lear (1605)' '1 Henry VI (1592)'
 'King John (1596)' 'As You Like It (1599)' 'Titus Andronicus (1594)'
 'Hamlet (1601)' 'Measure for Measure (1604)' '3 Henry VI (1591)'
 'The Merry Wives of Windsor (1600)' "'All''s Well that Ends Well (1602)'"
 '1 Henry IV (1597)' 'Henry V (1599)'
 "'A Midsummer Night''s Dream (1595)'" 'Richard II (1595)'
 'Julius Caesar (1599)' '2 Henry IV (1597)' 'Coriolanus (1608)'
 'The Taming of the Shrew (1594)' 'Timon of Athens (1607)'
 "'Love''s Labour Lost (1591)'" 'The Comedy of Errors (1592)'
 'Pericles (1608)' 'The Merchant of Venice (1596)'
 'Troilus and Cressida (1602)' 'Richard III (1593)' '2 Henry VI (1591)'
 'The Tempest (1611)' 'Cymbeline (1609)' 'Henry VIII (1613)'
 'Twelfth Night (1600)']


In [11]:
# TODO: 
# 1. Remove years from all titles "Macbeth (1606). . ."
# 2. Rename 1 Henry VI -> Henry VI Part 1, etc.
# 3. Make all titles and characters uppercase in both dataframes
# 4. Remove unnecessary apostrophes
# 5. Encode Female as '1' and Male as '0'



for index,row in gender.iterrows():
    row = row.copy()
    row["Play"] = row["Play"].split('(')[0][:-1].upper()
    row["Play"] = row["Play"].replace("'", "")
    if "1 HENRY VI" == row["Play"]:
        row["Play"] = "HENRY VI PART 1"
    elif "2 HENRY VI" == row["Play"]:
        row["Play"] = "HENRY VI PART 2"
    elif "3 HENRY VI" == row["Play"]:
        row["Play"] = "HENRY VI PART 3"
    elif "1 HENRY IV" == row["Play"] or "2 HENRY IV" == row["Play"]:
        row["Play"] = "HENRY IV"
    
    if row["Gender"] == "male":
        row["Gender"] = 0
    elif row["Gender"] == "female":
        row["Gender"] = 1
        
    gender.loc[index, "Play"] = row["Play"]
    gender.loc[index, "Character"] = row["Character"].upper()
    gender.loc[index, "Gender"] = row["Gender"]

In [12]:
gender.head()

Unnamed: 0,Character,Play,Gender
0,LENNOX,MACBETH,0
1,DUNCAN,MACBETH,0
2,SEYTON,MACBETH,0
3,YOUNGSIWARD,MACBETH,0
4,BANQUO,MACBETH,0


In [13]:
df_gender = df

df_gender["Player"] = df_gender["Player"].str.upper()

merged = pd.merge(df_gender,gender, left_on='Player', right_on="Character")
merged = merged.drop_duplicates(subset=["Dataline"])
merged = merged.reset_index(drop=True)
merged = merged.drop(columns=["Character", "Play_y"])
merged.columns = ['Dataline', 'Play', 'PlayerLinenumber', 'ActSceneLine', 'Player',
       'PlayerLine', 'Year', 'Gender']

In [14]:
merged.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Year,Gender
0,37,Henry IV,2.0,1.1.34,WESTMORELAND,"My liege, this haste was hot in question,",1597,0
1,38,Henry IV,2.0,1.1.35,WESTMORELAND,And many limits of the charge set down,1597,0
2,39,Henry IV,2.0,1.1.36,WESTMORELAND,But yesternight: when all athwart there came,1597,0
3,40,Henry IV,2.0,1.1.37,WESTMORELAND,"A post from Wales loaden with heavy news,",1597,0
4,41,Henry IV,2.0,1.1.38,WESTMORELAND,"Whose worst was, that the noble Mortimer,",1597,0


In [15]:
print(df_gender.shape)
print(merged.shape)

(105152, 7)
(71625, 8)


In [16]:
print("Able to match: ", len(list(set(gender["Character"]).intersection(merged["Player"]))), " characters with gender")

Able to match:  473  characters with gender


In [17]:
merged.to_csv("../data/processed/genders.csv")

### Adding role to characters

In [18]:
# Create dictionary with role (primary vs secondary) of character corresponding to play they are in

# NOTE: Utilized work by MARTIN GRANDJEAN http://www.martingrandjean.ch/network-visualization-shakespeare/
# who previously did network visualization work on Shakespeare's tradgedies to describe the "true" main character
# of the work.  Will use their findings as a means to populate the character roles of the matching works in our
#original dataset.

# IF THE CHARACTER IS THE MAIN CHARACTER, THEN VALUE FOR MAIN COL WILL BE 1, ELSE 0

roles = {
    "Titus and Andronicus": "Lavinia",
    "Romeo and Juliet": "Romeo",
    "Julius Caesar": "Brutus",
    "Hamlet": "Hamlet",
    "Troilus and Cressida": "Troilus",
    "Othello": "Othello",
    "King Lear": "King Lear",
    "Macbeth": "Rosse",
    "Timon of Athens": "Timon",
    "Antony and Cleopatra": "Mark Antony",
    "Coriolanus": "Coriolanus"
}


df_role = df

df_role["Play"] = df_role["Play"].str.upper()

roles =  {k.upper(): v for k, v in roles.items()}

roles_df = pd.DataFrame(list(roles.items()), columns=["Play", "Main"])

mer_role = pd.merge(df_role,roles_df, left_on='Play', right_on="Play")
mer_role = mer_role.drop_duplicates(subset=["Dataline"])
mer_role = mer_role.reset_index(drop=True)

for index, row in mer_role.iterrows():
    row = row.copy()
    if row["Player"] == row["Main"].replace("'", "").upper():
        main = 1
    else:
        main = 0
    
    mer_role.loc[index, "Main"] = main
    
mer_role.head(10)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,Year,Main
0,18569,ANTONY AND CLEOPATRA,1.0,1.1.1,PHILO,"Nay, but this dotage of our general's",1607,0
1,18570,ANTONY AND CLEOPATRA,1.0,1.1.2,PHILO,"O'erflows the measure: those his goodly eyes,",1607,0
2,18571,ANTONY AND CLEOPATRA,1.0,1.1.3,PHILO,That o'er the files and musters of the war,1607,0
3,18572,ANTONY AND CLEOPATRA,1.0,1.1.4,PHILO,"Have glow'd like plated Mars, now bend, now turn,",1607,0
4,18573,ANTONY AND CLEOPATRA,1.0,1.1.5,PHILO,The office and devotion of their view,1607,0
5,18574,ANTONY AND CLEOPATRA,1.0,1.1.6,PHILO,"Upon a tawny front: his captain's heart,",1607,0
6,18575,ANTONY AND CLEOPATRA,1.0,1.1.7,PHILO,Which in the scuffles of great fights hath burst,1607,0
7,18576,ANTONY AND CLEOPATRA,1.0,1.1.8,PHILO,"The buckles on his breast, reneges all temper,",1607,0
8,18577,ANTONY AND CLEOPATRA,1.0,1.1.9,PHILO,And is become the bellows and the fan,1607,0
9,18578,ANTONY AND CLEOPATRA,1.0,1.1.10,PHILO,To cool a gipsy's lust.,1607,0


In [19]:
mer_role.to_csv("../data/processed/roles.csv")