# Statspeare

The following is a series of data exploration and analysis in regards to the works of William Shakespeare. Data is taken from the public Folger Shakespeare API which can be accessed here: https://www.folgerdigitaltexts.org/api?_ga=2.253225545.110838517.1589207285-56030160.1526399691. <br><br> The main data analyzed here is the number of words a given character speaks in a play. Analyses on word distribution within a play and between plays will be analyzed. 

## Shortcomings and Disclaimers
### Word Count
There are a few shortcomings when looking at word distribution to determine how much a character speaks in a play. For instance, characters who talk with more simple prose using smaller words will have an easier time getting a higher count than characters that talk more eloquently. 

Some might ask why not use line count instead of word count, however, I do not think line count is any more accurate than word count when trying to get a quantitative number for how much a a character speaks in a play. A major issue with line count is the inconsistency when counting lines of prose. Where as it is easy to count the number of verse lines in a play (i.e. one line of iambic pentameter = one line) there is not an accurate way to translate that into counting lines of prose (lines in non-iambic pentameter). Second, there is not a verified database to retrieve line ocunt information. While there are a few online resources that put out some statistics like ShakespeareWords.com (https://www.shakespeareswords.com/Public/LanguageCompanion/CharactersParts.aspx), these are not from verified Shakespeare institutions. Furthermore, I cannot verify how these sources retrieve this information and how lines of prose are counted so those numbers have been dismissed. 

### Genre
When discussing the Shakespeare canon it is common to group Shakespeare's plays by genre. In the First Folio (the first collection of all of Shakespeare's plays), the plays were grouped in three categories: Comedy, Histories, and Tragedies. However, two of Shakespeare plays now considered to be part of his canon, *Pericles* and *Two Noble Kinsmen* were not in the First Folio. As Shakespeare scholarship progressed through the ages, plays that border on the line of comedy and tragedy started to create their own genres and subgenres. Shakespeare's late plays such as *Cymbeline*, *The Winter's Tale*, and *The Tempest* are often listed in their own separate genre. This fourth genre is often dubbed *Romances* with some occasionally referring to them as *Tragicomedies*. Another popular sub-genre is the *Problem-Play* to refer to plays in the middle of Shakespeare's career that have a mixture of comedy and melodrama, a pessimistic outlook, controversial subject matter and ambiguous endings such as *Measure for Measure*, *All's Well that Ends Well* and *Troilus and Cressida*. Problem-plays are often treated as a subgenre of the Comedies genre, thus these three plays will be listed in the Comedies section in modern-day Complete Works (as they were in the First Folio except for *Troilus and Cressida* which was placed with the tragedies although was left out of the table of contents). When adding genre as a column I defaulted to my personal copy of the Complete Works (the 7th edition edited by David Bevington) which uses Comedies, Histories, Tragedies, and Romances as the four genres.    

## Libraries

Python Libraries that will be used throughout this Notebook

In [3]:
import requests
import pandas
from bs4 import BeautifulSoup
import numpy as np

## Play Codes
Plays have a dedicated letter code attached to them. They will be used frequently so they are up here for reference.

*Personal Author's note: They are also super inconsistent which frustrates me to no end*

In [4]:
# official codes for plays via the Folgers system
PlayCodes = ['AWW', 'Ant', 'AYL', 'Err', 'Cor', 'Cym', 'Ham',
'1H4', '2H4', 'H5', '1H6', '2H6', '3H6', 'H8', 'JC', 'Jn', 'Lr',
'LLL', 'Mac', 'MM', 'MV', 'Wiv', 'MND', 'Ado', 'Oth', 'Per', 'R2',
'R3', 'Rom', 'Shr', 'Tmp', 'Tim', 'Tit', 'Tro', 'TN', 'TGV', 'TNK', 'WT']

## Data Retrieval
The following code is used to scrape the data from the Folger API and converts it to a csv. 

In [22]:

# create table
df = pandas.DataFrame(columns=["Character", "Words", "Play"])

# actual webscraping
def webscrape(df, pc):
    # grab site and set scraper
    URL = "https://www.folgerdigitaltexts.org/" + pc + "/charText/"
    site = requests.get(URL)
    soup = BeautifulSoup(site.content, "html.parser")

    # grab table elements, extract strings
    table_entries = soup.find_all("div")
    table_entries_text = []
    for te in table_entries:
        table_entries_text.append(te.text)

    # loop through table entries, add to dataframe
    i = 2
    while(i < len(table_entries)):
        re = [table_entries_text[i+1], int(table_entries_text[i]), pc]
        df.loc[len(df.index)] = re
        i += 2

# loop through each play getting wordcounts by character
for pc in PlayCodes:
    webscrape(df, pc)

# convert to csv
df.to_csv("statspeare.csv", index=False) 

df.head()

Unnamed: 0,Character,Words,Play
0,Helen,3586,AWW
1,Parolles,2995,AWW
2,King,2831,AWW
3,Lafew,2219,AWW
4,Countess,2175,AWW


## Additional Data Calculations and Fields

### Percent (of words in respective play)
The following code calculates the percentage of words a given character speaks in their respective play 

In [23]:
# calculate a character's percentage of words in a particular play
def calcPercentages(playData):
    totalWords = playData['Words'].sum()
    wordCounts = playData['Words']
    playPercents = []
    for e in wordCounts:
        playPercents.append(e / totalWords)
    return playPercents

allPercents = []

# grab all characters in a play
for play in PlayCodes:
    playData = df[df['Play'] == play]
    allPercents = allPercents + calcPercentages(playData)

# add percentage column to dataframe
df['Percent'] = allPercents
df.head()

Unnamed: 0,Character,Words,Play,Percent
0,Helen,3586,AWW,0.158392
1,Parolles,2995,AWW,0.132288
2,King,2831,AWW,0.125044
3,Lafew,2219,AWW,0.098012
4,Countess,2175,AWW,0.096069


### Gender

A gender column for the traditonally assigned gender for a role is made here. 0 is for roles traditionally assigned male and 1 for roles assigned female. While some characters have a history of some form of gender fluidity/ambiguity such as Ariel in the *The Tempest*, such characters are given 0 in the column below since these characters will often default to  using male pronouns. The characters with a 1 will be reserved to characters referred to using female characters or have a gendered role in their title (i.e. 'LADY'). Crossdressing women will also be given 1 given despite their gender fluidity such as Viola in *Twelfth Night* and Rosalind in *As You Like It*.

No online resource has a full ist of female parts in Shakespeare, so this may be the first. I have created a separate dataframe of just the women and their respective plays for others to use for their own purposes. I hope others looking at women in Shakespeare can use this list to help them with their studies. 

On that note, as of writing, no other party has verified this list, so it is possible that a role is missing or there is an accidental inclusion of a male role. Please use this list with caution. 

### Creation of Gender Column

In [28]:
# Self created list of the women roles in Shakespeare
# IMPORTANT: Names that are repeated in Shakespeare are not repeated
# in this list. Do not use as a complete list of Women in Shakespeare
# (Example: Katherine from Henry V and Katherine from Taming of the Shrew)
# are not listed separately. There is just one Katherine in the list
Women = ['Countess', 'Diana', 'Helen', 'Mariana', 'Widow', 
         'Charmian', 'Cleopatra', 'Iras', 'LADIES', 'Octavia', 
         'Audrey', 'Celia', 'Phoebe', 'Rosalind', 'Abbess', 
         'Adriana', 'Luce', 'Luciana', 'Gentlewoman', ' Valeria'
         'Virgilia', 'Volumnia', 'GHOSTS.Mother', 'Imogen', 'LADIES.IMOGEN',
         'LADIES.QUEEN', 'LADIES.QUEEN.0.1', 'Queen', 'Gertrude', 'Ophelia',
         'LadyMortimer', 'LadyPercy', 'MistressQuickly', 'DollTearsheet', 
         'LadyNorthumberland', 'Alice', 'Katherine', 'QueenOfFrance', 'QueenMargaret',
         'Pucelle', 'DuchessOfGloucester', 'Jourdain', 'SimpcoxWife', 'LadyBona',
         'Nurse', 'QueenElizabeth', 'Anne', 'ElizabethI.Infant', 'Katherine',
         'LADIES.KATHERINE', 'LADIES.KATHERINE.0.1', 'LADIES.KATHERINE.Patience',
         'OldLady', 'Calphurnia', 'Portia', 'LadyFaulconbridge', 'QueenEleanor',
         'Cordelia', 'Goneril', 'Regan', 'Jaquenetta', 'Maria', 
         'Princess', 'Rosaline', 'Gentlewoman', 'Hecate', 'LadyMacbeth', 
         'LadyMacduff', 'WITCHES.1', 'WITCHES.2', 'WITCHES.3', 'Bawd',
         'Isabella', 'Juliet', 'Nun', 'Jessica',
         'Portia', 'Nerissa', 'AnnePage', 'MistressFord', 'MistressPage',
         'Helena', 'Hermia', 'Hippolyta', 'Titania', 'Beatrice',
         'Hero', 'Margaret', 'Ursula', 'Bianca', 'Desdemona', 'Emilia', 
         'Daughter', 'Diana', 'Dionyza', 'Lychorida', 'Marina', 'Thaisa', 
         'DuchessOfYork', 'LADIES.0.1', 'ClarencesDaughter', 'LadyAnne', 'LadyCapulet',
         'LadyMontague', 'Hostess', 'Miranda', 'Phrynia', 'Timandra',
         'Lavinia', 'Tamora', 'Andromache', 'Cassandra', 'Cressida',
         'Olivia', 'Viola', 'Julia', 'Lucetta', 'SERVANTS.Ursula', 'Sylvia',
         'MAIDS', 'Maid', 'QUEENS.1', 'QUEENS.2', 'QUEENS.3', 'Woman', 
         'COUNTRYWOMEN.0.1', 'Dorcas', 'Hermione', 'LADIES.0.1', 'LADIES.0.2', 'LADIES.Emilia',
         'Mopsa', 'Paulina', 'Perdita']

df['Gender'] = 0

for i in range(len(df)):
    if(df.loc[i, 'Character'] in Women):
        df.loc[i, 'Gender'] = 1

df.head()

Unnamed: 0,Character,Words,Play,Percent,Gender
0,Helen,3586,AWW,0.158392,1
1,Parolles,2995,AWW,0.132288,0
2,King,2831,AWW,0.125044,0
3,Lafew,2219,AWW,0.098012,0
4,Countess,2175,AWW,0.096069,1


### Complete List of Women characters in Shakespeare
Use the following data frame to have a list of all the women in Shakespeare

In [41]:
list_women = df[df.Gender != 0]
list_women = list_women[['Character', 'Words', 'Play']]
list_women.head(10)
# Uncomment the following line to create a csv of the list of women
#list_women.to_csv('list_of_women_Shakespeare.csv', index=False)

Unnamed: 0,Character,Words,Play
0,Helen,3586,AWW
4,Countess,2175,AWW
8,Diana,1024,AWW
11,Widow,435,AWW
13,Mariana,172,AWW
30,Cleopatra,4695,Ant
34,Charmian,632,Ant
43,Octavia,234,Ant
51,Iras,160,Ant
100,LADIES,3,Ant


### Genre

As stated earlier in the document. Genre is a controversial category. For the purposes of this pdf I have categorized the plays in to the genres they are assigned in David Bevington's Complete Works. If one disagrees with these categories, One may easily move the play codes around in the following code to where they think a play should be properly categorized. 

In [48]:
Comedies = ['AWW', 'AYL', 'Err', 'LLL', 'MM', 'MV', 'Wiv', 'MND', 'Ado', 'Shr', 'Tro', 'TN', 'TGV']
Histories = ['1H4', '2H4', 'H5', '1H6', '2H6', '3H6', 'H8', 'Jn', 'R2', 'R3']
Tragedies = ['Ant', 'Cor', 'Ham', 'JC', 'Lr', 'Mac', 'Oth', 'Rom', 'Tim', 'Tit']
Romances = ['Cym', 'Per', 'Tmp', 'TNK', 'WT']

Genre = []

for i in range(len(df)):
    if(df.loc[i, 'Play'] in Comedies):
        Genre.append('Comedy')
    elif(df.loc[i, 'Play'] in Histories):
        Genre.append('History')
    elif(df.loc[i, 'Play'] in Tragedies):
        Genre.append('Tragedy')
    else:
        Genre.append('Romance')

df['Genre'] = Genre
df.head()

Unnamed: 0,Character,Words,Play,Percent,Gender,Genre
0,Helen,3586,AWW,0.158392,1,Comedy
1,Parolles,2995,AWW,0.132288,0,Comedy
2,King,2831,AWW,0.125044,0,Comedy
3,Lafew,2219,AWW,0.098012,0,Comedy
4,Countess,2175,AWW,0.096069,1,Comedy


## Data Exploration and Visualizations

The following section will use the data created above to calculate measurements and create interesting data visuals