# Introduction

For my final project, I wanted to compare the literary works of Shakespeare, divided by comedies and tragedies. Shakespeare is known for his plays, most of them published in between the 1590s and the 1610s. His most famous work is the tragedy of Romeo and Juliet, but he wrote many comedies along with his famous tragedies. His plays are usually divided into those two genres, however, some of his later works are considered "problem plays" because they do not fit into the predefined conventions for comedy and tragedy plays. So, what makes a play a comedy or a tragedy? Are there quantifiable differences in word usage between the two genres? These are the questions I am aiming to answer in this final project. The programming methods I used were Pandas dataframes, Natural Language Processing by spaCy, and Scattertext. For the scatterplots, I compared usage of verbs and adjectives in comedies versus tragedies. I compiled the dataset from Project Gutenberg. I compiled six of Shakespeare's most famous comedies and six of his tragedies into a list of dictionaries and each link to each play can be accessed through its URL.

# Code

## Collecting text into DataFrames and tokenizing

In [1]:
# imports
import requests
import pandas as pd
import spacy

In [2]:
# set up nlp pipline
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

In [3]:
def get_text(url):
    response = requests.get(url)
    text = response.text
    return text

In [4]:
def divide_paras(text, start, end, para_break):
    text = text[start:end]
    paras = text.split(para_break)
    return paras

In [5]:
tragedies = [
    {
        'genre': 'tragedy',
        'title': 'hamlet',
        'url': 'https://www.gutenberg.org/cache/epub/1524/pg1524.txt',
    },
    {
        'genre': 'tragedy',
        'title': 'lear',
        'url': 'https://www.gutenberg.org/cache/epub/1532/pg1532.txt',
    },
    {
        'genre': 'tragedy',
        'title': 'romeo_juliet',
        'url': 'https://www.gutenberg.org/cache/epub/1513/pg1513.txt',
    },
    {
        'genre': 'tragedy',
        'title': 'macbeth',
        'url': 'https://www.gutenberg.org/cache/epub/1533/pg1533.txt',
    },
    {
        'genre': 'tragedy',
        'title': 'othello',
        'url': 'https://www.gutenberg.org/cache/epub/1531/pg1531.txt',
    },
    {
        'genre': 'tragedy',
        'title': 'coriolanus',
        'url': 'https://www.gutenberg.org/cache/epub/1535/pg1535.txt',
    }
]

In [6]:
start = '*** START OF THE PROJECT GUTENBERG EBOOK'
end = '*** END OF THE PROJECT GUTENBERG EBOOK'
para_break = '\r\n\r\n'
data = {'genre': [], 'title': [], 'text': []}
for item in tragedies:
  genre = item['genre']
  title = item['title']
  text = get_text(item['url'])
  start_index = text.find(start)
  end_index = text.find(end)
  paras = divide_paras(text=text, start=start_index, end=end_index, para_break=para_break)
  for para in paras:
    data['genre'].append(genre)
    data['title'].append(title)
    data['text'].append(para)
  tragedies_df = pd.DataFrame.from_dict(data)

In [7]:
tragedies_df.sample(10)

Unnamed: 0,genre,title,text
2878,tragedy,romeo_juliet,MONTAGUE.\r\nI would thou wert so happy by thy...
6466,tragedy,coriolanus,Go sound thy trumpet in the marketplace.\r\nCa...
4388,tragedy,macbeth,LADY MACBETH.\r\nA kind good night to all!
4563,tragedy,macbeth,Enter a Doctor.
3236,tragedy,romeo_juliet,"BENVOLIO.\r\nStop there, stop there."
5290,tragedy,othello,[_Exeunt._]
4754,tragedy,macbeth,SCENE VIII. The same. Another part of the field.
3332,tragedy,romeo_juliet,\r\nACT III
2798,tragedy,romeo_juliet,\r\n Enter Chorus.
5103,tragedy,othello,IAGO.\r\nI warrant thee. Meet me by and by at ...


In [8]:
def get_noun_lemmas(text):
  doc = nlp(text)
  tokens = [token for token in doc if token.pos_ == 'NOUN']
  lemmas = [token.lemma_ for token in tokens]
  results_str = ' '.join(lemmas)
  return results_str

In [9]:
def get_adj_lemmas(text):
  doc = nlp(text)
  tokens = [token for token in doc if token.pos_ == 'ADJ']
  lemmas = [token.lemma_ for token in tokens]
  results_str = ' '.join(lemmas)
  return results_str

In [10]:
def get_verb_lemmas(text):
  doc = nlp(text)
  tokens = [token for token in doc if token.pos_ == 'VERB']
  lemmas = [token.lemma_ for token in tokens]
  results_str = ' '.join(lemmas)
  return results_str

In [11]:
tragedies_df['nouns'] = tragedies_df['text'].apply(get_noun_lemmas)

In [12]:
tragedies_df['adjectives'] = tragedies_df['text'].apply(get_adj_lemmas)

In [13]:
tragedies_df['verbs'] = tragedies_df['text'].apply(get_verb_lemmas)

In [14]:
# remove rows who have only whitespace in nouns or adjectives column
tragedies_df = tragedies_df[tragedies_df['nouns'].str.strip().astype(bool)]
tragedies_df = tragedies_df[tragedies_df['adjectives'].str.strip().astype(bool)]
tragedies_df = tragedies_df[tragedies_df['verbs'].str.strip().astype(bool)]

In [15]:
tragedies_df.sample(10)

Unnamed: 0,genre,title,text,nouns,adjectives,verbs
3053,tragedy,romeo_juliet,"JULIET.\r\nGood pilgrim, you do wrong your han...",juliet pilgrim hand devotion saint hand hand t...,good mannerly holy,wrong show pilgrim
4164,tragedy,macbeth,LENNOX.\r\nMy young remembrance cannot paralle...,remembrance fellow,young,parallel
3457,tragedy,romeo_juliet,ROMEO.\r\nThou canst not speak of that thou do...,ROMEO thou thou love hour doting hair ground m...,canst dost wert young married,speak feel murder banish tear fall take
5039,tragedy,othello,"DESDEMONA.\r\nI thank you, valiant Cassio.\r\n...",tiding,valiant,thank tell
4621,tragedy,macbeth,"Lo you, here she comes! This is her very guise...",life,guise asleep close,come observe stand
2422,tragedy,lear,EDGAR.\r\nY’are much deceiv’d: in nothing am I...,deceiv’d garment,much,y’are chang’d
2202,tragedy,lear,"GLOUCESTER.\r\nGood friend, I prythee, take hi...",friend prythee arm plot death litter friend sh...,good ready thou assured quick,take o’erheard lay drive meet take shouldst of...
3492,tragedy,romeo_juliet,"LADY CAPULET.\r\nI will, and know her mind ear...",mind tomorrow tonight heaviness,early,know ’
5387,tragedy,othello,OTHELLO.\r\nWhy did I marry? This honest creat...,creature doubtless,honest more more,marry know unfold
2566,tragedy,lear,KENT.\r\nReport is changeable. ’Tis time to lo...,time power kingdom approach,changeable,look


In [16]:
tragedies_df.to_csv('tragedies.csv', index=False)

In [17]:
comedies = [
    {
        'genre': 'comedy',
        'title': 'midsummer',
        'url': 'https://www.gutenberg.org/cache/epub/1514/pg1514.txt',
    },
    {
        'genre': 'comedy',
        'title': 'shrew',
        'url': 'https://www.gutenberg.org/cache/epub/1508/pg1508.txt',
    },
    {
        'genre': 'comedy',
        'title': 'twelfth',
        'url': 'https://www.gutenberg.org/cache/epub/1526/pg1526.txt',
    },
    {
        'genre': 'comedy',
        'title': 'winters',
        'url': 'https://www.gutenberg.org/cache/epub/1539/pg1539.txt',
    },
    {
        'genre': 'comedy',
        'title': 'much_ado',
        'url': 'https://www.gutenberg.org/cache/epub/1519/pg1519.txt',
    },
    {
        'genre': 'comedy',
        'title': 'tempest',
        'url': 'https://www.gutenberg.org/cache/epub/1540/pg1540.txt',
    }
]

In [20]:
start = '*** START OF THE PROJECT GUTENBERG EBOOK'
end = '*** END OF THE PROJECT GUTENBERG EBOOK'
para_break = '\r\n\r\n'
data = {'genre': [], 'title': [], 'text': []}
for item in comedies:
  genre = item['genre']
  title = item['title']
  text = get_text(item['url'])
  start_index = text.find(start)
  end_index = text.find(end)
  paras = divide_paras(text=text, start=start_index, end=end_index, para_break=para_break)
  for para in paras:
    data['genre'].append(genre)
    data['title'].append(title)
    data['text'].append(para)
  comedies_df = pd.DataFrame.from_dict(data)

In [21]:
comedies_df.sample(10)

Unnamed: 0,genre,title,text
3493,comedy,winters,CLOWN.\r\nWhat hast here? Ballads?
5388,comedy,tempest,"SEBASTIAN.\r\nThy case, dear friend,\r\nShall ..."
2956,comedy,winters,
1320,comedy,shrew,"CURTIS.\r\nThis ’tis to feel a tale, not to he..."
1629,comedy,shrew,"PETRUCHIO.\r\n[_To Vincentio_] Why, how now, ..."
425,comedy,midsummer,PUCK.\r\nFollow me then to plainer ground.
1876,comedy,twelfth,"MARIA.\r\nBy my troth, Sir Toby, you must come..."
687,comedy,shrew,\r\nTHE TAMING OF THE SHREW
4289,comedy,much_ado,LEONATO.\r\nNo; and swears she never will: tha...
1321,comedy,shrew,GRUMIO.\r\nAnd therefore ’tis called a sensibl...


In [22]:
comedies_df['nouns'] = comedies_df['text'].apply(get_noun_lemmas)

In [23]:
comedies_df['adjectives'] = comedies_df['text'].apply(get_adj_lemmas)

In [24]:
comedies_df['verbs'] = comedies_df['text'].apply(get_verb_lemmas)

In [25]:
# remove rows who have only whitespace in nouns or adjectives column
comedies_df = comedies_df[comedies_df['nouns'].str.strip().astype(bool)]
comedies_df = comedies_df[comedies_df['adjectives'].str.strip().astype(bool)]
comedies_df = comedies_df[comedies_df['verbs'].str.strip().astype(bool)]

In [26]:
comedies_df.sample(10)

Unnamed: 0,genre,title,text,nouns,adjectives,verbs
4951,comedy,much_ado,"You know your office, brother;\r\nYou must be ...",office brother father brother daughter,young,know give
2431,comedy,twelfth,"VIOLA.\r\nNo, not a grize; for ’tis a vulgar p...",viola grize tis proof enemy,vulgar,pity
615,comedy,midsummer,"LION.\r\nYou, ladies, you, whose gentle hearts...",lady heart mouse floor quake lion rage roar jo...,gentle small monstrous rough wildest,fear creep tremble know fall come twere
64,comedy,midsummer,HERMIA.\r\nGod speed fair Helena! Whither away?,whither,fair,speed
1027,comedy,shrew,"BAPTISTA.\r\nA thousand thanks, Signior Gremio...",thank stranger cause,good gentle bold,methink walk know come
2113,comedy,twelfth,"SEBASTIAN.\r\nO good Antonio, forgive me your ...",trouble,good,forgive
5033,comedy,tempest,Other Spirits attending on Prospero,spirit,other,attend
775,comedy,shrew,THIRD SERVANT.\r\nOr Daphne roaming through a ...,wood leg sight weep blood tear,thorny sad,roam scratch swear bleed draw
3617,comedy,winters,"CAMILLO.\r\nHow now, good fellow! why shakest ...",fellow thou man harm,good shak,fear ’ intend
3845,comedy,winters,PAULINA.\r\nIt is requir’d\r\nYou do awake you...,requir’d faith business,unlawful,awake stand think let depart


In [27]:
comedies_df.to_csv('comedies.csv', index=False)

## Using Scattertext to compare verbs and adjectives

In [28]:
%%capture
! pip install scattertext

In [29]:
import scattertext as st
from IPython.core.display import HTML

In [30]:
print(tragedies_df.shape)
print(comedies_df.shape)

(2822, 6)
(2337, 6)


In [31]:
tragedies_df_adj = tragedies_df
comedies_df_adj = comedies_df

In [32]:
df_combined = pd.concat([tragedies_df_adj, comedies_df_adj])

In [33]:
adj_df = df_combined[['genre', 'adjectives']]

In [34]:
# sanity check
print(adj_df.shape) # should be 5159, 2
adj_df.sample(10)

(5159, 2)


Unnamed: 0,genre,adjectives
5073,comedy,wide
2447,tragedy,free patient
780,tragedy,holy religious many many safe
1031,comedy,mighty welcome lute
268,tragedy,incestuous traitorous wicked shameful most see...
5080,comedy,barren long brown dry
2277,tragedy,old
1813,comedy,curst
5652,comedy,wise
3181,tragedy,mine sudden thy holy blessed


In [35]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adj_df, category_col='genre', text_col='adjectives').build()

In [36]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='comedy',  # this sets the y-axis
                                       category_name='Comedy', # label y-axis
                                       not_category_name='Tragedy',  # label x-axis
                                       minimum_term_frequency=10, # I used a minimum frequency of 10 to populate the scatterplot more
                                       width_in_pixels=900)

In [37]:
HTML(html)

In [38]:
# Save this visualization as an html file
file_name = 'final_proj_Vis_adj.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html)

In [39]:
tragedies_df_verb = tragedies_df
comedies_df_verb = comedies_df

In [40]:
df_combined_2 = pd.concat([tragedies_df_verb, comedies_df_verb])

In [41]:
verb_df = df_combined_2[['genre', 'verbs']]

In [42]:
# sanity check
print(verb_df.shape) # should be 5159, 2
verb_df.sample(10)

(5159, 2)


Unnamed: 0,genre,verbs
3998,comedy,prove lose get pick hang
1834,comedy,attend attend
2006,comedy,speak
3058,tragedy,move take kiss
4846,tragedy,answer beseech find transport be know do know ...
497,tragedy,come let comply tell show appear deceive
4346,comedy,know
7635,tragedy,take let make
1393,tragedy,slay lie rise blame
2644,tragedy,GONERIL exalt


In [43]:
# create a scattertext corpus
corpus_2 = st.CorpusFromPandas(verb_df, category_col='genre', text_col='verbs').build()

In [44]:
# transform corpus into html-based visualization with scattertext
html_2 = st.produce_scattertext_explorer(corpus_2,
                                       category='comedy',  # this sets the y-axis
                                       category_name='Comedy', # label y-axis
                                       not_category_name='Tragedy',  # label x-axis
                                       minimum_term_frequency=10,
                                       width_in_pixels=900)

In [45]:
HTML(html_2)

In [46]:
# Save this visualization as an html file
file_name = 'final_proj_Vis_verb.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html)

# Discussion and Analysis

#### So, what makes a play a comedy or a tragedy? Are there quantifiable differences in word usage between the two genres?

