# <ins>Predicting Poetic Movements</ins>

# Data Cleaning

- In this notebook, I look at the data I scraped from [PoetryFoundation.org](https://www.poetryfoundation.org/) in the previous [notebook](01_webscraping.ipynb).
- I check for and rectify duplicate (or near-duplicate) entries, NaN values, and poems that need to be rescraped.

#### Important note
As in the [previous notebook](01_webscraping.ipynb), due to the imperfection and idiosyncracies of scraping text from images, a lot of rescraping was necessary. Sometimes this had to be done in a manner that is best described, rather unfortunately, as nonprogrammatic. As a result, this notebook is at times messy, which is not a reflection on the other notebooks for this project.

Thank you for understanding :)

## Table of contents

1. [Load packages and data](#Load-packages-and-data)
2. [Check for duplicates](#Check-for-duplicates)

    - [Drop duplicates](#Drop-duplicates)
    - [Scrape extra pages](#Scrape-extra-pages)
    - [Add titles](#Add-titles)
  
  
3. [Check for NaN values](#Check-for-NaN-values)
4. [Check for other bad scrapes](#Check-for-other-bad-scrapes)

    - [Finishing touches](#Finishing-touches)
    - [Save DataFrame](#💾-Save-DataFrame)
    
    
5. [Drop certain genres](#Drop-certain-genres)

    - [Save/Load trimmed DataFrame](#💾-Save/Load-trimmed-DataFrame)
    
    
6. [Next notebook: NLP, Feature Engineering, and EDA](#Next-notebook:-NLP,-Feature-Engineering,-and-EDA)

    
## Load packages and data

[[go back to the top](#Predicting-Poetic-Movements)]

In [1]:
# custom functions for webscraping
from functions_webscraping import *

# standard dataframe packages
import pandas as pd
import numpy as np

# timekeeping/progress packages
import time
from tqdm import tqdm

# saving packages
import gzip
import pickle

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# increase column width of dataframe
pd.set_option('max_colwidth', 150)

In [2]:
df = pd.read_csv('data/poems_df_pre_clean.csv', index_col=0)
df.shape

(5168, 6)

In [3]:
df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Alexander Pope,https://www.poetryfoundation.org/poems/44896/an-essay-on-criticism-part-1,An Essay on Criticism: Part 1,"['PART 1', ""'Tis hard to say, if greater want of skill"", 'Appear in writing or in judging ill;', ""But, of the two, less dang'rous is th' offence"",...","PART 1\n'Tis hard to say, if greater want of skill\nAppear in writing or in judging ill;\nBut, of the two, less dang'rous is th' offence\nTo tire ...",augustan
1,Alexander Pope,https://www.poetryfoundation.org/poems/44897/an-essay-on-criticism-part-2,An Essay on Criticism: Part 2,"['Of all the causes which conspire to blind', ""Man's erring judgment, and misguide the mind,"", 'What the weak head with strongest bias rules,', 'I...","Of all the causes which conspire to blind\nMan's erring judgment, and misguide the mind,\nWhat the weak head with strongest bias rules,\nIs pride,...",augustan
2,Alexander Pope,https://www.poetryfoundation.org/poems/44898/an-essay-on-criticism-part-3,An Essay on Criticism: Part 3,"['Learn then what morals critics ought to show,', ""For 'tis but half a judge's task, to know."", ""'Tis not enough, taste, judgment, learning, join;...","Learn then what morals critics ought to show,\nFor 'tis but half a judge's task, to know.\n'Tis not enough, taste, judgment, learning, join;\nIn a...",augustan
3,Alexander Pope,https://www.poetryfoundation.org/poems/44899/an-essay-on-man-epistle-i,An Essay on Man: Epistle I,"['Awake, my St. John! leave all meaner things', 'To low ambition, and the pride of kings.', 'Let us (since life can little more supply', 'Than jus...","Awake, my St. John! leave all meaner things\nTo low ambition, and the pride of kings.\nLet us (since life can little more supply\nThan just to loo...",augustan
4,Alexander Pope,https://www.poetryfoundation.org/poems/44900/an-essay-on-man-epistle-ii,An Essay on Man: Epistle II,"['I.', 'Know then thyself, presume not God to scan;', 'The proper study of mankind is man.', ""Plac'd on this isthmus of a middle state,"", 'A being...","I.\nKnow then thyself, presume not God to scan;\nThe proper study of mankind is man.\nPlac'd on this isthmus of a middle state,\nA being darkly wi...",augustan


- Saving to CSV converts the list of poem_lines to a string, so I'll use my destringify function.

In [4]:
df['poem_lines'] = df['poem_lines'].apply(destringify)

## Check for duplicates

[[go back to the top](#Predicting-Poetic-Movements)]

- First check if any poems were scraped twice.
- Then look for poems that are the same, but may have been scraped slightly differently.

In [5]:
len(df[df.duplicated(subset=['poet', 'poem_string'])])

21

In [6]:
df[df.duplicated(subset=['poet', 'poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
119,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"[I, I saw the best minds of my generation destroyed by madness, starving hysterical naked,, dragging themselves through the negro streets at dawn ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
120,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"[I, I saw the best minds of my generation destroyed by madness, starving hysterical naked,, dragging themselves through the negro streets at dawn ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
249,Richard Brautigan,https://www.poetryfoundation.org/poems/48576/a-boat,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
250,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/56423/a-boat-56d238e754f45,A Boat,"[O beautiful, was the werewolf, in his evil forest., We took him, to the carnival, and he started, crying, when he saw, the Ferris wheel., Electri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
613,Robert Creeley,https://www.poetryfoundation.org/poems/42840/the-rescue-56d2217b24ec4,The Rescue,"[The man sits in a timelessness, with the horse under him in time, to a movement of legs and hooves, upon a timeless sand., Distance comes in from...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
614,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/28665/the-rescue,The Rescue,"[The man sits in a timelessness, with the horse under him in time, to a movement of legs and hooves, upon a timeless sand., Distance comes in from...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
738,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29165/four-dream-songs,Four Dream Songs,"[I, To Ralph Ross, The greens of the Ganges delta foliate., Of heartless youth made late aware he pled:, Brownies, please come., To Henry in his s...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
741,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29167/henrys-pelt-was-put-on,Henrys Pelt Was Put On,"[I, To Ralph Ross, The greens of the Ganges delta foliate., Of heartless youth made late aware he pled:, Brownies, please come., To Henry in his s...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
852,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1845,Alfred Kreymborg,https://www.poetryfoundation.org/poetrymagazine/poems/14702/cradle,Cradle,"[The blue-eyed youngster, And the fat old man, Play ball in me;, And music—, The one on his penny flute,, The other on his bassoon., Their tolerat...","The blue-eyed youngster\nAnd the fat old man\nPlay ball in me;\nAnd music—\nThe one on his penny flute,\nThe other on his bassoon.\nTheir tolerati...",modern


- I'm actually somewhat happy about this because it shows that my image scraper worked pretty darn well, if the strings are exactly the same.
- That said, it may also mean that there are even more near-duplicates, but I can check for duplicates across poet and title next.

In [7]:
# drop duplicates
to_drop = [120, 250, 614, 2338, 2358, 2367, 2931, 2995, 3454, 3455, 3642, 4481, 5159]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5155, 6)

- With some poems that have multiple parts, my scraper scraped all the parts for each row, so those need a more nuanced fix, as conducted below.

In [8]:
df.loc[738, 'poem_lines'] = df.loc[738, 'poem_lines'][:-4]
df.loc[738, 'poem_string'] = '\n'.join(df.loc[738, 'poem_lines'])

df.loc[741, 'poem_lines'] = df.loc[741, 'poem_lines'][-3:]
df.loc[741, 'poem_string'] = '\n'.join(df.loc[741, 'poem_lines'])

In [9]:
df.loc[1845, 'poem_lines'] = df.loc[1845, 'poem_lines'][:19]
df.loc[1845, 'poem_string'] = '\n'.join(df.loc[1845, 'poem_lines'])

df.loc[1867, 'poem_lines'] = df.loc[1867, 'poem_lines'][20:]
df.loc[1867, 'poem_string'] = '\n'.join(df.loc[1867, 'poem_lines'])

In [10]:
df.loc[2195, 'poem_lines'] = df.loc[2195, 'poem_lines'][:12]
df.loc[2195, 'poem_string'] = '\n'.join(df.loc[2195, 'poem_lines'])

df.loc[2236, 'poem_lines'] = df.loc[2236, 'poem_lines'][12:51]
df.loc[2236, 'poem_string'] = '\n'.join(df.loc[2236, 'poem_lines'])

In [11]:
df.loc[2571, 'poem_lines'] = df.loc[2571, 'poem_lines'][:-1]
df.loc[2571, 'poem_string'] = '\n'.join(df.loc[2571, 'poem_lines'])

- A couple of total rescrapes, because the URL went to the wrong page.

In [12]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=2'
url = df.loc[2567, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2567, 'poet'], input_title=df.loc[2567, 'title'])
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2567, 'genre']
df.loc[2567, 'poem_lines'] = rescrape['poem_lines']
df.loc[2567, 'poem_string'] = rescrape['poem_string']

In [13]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=3'
url = df.loc[2575, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2575, 'poet'], input_title=df.loc[2575, 'title'])
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2575, 'genre']
df.loc[2575, 'poem_lines'] = rescrape['poem_lines']
df.loc[2575, 'poem_string'] = rescrape['poem_string']

- A slight title adjustment so all the words in the first line will be accounted for.

In [14]:
df.loc[3666, 'title'] = 'Young in Fall I said: the birds'

- Check for duplicates again.

In [15]:
df[df.duplicated(subset=['poet', 'poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
2355,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19926/the-urn-enrich-my-resignation,The Urn Enrich My Resignation,[],,modern
2356,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn Purgatorio,[],,modern
2359,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,[],,modern
2360,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn The Sad Indian,[],,modern


- I know I scraped these in the [rescrape](01_webscraping.ipynb#Rescrape) portion of the previous notebook, so I'll confirm those are somewhere in the DataFrame.

In [16]:
df[df.poet == 'Hart Crane']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
2341,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19925/a-postscript,A Postscript,"[Friendship agony! words came to me, At last shyly. My only final friends,, The wren and thrush, made solid print for me, Across dawn’s broken arc...","Friendship agony! words came to me\nAt last shyly. My only final friends,\nThe wren and thrush, made solid print for me\nAcross dawn’s broken arc....",modern
2342,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17345/at-melvilles-tomb,At Melvilles Tomb,"[Often beneath the wave, wide from this ledge,, The dice of drowned men’s bones he saw bequeath, An embassy. Their numbers, as he watched,, Beat o...","Often beneath the wave, wide from this ledge,\nThe dice of drowned men’s bones he saw bequeath\nAn embassy. Their numbers, as he watched,\nBeat on...",modern
2343,Hart Crane,https://www.poetryfoundation.org/poems/43260/at-melvilles-tomb-56d221f8f2f82,At Melville’s Tomb,"[Often beneath the wave, wide from this ledge, The dice of drowned men’s bones he saw bequeath, An embassy. Their numbers as he watched,, Beat on ...","Often beneath the wave, wide from this ledge\nThe dice of drowned men’s bones he saw bequeath\nAn embassy. Their numbers as he watched,\nBeat on t...",modern
2344,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19923/by-nilus-once,By Nilus Once,"[Some old Egyptian joke is in the air,, Dear lady, the poet said, release your hair;, Come, search the marshes for a friendly bed,, Or let us bump...","Some old Egyptian joke is in the air,\nDear lady, the poet said, release your hair;\nCome, search the marshes for a friendly bed,\nOr let us bump ...",modern
2345,Hart Crane,https://www.poetryfoundation.org/poems/43257/chaplinesque,Chaplinesque,"[We make our meek adjustments,, Contented with such random consolations, As the wind deposits, In slithered and too ample pockets., For we can sti...","We make our meek adjustments,\nContented with such random consolations\nAs the wind deposits\nIn slithered and too ample pockets.\nFor we can stil...",modern
2346,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17746/the-bridge-cutty-sark,Cutty Sark,"[I met a man in South Street, tall—, a nervous shark tooth swung on his chain., His eyes pressed through green glass, —green glasses, or bar light...","I met a man in South Street, tall—\na nervous shark tooth swung on his chain.\nHis eyes pressed through green glass\n—green glasses, or bar lights...",modern
2347,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/18754/eldorado,Eldorado,"[The morning glory, climbing the morning long, Over the lintel on its wiry vine,, Closes before the dusk, furls in its song, As I close mine..., A...","The morning glory, climbing the morning long\nOver the lintel on its wiry vine,\nCloses before the dusk, furls in its song\nAs I close mine...\nAn...",modern
2348,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19919/havana-rose,Havana Rose,"[Let us strip the desk for action, now we have a house in, Mexico. . . . That night in Vera Cruz—verily for me “the, True Cross”—let us remember t...","Let us strip the desk for action, now we have a house in\nMexico. . . . That night in Vera Cruz—verily for me “the\nTrue Cross”—let us remember th...",modern
2349,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19918/imperator-victus,Imperator Victus,"[Big guns again, No speakee well, But plain., Again, again—, And they shall tell, The Spanish Main, The Dollar from the Cross., Big guns again—, B...","Big guns again\nNo speakee well\nBut plain.\nAgain, again—\nAnd they shall tell\nThe Spanish Main\nThe Dollar from the Cross.\nBig guns again—\nBu...",modern
2350,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/17747/o-carib-isle,O Carib Isle!,"[O Carib Isle!, The tarantula rattling at the lily’s foot, Across the feet of the dead, laid in white sand, Near the coral beach—nor zigzag fiddle...","O Carib Isle!\nThe tarantula rattling at the lily’s foot\nAcross the feet of the dead, laid in white sand\nNear the coral beach—nor zigzag fiddle ...",modern


- Found them! And interestingly, I found a duplicate for which my image-text scraper must not have scraped entirely properly.
- I'll go ahead a drop all those now.

In [17]:
# drop duplicates
to_drop = [2342, 2355, 2356, 2359, 2360]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5150, 6)

- Check for poems with same poet and title, which may mean there are differently scraped duplicate poems.
- Since there were so many, I did these in batches of 40, using the code in the following cell.
- The subsequent cells detail my process for fixing any legitimate duplicates, by either dropping, rescraping, adding titles, etc.

In [18]:
df[df.duplicated(subset=['poet', 'title'], keep=False)].head(40)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
176,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,[],,beat
177,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,"[XVI, No sooner had the clowns got a new house built,, a worse wind than the first blew it down. And it also, re-blew down the old house which the...","XVI\nNo sooner had the clowns got a new house built,\na worse wind than the first blew it down. And it also\nre-blew down the old house which they...",beat
461,Denise Levertov,https://www.poetryfoundation.org/poetrymagazine/poems/55624/invocation-56d2376543226,Invocation,"[Silent, about-to-be-parted-from house., Wood creaking, trying to sigh, impatient., Clicking of squirrel-teeth in the attic., Denuded beds, couche...","Silent, about-to-be-parted-from house.\nWood creaking, trying to sigh, impatient.\nClicking of squirrel-teeth in the attic.\nDenuded beds, couches...",black_mountain
462,Denise Levertov,https://www.poetryfoundation.org/poetrymagazine/poems/31377/invocation-56d214d8e4ca6,Invocation,"[Silent, about-to-be-parted-from house., Wood creaking, trying to sigh, impatient., Clicking of squirrel-teeth in the attic., Denuded beds, couche...","Silent, about-to-be-parted-from house.\nWood creaking, trying to sigh, impatient.\nClicking of squirrel-teeth in the attic.\nDenuded beds, couches...",black_mountain
546,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/55314/a-prayer-56d236c6bb760,A Prayer,"[Bless, something small, but infinite, and quiet., There are senses, make an object, in their simple, feeling for one., ]",Bless\nsomething small\nbut infinite\nand quiet.\nThere are senses\nmake an object\nin their simple\nfeeling for one.\n,black_mountain
547,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30222/a-prayer-56d213b6bd3ec,A Prayer,"[ess, something small, but infinite, and quiet., There are senses, make an object, in their simple, feeling for one.]",ess\nsomething small\nbut infinite\nand quiet.\nThere are senses\nmake an object\nin their simple\nfeeling for one.,black_mountain
603,Robert Creeley,https://www.poetryfoundation.org/poems/49024/the-language-56d22abc283f2,The Language,"[Locate, love you, where in, teeth and, eyes, bite, it but, take care not, to hurt, you, want so, much so, little. Words, say everything., I, love...","Locate\nlove you\nwhere in\nteeth and\neyes, bite\nit but\ntake care not\nto hurt, you\nwant so\nmuch so\nlittle. Words\nsay everything.\nI\nlove ...",black_mountain
604,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29780/the-language,The Language,"[Locate I, love you some-, where in, teeth and, eyes, bite, it but, take care not, to hurt, you, want so, much so, little. Words, say everything,,...","Locate I\nlove you some-\nwhere in\nteeth and\neyes, bite\nit but\ntake care not\nto hurt, you\nwant so\nmuch so\nlittle. Words\nsay everything,\n...",black_mountain
622,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29778/the-window-56d2134a5739a,The Window,"[Position is where you, put it, where it is,, did you, for example, that, large tank there, silvered,, with the white church along-, side, lift, a...","Position is where you\nput it, where it is,\ndid you, for example, that\nlarge tank there, silvered,\nwith the white church along-\nside, lift\nal...",black_mountain
623,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30219/the-window-56d213b62fd88,The Window,"[There will be no simple, way to avoid what, confronts me. Again and, again I know it, but, take heart, hopefully,, in the world unavoidably, pres...","There will be no simple\nway to avoid what\nconfronts me. Again and\nagain I know it, but\ntake heart, hopefully,\nin the world unavoidably\nprese...",black_mountain


### Drop duplicates

[[go back to the top](#Predicting-Poetic-Movements)]

- Drop duplicate values.

In [19]:
# drop duplicates
to_drop = [176, 461, 547, 603, 737, 803, 823, 930, 939, 1267, 1506, 1556, 1930, 1936]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5136, 6)

In [20]:
# drop duplicates
to_drop = [2010, 2052, 2119, 2236, 2315, 2397, 2504, 2505, 2574, 2831, 2863, 2866, 2900]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5123, 6)

In [21]:
# drop duplicates
to_drop = [2903, 2914, 2969, 3238, 3264, 3287, 3902]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5116, 6)

In [22]:
# drop duplicates
to_drop = [4041]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5115, 6)

### Scrape extra pages

[[go back to the top](#Predicting-Poetic-Movements)]

- I noticed some poems that were incomplete, so I'll rescrape post-first pages and add those on.

In [23]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=88&issue=5&page=10',
                  input_poet=df.loc[824, 'poet'], input_title='Tailfever was a bawdreur good')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [24]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=88&issue=5&page=11',
                  input_poet=df.loc[824, 'poet'], input_title='They caught each other by the body')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])

In [25]:
temp_lines = df.loc[824, 'poem_lines'].copy()
temp_lines.extend(temp_rescrape_lines)
temp_lines.extend(temp_rescrape_lines2)
temp_string = '\n'.join(temp_lines)

In [26]:
df.loc[824, 'poem_lines'] = temp_lines
df.loc[824, 'poem_string'] = temp_string

In [27]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=55&issue=5&page=4',
                  input_poet='Gertrude Stein', input_title='Before')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [28]:
temp_lines = df.loc[2316, 'poem_lines'].copy()
temp_lines.extend(temp_rescrape_lines[:-1])
temp_string = '\n'.join(temp_lines)

In [29]:
df.loc[2316, 'poem_lines'] = temp_lines
df.loc[2316, 'poem_string'] = temp_string

### Add titles

[[go back to the top](#Predicting-Poetic-Movements)]

- Some titles are mysteriously difficult to scrape, so I'll impute those here.

In [30]:
df.loc[1167, 'title'] = "Lady’s Boogie"
df.loc[1168, 'title'] = 'Flatted Fifths'

In [31]:
df.loc[3416, 'title'] = 'Christmas Eve'
df.loc[3417, 'title'] = 'The Obvious Tradition'
df.loc[3568, 'title'] = 'Deerfield:1703'
df.loc[3569, 'title'] = 'Slave Sale: New Orleans'
df.loc[3702, 'title'] = 'Song: to Celia [Come, my Celia, let us prove]'
df.loc[3703, 'title'] = 'Song: to Celia [“Drink to me only with thine eyes”]'
df.loc[3704, 'title'] = '“Though I am young, and cannot tell”'
df.loc[3705, 'title'] = 'Ode to Himself [“Come leave the loathéd stage”]'
df.loc[3900, 'title'] = 'Delia 36: But love whilst that thou mayst be loved again'
df.loc[3901, 'title'] = 'Delia 2: Go wailing verse, the infants of my love'
df.loc[3903, 'title'] = 'Delia 53: Unhappy pen and ill accepted papers'

In [32]:
df.loc[3904, 'title'] = 'Delia 47: Read in my face a volume of despairs'
df.loc[3905, 'title'] = 'Delia 1: Unto the boundless Ocean of thy beauty'
df.loc[3906, 'title'] = "Delia 6: Fair is my love, and cruel as she's fair"
df.loc[3907, 'title'] = 'Delia 46: Let others sing of knights and paladins'
df.loc[3908, 'title'] = 'Delia 37: When men shall find thy flower, thy glory pass'
df.loc[3909, 'title'] = 'Delia 45: Care-charmer Sleep, son of the sable Night'
df.loc[3917, 'title'] = 'Astrophil and Stella 3: Let dainty wits cry on the sisters nine'
df.loc[3918, 'title'] = 'Astrophil and Stella 41: Having this day my horse, my hand, my lance '
df.loc[3919, 'title'] = 'Astrophil and Stella 63: O Grammar rules, O now your virtues show'
df.loc[3920, 'title'] = 'Astrophil and Stella 64: No more, my dear, no more these counsels try'
df.loc[3921, 'title'] = 'Astrophil and Stella 52: A strife is grown between Virtue and Love'
df.loc[3922, 'title'] = 'Astrophil and Stella 21: Your words my friend (right healthful caustics) blame'
df.loc[3923, 'title'] = 'Astrophil and Stella 15: You that do search for every purling spring'
df.loc[3924, 'title'] = 'Astrophil and Stella 72: Desire, though thou my old companion art'
df.loc[3925, 'title'] = 'Astrophil and Stella 90: Stella, think not that I by verse seek fame'
df.loc[3926, 'title'] = 'Astrophil and Stella 92: Be your words made, good sir, of Indian ware'
df.loc[3927, 'title'] = 'Astrophil and Stella 49: I on my horse, and Love on me, doth try '
df.loc[3928, 'title'] = 'Astrophil and Stella 47: What, have I thus betrayed my liberty?'
df.loc[3929, 'title'] = 'Astrophil and Stella 107: Stella, since thou so right a princess art'
df.loc[3930, 'title'] = 'Astrophil and Stella 20: Fly, fly, my friends, I have my death wound, fly'
df.loc[3931, 'title'] = 'Astrophil and Stella 23: The curious wits, seeing dull pensiveness'
df.loc[3932, 'title'] = 'Astrophil and Stella 25: The wisest scholar of the wight most wise'
df.loc[3933, 'title'] = 'Astrophil and Stella 48: Soul’s joy, bend not those morning stars from me'
df.loc[3934, 'title'] = 'Astrophil and Stella 71: Who will in fairest book of nature know'

In [33]:
df.loc[3935, 'title'] = 'Astrophil and Stella 84: Highway, since you my chief Parnassus be'
df.loc[3936, 'title'] = "Astrophil and Stella 31: With how sad steps, O Moon, thou climb'st the skies"
df.loc[3937, 'title'] = 'Astrophil and Stella 33: I might!—unhappy word—O me, I might'
df.loc[3938, 'title'] = 'Astrophil and Stella 1: Loving in truth, and fain in verse my love to show'
df.loc[3939, 'title'] = "Astrophil and Stella 7: When Nature made her chief work, Stella's eyes"
df.loc[3940, 'title'] = 'Astrophil and Stella 14: Alas, have I not pain enough, my friend'
df.loc[3941, 'title'] = 'Song from Arcadia: “My True Love Hath My Heart”'
df.loc[3942, 'title'] = 'Astrophil and Stella 39: Come Sleep! O Sleep, the certain knot of peace'
df.loc[3997, 'title'] = 'Book 1, Epigram 39: Ad librum suum.'
df.loc[3998, 'title'] = 'Book 1, Epigram 5: Ad lectorem de subjecto operis sui.'
df.loc[3999, 'title'] = 'Book 7, Epigram 9: De senectute & iuuentute.'
df.loc[4000, 'title'] = 'Book 2, Epigram 4: Ad Henricum Wottonum.'
df.loc[4001, 'title'] = 'Book 5, Epigram 20: In Misum & Mopsam.'
df.loc[4002, 'title'] = 'Book 1, Epigram 34: Ad. Thomam Freake armig. de veris adventu.'
df.loc[4009, 'title'] = 'Book 6, Epigram 17: In Sextum.'
df.loc[4010, 'title'] = 'Book 2, Epigram 21: In Momum.'
df.loc[4011, 'title'] = 'Book 6, Epigram 7: In prophanationem nominis Dei.'
df.loc[4012, 'title'] = 'Book 7, Epigram 36: De puero balbutiente.'
df.loc[4013, 'title'] = 'Book 7, Epigram 47: De Hominis Ortu & Sepultura.'
df.loc[4014, 'title'] = 'Book 2, Epigram 40: De libro suo.'
df.loc[4015, 'title'] = 'Book 6, Epigram 14: De Piscatione.'
df.loc[4042, 'title'] = 'Song: “Come away, come away, death”'
df.loc[4043, 'title'] = 'Song: “Where the bee sucks, there suck I”'

In [34]:
df.loc[4044, 'title'] = 'Song: “When daisies pied and violets blue”'
df.loc[4045, 'title'] = 'Song: “Sigh no more, ladies, sigh no more”'
df.loc[4046, 'title'] = 'Song: “Orpheus with his lute made trees”'
df.loc[4047, 'title'] = 'Song: “Fear no more the heat o’ the sun”'
df.loc[4048, 'title'] = 'Sonnet 135: Whoever hath her wish, thou hast thy Will'
df.loc[4049, 'title'] = 'Song: “O Mistress mine where are you roaming?”'
df.loc[4050, 'title'] = 'Song: “Who is Silvia? what is she”'
df.loc[4051, 'title'] = 'Song: “When that I was and a little tiny boy (With hey, ho, the wind and the rain)”'
df.loc[4052, 'title'] = "Song: “Hark, hark! the lark at heaven's gate sings”"
df.loc[4113, 'title'] = 'Speech: “O Romeo, Romeo, wherefore art thou Romeo?”'
df.loc[4114, 'title'] = 'Speech: “Is this a dagger which I see before me”'
df.loc[4115, 'title'] = 'Speech: “No matter where; of comfort no man speak”'
df.loc[4116, 'title'] = 'Speech: “To be, or not to be, that is the question”'
df.loc[4117, 'title'] = 'Speech: “This day is called the feast of Crispian”'
df.loc[4118, 'title'] = 'Speech: “Friends, Romans, countrymen, lend me your ears”'
df.loc[4119, 'title'] = 'Speech: “Once more unto the breach, dear friends, once more”'
df.loc[4120, 'title'] = 'Speech: “Tomorrow, and tomorrow, and tomorrow”'
df.loc[4121, 'title'] = 'Speech: “The raven himself is hoarse”'
df.loc[4122, 'title'] = 'Song: “Take, oh take those lips away”'
df.loc[4123, 'title'] = 'Speech: “Time hath, my lord, a wallet at his back”'
df.loc[4124, 'title'] = 'Song: “It was a lover and his lass”'
df.loc[4125, 'title'] = "Speech: All the world's a stage."
df.loc[4126, 'title'] = 'Song: Blow blow though winter wind'

In [35]:
df.loc[4139, 'title'] = "Sonnet 92: Behold that tree, in Autumn’s dim decay"
df.loc[4140, 'title'] = 'Sonnet 91: On the fleet streams, the Sun, that late arose'

#### Miscellaneous (remove extra scrape)

In [36]:
df.loc[1931, 'poem_lines'] = df.loc[1931, 'poem_lines'][:4]

## Check for NaN values

[[go back to the top](#Predicting-Poetic-Movements)]

- Check for NaN values.
- Impute, rescrape, or drop as necessary.

In [37]:
df.isna().sum()

poet           0
poem_url       0
title          9
poem_lines     0
poem_string    8
genre          0
dtype: int64

In [38]:
df[df.title.isna()]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
114,Allen Ginsberg,https://www.poetryfoundation.org/poems/50123/new-stanzas-for-amazing-grace,,"[I dreamed I dwelled in a homeless place, Where I was lost alone, Folk looked right through me into space, And passed with eyes of stone, O homele...",I dreamed I dwelled in a homeless place\nWhere I was lost alone\nFolk looked right through me into space\nAnd passed with eyes of stone\nO homeles...,beat
264,Amiri Baraka,https://www.poetryfoundation.org/poems/52777/an-agony-as-now,,"[I am inside someone, who hates me. I look, out from his eyes. Smell, what fouled tunes come in, to his breath. Love his, wretched women., Slits i...",I am inside someone\nwho hates me. I look\nout from his eyes. Smell\nwhat fouled tunes come in\nto his breath. Love his\nwretched women.\nSlits in...,black_arts_movement
306,Gwendolyn Brooks,https://www.poetryfoundation.org/poems/58377/riot-56d23cb395a01,,"[A Poem in Three Parts, , John Cabot, out of Wilma, once a Wycliffe, , Because the “Negroes” were coming down the street. , Because ...","A Poem in Three Parts\n \nJohn Cabot, out of Wilma, once a Wycliffe, \nBecause the “Negroes” were coming down the street. \nBecause t...",black_arts_movement
1045,Walter de La Mare,https://www.poetryfoundation.org/poems/48215/gloria-mundi,,"[Upon a bank, easeless with knobs of gold,, Beneath a canopy of noonday smoke,, I saw a measureless Beast, morose and bold,, With eyes like one fr...","Upon a bank, easeless with knobs of gold,\nBeneath a canopy of noonday smoke,\nI saw a measureless Beast, morose and bold,\nWith eyes like one fro...",georgian
2186,Edgar Lee Masters,https://www.poetryfoundation.org/poems/56348/archibald-higbie,,"[I loathed you, Spoon River. I tried to rise above you,, I was ashamed of you. I despised you, As the place of my nativity., And there in Rome, am...","I loathed you, Spoon River. I tried to rise above you,\nI was ashamed of you. I despised you\nAs the place of my nativity.\nAnd there in Rome, amo...",modern
3874,Mary Sidney Herbert Countess of Pembroke,https://www.poetryfoundation.org/poems/55249/o-56d2369e67a1d,,"[Oh, what a lantern, what a lamp of light, Is thy pure word to me, To clear my paths and guide my goings right!, I swore and swear again,, I of th...","Oh, what a lantern, what a lamp of light\nIs thy pure word to me\nTo clear my paths and guide my goings right!\nI swore and swear again,\nI of the...",renaissance
3985,Sir Walter Ralegh,https://www.poetryfoundation.org/poems/57130/on-the-cards-and-dice,,"[Before the sixth day of the next new year,, Strange wonders in this kingdom shall appear:, Four kings shall be assembled in this isle,, Where the...","Before the sixth day of the next new year,\nStrange wonders in this kingdom shall appear:\nFour kings shall be assembled in this isle,\nWhere they...",renaissance
4216,John Keats,https://www.poetryfoundation.org/poems/44468/bright-star-would-i-were-stedfast-as-thou-art,,"[Bright star, would I were stedfast as thou art—, Not in lone splendour hung aloft the night, And watching, with eternal lids apart,, Like nature'...","Bright star, would I were stedfast as thou art—\nNot in lone splendour hung aloft the night\nAnd watching, with eternal lids apart,\nLike nature's...",romantic
4766,Elizabeth Barrett Browning,https://www.poetryfoundation.org/poems/43733/sonnets-from-the-portuguese-1-i-thought-once-how-theocritus-had-sung,,"[I thought once how Theocritus had sung, Of the sweet years, the dear and wished for years,, Who each one in a gracious hand appears, To bear a gi...","I thought once how Theocritus had sung\nOf the sweet years, the dear and wished for years,\nWho each one in a gracious hand appears\nTo bear a gif...",victorian


- Impute titles.

In [39]:
df.loc[114, 'title'] = 'New Stanzas for Amazing Grace'
df.loc[264, 'title'] = 'An Agony. As Now.'
df.loc[306, 'title'] = 'RIOT'
df.loc[1045, 'title'] = 'Gloria Mundi'
df.loc[2186, 'title'] = 'Archibald Higbie'
df.loc[3874, 'title'] = 'O'
df.loc[3985, 'title'] = 'On the Cards and Dice'
df.loc[4216, 'title'] = '“Bright star, would I were stedfast as thou art”'
df.loc[4766, 'title'] = 'Sonnets from the Portuguese 1: I thought once how Theocritus had sung'

In [40]:
df[df.poem_string.isna()]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
239,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
2339,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25655/toward-the-south-tr-by-harry-duncan,Toward The South Tr By Harry Duncan,[],,modern
2526,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968,[],,modern
2886,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/22310/poem-after-the-wrestling,Poem After The Wrestling,[],,modern
3007,William Butler Yeats,https://www.poetryfoundation.org/poetrymagazine/poems/20737/a-full-moon-in-march,A Full Moon In March,[],,modern
3143,Frank O'Hara,https://www.poetryfoundation.org/poetrymagazine/poems/31123/places-for-oscar-salvador,Places For Oscar Salvador,[],,new_york_school
3386,Anne Waldman,https://www.poetryfoundation.org/poetrymagazine/poems/56845/history-will-decide,History Will Decide,[],,new_york_school_2nd_generation
3516,Tom Clark,https://www.poetryfoundation.org/poetrymagazine/poems/30773/fig-1,Fig 1,[],,new_york_school_2nd_generation


- Some of these poems have already been rescraped, so I can drop those below.
- The rest I will re-scrape.

In [41]:
# drop already re-scraped poems
to_drop = [239, 2339, 2526, 3007, 3143]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5110, 6)

#### Rescrapes

In [42]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=54&issue=1&page=17'
url = df.loc[2886, 'poem_url']
rescrape = scan_poem_scraper(actual_url, input_poet=df.loc[2886, 'poet'], input_title='Poem')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2886, 'genre']
df.loc[2886, 'poem_lines'] = rescrape['poem_lines']
df.loc[2886, 'poem_string'] = rescrape['poem_string']

In [43]:
df.loc[3386, 'poem_lines']

[]

In [44]:
rescrape = rescraper(df.loc[3386, 'poem_url'], 'justify')
df.loc[3386, 'poem_lines'] = rescrape[0]
df.loc[3386, 'poem_string'] = rescrape[1]

In [45]:
# rescrape (NOTE: only grabs first page)
url = df.loc[3516, 'poem_url']
rescrape = scan_poem_scraper(url, input_poet=df.loc[3516, 'poet'], input_title='FIG. 1: Weakly cuddling the telephone as a last')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[3516, 'genre']

In [46]:
# scrape second page
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=6',
                  input_poet='Tom Clark', input_title='POETRY')
temp_rescrape_lines = temp_rescrape['poem_lines']

In [47]:
# combine pages
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[3516, 'poem_lines'] = rescrape['poem_lines']
df.loc[3516, 'poem_string'] = rescrape['poem_string']

- Re-check for NaN values.

In [48]:
df.isna().sum()

poet           0
poem_url       0
title          0
poem_lines     0
poem_string    0
genre          0
dtype: int64

- Re-sort and reset indicies.

In [49]:
# sort dataframe
df.sort_values(by=['genre', 'poet', 'title'], inplace=True)

# reset indices
df.reset_index(drop=True, inplace=True)

#### 💾 Save a copy

In [50]:
# # uncomment to save
# df.to_csv('data/poems_df_cleaner.csv')

## Check for other bad scrapes

[[go back to the top](#Data-Cleaning)]

- I'll look for any poems that seem suspiciusly short.
- Impute, rescrape, or drop as necessary.

In [51]:
# length of poem string
df['temp_len'] = df.poem_string.apply(lambda x: len(x))

# take a look at the numbers
df.temp_len.describe()

count     5110.000000
mean      1487.130920
std       2715.646359
min          1.000000
25%        475.000000
50%        715.000000
75%       1405.500000
max      53241.000000
Name: temp_len, dtype: float64

In [52]:
# if string fewer than 10 characters
df[df.temp_len <= 10]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len
1072,Wilfred Owen,https://www.poetryfoundation.org/poems/57369/the-send-off,The Send-Off,[ ],,georgian,1
1473,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/28309/3-stances,3 Stances,[I],I,imagist,1
2731,Paul Valéry,https://www.poetryfoundation.org/poetrymagazine/poems/27998/to-the-plane-tree-tr-by-louise-bogan-and-may-sarton,To The Plane Tree Tr By Louise Bogan And May Sarton,[ ],,modern,1
2844,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/29238/journal-leaves,Journal Leaves,[I],I,modern,1
3308,Alice Notley,https://www.poetryfoundation.org/poems/58243/gift-56d23c725d4d9,Gift,"[, ]",\n,new_york_school_2nd_generation,1
3347,Aram Saroyan,https://www.poetryfoundation.org/poetrymagazine/poems/30722/cham-pagne,Cham Pagne,[cham.],cham.,new_york_school_2nd_generation,5
3544,Gaius Valerius Catullus,https://www.poetryfoundation.org/poetrymagazine/poems/31271/peliaco-quondam-with-celia-zukofsky,Peliaco Quondam With Celia Zukofsky,[ ],,objectivist,1
4474,A. E. Housman,https://www.poetryfoundation.org/poems/58269/a-shropshire-lad-52-far-in-a-western-brookland-,A Shropshire Lad 52: Far in a western brookland,[ ],,victorian,1
4876,Katharine Tynan,https://www.poetryfoundation.org/poems/57349/a-lament-56d23ac7ae84a,A Lament,[ ],,victorian,1


In [53]:
# drop already re-scraped poems
to_drop = [1473, 2731, 3544]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5107, 7)

#### Rescrapes

- Due to a mysterious length error, the list of lines have to be imputed as a string.

In [54]:
url = df.loc[1072, 'poem_url']
rescrape = rescraper(url, 'poempara')
df.loc[1072, 'poem_lines'] = str(rescrape[0])
df.loc[1072, 'poem_string'] = rescrape[1]

In [55]:
url = df.loc[4474, 'poem_url']
rescrape = rescraper(url, 'poempara')
df.loc[4474, 'poem_lines'] = str(rescrape[0])
df.loc[4474, 'poem_string'] = rescrape[1]

In [56]:
url = df.loc[4876, 'poem_url']
rescrape = rescraper(url, 'poempara')
df.loc[4876, 'poem_lines'] = str(rescrape[0])
df.loc[4876, 'poem_string'] = rescrape[1]

In [57]:
url = df.loc[2844, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet='Stephen Spender',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!CHARLES TOMLINSON).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'

In [58]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=101&issue=1&page=169',
                  input_poet='Stephen Spender', input_title='Lothar lit a log fire in the yard', 
                                  first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!CHARLES TOMLINSON).*)*)')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [59]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[2844, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[2844, 'poem_string'] = rescrape['poem_string']

In [60]:
url = df.loc[3308, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[3308, 'poem_lines'] = str(rescrape[0])
df.loc[3308, 'poem_string'] = rescrape[1]

#### Re-check for bad scrapes

- This time for anything less than 50 characters.

In [61]:
df['temp_len'] = df.poem_string.apply(lambda x: len(x))
df[df.temp_len <= 50]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len
188,Kenneth Rexroth,https://www.poetryfoundation.org/poetrymagazine/poems/29229/air-and-angels,Air And Angels,"[Four Poems for Susan, I]",Four Poems for Susan\nI,beat,22
571,Robert Creeley,https://www.poetryfoundation.org/poems/54508/one-day-56d234ece07b2,One Day,"[One day after another—, Perfect., They all fit.]",One day after another—\nPerfect.\nThey all fit.,black_mountain,45
996,Robert Nichols,https://www.poetryfoundation.org/poetrymagazine/poems/14477/modern-love-song,Modern Love Song,[For L. F. 8.],For L. F. 8.,georgian,12
1068,Wilfred Owen,https://www.poetryfoundation.org/poems/57347/smile-smile-smile,"Smile, Smile, Smile","[Head to limp head, the sunk-eyed wounded scanned]","Head to limp head, the sunk-eyed wounded scanned",georgian,48
1221,Melvin B. Tolson,https://www.poetryfoundation.org/poems/56036/delta,DELTA,"[Doubt not, the Siamese twin]",Doubt not\nthe Siamese twin,harlem_renaissance,26
1223,Melvin B. Tolson,https://www.poetryfoundation.org/poems/56037/eta,ETA,[Her neon sign blared two Harlem blocks.],Her neon sign blared two Harlem blocks.,harlem_renaissance,39
1588,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/21822/weasel-snout,Weasel Snout,"[Staring she, kindles, the street windows]",Staring she\nkindles\nthe street windows,imagist,38
2266,Ford Madox Ford,https://www.poetryfoundation.org/poetrymagazine/poems/19554/buckshee-i-v,Buckshee I V,"[NO. V, , A Magazine of Verse]",NO. V\n \nA Magazine of Verse,modern,27
2285,Gertrude Stein,https://www.poetryfoundation.org/poems/51213/a-white-hunter,A White Hunter,[A white hunter is nearly crazy.],A white hunter is nearly crazy.,modern,31
2308,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25652/le-pont-mirabeau-tr-by-harry-duncan,Le Pont Mirabeau Tr By Harry Duncan,[four poems from the french of apollinaire],four poems from the french of apollinaire,modern,41


In [62]:
# drop already re-scraped poems and un-scrapables
to_drop = [996, 1068, 2308, 2352, 3361, 3633, 3639, 3963, 4132]
df.drop(index=to_drop, inplace=True)

# confirm
df.shape

(5098, 7)

#### Rescrapes

In [63]:
url = df.loc[188, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet='Kenneth Rexroth',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!MURIEL RUKEYSER).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'

In [64]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=101&issue=1&page=147',
                  input_poet='Kenneth Rexroth', input_title='An Easy Song', 
                                  first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!MURIEL RUKEYSER).*)*)')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [65]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=101&issue=1&page=148',
                  input_poet='Kenneth Rexroth', input_title='Coming', 
                                  first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!MURIEL RUKEYSER).*)*)')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])

In [66]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_lines'].extend(temp_rescrape_lines2)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[188, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[188, 'poem_string'] = rescrape['poem_string']

In [67]:
url = df.loc[1221, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1221, 'poem_lines'] = str(rescrape[0])
df.loc[1221, 'poem_string'] = rescrape[1]

In [68]:
url = df.loc[1223, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1223, 'poem_lines'] = str(rescrape[0])
df.loc[1223, 'poem_string'] = rescrape[1]

In [69]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=51&issue=1&page=20',
                  input_poet=df.loc[1588, 'poet'], input_title='to daintiness')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [70]:
df.loc[1588, 'poem_lines'].extend(temp_rescrape_lines[:-1])
df.loc[1588, 'poem_string'] = '\n'.join(df.loc[1588, 'poem_lines'])

In [71]:
url = df.loc[2266, 'poem_url']
rescrape = scan_poem_scraper(url, input_poet=df.loc[2266, 'poet'], input_title='Buckshee')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2266, 'genre']

In [72]:
df.loc[2266, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[2266, 'poem_string'] = rescrape['poem_string']

In [73]:
url = df.loc[3270, 'poem_url']
rescrape = scan_poem_scraper(url, input_poet=df.loc[3270, 'poet'], input_title='The Art of')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[3270, 'genre']

In [74]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=125&issue=4&page=10',
                  input_poet=df.loc[3270, 'poet'], input_title='(More about this a little later) 6) Is it in my own')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [75]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=125&issue=4&page=11',
                  input_poet=df.loc[3270, 'poet'], input_title='Indeed. For the original "inspration" is not there. Some')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])

In [76]:
temp_rescrape3 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=125&issue=4&page=19',
                  input_poet=df.loc[3270, 'poet'], input_title='Total absorption in poetry is one of the finest things in')
temp_rescrape_lines3 = [temp_rescrape3['title']]
temp_rescrape_lines3.extend(temp_rescrape3['poem_lines'])

In [77]:
temp_rescrape4 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=125&issue=4&page=20',
                  input_poet=df.loc[3270, 'poet'], 
                  input_title='Natural part of writing. Your poetry, if possible')
temp_rescrape_lines4 = [temp_rescrape4['title']]
temp_rescrape_lines4.extend(temp_rescrape4['poem_lines'])

In [78]:
temp_rescrape5 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=125&issue=4&page=21',
                  input_poet=df.loc[3270, 'poet'], 
                  input_title='Or "almost" being friends with someone, or hanging')
temp_rescrape_lines5 = [temp_rescrape5['title']]
temp_rescrape_lines5.extend(temp_rescrape5['poem_lines'])

In [79]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_lines'].extend(temp_rescrape_lines2)
rescrape['poem_lines'].extend(temp_rescrape_lines3)
rescrape['poem_lines'].extend(temp_rescrape_lines4)
rescrape['poem_lines'].extend(temp_rescrape_lines5)

In [80]:
df.loc[3270, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[3270, 'poem_string'] = '\n'.join(rescrape['poem_lines'])

In [81]:
url = df.loc[3342, 'poem_url']
rescrape = rescraper(url, 'center')
df.loc[3342, 'poem_lines'] = str(rescrape[0])
df.loc[3342, 'poem_string'] = rescrape[1]

In [82]:
df.loc[3347, 'poem_lines'] = str(df.loc[3351, 'poem_lines'][-2:])
df.loc[3347, 'poem_string'] = 'pagne\nchamp'

In [83]:
df.loc[3351, 'poem_lines'] = str(['I crazy.'])
df.loc[3351, 'poem_string'] = 'I crazy.'

In [84]:
df.loc[3357, 'title'] = 'Untitled'
df.loc[3357, 'poem_lines'] = str(['night night night night night night night night night night night night night night night'])
df.loc[3357, 'poem_string'] = 'night night night night night night night night night night night night night night night'

In [85]:
df.loc[3358, 'poem_lines'] = str(['room now', 'door Humphrey', 'Bogart'])
df.loc[3358, 'poem_string'] = 'room now\ndoor Humphrey\nBogart'

In [86]:
df.loc[3359, 'poem_lines'] = str(df.loc[3359, 'poem_lines'][:2])
df.loc[3359, 'poem_string'] = 'tragedy\nbodies'

#### Re-check for bad scrapes

- This time for anything less than 100 characters.

In [87]:
df['temp_len'] = df.poem_string.apply(lambda x: len(x))
df[df.temp_len <= 100].tail(50)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len
1624,Charles Bernstein,https://www.poetryfoundation.org/poetrymagazine/poems/51175/two-stones-with-one-bird,Two Stones with One Bird,"[Re-, demption, comes, &, redemp-, tion, goes, but, trans-, ience, is, here, for-, ever.]",Re-\ndemption\ncomes\n&\nredemp-\ntion\ngoes\nbut\ntrans-\nience\nis\nhere\nfor-\never.,language_poetry,74
1682,Michael Palmer,https://www.poetryfoundation.org/poems/54871/autobiography-2-hellogoodby,Autobiography 2 (hellogoodby),"[The Book of Company which, I put down and can’t pick up, ]",The Book of Company which\nI put down and can’t pick up\n,language_poetry,56
1683,Michael Palmer,https://www.poetryfoundation.org/poems/54872/autobiography-3,Autobiography 3,"[Yes, I was born on the street known as Glass—as Paper, Scissors or Rock., ]","Yes, I was born on the street known as Glass—as Paper, Scissors or Rock.\n",language_poetry,74
1685,Michael Palmer,https://www.poetryfoundation.org/poems/54870/eighth-sky,Eighth Sky,"[It is scribbled along the body, Impossible even to say a word]",It is scribbled along the body\nImpossible even to say a word,language_poetry,60
1687,Michael Palmer,https://www.poetryfoundation.org/poems/54868/h,H,"[Yet the after is still a storm, as witness bent shadbush]",Yet the after is still a storm\nas witness bent shadbush,language_poetry,55
1695,Michael Palmer,https://www.poetryfoundation.org/poems/54869/twenty-four-logics-in-memory-of-lee-hickman,Twenty-four Logics in Memory of Lee Hickman,"[The bend in the river followed us for days, and above us the sun]",The bend in the river followed us for days\nand above us the sun,language_poetry,63
1703,Rae Armantrout,https://www.poetryfoundation.org/poems/46580/anti-short-story,Anti-Short Story,"[A girl is running., “She’s running for her bus.”, All that aside!]",A girl is running.\n“She’s running for her bus.”\nAll that aside!,language_poetry,63
1798,A. M. Klein,https://www.poetryfoundation.org/poetrymagazine/poems/18561/doubt,Doubt,"[And yet the doubt is hither-thither cast—, Will the last kiss I gave her be the last? . . .]",And yet the doubt is hither-thither cast—\nWill the last kiss I gave her be the last? . . .,modern,90
1808,A. M. Klein,https://www.poetryfoundation.org/poetrymagazine/poems/18555/love-call,Love Call,"[Now she awaits me at this time we made—, I'll ring the door-bell as my serenade.]",Now she awaits me at this time we made—\nI'll ring the door-bell as my serenade.,modern,79
1838,Alfred Kreymborg,https://www.poetryfoundation.org/poetrymagazine/poems/17680/manhattan-epitaphs,Manhattan Epitaphs,"[The one lone truth I’m certain of, this side the grave:, I haven't so long to live, as I used to have.]",The one lone truth I’m certain of\nthis side the grave:\nI haven't so long to live\nas I used to have.,modern,99


In [88]:
url = df.loc[1682, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1682, 'poem_lines'] = str(rescrape[0])
df.loc[1682, 'poem_string'] = rescrape[1]

In [89]:
url = df.loc[1683, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1683, 'poem_lines'] = str(rescrape[0])
df.loc[1683, 'poem_string'] = rescrape[1]

In [90]:
url = df.loc[1685, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1685, 'poem_lines'] = str(rescrape[0])
df.loc[1685, 'poem_string'] = rescrape[1]

In [91]:
url = df.loc[1687, 'poem_url']
rescrape = rescraper(url, 'p_all')
df.loc[1687, 'poem_lines'] = str(rescrape[0])
df.loc[1687, 'poem_string'] = rescrape[1]

In [92]:
url = df.loc[1695, 'poem_url']
rescrape = rescraper(url, 'p_all')
# grab first line
rescrape2 = [rescraper(url, 'PoemView')[0][0]]
# add other lines
rescrape2.extend(rescrape[0])
temp_string = '\n'.join(rescrape2)
df.loc[1695, 'poem_lines'] = str(rescrape2)
df.loc[1695, 'poem_string'] = temp_string

In [93]:
url = df.loc[1940, 'poem_url']
rescrape = scan_poem_scraper(url, input_poet=df.loc[1940, 'poet'], input_title='Joy: Let a joy keep you')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[1940, 'genre']
df.loc[1940, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[1940, 'poem_string'] = rescrape['poem_string']

In [94]:
url = df.loc[2106, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet=df.loc[2106, 'poet'],
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!POETRY).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2106, 'genre']
df.loc[2106, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[2106, 'poem_string'] = rescrape['poem_string']

In [95]:
url = df.loc[2536, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet=df.loc[2536, 'poet'],
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!THE HERO).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2536, 'genre']

In [96]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=40&issue=3&page=6',
                  input_poet=df.loc[2536, 'poet'],
                    input_title='says, "but the French do not think that all can have it', 
                    first_pattern='.*((?:\r?\n.*)*)',
                    next_pattern='\n((?:\r?\n(?!THE HERO).*)*)')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])

In [97]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=40&issue=3&page=8',
                  input_poet=df.loc[2536, 'poet'],
                  input_title='of the eye showing; or as sea-lions keep going round and', 
                  first_pattern='.*((?:\r?\n.*)*)',
                  next_pattern='\n((?:\r?\n(?!THE HERO).*)*)')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])

In [98]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_lines'].extend(temp_rescrape_lines2)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[2536, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[2536, 'poem_string'] = rescrape['poem_string']

In [99]:
url = df.loc[2586, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet=df.loc[2586, 'poet'],
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!MUSICIAN).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[2586, 'genre']
df.loc[2586, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[2586, 'poem_string'] = rescrape['poem_string']

In [100]:
df.loc[3343, 'poem_string'] = '\n'.join(df.loc[3343, 'poem_lines'][:2])
df.loc[3343, 'poem_lines'] = str(df.loc[3343, 'poem_lines'][:2])

df.loc[3346, 'poem_string'] = '\n'.join(df.loc[3346, 'poem_lines'][:2])
df.loc[3346, 'poem_lines'] = str(df.loc[3346, 'poem_lines'][:2])

df.loc[3360, 'poem_string'] = '\n'.join(df.loc[3360, 'poem_lines'][:2])
df.loc[3360, 'poem_lines'] = str(df.loc[3360, 'poem_lines'][:2])

In [101]:
url = df.loc[3568, 'poem_url']
rescrape = scan_poem_scraper(url, 
                             input_poet=df.loc[3568, 'poet'],
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern="\n((?:\r?\n(?!GIOVANNI'S RAPE OF THE SABINE).*)*)")
rescrape['poem_url'] = url
rescrape['genre'] = df.loc[3568, 'genre']

In [102]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=2&page=16',
                  input_poet=df.loc[3568, 'poet'],
                    input_title='beach Curves to a bay. Here the dug-out is hauled', 
                    first_pattern=".*((?:\r?\n(?!Crying\.).*)*)",
                    next_pattern="\n((?:\r?\n(?!GIOVANNI'S RAPE OF THE SABINE).*)*)")
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines.extend(['Crying.'])

In [103]:
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = '\n'.join(rescrape['poem_lines'])
df.loc[3568, 'poem_lines'] = str(rescrape['poem_lines'])
df.loc[3568, 'poem_string'] = rescrape['poem_string']

#### Re-check for bad scrapes

- Again, I'll look at anything less than 100 characters.

In [104]:
df['temp_len'] = df.poem_string.apply(lambda x: len(x))
df[df.temp_len <= 100].head(50)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len
179,Kenneth Patchen,https://www.poetryfoundation.org/poems/46346/the-murder-of-two-men-by-a-young-kid-wearing-lemon-colored-gloves,The Murder of Two Men by a Young Kid Wearing Lemon-colored Gloves,"[Wait., Wait., Wait., Wait. Wait., Wait., Wait., W a i t., Wait., Wait., Wait., Wait., Wait., NOW.]",Wait.\nWait.\nWait.\nWait. Wait.\nWait.\nWait.\nW a i t.\nWait.\nWait.\nWait.\nWait.\nWait.\nNOW.,beat,85
249,Richard Brautigan,https://www.poetryfoundation.org/poems/48581/haiku-ambulance,Haiku Ambulance,"[A piece of green pepper, fell, off the wooden salad bowl:, so what?]",A piece of green pepper\nfell\noff the wooden salad bowl:\nso what?,beat,64
256,Richard Brautigan,https://www.poetryfoundation.org/poems/48585/the-pill-versus-the-springhill-mine-disaster,The Pill Versus the Springhill Mine Disaster,"[When you take your pill, it’s like a mine disaster., I think of all the people, lost inside of you.]",When you take your pill\nit’s like a mine disaster.\nI think of all the people\nlost inside of you.,beat,96
543,Robert Creeley,https://www.poetryfoundation.org/poems/49023/a-token,A Token,"[My lady, fair with, soft, arms, what, can I say to, you—words, words, as if all, worlds were there.]","My lady\nfair with\nsoft\narms, what\ncan I say to\nyou—words, words\nas if all\nworlds were there.",black_mountain,92
560,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30527/here,Here,"[What, has happened, makes, the world., Live, on the edge,, looking.]","What\nhas happened\nmakes\nthe world.\nLive\non the edge,\nlooking.",black_mountain,61
571,Robert Creeley,https://www.poetryfoundation.org/poems/54508/one-day-56d234ece07b2,One Day,"[One day after another—, Perfect., They all fit.]",One day after another—\nPerfect.\nThey all fit.,black_mountain,45
604,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30990/the-puritan-ethos,The Puritan Ethos,"[Happy the man who loves what, he has and worked for it also.]",Happy the man who loves what\nhe has and worked for it also.,black_mountain,59
1107,Claude McKay,https://www.poetryfoundation.org/poems/56983/the-lynching,The Lynching,"[His spirit is smoke ascended to high heaven., His father, by the cruelest way of pain,]","His spirit is smoke ascended to high heaven.\nHis father, by the cruelest way of pain,",harlem_renaissance,85
1157,Langston Hughes,https://www.poetryfoundation.org/poetrymagazine/poems/28738/blues-in-stereo,Blues In Stereo,"[YOUR NUMBER’S COMING OUT!, BOUQUETS I'LL SEND YOU, DREAMS I'LL SEND YOU]",YOUR NUMBER’S COMING OUT!\nBOUQUETS I'LL SEND YOU\nDREAMS I'LL SEND YOU,harlem_renaissance,69
1158,Langston Hughes,https://www.poetryfoundation.org/poetrymagazine/poems/24640/blues-on-a-box,Blues On A Box,"[Play your guitar, boy,, Till yesterday's, Black cat, Runs out tomorrow’s, Back door]","Play your guitar, boy,\nTill yesterday's\nBlack cat\nRuns out tomorrow’s\nBack door",harlem_renaissance,79


- I found a couple during EDA that I need to adjust.

In [105]:
# these were scraped all as one line, so split on newline characters
# and drop empty strings
df.loc[216, 'poem_lines'] = str([line for line in \
                             df.loc[216, 'poem_string'].split('\n') \
                             if line.strip()])

df.loc[1774, 'poem_lines'] = str([line for line in \
                             df.loc[1774, 'poem_string'].split('\n') \
                             if line.strip()])

- One last check for duplicates.

In [106]:
df[df.duplicated(subset=['poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre,temp_len


### Finishing touches

[[go back to the top](#Data-Cleaning)]

- Drop ```temp_len``` column.
- Sort and reset indices.
- Convert ```poem_lines``` back to lists from string.

In [107]:
# drop temp_len column
df.drop(columns='temp_len', inplace=True)

# confirm
df.shape

(5098, 6)

In [108]:
# sort dataframe
df.sort_values(by=['genre', 'poet', 'title'], inplace=True)

# reset indices
df.reset_index(drop=True, inplace=True)

In [109]:
# destringify
df['poem_lines'] = df['poem_lines'].apply(destringify)

### 💾 Save DataFrame

[[go back to the top](#Predicting-Poetic-Movements)]

In [110]:
# # uncomment to save
# with gzip.open('data/poems_df_clean_all.pkl', 'wb') as goodbye:
#     pickle.dump(df, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

In [111]:
# # uncomment to save as csv
# df.to_csv('data/poems_df_cleaned_all.csv')

In [112]:
df.columns

Index(['poet', 'poem_url', 'title', 'poem_lines', 'poem_string', 'genre'], dtype='object')

## Drop certain genres

[[go back to the top](#Predicting-Poetic-Movements)]

- I'll look at a breakdown of genres and see if there are any I should get rid of.
- My initial thoughts are to limit it by time period, so as to remove any language barriers, so to speak, between old English and modern English.

In [113]:
# # uncomment to load
# with gzip.open('data/poems_df_clean_all.pkl', 'rb') as hello:
#     df = pickle.load(hello)

In [114]:
df.shape

(5098, 6)

In [115]:
# look at quantity within each genre
df.genre.value_counts()

modern                            1244
victorian                          642
renaissance                        423
romantic                           395
imagist                            350
new_york_school                    259
black_mountain                     256
language_poetry                    194
new_york_school_2nd_generation     188
confessional                       172
black_arts_movement                159
georgian                           158
objectivist                        155
harlem_renaissance                 147
beat                               146
augustan                           114
fugitive                            86
middle_english                      10
Name: genre, dtype: int64

In [116]:
# check a sample Middle English poem
print(df[df.genre == 'middle_english'].iloc[0, -2])

Whan that Aprille with his shour
The droghte of March hath perc
And bath
Of which vertú engendr
Whan Zephirus eek with his swet
Inspir
The tendr
Hath in the Ram his half
And smal
That slepen al the nyght with open y
So priketh hem Natúre in hir corag
Thanne longen folk to goon on pilgrimag
And palmeres for to seken straung
To fern
And specially, from every shir
Of Eng
The hooly blisful martir for to sek
That hem hath holpen whan that they were seek
Bifil that in that seson on a day,
In Southwerk at the Tabard as I lay,
Redy to wenden on my pilgrymag
To Caunterbury with ful devout corag
At nyght were come into that hostelry
Wel nyne and twenty in a compaigny
Of sondry folk, by áventure y-fall
In felaweshipe, and pilgrimes were they all
That toward Caunterbury wolden ryd
The chambr
And wel we weren es
And shortly, whan the sonn
So hadde I spoken with hem everychon,
That I was of hir felaweshipe anon,
And mad
To take oure wey, ther as I yow devys
But nath
Er that I ferther in this tal


- Indeed, Middle English is definitely out.

In [117]:
# drop genre
df = df[df.genre != 'middle_english']

# confirm
df.shape

(5088, 6)

In [118]:
# check a sample Renaissance poem
print(df[df.genre == 'renaissance'].iloc[1, -2])

See the chariot at hand here of Love,
Wherein my lady rideth!
Each that draws is a swan or a dove,
And well the car Love guideth.
As she goes, all hearts do duty
Unto her beauty;
And enamour'd, do wish, so they might
But enjoy such a sight,
That they still were to run by her side,
Through swords, through seas, whither she would ride.
Do but look on her eyes, they do light
All that Love's world compriseth!
Do but look on her hair, it is bright
As Love's star when it riseth!
Do but mark, her forehead's smoother
Than words that soothe her;
And from her arched brows, such a grace
Sheds itself through the face
As alone there triumphs to the life
All the gain, all the good, of the elements' strife.
Have you seen but a bright lily grow,
Before rude hands have touch'd it?
Ha' you mark'd but the fall o' the snow
Before the soil hath smutch'd it?
Ha' you felt the wool o' the beaver?
Or swan's down ever?
Or have smelt o' the bud o' the briar?
Or the nard in the fire?
Or have tasted the bag of the

In [119]:
# check a sample Augustan poem
print(df[df.genre == 'augustan'].iloc[2, -2])

Learn then what morals critics ought to show,
For 'tis but half a judge's task, to know.
'Tis not enough, taste, judgment, learning, join;
In all you speak, let truth and candour shine:
That not alone what to your sense is due,
All may allow; but seek your friendship too.
Be silent always when you doubt your sense;
And speak, though sure, with seeming diffidence:
Some positive, persisting fops we know,
Who, if once wrong, will needs be always so;
But you, with pleasure own your errors past,
And make each day a critic on the last.
'Tis not enough, your counsel still be true;
Blunt truths more mischief than nice falsehoods do;
Men must be taught as if you taught them not;
And things unknown proposed as things forgot.
Without good breeding, truth is disapprov'd;
That only makes superior sense belov'd.
Be niggards of advice on no pretence;
For the worst avarice is that of sense.
With mean complacence ne'er betray your trust,
Nor be so civil as to prove unjust.
Fear not the anger of the wis

- According to Poetry Foundation's website, Renaissance and Augustan poems are from the years 1500 - 1780, and the differences in the English are fairly clear.
- I'm going to go ahead and drop these.

In [120]:
# drop genres
df_trim = df[df.genre != 'renaissance']
df_trim = df_trim[df_trim.genre != 'augustan']

# confirm
df_trim.shape

(4551, 6)

In [121]:
# check a sample Victorian poem
print(df[df.genre == 'victorian'].iloc[1,-2])

The time you won your town the race
We chaired you through the market-place;
Man and boy stood cheering by,
And home we brought you shoulder-high.
To-day, the road all runners come,
Shoulder-high we bring you home,
And set you at your threshold down,
Townsman of a stiller town.
Smart lad, to slip betimes away
From fields where glory does not stay
And early though the laurel grows
It withers quicker than the rose.
Eyes the shady night has shut
Cannot see the record cut,
And silence sounds no worse than cheers
After earth has stopped the ears:
Now you will not swell the rout
Of lads that wore their honours out,
Runners whom renown outran
And the name died before the man.
So set, before its echoes fade,
The fleet foot on the sill of shade,
And hold to the low lintel up
The still-defended challenge-cup.
And round that early-laurelled head
Will flock to gaze the strengthless dead,
And find unwithered on its curls
The garland briefer than a girl's.


In [122]:
# check a sample Romantic poem
print(df[df.genre == 'romantic'].iloc[1,-2])

I once rejoiced, sweet evening gale,
To see thy breath the poplar wave;
But now it makes my cheek turn pale,
It waves the grass o’er Henry’s grave.
Ah! setting sun! how changed I seem!
I to thy rays prefer deep gloom, —
Since now, alas! I see them beam
Upon my Henry’s lonely tomb.
Sweet evening gale, howe’er I seem,
I wish thee o’er my sod to wave;
Ah! setting sun! soon mayst thou beam
On mine, as well as Henry’s grave!


- Romantic and Victorian poems are from 1781-1900, but the language seems fairly similar.
- Plus, these are some very formative genres for poetry in English. I'll keep these.
- All other genres are from after 1900.
- I'd never heard of the genres Georgian and Fugitive, however, and after doing some research, I've decided to drop them, due to some associations with white supremacy.

In [123]:
# drop genres
df_trim = df_trim[df_trim.genre != 'georgian']
df_trim = df_trim[df_trim.genre != 'fugitive']

# confirm
df_trim.shape

(4307, 6)

- Reset indices and save.

In [124]:
# reindex
df_trim.reset_index(drop=True, inplace=True)

### 💾 Save/Load trimmed DataFrame

[[go back to the top](#Predicting-Poetic-Movements)]

In [125]:
# # uncomment to save
# with gzip.open('data/poems_df_clean_trim.pkl', 'wb') as goodbye:
#     pickle.dump(df_trim, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# # uncomment to load
# with gzip.open('data/poems_df_clean_trim.pkl', 'rb') as hello:
#     df = pickle.load(hello)

In [126]:
# uncomment to save
df_trim.to_csv('data/poems_df_cleaned_trim.csv')

## Next notebook: [NLP, Feature Engineering, and EDA](03_nlp_features_eda.ipynb)

[[go back to the top](#Data-Cleaning)]

- The next notebook includes natural language processing, engineering of features, exploring data, and analyzing data.