# Data Cleaning

- In this notebook, I look at the data I scraped from [PoetryFoundation.org](https://www.poetryfoundation.org/) in the previous [notebook](01_webscraping.ipynb).

#### NOTE: WORK IN PROGRESS
Because I overhauled my original webscrape, I am currently working on re-running, organizing, and cleaning up this notebook.

Thank you for understanding :)

## Table of contents

1. [Import necessary packages](#Import-necessary-packages)
    
## Import necessary packages

[[go back to the top](#Data-Cleaning)]

In [12]:
# custom functions for webscraping
from functions_webscraping import *
from functions import destringify

# standard dataframe packages
import pandas as pd
import numpy as np

# timekeeping/progress packages
import time
from tqdm import tqdm

# saving packages
import gzip
import pickle

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# increase column width of dataframe
pd.set_option('max_colwidth', 150)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
df = pd.read_csv('data/poems_df_pre_clean.csv', index_col=0)
df.shape

(5168, 6)

In [4]:
df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Alexander Pope,https://www.poetryfoundation.org/poems/44896/an-essay-on-criticism-part-1,An Essay on Criticism: Part 1,"['PART 1', ""'Tis hard to say, if greater want of skill"", 'Appear in writing or in judging ill;', ""But, of the two, less dang'rous is th' offence"",...","PART 1\n'Tis hard to say, if greater want of skill\nAppear in writing or in judging ill;\nBut, of the two, less dang'rous is th' offence\nTo tire ...",augustan
1,Alexander Pope,https://www.poetryfoundation.org/poems/44897/an-essay-on-criticism-part-2,An Essay on Criticism: Part 2,"['Of all the causes which conspire to blind', ""Man's erring judgment, and misguide the mind,"", 'What the weak head with strongest bias rules,', 'I...","Of all the causes which conspire to blind\nMan's erring judgment, and misguide the mind,\nWhat the weak head with strongest bias rules,\nIs pride,...",augustan
2,Alexander Pope,https://www.poetryfoundation.org/poems/44898/an-essay-on-criticism-part-3,An Essay on Criticism: Part 3,"['Learn then what morals critics ought to show,', ""For 'tis but half a judge's task, to know."", ""'Tis not enough, taste, judgment, learning, join;...","Learn then what morals critics ought to show,\nFor 'tis but half a judge's task, to know.\n'Tis not enough, taste, judgment, learning, join;\nIn a...",augustan
3,Alexander Pope,https://www.poetryfoundation.org/poems/44899/an-essay-on-man-epistle-i,An Essay on Man: Epistle I,"['Awake, my St. John! leave all meaner things', 'To low ambition, and the pride of kings.', 'Let us (since life can little more supply', 'Than jus...","Awake, my St. John! leave all meaner things\nTo low ambition, and the pride of kings.\nLet us (since life can little more supply\nThan just to loo...",augustan
4,Alexander Pope,https://www.poetryfoundation.org/poems/44900/an-essay-on-man-epistle-ii,An Essay on Man: Epistle II,"['I.', 'Know then thyself, presume not God to scan;', 'The proper study of mankind is man.', ""Plac'd on this isthmus of a middle state,"", 'A being...","I.\nKnow then thyself, presume not God to scan;\nThe proper study of mankind is man.\nPlac'd on this isthmus of a middle state,\nA being darkly wi...",augustan


- Saving to CSV converts the list of poem_lines to a string, so I'll use my destringify function.

In [13]:
df['poem_lines'] = df['poem_lines'].apply(destringify)

In [7]:
len(df[df.duplicated(subset=['poet', 'poem_string'])])

21

In [9]:
df[df.duplicated(subset=['poet', 'poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
119,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"['I', 'I saw the best minds of my generation destroyed by madness, starving hysterical naked,', 'dragging themselves through the negro streets at ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
120,Allen Ginsberg,https://www.poetryfoundation.org/poems/49303/howl,Howl,"['I', 'I saw the best minds of my generation destroyed by madness, starving hysterical naked,', 'dragging themselves through the negro streets at ...","I\nI saw the best minds of my generation destroyed by madness, starving hysterical naked,\ndragging themselves through the negro streets at dawn l...",beat
249,Richard Brautigan,https://www.poetryfoundation.org/poems/48576/a-boat,A Boat,"['O beautiful', 'was the werewolf', 'in his evil forest.', 'We took him', 'to the carnival', 'and he started', 'crying', 'when he saw', 'the Ferri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
250,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/56423/a-boat-56d238e754f45,A Boat,"['O beautiful', 'was the werewolf', 'in his evil forest.', 'We took him', 'to the carnival', 'and he started', 'crying', 'when he saw', 'the Ferri...",O beautiful\nwas the werewolf\nin his evil forest.\nWe took him\nto the carnival\nand he started\ncrying\nwhen he saw\nthe Ferris wheel.\nElectric...,beat
613,Robert Creeley,https://www.poetryfoundation.org/poems/42840/the-rescue-56d2217b24ec4,The Rescue,"['The man sits in a timelessness', 'with the horse under him in time', 'to a movement of legs and hooves', 'upon a timeless sand.', 'Distance come...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
614,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/28665/the-rescue,The Rescue,"['The man sits in a timelessness', 'with the horse under him in time', 'to a movement of legs and hooves', 'upon a timeless sand.', 'Distance come...",The man sits in a timelessness\nwith the horse under him in time\nto a movement of legs and hooves\nupon a timeless sand.\nDistance comes in from ...,black_mountain
738,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29165/four-dream-songs,Four Dream Songs,"['I', 'To Ralph Ross', 'The greens of the Ganges delta foliate.', 'Of heartless youth made late aware he pled:', 'Brownies, please come.', 'To Hen...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
741,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29167/henrys-pelt-was-put-on,Henrys Pelt Was Put On,"['I', 'To Ralph Ross', 'The greens of the Ganges delta foliate.', 'Of heartless youth made late aware he pled:', 'Brownies, please come.', 'To Hen...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional
852,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"['Sweet beast, I have gone prowling,', 'a proud rejected man', 'who lived along the edges', 'catch as catch can;', 'in darkness and in hedges', 'I...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1845,Alfred Kreymborg,https://www.poetryfoundation.org/poetrymagazine/poems/14702/cradle,Cradle,"['The blue-eyed youngster', 'And the fat old man', 'Play ball in me;', 'And music—', 'The one on his penny flute,', 'The other on his bassoon.', '...","The blue-eyed youngster\nAnd the fat old man\nPlay ball in me;\nAnd music—\nThe one on his penny flute,\nThe other on his bassoon.\nTheir tolerati...",modern


- I'm actually somewhat happy about this because it shows that my image scraper worked pretty darn well, if the strings are exactly the same.
- That said, it may also mean that there are even more near-duplicates, but I can check for duplicates across poet and title next.

In [26]:
df.loc[2195, 'poem_lines']

['I went up and down the streets',
 'Here and there by day an',
 'Thro ough all hours of the night caring for the poor who were',
 'Do 0 you know why?',
 'My wife hated me, my son went to the dogs.',
 'And I turned to the people and poured out my love to them.',
 'Sweet it was to see the crowds about the lawns on the day of my',
 ',',
 'And hear them murmur their love and sorrow.',
 'When I saw Em Stanton behind the oak tree',
 'rave,',
 'Hiding herself, and her grief!',
 'Where are Elmer, Herman, Bert, "Tom and Charley,',
 'The weal of will, the strong of arm, the clown, ithe boozer, the',
 'r?',
 'All, all, are sleeping on the hill.',
 'One passed in a fever,',
 'ne was burned in a ,',
 'One was ies in a brawl,',
 'One died ina',
 'One fell from fo bridge toiling for children and wife—',
 'All, all are sleeping, sleeping, sleeping on the hill',
 'Where are Ella, Kate, Mag, Lizzie and Edith',
 'The tender heart, the simple soul, the loud, the proud, the happy',
 'one ?—',
 'All, all, 

In [16]:
'\n'.join(df.loc[741, 'poem_lines'][-3:])

'Henry’s pelt was put on sundry walls\nwhere it did much resemble Henry and\nthem persons was delighted.'

In [20]:
df.loc[738, 'poem_lines']

['I',
 'To Ralph Ross',
 'The greens of the Ganges delta foliate.',
 'Of heartless youth made late aware he pled:',
 'Brownies, please come.',
 'To Henry in his sparest times sometimes',
 'the little people spread, & did friendly things;',
 'then he was glad.',
 'Pleased, at the worst, except with man, he shook',
 'the brightest winter sun.',
 'All the green lives',
 'of the great delta, hours, hurt his migrant heart',
 'in a safety of the steady plane. Please, please',
 'come.',
 "My friends,—he has been known to mourn,—I'll die;",
 'live you, in the most wild, kindly, green',
 'partly forgiving wood,',
 'sort of forever and all those human sings',
 'close not your better ears to, while good Spring',
 'returns with a dance and a sigh.',
 'i']

In [21]:
df.loc[738, 'poem_lines'] = df.loc[738, 'poem_lines'][:-1]
df.loc[738, 'poem_string'] = '\n'.join(df.loc[738, 'poem_lines'])
df.loc[738, 'poem_string']

"I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sparest times sometimes\nthe little people spread, & did friendly things;\nthen he was glad.\nPleased, at the worst, except with man, he shook\nthe brightest winter sun.\nAll the green lives\nof the great delta, hours, hurt his migrant heart\nin a safety of the steady plane. Please, please\ncome.\nMy friends,—he has been known to mourn,—I'll die;\nlive you, in the most wild, kindly, green\npartly forgiving wood,\nsort of forever and all those human sings\nclose not your better ears to, while good Spring\nreturns with a dance and a sigh."

In [17]:
df.loc[741, 'poem_lines'] = df.loc[741, 'poem_lines'][-3:]
df.loc[741, 'poem_string'] = '\n'.join(df.loc[741, 'poem_lines'])
df.loc[741, 'poem_string']

'Henry’s pelt was put on sundry walls\nwhere it did much resemble Henry and\nthem persons was delighted.'

In [25]:
df.loc[1845, 'poem_lines'] = df.loc[1845, 'poem_lines'][:19]
df.loc[1845, 'poem_string'] = '\n'.join(df.loc[1845, 'poem_lines'])

df.loc[1867, 'poem_lines'] = df.loc[1867, 'poem_lines'][20:]
df.loc[1867, 'poem_string'] = '\n'.join(df.loc[1867, 'poem_lines'])

In [None]:
to_drop = [120, 250, 614, ]

In [515]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=83&issue=6&page=3'

poems.loc[2629,'poem_lines'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2629,'poet'], 
                                                input_title=poems.loc[2629,'title'])['poem_lines']

poems.loc[2629,'poem_string'] = scan_poem_scraper(actual_url, input_poet=poems.loc[2629,'poet'], 
                                                input_title=poems.loc[2629,'title'])['poem_string']

In [519]:
poems.drop_duplicates(subset=['poet', 'poem_string', 'genre'], inplace=True)

In [524]:
poempara_rescraper('https://www.poetryfoundation.org/poems/57369/the-send-off')

(['Down the close, darkening lanes they sang their way',
  'To the siding-shed,',
  'And lined the train with faces grimly gay.',
  '',
  'Their breasts were stuck all white with wreath and spray',
  "As men's are, dead.",
  '',
  'Dull porters watched them, and a casual tramp',
  'Stood staring hard,',
  'Sorry to miss them from the upland camp.',
  'Then, unmoved, signals nodded, and a lamp',
  'Winked to the guard.',
  '',
  'So secretly, like wrongs hushed-up, they went.',
  'They were not ours:',
  'We never heard to which front these were sent.',
  '',
  'Nor there if they yet mock what women meant',
  'Who gave them flowers.',
  '',
  'Shall they return to beatings of great bells',
  'In wild trainloads?',
  'A few, a few, too few for drums and yells,',
  'May creep back, silent, to still village wells',
  'Up half-known roads.'],
 "Down the close, darkening lanes they sang their way\nTo the siding-shed,\nAnd lined the train with faces grimly gay.\n\nTheir breasts were stuck all

In [526]:
poems[poems.duplicated(subset=['poet', 'poem_string'], keep=False)].head(20)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
743,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1231,Richard Aldington,https://www.poetryfoundation.org/poems/53969/le-maudit,Le Maudit,"[Women’s tears are but water;, The tears of men are blood., He sits alone in the firelight, And on either side drifts by, Sleep, like a torrent wh...","Women’s tears are but water;\nThe tears of men are blood.\nHe sits alone in the firelight\nAnd on either side drifts by\nSleep, like a torrent whi...",imagist
1270,Ezra Pound,https://www.poetryfoundation.org/poems/54314/canto-i,Canto I,"[And then went down to the ship,, Set keel to breakers, forth on the godly sea, and, We set up mast and sail on that swart ship,, Bore sheep aboar...","And then went down to the ship,\nSet keel to breakers, forth on the godly sea, and\nWe set up mast and sail on that swart ship,\nBore sheep aboard...",imagist
1271,Ezra Pound,https://www.poetryfoundation.org/poems/52318/cantico-del-sole,Cantico del Sole,"[The thought of what America would be like, If the Classics had a wide circulation, Troubles my sleep,, The thought of what America,, The thought ...","The thought of what America would be like\nIf the Classics had a wide circulation\nTroubles my sleep,\nThe thought of what America,\nThe thought o...",imagist
1272,Ezra Pound,https://www.poetryfoundation.org/poems/54317/canto-xvi-56d234860e2a1,Canto XVI,"[And before hell mouth; dry plain, and two mountains;, On the one mountain, a running form,, and another, In the turn of the hill; in hard steel, ...","And before hell mouth; dry plain\nand two mountains;\nOn the one mountain, a running form,\nand another\nIn the turn of the hill; in hard steel\nT...",imagist
1273,Ezra Pound,https://www.poetryfoundation.org/poems/54321/from-canto-cxv,Canto CXV,"[The scientists are in terror, and the European mind stops, Wyndham Lewis chose blindness, rather than have his mind stop., Night under wind mid g...",The scientists are in terror\nand the European mind stops\nWyndham Lewis chose blindness\nrather than have his mind stop.\nNight under wind mid ga...,imagist
1274,Ezra Pound,https://www.poetryfoundation.org/poems/44915/hugh-selwyn-mauberley-part-i,Hugh Selwyn Mauberley [Part I],"[E. P. ODE POUR L’ÉLECTION DE SON SÉPULCHRE, , For three years, out of key with his time,, He strove to resuscitate the dead art, Of poetry; to...","E. P. ODE POUR L’ÉLECTION DE SON SÉPULCHRE\n \nFor three years, out of key with his time,\nHe strove to resuscitate the dead art\nOf poetry; to ...",imagist
1275,Ezra Pound,https://www.poetryfoundation.org/poems/54315/canto-iii-56d234851afde,Canto III,"[I sat on the Dogana’s steps, For the gondolas cost too much, that year,, And there were not “those girls”, there was one face,, And the Buccentor...","I sat on the Dogana’s steps\nFor the gondolas cost too much, that year,\nAnd there were not “those girls”, there was one face,\nAnd the Buccentoro...",imagist
1276,Ezra Pound,https://www.poetryfoundation.org/poems/57353/hugh-selwyn-mauberley-part-ii,Hugh Selwyn Mauberley [Part II],"[Par Jaquemart”, To the strait head, Of Messalina:, “His True Penelope, Was Flaubert,”, And his tool, The engraver's., Firmness,, Not the full smi...","Par Jaquemart”\nTo the strait head\nOf Messalina:\n“His True Penelope\nWas Flaubert,”\nAnd his tool\nThe engraver's.\nFirmness,\nNot the full smil...",imagist
1277,Ezra Pound,https://www.poetryfoundation.org/poems/54320/canto-lxxxi,Canto LXXXI,"[Zeus lies in Ceres’ bosom, Taishan is attended of loves, under Cythera, before sunrise, And he said: “Hay aquí mucho catolicismo—(sounded, catol...","Zeus lies in Ceres’ bosom\nTaishan is attended of loves\nunder Cythera, before sunrise\nAnd he said: “Hay aquí mucho catolicismo—(sounded\ncatoli...",imagist


In [531]:
poems_double_backup = poems.copy()

In [536]:
poems.shape

(5151, 6)

In [549]:
rial_mod_ind = list(poems[(poems.poet == 'Richard Aldington') & (poems.genre == 'modern')].index)

poems.drop(rial_mod_ind, inplace=True)

poems.shape

(5048, 6)

In [551]:
poems[poems.poet == 'Li Bai']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1285,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs’ Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",imagist
2003,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs’ Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",modern


In [550]:
poems[poems.duplicated(subset=['poet', 'poem_string'], keep=False)].head(20)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
743,W. D. Snodgrass,https://www.poetryfoundation.org/poems/52643/song-56d2314775fcc,Song,"[Sweet beast, I have gone prowling,, a proud rejected man, who lived along the edges, catch as catch can;, in darkness and in hedges, I sang my so...","Sweet beast, I have gone prowling,\na proud rejected man\nwho lived along the edges\ncatch as catch can;\nin darkness and in hedges\nI sang my sou...",confessional
1285,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs’ Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",imagist
2003,Li Bai,https://www.poetryfoundation.org/poems/48687/the-jewel-stairs-grievance,The Jewel Stairs’ Grievance,"[The jewelled steps are already quite white with dew,, It is so late that the dew soaks my gauze stockings,, And I let down the crystal curtain, A...","The jewelled steps are already quite white with dew,\nIt is so late that the dew soaks my gauze stockings,\nAnd I let down the crystal curtain\nAn...",modern
2133,Henry Wadsworth Longfellow,https://www.poetryfoundation.org/poems/44637/the-landlords-tale-paul-reveres-ride,The Landlord's Tale. Paul Revere's Ride,"[Listen, my children, and you shall hear, Of the midnight ride of Paul Revere,, On the eighteenth of April, in Seventy-five;, Hardly a man is now ...","Listen, my children, and you shall hear\nOf the midnight ride of Paul Revere,\nOn the eighteenth of April, in Seventy-five;\nHardly a man is now a...",modern
2196,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58343/ocean-of-earth,Ocean of Earth,"[I have built a house in the middle of the Ocean, Its windows are the rivers flowing from my eyes, Octopi are crawling all over where the walls ar...",I have built a house in the middle of the Ocean\nIts windows are the rivers flowing from my eyes\nOctopi are crawling all over where the walls are...,modern
2197,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58342/the-lady,The Lady,"[Knock knock He has closed his door, The garden’s lilies have started to rot, So who is the corpse being carried from the house, You just knocked ...",Knock knock He has closed his door\nThe garden’s lilies have started to rot\nSo who is the corpse being carried from the house\nYou just knocked o...,modern
2292,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58341/the-seasons-56d23ca091a25,The Seasons,"[It was a blessèd time we were at the beach, Go out early in the morning no shoes no hats no ties, And quick as a toad’s tongue can reach, Love w...",It was a blessèd time we were at the beach\nGo out early in the morning no shoes no hats no ties\nAnd quick as a toad’s tongue can reach\nLove wo...,modern
3478,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58343/ocean-of-earth,Ocean of Earth,"[I have built a house in the middle of the Ocean, Its windows are the rivers flowing from my eyes, Octopi are crawling all over where the walls ar...",I have built a house in the middle of the Ocean\nIts windows are the rivers flowing from my eyes\nOctopi are crawling all over where the walls are...,new_york_school_2nd_generation
3480,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58342/the-lady,The Lady,"[Knock knock He has closed his door, The garden’s lilies have started to rot, So who is the corpse being carried from the house, You just knocked ...",Knock knock He has closed his door\nThe garden’s lilies have started to rot\nSo who is the corpse being carried from the house\nYou just knocked o...,new_york_school_2nd_generation
3529,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/58341/the-seasons-56d23ca091a25,The Seasons,"[It was a blessèd time we were at the beach, Go out early in the morning no shoes no hats no ties, And quick as a toad’s tongue can reach, Love w...",It was a blessèd time we were at the beach\nGo out early in the morning no shoes no hats no ties\nAnd quick as a toad’s tongue can reach\nLove wo...,new_york_school_2nd_generation


In [552]:
poems.drop([2003, 2133, 3478, 3480, 3529, 4976], inplace=True)

poems.shape

(5042, 6)

In [554]:
poems[poems.duplicated(subset=['poem_string'], keep=False)]

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
4754,A. E. Housman,https://www.poetryfoundation.org/poems/58269/a-shropshire-lad-52-far-in-a-western-brookland-,A Shropshire Lad 52: Far in a western brookland,[ ],,victorian
5142,Katharine Tynan,https://www.poetryfoundation.org/poems/57349/a-lament-56d23ac7ae84a,A Lament,[ ],,victorian


In [555]:
poems.loc[988,'poem_lines'] = poempara_rescraper(poems.loc[988,'poem_url'])[0]
poems.loc[988,'poem_string'] = poempara_rescraper(poems.loc[988,'poem_url'])[1]

poems.loc[4754,'poem_lines'] = poempara_rescraper(poems.loc[4754,'poem_url'])[0]
poems.loc[4754,'poem_string'] = poempara_rescraper(poems.loc[4754,'poem_url'])[1]

poems.loc[5142,'poem_lines'] = poempara_rescraper(poems.loc[5142,'poem_url'])[0]
poems.loc[5142,'poem_string'] = poempara_rescraper(poems.loc[5142,'poem_url'])[1]

In [560]:
poems.to_csv('data/poems_df.csv')

In [404]:
poems_backup = poems.copy()

In [559]:
# uncomment to save
with gzip.open('data/poems_df.pkl', 'wb') as goodbye:
    pickle.dump(poems, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# uncomment to load
with gzip.open('data/poems_df.pkl', 'rb') as hello:
    poems_df = pickle.load(hello)

RecursionError: maximum recursion depth exceeded while getting the str of an object

In [389]:
error_poems = error_poems_orig.copy()
len(error_poems)

223

In [563]:
poems[poems.poet == 'Michael McClure']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
169,Michael McClure,https://www.poetryfoundation.org/poems/54613/dream-the-night-of-december-23rd-,Dream: The Night of December 23rd ﻿,"[—ALL HUGE LIKE GIANT FLIGHTLESS KIWIS TWICE THE, SIZE OF OSTRICHES,, they turned and walked away from us, and you were there Jane and you were t...","—ALL HUGE LIKE GIANT FLIGHTLESS KIWIS TWICE THE\nSIZE OF OSTRICHES,\nthey turned and walked away from us\nand you were there Jane and you were tw...",beat
170,Michael McClure,https://www.poetryfoundation.org/poems/54612/the-chamber,The Chamber,"[IN LIGHT ROOM IN DARK HELL IN UMBER IN CHROME,, I sit feeling the swell of the cloud made about by movement, of arm leg and tongue. In reflection...","IN LIGHT ROOM IN DARK HELL IN UMBER IN CHROME,\nI sit feeling the swell of the cloud made about by movement\nof arm leg and tongue. In reflections...",beat
171,Michael McClure,https://www.poetryfoundation.org/poems/54614/mexico-seen-from-the-moving-car-,Mexico Seen from the Moving Car ﻿,"[THERE ARE HILLS LIKE SHARKFINS, and clods of mud., The mind drifts through, in the shape of a museum,, in the guise of a museum, dreaming dead fr...","THERE ARE HILLS LIKE SHARKFINS\nand clods of mud.\nThe mind drifts through\nin the shape of a museum,\nin the guise of a museum\ndreaming dead fri...",beat
172,Michael McClure,https://www.poetryfoundation.org/poems/54611/the-mystery-of-the-hunt,The Mystery of the Hunt,"[It’s the mystery of the hunt that intrigues me,, That drives us like lemmings, but cautiously—, The search for a bright square cloud—the scent of...","It’s the mystery of the hunt that intrigues me,\nThat drives us like lemmings, but cautiously—\nThe search for a bright square cloud—the scent of ...",beat
218,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,"[PREMONITION, My bones ascend by arsenics of sight., Where noise is all the sound there is to hear,, Beginning in the heart I work towards light.,...","PREMONITION\nMy bones ascend by arsenics of sight.\nWhere noise is all the sound there is to hear,\nBeginning in the heart I work towards light.\n...",beat
219,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29414/the-child,The Child,"[Who were the Lion Men who walked in my dreams, when I was a fat and sleeping babe, in a room whose walls were miracles?, Who were the lion men wi...",Who were the Lion Men who walked in my dreams\nwhen I was a fat and sleeping babe\nin a room whose walls were miracles?\nWho were the lion men wit...,beat


In [573]:
poems.loc[574, 'poem_string']

'I\nIt is possible, in words, to speak\nof what has happened—a sense\nof there and here, now\nand then. It is some other\nway of being, prized enough,\nthat it makes a common\nground. Once\nyou were\nalone and I\nmet you. It was late\nat night.\nT never'

In [576]:
scan_poem_scraper(poems.loc[574, 'poem_url'])

{'poet': 'Robert Creeley',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/30524/enough-56d214013c576',
 'title': 'Enough',
 'poem_lines': ['I',
  'It is possible, in words, to speak',
  'of what has happened—a sense',
  'of there and here, now',
  'and then. It is some other',
  'way of being, prized enough,',
  'that it makes a common',
  'ground. Once',
  'you were',
  'alone and I',
  'met you. It was late',
  'at night.',
  'T never'],
 'poem_string': 'I\nIt is possible, in words, to speak\nof what has happened—a sense\nof there and here, now\nand then. It is some other\nway of being, prized enough,\nthat it makes a common\nground. Once\nyou were\nalone and I\nmet you. It was late\nat night.\nT never'}

In [582]:
text = pytesseract.image_to_string('data/temp.png')
text

'POETRY\n\nleft after that,\nnot to my own mind,\n\nbut stayed\nand stayed. Years\n\nwent by. What\nwere they. Days—\n\nsome happy,\nbut some bitter\n\nand sad. If I walked\nacross the room, then,\n\nand saw you un-\nexpected, saw the particular\n\nwhiteness of\nyour body, a little\n\nolder, more\ntired—in words\n\nI possessed it, in\nmy mind I thought, and\n\nyou never knew\nit, there I danced\n\nfor you, stumbling, in\nthe corner of my eye.'

In [591]:
poems[poems.title == 'Four Dream Songs']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
758,John Berryman,https://www.poetryfoundation.org/poetrymagazine/poems/29165/four-dream-songs,Four Dream Songs,"[I, To Ralph Ross, The greens of the Ganges delta foliate., Of heartless youth made late aware he pled:, Brownies, please come., To Henry in his s...","I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sp...",confessional


In [592]:
poems.loc[758, 'poem_string']

"I\nTo Ralph Ross\nThe greens of the Ganges delta foliate.\nOf heartless youth made late aware he pled:\nBrownies, please come.\nTo Henry in his sparest times sometimes\nthe little people spread, & did friendly things;\nthen he was glad.\nPleased, at the worst, except with man, he shook\nthe brightest winter sun.\nAll the green lives\nof the great delta, hours, hurt his migrant heart\nin a safety of the steady plane. Please, please\ncome.\nMy friends,—he has been known to mourn,—I'll die;\nlive you, in the most wild, kindly, green\npartly forgiving wood,\nsort of forever and all those human sings\nclose not your better ears to, while good Spring\nreturns with a dance and a sigh.\ni\nHenry’s pelt was put on sundry walls\nwhere it did much resemble Henry and\nthem persons was delighted."

In [645]:
error_poems

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29525/viii-he-yelled-at-me-in-greek',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29166/the-greens-of-the-ganges',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29552/you-search-in-rome-for-rome',
 'https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d2296992

In [637]:
%%time

simple_rescrapes = []
still_errors = []
for url in tqdm(error_poems):
    try:
        info = scan_poem_scraper(url)
        info['poem_url'] = url
        info['genre'] = ''
        simple_rescrapes.append(info)
    except:
        still_errors.append(url)

100%|██████████| 102/102 [05:06<00:00,  3.00s/it]

CPU times: user 14.6 s, sys: 45.9 s, total: 1min
Wall time: 5min 6s





In [638]:
simple_rescrapes

[{'poet': 'John Berryman',
  'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
  'title': 'Spellbound Held Subtle Henry',
  'poem_lines': ['the little people spread, & did friendly things;',
   'then he was glad.',
   'Pleased, at the worst, except with man, he shook',
   'the brightest winter sun.',
   'All the green lives',
   'of the great delta, hours, hurt his migrant heart',
   'in a safety of the steady plane. Please, please',
   'come.',
   "My friends,—he has been known to mourn,—I'll die;",
   'live you, in the most wild, kindly, green',
   'partly forgiving wood,',
   'sort of forever and all those human sings',
   'close not your better ears to, while good Spring',
   'returns with a dance and a sigh.',
   'i',
   'Henry’s pelt was put on sundry walls',
   'where it did much resemble Henry and',
   'them persons was delighted.'],
  'poem_string': "the little people spread, & did friendly things;\nthen he was glad.\nPleas

In [639]:
still_errors

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0',
 'https://www.poetryfoundation.org/poetrymagazine/poems/19645/persons-seen',
 'https://www.poetryfoundation.org/poetrymagazine/poems/32401/sonnets-of-the-blood',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14310/an-evening-meeting-tr-by-amy-lowell-and-florence-ayscough',
 'https://www.poetryfoundation.org/poetrymagazine/poems/14321/the-inn-at-the-w

In [648]:
poems[poems.poet == 'William Carlos Williams']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1289,William Carlos Williams,https://www.poetryfoundation.org/poems/148460/spring-and-all-chapter-xiii-thus-weary-of-life,"Spring and All: Chapter XIII [Thus, weary of life]","[Thus, weary of life, in view of the great consummation which awaits us — tomorrow, we rush among our friends congratulating ourselves upon the jo...","Thus, weary of life, in view of the great consummation which awaits us — tomorrow, we rush among our friends congratulating ourselves upon the joy...",imagist
1290,William Carlos Williams,https://www.poetryfoundation.org/poems/56159/this-is-just-to-say,This Is Just To Say,"[I have eaten, the plums, that were in, the icebox, and which, you were probably, saving, for breakfast, Forgive me, they were delicious, so sweet...",I have eaten\nthe plums\nthat were in\nthe icebox\nand which\nyou were probably\nsaving\nfor breakfast\nForgive me\nthey were delicious\nso sweet\...,imagist
1291,William Carlos Williams,https://www.poetryfoundation.org/poems/148462/spring-and-all-xi-in-passing-with-my-mind,Spring and All: XI [In passing with my mind],"[In passing with my mind, on nothing in the world, but the right of way, I enjoy on the road by, virtue of the law —, I saw, an elderly man...",In passing with my mind\non nothing in the world\nbut the right of way\nI enjoy on the road by\nvirtue of the law —\nI saw\nan elderly man ...,imagist
1292,William Carlos Williams,https://www.poetryfoundation.org/poems/53078/flowers-by-the-sea-56d23210587cf,Flowers by the Sea,"[When over the flowery, sharp pasture’s, edge, unseen, the salt ocean, lifts its form—chicory and daisies, tied, released, seem hardly flowers alo...","When over the flowery, sharp pasture’s\nedge, unseen, the salt ocean\nlifts its form—chicory and daisies\ntied, released, seem hardly flowers alon...",imagist
1293,William Carlos Williams,https://www.poetryfoundation.org/poems/49849/between-walls,Between Walls,"[the back wings, of the, hospital where, nothing, will grow lie, cinders, in which shine, the broken, pieces of a green, bottle]",the back wings\nof the\nhospital where\nnothing\nwill grow lie\ncinders\nin which shine\nthe broken\npieces of a green\nbottle,imagist
...,...,...,...,...,...,...
1571,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/14364/the-dark-day,The Dark Day,"[A three-day-long rain from the east—, An interminable talking, talking, Of no consequence—patter, patter, patter., Hand in hand little winds, Blo...","A three-day-long rain from the east—\nAn interminable talking, talking\nOf no consequence—patter, patter, patter.\nHand in hand little winds\nBlow...",imagist
1572,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/13517/summer-song,Summer Song,"[Wanderer moon,, Smiling, A faintly ironical smile, At this brilliant,, Dew-moistened, Summer morning—, A detached,, Sleepily indifferent, Smile,,...","Wanderer moon,\nSmiling\nA faintly ironical smile\nAt this brilliant,\nDew-moistened\nSummer morning—\nA detached,\nSleepily indifferent\nSmile,\n...",imagist
1573,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/22430/the-forgotten-city,The Forgotten City,"[When I was coming down from the country, with my mother, the day of the storm,, trees were across the road and small branches, kept rattling on t...","When I was coming down from the country\nwith my mother, the day of the storm,\ntrees were across the road and small branches\nkept rattling on th...",imagist
1574,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/18898/the-unfrocked-priest,The Unfrocked Priest,"[I, When a man had gone, in Russia from a small, town, to the University, he, returned a hero—, people, bowed down to him—, his, ego, nourished by...","I\nWhen a man had gone\nin Russia from a small\ntown\nto the University\nhe\nreturned a hero—\npeople\nbowed down to him—\nhis\nego, nourished by ...",imagist


In [640]:
error_rescrapes = []

In [641]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=18'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: We Shall Be Free')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [642]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: When Spirit Has No Edge')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [643]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=5&page=6'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Enough: Left After That')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [644]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=3&page=19'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Walking: In My Head')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [654]:
img_data = rq.get('https://static.poetryfoundation.org/jstor/i20572016/pages/13.png').content
with open('data/temp.png', 'wb') as handle:
    handle.write(img_data)
text = pytesseract.image_to_string('data/temp.png')
text

'William Carlos Williams\n\nEPITAPH\n\nAn old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n\n“Love is a young green willow\nShimmering at the bare wood’s edge.”\n\nSPRING\n\nO my grey hairs!\nYou are truly white as plum blossoms.\n\nSTROLLER\n\nI have seen the hills blue,\n\nI have seen them purple;\n\nAnd it is as hard to know\n\nThe words of a woman\n\nAs to straighten the crumpled branch\nOf an old willow.\n\nMEMORY OF APRIL\n\nYou say love is this, love is that:\nPoplar tassels, willow tendrils\n\nThe wind and the rain comb,\nTinkle and drip, tinkle and drip—\nBranches drifting apart. Hagh!\nLove has not even visited this country.\n\n[303]'

In [699]:
text = 'William Carlos Williams\n\nEPITAPH\n\nAn old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n\n“Love is a young green willow\nShimmering at the bare wood’s edge.”\n\nSPIRIT\n\nO my grey hairs!\nYou are truly white as plum blossoms.\n\nSTROLLER\n\nI have seen the hills blue,\n\nI have seen them purple;\n\nAnd it is as hard to know\n\nThe words of a woman\n\nAs to straighten the crumpled branch\nOf an old willow.\n\nMEMORY OF APRIL\n\nYou say love is this, love is that:\nPoplar tassels, willow tendrils\n\nThe wind and the rain comb,\nTinkle and drip, tinkle and drip—\nBranches drifting apart. Hagh!\nLove has not even visited this country.\n\n[303]'

In [700]:
title = 'Epitaph'
scan_pattern = fr'{title.split()[-1].upper()}\b.*((?:\r?\n(?![A-HJ-Z][A-HJ-Z ][A-Z ]+$).*)*)'
lines = re.search(scan_pattern, text, re.MULTILINE).group(1).splitlines()
lines

['',
 '',
 'An old willow with hollow branches',
 'Slowly swayed his few high bright tendrils',
 'And sang:',
 '',
 '“Love is a young green willow',
 'Shimmering at the bare wood’s edge.”']

In [701]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=13&issue=6&page=13'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Epitaph')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
rescrape
# error_rescrapes.append(rescrape)

{'poet': 'William Carlos Williams',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow',
 'title': 'Epitaph',
 'poem_lines': ['An old willow with hollow branches',
  'Slowly swayed his few high bright tendrils',
  'And sang:',
  '“Love is a young green willow',
  'Shimmering at the bare wood’s edge.”'],
 'poem_string': 'An old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n“Love is a young green willow\nShimmering at the bare wood’s edge.”',
 'genre': 'imagist'}

In [374]:
from tqdm import tqdm

In [391]:
%%time

rescraped = []
for url in tqdm(error_poems):
    try:
        poem = scan_poem_rescrape(url)
        rescraped.append(poem)
        error_poems.remove(url)
    except:
        continue

 65%|██████▌   | 145/223 [06:41<03:36,  2.77s/it]

CPU times: user 28 s, sys: 52.3 s, total: 1min 20s
Wall time: 6min 41s





In [392]:
len(error_poems)

145

In [393]:
pd.DataFrame(rescraped)[pd.DataFrame(rescraped).poet == 'Kenneth Rexroth']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string


In [384]:
error_poems

['https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27130/in-my-childhood-when-i-first',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29522/v-tell-it-to-the-forest-fire',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29169/spellbound-held-subtle-henry',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29525/viii-he-yelled-at-me-in-greek',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29166/the-greens

In [400]:
poem_url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'

jumble_pattern = r'-[0-9]+[a-z][0-9a-z]*$'
clean_url = re.sub(jumble_pattern, '', poem_url)
# try:
#     title = soup.find('h1').contents[-1].strip()
# except:
title_pattern = r'[a-z0-9\-]*$'
title = re.search(
    title_pattern,
    clean_url,
    re.I).group().replace(
    '-',
    ' ').title()
    
title

'Walking'

In [403]:
title.split()[-1].upper()

'WALKING'

In [402]:
scan_poem_rescrape('https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892')

AttributeError: 'NoneType' object has no attribute 'group'

In [346]:
type(poems.loc[157,'poem_lines'])

list

In [345]:
# rescrape poem based on index from above 
poems.loc[157,'poem_lines'] = PoemView_rescraper(poems.loc[157,'poem_url'])[0]
poems.loc[157,'poem_string'] = PoemView_rescraper(poems.loc[157,'poem_url'])[1]

In [None]:
# rescrape poem based on index from above 
poems.loc[157,'poem_lines'] = PoemView_rescraper(poems.loc[157,'poem_url'])[0]
poems.loc[157,'poem_string'] = PoemView_rescraper(poems.loc[157,'poem_url'])[1]

poems.loc[165,'poem_lines'] = PoemView_rescraper(poems.loc[165,'poem_url'])[0]
poems.loc[165,'poem_string'] = PoemView_rescraper(poems.loc[165,'poem_url'])[1]

poems.loc[210,'poem_lines'] = PoemView_rescraper(poems.loc[210,'poem_url'])[0]
poems.loc[210,'poem_string'] = PoemView_rescraper(poems.loc[210,'poem_url'])[1]

df_trim.loc[165,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[703,'poem_url'])[0])
df_trim.loc[703,'poem_string'] = PoemView_rescraper(df_trim.loc[703,'poem_url'])[1]

df_trim.loc[952,'poem_lines'] = str(poempara_rescraper(df_trim.loc[952,'poem_url'])[0])
df_trim.loc[952,'poem_string'] = poempara_rescraper(df_trim.loc[952,'poem_url'])[1]

df_trim.loc[953,'poem_lines'] = str(modified_regular_rescraper(df_trim.loc[953,'poem_url'])[0])
df_trim.loc[953,'poem_string'] = modified_regular_rescraper(df_trim.loc[953,'poem_url'])[1]

df_trim.loc[1231,'poem_lines'] = str(justify_rescraper(df_trim.loc[1231,'poem_url'])[0])
df_trim.loc[1231,'poem_string'] = justify_rescraper(df_trim.loc[1231,'poem_url'])[1]

df_trim.loc[1234,'poem_lines'] = str(justify_rescraper(df_trim.loc[1234,'poem_url'])[0])
df_trim.loc[1234,'poem_string'] = justify_rescraper(df_trim.loc[1234,'poem_url'])[1]

df_trim.loc[1389,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1389,'poem_url'])[0])
df_trim.loc[1389,'poem_string'] = PoemView_rescraper(df_trim.loc[1389,'poem_url'])[1]

df_trim.loc[1603,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1603,'poem_url'])[0])
df_trim.loc[1603,'poem_string'] = PoemView_rescraper(df_trim.loc[1603,'poem_url'])[1]

df_trim.loc[2514,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2514,'poem_url'])[0])
df_trim.loc[2514,'poem_string'] = PoemView_rescraper(df_trim.loc[2514,'poem_url'])[1]

df_trim.loc[2517,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2517,'poem_url'])[0])
df_trim.loc[2517,'poem_string'] = PoemView_rescraper(df_trim.loc[2517,'poem_url'])[1]

df_trim.loc[3335,'poem_lines'] = str(ranged_rescraper(df_trim.loc[3335,'poem_url'])[0])
df_trim.loc[3335,'poem_string'] = ranged_rescraper(df_trim.loc[3335,'poem_url'])[1]

df_trim.loc[3418,'poem_lines'] = str(center_rescraper(df_trim.loc[3418,'poem_url'])[0])
df_trim.loc[3418,'poem_string'] = center_rescraper(df_trim.loc[3418,'poem_url'])[1]

df_trim.loc[3421,'poem_lines'] = str(justify_rescraper(df_trim.loc[3421,'poem_url'])[0])
df_trim.loc[3421,'poem_string'] = justify_rescraper(df_trim.loc[3421,'poem_url'])[1]

df_trim.loc[4217,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4217,'poem_url'])[0])
df_trim.loc[4217,'poem_string'] = poempara_rescraper(df_trim.loc[4217,'poem_url'])[1]

df_trim.loc[4611,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4611,'poem_url'])[0])
df_trim.loc[4611,'poem_string'] = poempara_rescraper(df_trim.loc[4611,'poem_url'])[1]

In [338]:
PoemView_rescraper('https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv')

(['XIV1',
  'The brutal Lord of All will rip us from each other—leave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching—eyes looking, clouds rising—lips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream—: Of Ymir’s flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya!  ________________',
  'Out of bitterness itself the clear wine of the imagination will be pressed and the dance prosper thereby.  2',
  'To you! whoever you are, wherever you are! (But I know where you are!) There’s Dürer’s “Nemesis” naked on her sphere over the little town by the river—except she’s too old. There’s a dancing burgess by Tenier and Villon’s maitresse—after he’d gone bald and was skin pocked and toothless: she that had him ducked in the sewage drain. Then there’s that miller’s daughter of “buttocks broad and breastes high.” Something of Nietzsche, something of the 

In [333]:
from unicodedata import normalize

In [337]:
normalize('NFKD', lines_raw[1]).replace('\ufeff', '')

'\n       The brutal Lord of All will rip us from each other—leave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching—eyes looking, clouds rising—lips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream—: Of Ymir’s flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya!  ________________'

In [None]:
[line.normalize('NFKD', )]

In [331]:
poem_url = 'https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv'

page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')
lines_raw = soup.find(
                    'div', {
                        'data-view': 'PoemView'}).get_text().split('\r')

lines_raw

['\nXIV1',
 '\n\xa0 \xa0\xa0\xa0\xa0 The brutal Lord of All will rip us from each other—leave the one to suffer here alone. No need belief in god or hell to postulate that much. The dance: hands touching, leaves touching—eyes looking, clouds rising—lips touching, cheeks touching, arm about . . . Sleep. Heavy head, heavy arm, heavy dream—: Of Ymir’\ufeff\ufeffs flesh the earth was made and of his thoughts were all the gloomy clouds created. Oya! \xa0________________',
 '\n\xa0\xa0\xa0\xa0\xa0 Out of bitterness itself the clear wine of the imagination will be pressed and the dance prosper thereby. \xa02',
 '\n\xa0 \xa0\xa0\xa0\xa0 To you! whoever you are, wherever you are! (But I know where you are!) There’\ufeff\ufeffs Dü\ufeffrer’\ufeff\ufeffs “Nemesis” naked on her sphere over the little town by the river—except she’\ufeff\ufeffs too old. There’\ufeff\ufeff\ufeffs a dancing burgess by Tenier and Villon’\ufeff\ufeff\ufeffs maitresse—after he’\ufeff\ufeff\ufeffd gone bald and was skin p

In [None]:
page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')

In [135]:
text_poems = text_poems[text_poems.poem_string != ''].reset_index(drop=True)
text_poems.shape

(3082, 6)

In [136]:
text_poems.to_csv('data/text_poems.csv')

In [137]:
text_poems[text_poems.poet == 'William Carlos Williams']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
810,William Carlos Williams,https://www.poetryfoundation.org/poems/148460/spring-and-all-chapter-xiii-thus-weary-of-life,"Spring and All: Chapter XIII [Thus, weary of life]","[Thus, weary of life, in view of the great consummation which awaits us — tomorrow, we rush among our friends congratulating ourselves upon the jo...","Thus, weary of life, in view of the great consummation which awaits us — tomorrow, we rush among our friends congratulating ourselves upon the joy...",imagist
811,William Carlos Williams,https://www.poetryfoundation.org/poems/56159/this-is-just-to-say,This Is Just To Say,"[I have eaten, the plums, that were in, the icebox, and which, you were probably, saving, for breakfast, Forgive me, they were delicious, so sweet...",I have eaten\nthe plums\nthat were in\nthe icebox\nand which\nyou were probably\nsaving\nfor breakfast\nForgive me\nthey were delicious\nso sweet\...,imagist
812,William Carlos Williams,https://www.poetryfoundation.org/poems/148462/spring-and-all-xi-in-passing-with-my-mind,Spring and All: XI [In passing with my mind],"[In passing with my mind, on nothing in the world, but the right of way, I enjoy on the road by, virtue of the law —, I saw, an elderly man...",In passing with my mind\non nothing in the world\nbut the right of way\nI enjoy on the road by\nvirtue of the law —\nI saw\nan elderly man ...,imagist
813,William Carlos Williams,https://www.poetryfoundation.org/poems/53078/flowers-by-the-sea-56d23210587cf,Flowers by the Sea,"[When over the flowery, sharp pasture’s, edge, unseen, the salt ocean, lifts its form—chicory and daisies, tied, released, seem hardly flowers alo...","When over the flowery, sharp pasture’s\nedge, unseen, the salt ocean\nlifts its form—chicory and daisies\ntied, released, seem hardly flowers alon...",imagist
814,William Carlos Williams,https://www.poetryfoundation.org/poems/49849/between-walls,Between Walls,"[the back wings, of the, hospital where, nothing, will grow lie, cinders, in which shine, the broken, pieces of a green, bottle]",the back wings\nof the\nhospital where\nnothing\nwill grow lie\ncinders\nin which shine\nthe broken\npieces of a green\nbottle,imagist
815,William Carlos Williams,https://www.poetryfoundation.org/poems/54566/kora-in-hell-improvisations-xiv,Kora in Hell: Improvisations XI﻿V,"[XIV1, The brutal Lord of All will rip us from each other—leave the one to suffer here alone. No need belief in god or hell to postulate that much...",XIV1\nThe brutal Lord of All will rip us from each other—leave the one to suffer here alone. No need belief in god or hell to postulate that much....,imagist
816,William Carlos Williams,https://www.poetryfoundation.org/poems/46485/to-elsie,To Elsie,"[The pure products of America, go crazy—, mountain folk from Kentucky, or the ribbed north end of, Jersey, with its isolate lakes and, valleys, it...","The pure products of America\ngo crazy—\nmountain folk from Kentucky\nor the ribbed north end of\nJersey\nwith its isolate lakes and\nvalleys, its...",imagist
817,William Carlos Williams,https://www.poetryfoundation.org/poems/54564/kora-in-hell-improvisations-xxvii,Kora in Hell: Improvisations XXVII,"[XXVII 1, This particular thing, whether it be four pinches of four divers white powders cleverly compounded to cure surely, safely, pleasantly ...","XXVII 1\nThis particular thing, whether it be four pinches of four divers white powders cleverly compounded to cure surely, safely, pleasantly a...",imagist
818,William Carlos Williams,https://www.poetryfoundation.org/poems/54326/love-song-56d2348bab385,Love Song,"[I lie here thinking of you:— the stain of love is upon the world! Yellow, yellow, yellow it eats into the leaves, smears with saffron the horn...","I lie here thinking of you:— the stain of love is upon the world! Yellow, yellow, yellow it eats into the leaves, smears with saffron the horne...",imagist
819,William Carlos Williams,https://www.poetryfoundation.org/poems/46484/queen-annes-lace,Queen-Anne’s Lace,"[Her body is not so white as, anemony petals nor so smooth—nor, so remote a thing. It is a field, of the wild carrot taking, the field by force; t...",Her body is not so white as\nanemony petals nor so smooth—nor\nso remote a thing. It is a field\nof the wild carrot taking\nthe field by force; th...,imagist


In [202]:
poet_poems_url_dict

{'augustan': [{'https://www.poetryfoundation.org/poets/mary-barber': (['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage'],
    [])},
  {'https://www.poetryfoundation.org/poets/susanna-blamire': (['https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
     'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
     'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man'],
    [])},
  {'https://www.poetryfoundation.org/poets/henry-carey': (['https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley'],
    [])},
  {'https://www.poetryfoundation.org/poets/thomas-chatterton': (['https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
     'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude'],
    [])},
  {'https://www.poetryfoundation.org/poets/william-collins': (['https://www.poetryfoundation.org/poems/44003/ode-to-evening',
     'https://www.poetryfoundation.

In [204]:
poet_poems_url_dict['augustan']

[{'https://www.poetryfoundation.org/poets/mary-barber': (['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage'],
   [])},
 {'https://www.poetryfoundation.org/poets/susanna-blamire': (['https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
    'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
    'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man'],
   [])},
 {'https://www.poetryfoundation.org/poets/henry-carey': (['https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley'],
   [])},
 {'https://www.poetryfoundation.org/poets/thomas-chatterton': (['https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
    'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude'],
   [])},
 {'https://www.poetryfoundation.org/poets/william-collins': (['https://www.poetryfoundation.org/poems/44003/ode-to-evening',
    'https://www.poetryfoundation.org/poems/44002/an-ode-on

In [207]:
test = {genre:{'text_urls':[],'scan_urls':[]} for genre in poet_poems_url_dict}
for genre,poets in poet_poems_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            test[genre]['text_urls'].extend(poems[0])
            test[genre]['scan_urls'].extend(poems[1])
            
test

{'augustan': {'text_urls': ['https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage',
   'https://www.poetryfoundation.org/poems/50534/auld-robin-forbes',
   'https://www.poetryfoundation.org/poems/50532/the-siller-croun',
   'https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man',
   'https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley',
   'https://www.poetryfoundation.org/poems/43925/an-excelente-balade-of-charitie',
   'https://www.poetryfoundation.org/poems/43924/aella-a-tragical-interlude',
   'https://www.poetryfoundation.org/poems/44003/ode-to-evening',
   'https://www.poetryfoundation.org/poems/44002/an-ode-on-the-popular-superstitions-of-the-highlands-of-scotland-considered-as-the-subject-of-poetry',
   'https://www.poetryfoundation.org/poems/52293/eclogue-the-second-hassan-or-the-camel-driver',
   'https://www.poetryfoundation.org/poems/44001/ode-on-the-poetical-character',
   'https://www.poetryfoundation.org

In [201]:
%%time

poem_dicts = []
error_poems = []
for genre,poets in poet_poems_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            for text_url in poems[0]:
                poem = text_poem_scraper(text_url)
                poem['genre'] = genre
                poem['poem_url'] = text_url
                poem_dicts.append(poem)
            
            if poems[1]:
                for scan_url in poems[1]:
                    try:
                        poem = text_poem_scraper(scan_url)
                        poem['genre'] = genre
                        poem['poem_url'] = scan_url
                        poem_dicts.append(poem)
                        poems[0].append(scan_url)
                        poems[1].remove(scan_url)
                    except:
                        try:
                            poem = scan_poem_scraper(scan_url)
                            poem['genre'] = genre
                            poem['poet_url'] = scan_url
                            poem_dicts.append(poem)
                        except:
                            error_poems.append(scan_url)

KeyboardInterrupt: 

In [197]:
'https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool' in image_urls

True

In [196]:
text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool')

{'poet': 'H. D.',
 'poem_url': 'https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool',
 'title': 'The Pool',
 'poem_lines': ['Are you alive?',
  'I touch you.',
  'You quiver like a sea-fish.',
  'I cover you with my net.',
  'What are you—banded one?'],
 'poem_string': 'Are you alive?\nI touch you.\nYou quiver like a sea-fish.\nI cover you with my net.\nWhat are you—banded one?'}

In [195]:
poet_poems_url_dict['imagist'][1]

{'https://www.poetryfoundation.org/poets/h-d': (['https://www.poetryfoundation.org/poems/47927/leda-56d228c3a5948',
   'https://www.poetryfoundation.org/poems/51856/evening-56d22fe15dc07',
   'https://www.poetryfoundation.org/poems/44133/cassandra-56d2231be6015',
   'https://www.poetryfoundation.org/poems/48186/oread',
   'https://www.poetryfoundation.org/poems/48187/sea-poppies',
   'https://www.poetryfoundation.org/poems/46541/helen-56d22674d6e41',
   'https://www.poetryfoundation.org/poems/48189/sheltered-garden',
   'https://www.poetryfoundation.org/poems/44134/cities',
   'https://www.poetryfoundation.org/poems/48188/sea-rose',
   'https://www.poetryfoundation.org/poems/53970/sea-heroes',
   'https://www.poetryfoundation.org/poems/51870/sea-iris',
   'https://www.poetryfoundation.org/poems/51869/eurydice-56d22fe6d049d',
   'https://www.poetryfoundation.org/poems/48190/wash-of-cold-river'],
  ['https://www.poetryfoundation.org/poetrymagazine/poems/13056/the-pool',
   'https://www.p

In [194]:
len(image_urls)

2221

In [193]:
text_poems = pd.DataFrame(poem_dicts)
text_poems.shape

(3084, 7)

In [190]:
text_poems = text_poems[text_poems.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1321,Dylan Thomas,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Poem on His Birthday [Facs. drafts],[],,modern
1433,Barbara Guest,https://www.poetryfoundation.org/poems/49367/imagined-room,Imagined Room,[],,new_york_school


In [184]:
poem_url = 'https://www.poetryfoundation.org/poems/51653/to-a-poor-old-woman'

# load a page and soupify it
page = rq.get(poem_url)
soup = bs(page.content, 'html.parser')

# most frequent formatting
lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
# normalize text (from unicode)
lines = [normalize('NFKD', str(line.contents[0])) for line in lines_raw if line.contents]
# remove some hanging html
lines = [line.replace('<br/>', '') for line in lines]
line_pattern = '>(.*?)<'
lines = [re.search(line_pattern, line, re.I).group(1) if '<' in line else line for line in lines]
# scrape poem
# lines_raw = soup.find('div', {'data-view': 'PoemView'}).strings
# lines = [line.strip() for line in lines_raw if line.strip()]

# if not lines:
#     lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
#     lines = [line.get_text().strip() for line in lines_raw if line.get_text().strip()]

# # create string version of poem
# poem_string = '\n'.join(lines)

# info = {'poet': poet,
#         'poem_url': poem_url,
#         'title': title,
#         'poem_lines': lines,
#         'poem_string': poem_string}

lines

['munching a plum on   ',
 '\r the street a paper bag',
 '\r of them in her hand',
 '',
 '\r They taste good to her',
 '\r They taste good   ',
 '\r to her. They taste',
 '\r good to her',
 '',
 '\r You can see it by',
 '\r the way she gives herself',
 '\r to the one half',
 '\r sucked out in her hand',
 '',
 'Comforted',
 '\r a solace of ripe plums',
 '\r seeming to fill the air',
 '\r They taste good to her',
 '']

In [181]:
lines_raw

[<div style="text-indent: -1em; padding-left: 1em;">munching a plum on   <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  the street a paper bag<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  of them in her hand<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;"><br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  They taste good to her<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  They taste good   <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  to her. They taste<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  good to her<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;"><br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  You can see it by<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  the way she gives herself<br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">
  to the one half<br/></div>,
 <div

In [177]:
lines2 = [str(line) for line in lines_raw]
lines2

['\n',
 '\n',
 '\n',
 '\n',
 'Highlight Actions',
 '\n',
 'Enable or disable annotations',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'munching a plum on\xa0\xa0\xa0',
 '\r the street a paper bag',
 '\r of them in her hand',
 '\r They taste good to her',
 '\r They taste good\xa0\xa0\xa0',
 '\r to her. They taste',
 '\r good to her',
 '\r You can see it by',
 '\r the way she gives herself',
 '\r to the one half',
 '\r sucked out in her hand',
 'Comforted',
 'Comforted',
 ' When originally published in the journal ',
 'Smoke',
 ' (Autumn 1934), the line read: “Comforted, Relieved—”',
 '\r a solace of ripe plums',
 '\r seeming to fill the air',
 '\r They taste good to her',
 '\n']

In [178]:
[line.strip() for line in lines2 if line.strip()]

['Highlight Actions',
 'Enable or disable annotations',
 'munching a plum on',
 'the street a paper bag',
 'of them in her hand',
 'They taste good to her',
 'They taste good',
 'to her. They taste',
 'good to her',
 'You can see it by',
 'the way she gives herself',
 'to the one half',
 'sucked out in her hand',
 'Comforted',
 'Comforted',
 'When originally published in the journal',
 'Smoke',
 '(Autumn 1934), the line read: “Comforted, Relieved—”',
 'a solace of ripe plums',
 'seeming to fill the air',
 'They taste good to her']

In [113]:
pd.read_csv('data/text_poems.csv', index_col=0)

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Mary Barber,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Advice to Her Son on Marriage,[],,augustan
1,Susanna Blamire,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Auld Robin Forbes,"['And auld Robin Forbes hes gien tem a dance,', 'I pat on my speckets to see them aw prance;', 'I thout o’ the days when I was but fifteen,', 'And...","And auld Robin Forbes hes gien tem a dance,\nI pat on my speckets to see them aw prance;\nI thout o’ the days when I was but fifteen,\nAnd skipp’d...",augustan
2,Susanna Blamire,https://www.poetryfoundation.org/poems/50532/the-siller-croun,The Siller Croun,"['And ye shall walk in silk attire,', 'And siller hae to spare,', 'Gin ye’ll consent to be his bride,', 'Nor think o’ Donald mair.', 'O wha wad bu...","And ye shall walk in silk attire,\nAnd siller hae to spare,\nGin ye’ll consent to be his bride,\nNor think o’ Donald mair.\nO wha wad buy a silken...",augustan
3,Susanna Blamire,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,O Donald! Ye Are Just the Man,"['O Donald! ye are just the man', 'Who, when he’s got a wife,', 'Begins to fratch— nae notice ta’en—', 'They’re strangers a’ their life.', 'The fa...","O Donald! ye are just the man\nWho, when he’s got a wife,\nBegins to fratch— nae notice ta’en—\nThey’re strangers a’ their life.\nThe fan may drop...",augustan
4,Henry Carey,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,The Ballad of Sally in our Alley,"['The ARGUMENT. A Vulgar Error having long prevailed among many Persons, who imagine Sally Salisbury the Subject of this Ballad, the Author begs ...","The ARGUMENT. A Vulgar Error having long prevailed among many Persons, who imagine Sally Salisbury the Subject of this Ballad, the Author begs le...",augustan
...,...,...,...,...,...,...
3079,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45487/in-school-days,In School-days,"['Still sits the school-house by the road, \xa0\xa0\xa0A ragged beggar sleeping; Around it still the sumachs grow, \xa0\xa0\xa0And blackberry-vine...","Still sits the school-house by the road, A ragged beggar sleeping; Around it still the sumachs grow, And blackberry-vines are creeping. With...",victorian
3080,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45483/barbara-frietchie,Barbara Frietchie,"['Up from the meadows rich with corn,', 'Clear in the cool September morn,', 'The clustered spires of Frederick stand', 'Green-walled by the hills...","Up from the meadows rich with corn,\nClear in the cool September morn,\nThe clustered spires of Frederick stand\nGreen-walled by the hills of Mary...",victorian
3081,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45489/skipper-iresons-ride,Skipper Ireson’s Ride,"['Of all the rides since the birth of time,', 'Told in story or sung in rhyme, —', 'On Apuleius’s Golden Ass,', 'Or one-eyed Calender’s horse of b...","Of all the rides since the birth of time,\nTold in story or sung in rhyme, —\nOn Apuleius’s Golden Ass,\nOr one-eyed Calender’s horse of brass,\nW...",victorian
3082,John Greenleaf Whittier,https://www.poetryfoundation.org/poems/45493/the-worship-of-nature,The Worship of Nature,['The harp at Nature’s advent strung \xa0\xa0\xa0\xa0\xa0\xa0Has never ceased to play; The song the stars of morning sung \xa0\xa0\xa0\xa0\xa0\xa0...,The harp at Nature’s advent strung Has never ceased to play; The song the stars of morning sung Has never died away. And prayer is mad...,victorian


In [110]:
# uncomment to save
with gzip.open('data/text_poems.pkl', 'wb') as goodbye:
    pickle.dump(text_poems, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# uncomment to load
with gzip.open('data/text_poems.pkl', 'rb') as hello:
    df = pickle.load(hello)

RecursionError: maximum recursion depth exceeded while getting the str of an object

In [37]:
page = rq.get(list(poet_urls_dict['augustan'][10].values())[0][0][0])
soup = bs(page.content, 'html.parser')

In [38]:
poet = soup.find('a', href=re.compile('.*/poets/.*')).contents[0]
title = soup.find('h1').contents[-1].strip()
poet,title

('Thomas Gray', 'On the Death of Richard West')

In [80]:
pd.DataFrame(text_poem_scraper(list(poet_poems_url_dict['black_mountain'][1].values())[0][0][0]))

Unnamed: 0,0
0,Robert Creeley
1,After Frost
2,"[He comes here, by whatever way he can,, not too late,, not too soon., He sits, waiting., He doesn’t know, why he should, have such a patience., H..."
3,"He comes here\nby whatever way he can,\nnot too late,\nnot too soon.\nHe sits, waiting.\nHe doesn’t know\nwhy he should\nhave such a patience.\nHe..."


In [39]:
# most frequent formatting
lines_raw = soup.find_all('div', {'style': 'text-indent: -1em; padding-left: 1em;'})
lines_raw

[<div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And reddening Phœbus lifts his golden fire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race to cheer,
 <br/></div>

In [41]:
# if 'text-align' is justified
lines_raw = soup.find_all('div', {'style': 'text-align: justify;'})
lines_raw

[]

In [50]:
lines_raw

['\n',
 <div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And reddening Phœbus lifts his golden fire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.
 <br/></div>,
 <div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race to cheer,
 <br/

In [65]:
lines_raw = soup.find('div', {'data-view': 'PoemView'}).get_text().split('\r')
lines = [line.strip() for line in lines_raw if line.strip()]
lines

['In vain to me the smiling Mornings shine,',
 'And reddening Phœbus lifts his golden fire;',
 'The birds in vain their amorous descant join;',
 'Or cheerful fields resume their green attire;',
 'These ears, alas! for other notes repine,',
 'A different object do these eyes require;',
 'My lonely anguish melts no heart but mine;',
 'And in my breast the imperfect joys expire.',
 'Yet Morning smiles the busy race to cheer,',
 'And new-born pleasure brings to happier men;',
 'The fields to all their wonted tribute bear;',
 'To warm their little loves the birds complain;',
 'I fruitless mourn to him that cannot hear,',
 'And weep the more because I weep in vain.']

In [64]:
lines_raw

['\nIn vain to me the smiling Mornings shine,',
 '    And reddening Phœbus lifts his golden fire;',
 'The birds in vain their amorous descant join;',
 '    Or cheerful fields resume their green attire;',
 'These ears, alas! for other notes repine,',
 '    A different object do these eyes require;',
 'My lonely anguish melts no heart but mine;',
 '    And in my breast the imperfect joys expire.',
 'Yet Morning smiles the busy race to cheer,',
 '    And new-born pleasure brings to happier men;',
 'The fields to all their wonted tribute bear;',
 '    To warm their little loves the birds complain;',
 'I fruitless mourn to him that cannot hear,',
 '    And weep the more because I weep in vain.',
 '\n']

In [49]:
# scrape 'PoemView' html type
lines_raw = soup.find('div', {'data-view': 'PoemView'})

line_pattern = '>(.*?)<'
lines = [re.search(line_pattern, line, re.I).group(1) if '<' in line else line for line in lines]

# normalize text (from unicode)
lines = [normalize('NFKD', str(line)) for line in lines_raw if line]

# lines = [line.replace('<br/>', '') for line in lines]
lines = [line.strip() for line in lines if line]
lines

['',
 '<div style="text-indent: -1em; padding-left: 1em;">In vain to me the smiling Mornings shine,\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    And reddening Phœbus lifts his golden fire;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">The birds in vain their amorous descant join;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    Or cheerful fields resume their green attire;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">These ears, alas! for other notes repine,\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    A different object do these eyes require;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">My lonely anguish melts no heart but mine;\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">    And in my breast the imperfect joys expire.\r<br/></div>',
 '<div style="text-indent: -1em; padding-left: 1em;">Yet Morning smiles the busy race 

- **Check for duplicate values**

In [4]:
# create dataframe from poet_urls_dict
poet_df = pd.DataFrame([(genre,v) for genre in poet_urls_dict.keys() for v in poet_urls_dict[genre]])

# check if any URLs appear more than once
pd.concat(g for _, g in poet_df.groupby(1) if len(g) > 1)

Unnamed: 0,0,1
126,imagist,https://www.poetryfoundation.org/poets/ezra-pound
186,modern,https://www.poetryfoundation.org/poets/ezra-pound
122,imagist,https://www.poetryfoundation.org/poets/richard-aldington
150,modern,https://www.poetryfoundation.org/poets/richard-aldington


- **I'll give those poets to the imagist genre, since it has so few already.**

In [5]:
# list of duplicate URLs
dups = [value for value in poet_df[poet_df.duplicated(1)][1]]
dups

['https://www.poetryfoundation.org/poets/richard-aldington',
 'https://www.poetryfoundation.org/poets/ezra-pound']

In [6]:
# number of modern poets before
len(poet_urls_dict['modern'])

54

In [7]:
# re-listify the modernist URLs without Pound and Aldington
poet_urls_dict['modern'] = [url for url in poet_urls_dict['modern'] if url not in dups]

# number of modern poets after
len(poet_urls_dict['modern'])

52

## Build a dataframe
- **Scrape poems and other info.**

In [15]:
%%time

# instantiate an empty dataframe
df = pd.DataFrame()

# loop over each genre, create dataframe with desired information,
# concat to original dataframe, then save it before looping again
for genre in list(poet_urls_dict.keys()):
    genre_df = pf_scraper(poet_urls_dict, genre, 0.5)
    df = pd.concat([df, genre_df])
    df.to_csv('data/poetry_foundation_raw.csv')

KeyboardInterrupt: 

### Save/load dataframe

In [2]:
# # uncomment to save
# df.to_csv('data/poetry_foundation_raw.csv')

# # uncomment to load
# df = pd.read_csv('data/poetry_foundation_raw.csv', index_col=0)

In [3]:
# rename the columns
df.columns = ['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_lines', 'poem_string']
df.head()

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
0,https://www.poetryfoundation.org/poets/mary-barber,augustan,https://www.poetryfoundation.org/poems/50523/advice-to-her-son-on-marriage,Mary Barber,Advice to Her Son on Marriage,,"['When you gain her Affection, take care to preserve it;\r', 'Lest others persuade her, you do not deserve it.\r', 'Still study to heighten the Jo...","When you gain her Affection, take care to preserve it;\r\nLest others persuade her, you do not deserve it.\r\nStill study to heighten the Joys of ..."
1,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50534/auld-robin-forbes,Susanna Blamire,Auld Robin Forbes,,"['And auld Robin Forbes hes gien tem a dance,\r', 'I pat on my speckets to see them aw prance;\r', 'I thout o’ the days when I was but fifteen,\r'...","And auld Robin Forbes hes gien tem a dance,\r\nI pat on my speckets to see them aw prance;\r\nI thout o’ the days when I was but fifteen,\r\nAnd s..."
2,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50533/o-donald-ye-are-just-the-man,Susanna Blamire,O Donald! Ye Are Just the Man,,"['O Donald! ye are just the man\r', ' Who, when he’s got a wife,\r', 'Begins to fratch— nae notice ta’en—\r', ' They’re strangers a’ their life....","O Donald! ye are just the man\r\n Who, when he’s got a wife,\r\nBegins to fratch— nae notice ta’en—\r\n They’re strangers a’ their life.\r\n\nTh..."
3,https://www.poetryfoundation.org/poets/susanna-blamire,augustan,https://www.poetryfoundation.org/poems/50532/the-siller-croun,Susanna Blamire,The Siller Croun,,"['And ye shall walk in silk attire,\r', ' And siller hae to spare,\r', 'Gin ye’ll consent to be his bride,\r', ' Nor think o’ Donald mair.\r'...","And ye shall walk in silk attire,\r\n And siller hae to spare,\r\nGin ye’ll consent to be his bride,\r\n Nor think o’ Donald mair.\r\nO wha w..."
4,https://www.poetryfoundation.org/poets/henry-carey,augustan,https://www.poetryfoundation.org/poems/43884/the-ballad-of-sally-in-our-alley,Henry Carey,The Ballad of Sally in our Alley,,"['Of all the Girls that are so smart\r', ' There’s none like pretty SALLY,\r', 'She is the Darling of my Heart,\r', ' And she lives in our...","Of all the Girls that are so smart\r\n There’s none like pretty SALLY,\r\nShe is the Darling of my Heart,\r\n And she lives in our Alley.\..."


- **Explore how the data looks.**

In [4]:
df.shape

(5295, 8)

In [5]:
df.genre.unique()

array(['augustan', 'beat', 'black_arts_movement', 'black_mountain',
       'confessional', 'fugitive', 'georgian', 'harlem_renaissance',
       'imagist', 'language_poetry', 'middle_english', 'modern',
       'new_york_school', 'new_york_school_2nd_generation', 'objectivist',
       'renaissance', 'romantic', 'victorian'], dtype=object)

In [6]:
df.genre.value_counts()

modern                            1324
victorian                          674
renaissance                        430
romantic                           407
imagist                            370
new_york_school                    265
black_mountain                     257
new_york_school_2nd_generation     193
language_poetry                    192
confessional                       176
georgian                           167
black_arts_movement                165
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           121
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

- **Check for duplicate values across multiple columns and drop those rows.**

In [7]:
df.duplicated(subset=['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_string'], keep='last').sum()

98

In [8]:
# drop duplicates
df.drop_duplicates(subset=['poet_url', 'genre', 'poem_url', 'poet', 'title', 'year', 'poem_string'],
                   keep='last',
                   inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)

In [9]:
# check changes
df.shape

(5197, 8)

In [10]:
df.genre.value_counts()

modern                            1284
victorian                          643
renaissance                        427
romantic                           398
imagist                            370
new_york_school                    265
black_mountain                     257
new_york_school_2nd_generation     192
language_poetry                    192
confessional                       176
black_arts_movement                165
georgian                           160
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           114
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

- **Looks like the poem_lines column converted to a list inside of a string while saving to CSV.**
- **I'll wait to convert it until I can fill some missing values for that column, a process I found to be more easily done as a list inside of a string.**

In [11]:
df.loc[0,'poem_lines']

"['When you gain her Affection, take care to preserve it;\\r', 'Lest others persuade her, you do not deserve it.\\r', 'Still study to heighten the Joys of her Life;\\r', 'Not treat her the worse, for her being your Wife.\\r', 'If in Judgment she errs, set her right, without Pride:\\r', '’Tis the Province of insolent Fools, to deride.\\r', 'A Husband’s first Praise, is a ', 'Then change not these Titles, for ', 'Let your Person be neat, unaffectedly clean,\\r', 'Tho’ alone with your wife the whole Day you remain.\\r', 'Chuse Books, for her study, to fashion her Mind,\\r', 'To emulate those who excell’d of her Kind.\\r', 'Be Religion the principal Care of your Life,\\r', 'As you hope to be blest in your Children and Wife:\\r', 'So you, in your Marriage, shall gain its true End;\\r', 'And find, in your Wife, a ', '', '']"

- **Check for missing values.**

In [12]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet             13
title           215
year           1649
poem_lines      410
poem_string     412
dtype: int64

In [13]:
df[df.poet.isna()]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
858,https://www.poetryfoundation.org/poets/w-d-snodgrass,confessional,https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0,,,2006.0,"['ILEANA MALANCIOIU', '', 'Road', '', 'I walk on a dark road so that I won’t see', '', 'The way my young oxen limp so much;', '', 'The horseshoes ...",ILEANA MALANCIOIU\n\nRoad\n\nI walk on a dark road so that I won’t see\n\nThe way my young oxen limp so much;\n\nThe horseshoes gouging into their...
1409,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14311/after-how-many-years-tr-by-amy-lowell-and-florence-ayscough,,After How Many Years Tr By Amy Lowell And Florence Ayscough,1919.0,,
1410,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14312/calligraphy-tr-by-amy-lowell-and-florence-ayscough,,Calligraphy Tr By Amy Lowell And Florence Ayscough,1919.0,,
1411,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14322/the-emperors-return-from-a-journey-to-the-south-tr-by-amy-lowell-and-florence-ayscough,,The Emperors Return From A Journey To The South Tr By Amy Lowell And Florence Ayscough,1919.0,,
1412,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14310/an-evening-meeting-tr-by-amy-lowell-and-florence-ayscough,,An Evening Meeting Tr By Amy Lowell And Florence Ayscough,1919.0,,
1413,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14314/from-the-straw-hut-among-the-seven-peaks-tr-by-amy-lowell-and-florence-ayscough,,From The Straw Hut Among The Seven Peaks Tr By Amy Lowell And Florence Ayscough,1919.0,,
1414,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14321/the-inn-at-the-western-lake-tr-by-amy-lowell-and-florence-ayscough,,The Inn At The Western Lake Tr By Amy Lowell And Florence Ayscough,1919.0,,
1415,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14296/on-seeing-the-portrait-of-a-beautiful-concubine-tr-by-amy-lowell-and-florence-ayscough,,On Seeing The Portrait Of A Beautiful Concubine Tr By Amy Lowell And Florence Ayscough,1919.0,,
1416,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14316/on-the-classic-of-the-hills-and-sea-tr-by-amy-lowell-and-florence-ayscough,,On The Classic Of The Hills And Sea Tr By Amy Lowell And Florence Ayscough,1919.0,,
1417,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/14313/one-goes-a-journey-tr-by-amy-lowell-and-florence-ayscough,,One Goes A Journey Tr By Amy Lowell And Florence Ayscough,1919.0,,


- **The Amy Lowell and Ben Jonson entries appear unuseable, so I'll drop those rows.**
- **I'll go ahead and fill in the missing info for the Snodgrass poem (which is actually a translation of another poet, but a Confessional translator will probably produce a Confessional work).**

In [14]:
# manually load in information to the poet and title column
df.loc[858,'poet'] = 'ILEANA MALANCIOIU'.title()
df.loc[858,'title'] = 'Road'
df[df.index == 858]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
858,https://www.poetryfoundation.org/poets/w-d-snodgrass,confessional,https://www.poetryfoundation.org/poetrymagazine/poems/48292/road-56d22969928f0,Ileana Malancioiu,Road,2006.0,"['ILEANA MALANCIOIU', '', 'Road', '', 'I walk on a dark road so that I won’t see', '', 'The way my young oxen limp so much;', '', 'The horseshoes ...",ILEANA MALANCIOIU\n\nRoad\n\nI walk on a dark road so that I won’t see\n\nThe way my young oxen limp so much;\n\nThe horseshoes gouging into their...


In [15]:
# drop the rows with missing values in the poet column
df.dropna(subset=['poet'], inplace=True)

In [16]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines      398
poem_string     400
dtype: int64

## Rescraping
- **After reworking the scraping function a bit, I can try to fill in some missing poem_lines and poem_string values.**

### Round 1

In [17]:
# create a list of index numbers with NaN values in the poem_lines column
lookups = list(df[df.poem_lines.isna()].index)
lookups

[158,
 168,
 169,
 171,
 175,
 183,
 184,
 200,
 203,
 210,
 229,
 254,
 283,
 324,
 325,
 336,
 351,
 354,
 361,
 458,
 466,
 482,
 484,
 487,
 490,
 503,
 511,
 512,
 513,
 531,
 532,
 542,
 558,
 568,
 576,
 578,
 624,
 626,
 648,
 660,
 661,
 663,
 664,
 694,
 701,
 702,
 703,
 704,
 705,
 707,
 708,
 711,
 714,
 715,
 716,
 717,
 719,
 727,
 736,
 749,
 751,
 753,
 769,
 770,
 817,
 834,
 853,
 872,
 881,
 885,
 886,
 892,
 897,
 900,
 917,
 921,
 940,
 942,
 943,
 944,
 945,
 946,
 947,
 1004,
 1025,
 1123,
 1163,
 1169,
 1171,
 1184,
 1186,
 1192,
 1234,
 1297,
 1299,
 1319,
 1326,
 1345,
 1348,
 1363,
 1367,
 1371,
 1379,
 1383,
 1392,
 1395,
 1404,
 1440,
 1446,
 1452,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1515,
 1516,
 1517,
 1518,
 1519,
 1551,
 1552,
 1553,
 1554,
 1555,
 1556,
 1560,
 1565,
 1566,
 1587,
 1591,
 1594,
 1602,
 1604,
 1617,
 1618,
 1623,
 1631,
 1711,
 1731,
 1732,
 1743,
 1748,
 1770,
 1786,
 1815,
 1816

In [18]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups:
    info = poem_scraper(df.loc[i, 'poem_url'])
    try:
        df.loc[i,'poem_lines'] = str(info[3])
        df.loc[i,'poem_string'] = info[4]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 158
Success -- 168
Success -- 169
Success -- 171
Success -- 175
Success -- 183
Success -- 184
Success -- 200
Success -- 203
Success -- 210
Success -- 229
Success -- 254
Success -- 283
Success -- 324
Success -- 325
Success -- 336
Success -- 351
Success -- 354
Success -- 361
Success -- 458
Success -- 466
Success -- 482
Success -- 484
Success -- 487
Success -- 490
Success -- 503
Success -- 511
Success -- 512
Success -- 513
Success -- 531
Success -- 532
Success -- 542
Success -- 558
Success -- 568
Success -- 576
Success -- 578
Success -- 624
Success -- 626
Success -- 648
Success -- 660
Success -- 661
Success -- 663
Success -- 664
Success -- 694
Success -- 701
Success -- 702
Success -- 703
Success -- 704
Success -- 705
Success -- 707
Success -- 708
Success -- 711
Success -- 714
Success -- 715
Success -- 716
Success -- 717
Success -- 719
Success -- 727
Success -- 736
Success -- 749
Success -- 751
Success -- 753
Success -- 769
Success -- 770
Success -- 817
Success -- 834
Success --

- **Looks like the loop was somewhat successful though it did turn NaN values into the string 'nan'.**
- **I'll look first for other NaNs I may want to get rid of.**

In [20]:
df['poem_lines'] = df['poem_lines'].apply(destringify)

In [21]:
df.loc[0,'poem_lines']

['When you gain her Affection, take care to preserve it;\r',
 'Lest others persuade her, you do not deserve it.\r',
 'Still study to heighten the Joys of her Life;\r',
 'Not treat her the worse, for her being your Wife.\r',
 'If in Judgment she errs, set her right, without Pride:\r',
 '’Tis the Province of insolent Fools, to deride.\r',
 'A Husband’s first Praise, is a ',
 'Then change not these Titles, for ',
 'Let your Person be neat, unaffectedly clean,\r',
 'Tho’ alone with your wife the whole Day you remain.\r',
 'Chuse Books, for her study, to fashion her Mind,\r',
 'To emulate those who excell’d of her Kind.\r',
 'Be Religion the principal Care of your Life,\r',
 'As you hope to be blest in your Children and Wife:\r',
 'So you, in your Marriage, shall gain its true End;\r',
 'And find, in your Wife, a ',
 '',
 '']

In [23]:
# convert the string 'nan' back to NaN value
df['poem_lines'] = np.where(df['poem_lines'] == 'nan', np.nan, df['poem_lines'])

# check
df.loc[169,'poem_lines']

nan

In [24]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines      344
poem_string     346
dtype: int64

### Round 2

In [34]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups2 = list(df[df.poem_lines.isna()].index)
lookups2

[169,
 171,
 183,
 184,
 200,
 203,
 210,
 229,
 254,
 283,
 324,
 325,
 458,
 466,
 482,
 484,
 487,
 490,
 503,
 511,
 512,
 513,
 531,
 532,
 558,
 568,
 576,
 578,
 624,
 626,
 648,
 660,
 661,
 663,
 664,
 694,
 701,
 702,
 703,
 704,
 705,
 707,
 708,
 711,
 714,
 715,
 716,
 717,
 719,
 727,
 736,
 749,
 751,
 753,
 769,
 770,
 834,
 853,
 872,
 881,
 885,
 886,
 892,
 897,
 900,
 917,
 921,
 940,
 942,
 943,
 944,
 945,
 946,
 947,
 1004,
 1025,
 1163,
 1169,
 1171,
 1184,
 1186,
 1234,
 1297,
 1299,
 1319,
 1363,
 1367,
 1371,
 1379,
 1383,
 1392,
 1395,
 1404,
 1440,
 1446,
 1452,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1551,
 1552,
 1553,
 1554,
 1555,
 1556,
 1560,
 1565,
 1566,
 1587,
 1591,
 1594,
 1602,
 1604,
 1617,
 1618,
 1623,
 1711,
 1834,
 1836,
 1837,
 1839,
 1844,
 1865,
 1867,
 1870,
 1875,
 1876,
 1877,
 1906,
 1914,
 1915,
 1940,
 1965,
 1975,
 1976,
 1977,
 1978,
 1979,
 1993,
 1994,
 1997,
 1999,
 2000,
 20

In [41]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups2:
    try:
        info = image_rescraper_poet(df.loc[i, 'poem_url'], df.loc[i, 'poet'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 169
Success -- 171
Failure -- 183
Failure -- 184
Failure -- 200
Failure -- 203
Success -- 210
Failure -- 229
Failure -- 254
Failure -- 283
Failure -- 324
Failure -- 325
Success -- 458
Success -- 466
Success -- 482
Success -- 484
Success -- 487
Success -- 490
Success -- 503
Success -- 511
Success -- 512
Failure -- 513
Success -- 531
Success -- 532
Success -- 558
Success -- 568
Failure -- 576
Failure -- 578
Success -- 624
Success -- 626
Failure -- 648
Success -- 660
Success -- 661
Success -- 663
Success -- 664
Success -- 694
Success -- 701
Success -- 702
Success -- 703
Failure -- 704
Success -- 705
Failure -- 707
Success -- 708
Success -- 711
Success -- 714
Failure -- 715
Success -- 716
Failure -- 717
Success -- 719
Success -- 727
Success -- 736
Success -- 749
Failure -- 751
Failure -- 753
Failure -- 769
Failure -- 770
Success -- 834
Success -- 853
Success -- 872
Failure -- 881
Success -- 885
Success -- 886
Failure -- 892
Success -- 897
Failure -- 900
Failure -- 917
Success --

### Round 3

In [42]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups3 = list(df[df.poem_lines.isna()].index)
lookups3

[183,
 184,
 200,
 203,
 229,
 254,
 283,
 324,
 325,
 513,
 576,
 578,
 648,
 704,
 707,
 715,
 717,
 751,
 753,
 769,
 770,
 881,
 892,
 900,
 917,
 940,
 943,
 945,
 946,
 947,
 1025,
 1163,
 1169,
 1184,
 1234,
 1297,
 1299,
 1319,
 1363,
 1367,
 1371,
 1383,
 1392,
 1404,
 1440,
 1446,
 1456,
 1467,
 1468,
 1477,
 1482,
 1489,
 1495,
 1496,
 1498,
 1500,
 1502,
 1503,
 1505,
 1552,
 1554,
 1587,
 1594,
 1604,
 1617,
 1618,
 1623,
 1711,
 1834,
 1836,
 1837,
 1839,
 1865,
 1870,
 1915,
 1975,
 1976,
 1977,
 1978,
 1979,
 1993,
 1997,
 2003,
 2008,
 2011,
 2013,
 2019,
 2021,
 2023,
 2026,
 2032,
 2037,
 2042,
 2044,
 2050,
 2055,
 2091,
 2092,
 2093,
 2117,
 2122,
 2123,
 2156,
 2163,
 2165,
 2171,
 2193,
 2206,
 2240,
 2249,
 2293,
 2307,
 2310,
 2336,
 2349,
 2412,
 2417,
 2421,
 2424,
 2425,
 2434,
 2444,
 2451,
 2452,
 2457,
 2458,
 2461,
 2464,
 2488,
 2492,
 2528,
 2546,
 2572,
 2647,
 2648,
 2649,
 2728,
 2730,
 2744,
 2746,
 2776,
 2787,
 2803,
 2829,
 2851,
 2869,
 2877,
 

In [46]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I was getting a 'ValueError: Must have equal len keys and value when setting with an iterable', but converting
# the list to a string first seemed to make that go away. I have to convert this entire column anyway next.
for i in lookups3:
    try:
        info = image_rescraper_POETRY(df.loc[i, 'poem_url'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Failure -- 183
Failure -- 184
Success -- 200
Success -- 203
Success -- 229
Success -- 254
Failure -- 283
Success -- 324
Success -- 325
Success -- 513
Success -- 576
Success -- 578
Success -- 648
Success -- 704
Success -- 707
Success -- 715
Success -- 717
Success -- 751
Success -- 753
Success -- 769
Success -- 770
Success -- 881
Failure -- 892
Success -- 900
Success -- 917
Success -- 940
Success -- 943
Failure -- 945
Success -- 946
Success -- 947
Success -- 1025
Success -- 1163
Success -- 1169
Success -- 1184
Success -- 1234
Failure -- 1297
Success -- 1299
Success -- 1319
Failure -- 1363
Success -- 1367
Failure -- 1371
Failure -- 1383
Failure -- 1392
Success -- 1404
Failure -- 1440
Failure -- 1446
Success -- 1456
Success -- 1467
Success -- 1468
Success -- 1477
Failure -- 1482
Failure -- 1489
Failure -- 1495
Failure -- 1496
Success -- 1498
Failure -- 1500
Success -- 1502
Success -- 1503
Success -- 1505
Failure -- 1552
Success -- 1554
Success -- 1587
Success -- 1594
Success -- 1604
Failur

In [47]:
df.loc[200,'poem_lines']

"['© SHE IS AS LOVELY-OFTEN', 'And tallness stood upon the sky like a sparkling mane', 'O she is as lovely-often as every day; the day', 'following the day . . the day of our lives, the brief day.', 'Within this moving room, this shadowy often-', 'ness of days where the little hurry of our lives is said. .', 'O as lovely-often as the moving wing of a bird.', 'But ah, alas, sooner or later each of us must', 'stand before that Roman Court, and be judged free of', 'even such lies as I told about the imperishable beauty of', 'her hair. But that time is not now, and even such lies as', 'I said about the enduring wonder of her grace, are lies', 'that contain within them the only truth by which a', 'man may live in this world.', 'she is as lovely-often as every day; the day', 'following the little day . . the day of our lives, ah, alas,', 'the brief day.', 'FIRST CAME THE LION-RIDER', 'First came the Lion-Rider, across the green', 'fields of the morning, holding golden in his golden', 'hands 

### Round 4

In [48]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups4 = list(df[df.poem_lines.isna()].index)
lookups4

[183,
 184,
 283,
 892,
 945,
 1297,
 1363,
 1371,
 1383,
 1392,
 1440,
 1446,
 1482,
 1489,
 1495,
 1496,
 1500,
 1552,
 1617,
 1836,
 1839,
 1865,
 1870,
 1975,
 1976,
 1977,
 1978,
 1979,
 2003,
 2013,
 2050,
 2093,
 2122,
 2123,
 2412,
 2424,
 2434,
 2451,
 2452,
 2457,
 2458,
 2546,
 2572,
 2728,
 2776,
 2933,
 3004,
 3327,
 3335,
 3336,
 3452,
 4309]

In [60]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so I'm running that again
for i in lookups4:
    try:
        info = image_rescraper_poet(df.loc[i, 'poem_url'], df.loc[i, 'poet'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 183
Success -- 184
Failure -- 283
Failure -- 892
Failure -- 945
Success -- 1297
Failure -- 1363
Failure -- 1371
Failure -- 1383
Failure -- 1392
Failure -- 1440
Failure -- 1446
Failure -- 1482
Failure -- 1489
Failure -- 1495
Failure -- 1496
Failure -- 1500
Failure -- 1552
Failure -- 1617
Failure -- 1836
Failure -- 1839
Failure -- 1865
Failure -- 1870
Failure -- 1975
Failure -- 1976
Failure -- 1977
Failure -- 1978
Failure -- 1979
Success -- 2003
Success -- 2013
Failure -- 2050
Success -- 2093
Failure -- 2122
Failure -- 2123
Failure -- 2412
Failure -- 2424
Success -- 2434
Failure -- 2451
Failure -- 2452
Failure -- 2457
Failure -- 2458
Failure -- 2546
Failure -- 2572
Failure -- 2728
Failure -- 2776
Failure -- 2933
Failure -- 3004
Success -- 3327
Success -- 3335
Success -- 3336
Failure -- 3452
Failure -- 4309
CPU times: user 5.96 s, sys: 798 ms, total: 6.75 s
Wall time: 1min 13s


### Round 5

In [61]:
# again, create a list of index numbers with NaN values in the poem_lines column
lookups5 = list(df[df.poem_lines.isna()].index)
lookups5

[283,
 892,
 945,
 1363,
 1371,
 1383,
 1392,
 1440,
 1446,
 1482,
 1489,
 1495,
 1496,
 1500,
 1552,
 1617,
 1836,
 1839,
 1865,
 1870,
 1975,
 1976,
 1977,
 1978,
 1979,
 2050,
 2122,
 2123,
 2412,
 2424,
 2451,
 2452,
 2457,
 2458,
 2546,
 2572,
 2728,
 2776,
 2933,
 3004,
 3452,
 4309]

In [69]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so am running that again
for i in lookups5:
    try:
        info = image_rescraper_title(df.loc[i, 'poem_url'], df.loc[i, 'title'])
        df.loc[i,'poem_lines'] = str(info[0])
        df.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 283
Success -- 892
Success -- 945
Success -- 1363
Success -- 1371
Success -- 1383
Success -- 1392
Failure -- 1440
Success -- 1446
Failure -- 1482
Success -- 1489
Failure -- 1495
Success -- 1496
Failure -- 1500
Success -- 1552
Failure -- 1617
Success -- 1836
Failure -- 1839
Success -- 1865
Success -- 1870
Failure -- 1975
Failure -- 1976
Success -- 1977
Failure -- 1978
Failure -- 1979
Failure -- 2050
Success -- 2122
Failure -- 2123
Failure -- 2412
Success -- 2424
Failure -- 2451
Success -- 2452
Success -- 2457
Failure -- 2458
Success -- 2546
Success -- 2572
Success -- 2728
Failure -- 2776
Success -- 2933
Success -- 3004
Success -- 3452
Success -- 4309
CPU times: user 4.89 s, sys: 663 ms, total: 5.56 s
Wall time: 58.8 s


### A little excessive, but not bad!

In [73]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines        7
poem_string       9
dtype: int64

- **I'll drop the remaining rows with missing poem_lines values.**

In [75]:
# drop the rows with missing values in the poem_lines column
df.dropna(subset=['poem_lines'], inplace=True)

In [76]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title           214
year           1649
poem_lines        0
poem_string       2
dtype: int64

- **The pages for the rows with missing poem_string values appear to be blank so I'll drop those.**

In [77]:
df[df.poem_string.isna()]

Unnamed: 0,poet_url,genre,poem_url,poet,title,year,poem_lines,poem_string
2941,https://www.poetryfoundation.org/poets/dylan-thomas,modern,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Dylan Thomas,Poem on His Birthday [Facs. drafts],,[],
3230,https://www.poetryfoundation.org/poets/barbara-guest,new_york_school,https://www.poetryfoundation.org/poems/49367/imagined-room,Barbara Guest,Imagined Room,,[],


In [78]:
# drop the rows with missing values in the poem_string column, the pages for which do appear blank
df.dropna(subset=['poem_string'], inplace=True)

- **I'll try to fill in the title column using Regex.**

In [79]:
# create a list of index numbers with NaN values in the title column
lookups_title = list(df[df.title.isna()].index)
lookups_title

[166,
 251,
 275,
 285,
 306,
 459,
 460,
 462,
 463,
 469,
 470,
 471,
 472,
 514,
 517,
 521,
 522,
 523,
 552,
 556,
 557,
 559,
 561,
 563,
 567,
 619,
 631,
 639,
 641,
 642,
 696,
 710,
 779,
 780,
 830,
 831,
 906,
 908,
 922,
 924,
 986,
 999,
 1012,
 1046,
 1112,
 1136,
 1143,
 1164,
 1174,
 1261,
 1262,
 1296,
 1349,
 1455,
 1539,
 1540,
 1586,
 1588,
 1596,
 1599,
 1609,
 1757,
 1842,
 1848,
 1849,
 1903,
 1907,
 1908,
 1930,
 1935,
 1946,
 1947,
 1955,
 2028,
 2034,
 2118,
 2159,
 2160,
 2167,
 2177,
 2182,
 2188,
 2198,
 2210,
 2211,
 2212,
 2219,
 2223,
 2291,
 2363,
 2415,
 2426,
 2428,
 2460,
 2466,
 2493,
 2494,
 2522,
 2757,
 2758,
 2760,
 2767,
 2778,
 2781,
 2796,
 2806,
 2816,
 2820,
 2830,
 2845,
 2847,
 2858,
 2862,
 2864,
 2871,
 2953,
 2955,
 2969,
 2996,
 2997,
 3002,
 3008,
 3167,
 3271,
 3309,
 3346,
 3360,
 3369,
 3380,
 3381,
 3390,
 3430,
 3431,
 3433,
 3449,
 3456,
 3533,
 3592,
 3593,
 3641,
 3644,
 3677,
 3696,
 3704,
 3705,
 3707,
 3708,
 3709,
 3714,

In [80]:
%%time

# create regex pattern to capture the ending of the url
title_pattern = '.+/([a-z\-]*).*$'

# iterate over the list, attempting to fill in the title with re-stylized url ending
for i in lookups_title:
    title = re.search(title_pattern, df.loc[i,'poem_url'], re.I).group(1).replace('-', ' ').title()
    try:
        df.loc[i,'title'] = title
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 166
Success -- 251
Success -- 275
Success -- 285
Success -- 306
Success -- 459
Success -- 460
Success -- 462
Success -- 463
Success -- 469
Success -- 470
Success -- 471
Success -- 472
Success -- 514
Success -- 517
Success -- 521
Success -- 522
Success -- 523
Success -- 552
Success -- 556
Success -- 557
Success -- 559
Success -- 561
Success -- 563
Success -- 567
Success -- 619
Success -- 631
Success -- 639
Success -- 641
Success -- 642
Success -- 696
Success -- 710
Success -- 779
Success -- 780
Success -- 830
Success -- 831
Success -- 906
Success -- 908
Success -- 922
Success -- 924
Success -- 986
Success -- 999
Success -- 1012
Success -- 1046
Success -- 1112
Success -- 1136
Success -- 1143
Success -- 1164
Success -- 1174
Success -- 1261
Success -- 1262
Success -- 1296
Success -- 1349
Success -- 1455
Success -- 1539
Success -- 1540
Success -- 1586
Success -- 1588
Success -- 1596
Success -- 1599
Success -- 1609
Success -- 1757
Success -- 1842
Success -- 1848
Success -- 1849
Su

In [81]:
df.isna().sum()

poet_url          0
genre             0
poem_url          0
poet              0
title             0
year           1647
poem_lines        0
poem_string       0
dtype: int64

- **I'll drop the year column, as that didn't seem to be too successful.**

In [83]:
df.drop(columns='year', inplace=True)
df.isna().sum()

poet_url       0
genre          0
poem_url       0
poet           0
title          0
poem_lines     0
poem_string    0
dtype: int64

In [84]:
df.shape

(5176, 7)

### Save a copy

In [87]:
df.to_csv('data/poetry_foundation_raw_rescrape.csv')


- **I'll look at a breakdown of genres and see if there are any I should get rid of.**
- **My initial thoughts are to limit it in time period, so as to remove any language barriers, so to speak (between, say, Shakespearean English and modern English).**

In [88]:
df.genre.value_counts()

modern                            1279
victorian                          643
renaissance                        426
romantic                           398
imagist                            356
new_york_school                    264
black_mountain                     257
new_york_school_2nd_generation     192
language_poetry                    192
confessional                       176
black_arts_movement                165
georgian                           160
objectivist                        159
harlem_renaissance                 148
beat                               147
augustan                           114
fugitive                            90
middle_english                      10
Name: genre, dtype: int64

In [89]:
# check a sample Middle English poem
print(df[df.genre == 'middle_english'].iloc[0,-1])

Whan that Aprille with his shour
The droghte of March hath perc
And bath
Of which vertú engendr
Whan Zephirus eek with his swet
Inspir
The tendr
Hath in the Ram his half
And smal
That slepen al the nyght with open y
So priketh hem Natúre in hir corag
Thanne longen folk to goon on pilgrimag
And palmeres for to seken straung
To fern
And specially, from every shir
Of Eng
The hooly blisful martir for to sek
That hem hath holpen whan that they were seek

Bifil that in that seson on a day, 
In Southwerk at the Tabard as I lay, 
Redy to wenden on my pilgrymag
To Caunterbury with ful devout corag
At nyght were come into that hostelry
Wel nyne and twenty in a compaigny
Of sondry folk, by áventure y-fall
In felaweshipe, and pilgrimes were they all
That toward Caunterbury wolden ryd
The chambr
And wel we weren es
And shortly, whan the sonn
So hadde I spoken with hem everychon, 
That I was of hir felaweshipe anon, 
And mad
To take oure wey, ther as I yow devys

But nath
Er that I ferther in thi

- **Indeed, Middle English is definitely out.**

In [90]:
df = df[df.genre != 'middle_english']
df.shape

(5166, 7)

In [91]:
# check a sample Renaissance poem
print(df[df.genre == 'renaissance'].iloc[0,-1])

Long have I long’d to see my love againe,
   Still have I wisht, but never could obtaine it;
   Rather than all the world (if I might gaine it)
Would I desire my love’s sweet precious gaine.
Yet in my soule I see him everie day,
   See him, and see his still sterne countenaunce,
   But (ah) what is of long continuance,
Where majestie and beautie beares the sway?
Sometimes, when I imagine that I see him,
   (As love is full of foolish fantasies)
   Weening to kisse his lips, as my love’s fees,
I feele but aire: nothing but aire to bee him.
   Thus with Ixion, kisse I clouds in vaine:
   Thus with Ixion, feele I endles paine.





In [92]:
# check a sample Augustan poem
print(df[df.genre == 'augustan'].iloc[1,-1])

And auld Robin Forbes hes gien tem a dance,
I pat on my speckets to see them aw prance;
I thout o’ the days when I was but fifteen,
And skipp’d wi’ the best upon Forbes’s green.
Of aw things that is I think thout is meast queer,
It brings that that’s by-past and sets it down here;
I see Willy as plain as I dui this bit leace,
When he tuik his cwoat lappet and deeghted his feace.

The lasses aw wonder’d what Willy cud see
In yen that was dark and hard featur’d leyke me;
And they wonder’d ay mair when they talk’d o’ my wit,
And slily telt Willy that cudn’t be it:
But Willy he laugh’d, and he meade me his weyfe,
And whea was mair happy thro’ aw his lang leyfe?
It’s e’en my great comfort, now Willy is geane,
The he offen said— nae place was leyke his awn heame!

I mind when I carried my wark to yon steyle
Where Willy was deykin, the time to beguile,
He wad fling me a daisy to put i’ my breast,
And I hammer’d my noddle to mek out a jest.
But merry or grave, Willy often w

- **According to Poetry Foundation's website, Renaissance and Augustan poems are from the years 1500 - 1780, and the differences in the English are fairly clear.**
- **For now, I'll drop these.**

In [93]:
df_trim = df[df.genre != 'renaissance']
df_trim = df_trim[df_trim.genre != 'augustan']
df_trim.shape

(4626, 7)

In [94]:
# check a sample Victorian poem
print(df[df.genre == 'victorian'].iloc[1,-1])

I
The evening comes, the fields are still. 
The tinkle of the thirsty rill, 
Unheard all day, ascends again; 
Deserted is the half-mown plain, 
Silent the swaths! the ringing wain, 
The mower's cry, the dog's alarms, 
All housed within the sleeping farms! 
The business of the day is done, 
The last-left haymaker is gone. 
And from the thyme upon the height, 
And from the elder-blossom white 
And pale dog-roses in the hedge, 
And from the mint-plant in the sedge, 
In puffs of balm the night-air blows 
The perfume which the day forgoes. 
And on the pure horizon far, 
See, pulsing with the first-born star, 
The liquid sky above the hill! 
The evening comes, the fields are still. 

       Loitering and leaping, 
       With saunter, with bounds— 
       Flickering and circling 
       In files and in rounds— 
       Gaily their pine-staff green 
       Tossing in air, 
       Loose o'er their shoulders white 
       Showering their hair— 
       See! the wild Maenads 
       Break from the

In [95]:
# check a sample Romantic poem
print(df[df.genre == 'romantic'].iloc[1,-1])

Now in thy dazzling half-oped eye, 
Thy curled nose and lip awry, 
Uphoisted arms and noddling head, 
And little chin with crystal spread, 
Poor helpless thing! what do I see, 
That I should sing of thee? 

From thy poor tongue no accents come, 
Which can but rub thy toothless gum: 
Small understanding boasts thy face, 
Thy shapeless limbs nor step nor grace: 
A few short words thy feats may tell, 
And yet I love thee well. 

When wakes the sudden bitter shriek, 
And redder swells thy little cheek 
When rattled keys thy woes beguile, 
And through thine eyelids gleams the smile, 
Still for thy weakly self is spent 
Thy little silly plaint. 

But when thy friends are in distress. 
Thou’lt laugh and chuckle n’ertheless, 
Nor with kind sympathy be smitten, 
Though all are sad but thee and kitten; 
Yet puny varlet that thou art, 
Thou twitchest at the heart. 

Thy smooth round cheek so soft and warm; 
Thy pinky hand and dimpled arm; 
Thy silken locks that scantly peep, 
With gold tipped end

- **Romantic and Victorian poems are from 1781-1900, but the language seems fairly similar.**
- **Plus, these are some very formative genres for poetry in English. For now, I'll keep these.**

- **All other genres are from after 1900.**

In [96]:
# let's reindex
df_trim.reset_index(drop=True, inplace=True)

## Rescraping (again)
- **Look more closely at how the scraping went.**
- **Eventually, I'll want to create some new features, like number of lines and average line length.**
    - **Since I can't divide by zero, this is a good opportunity to look for any unsuccessful scrapes--those where 0 or too few lines were scraped.**
    - **NOTE: I'm checking if length of poem_lines is less than or equal to 1 because that yielded the desired results, whereas seeing if length equaled 0 did not.**

In [97]:
df_trim[df_trim['poem_lines'].map(lambda x: len(x)) <= 1]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
222,https://www.poetryfoundation.org/poets/henry-dumas,black_arts_movement,https://www.poetryfoundation.org/poems/53477/kef-21,Henry Dumas,Kef 21,"[First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling...","First there was the earth in my mouth. It was there like a running stream, the July fever sweating the delirium of August, and the green buckling ..."
428,https://www.poetryfoundation.org/poets/robert-duncan,black_mountain,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,Robert Duncan,A Poem Beginning with a Line by Pindar,[I],I
703,https://www.poetryfoundation.org/poets/anne-sexton,confessional,https://www.poetryfoundation.org/poems/152252/o-ye-tongues,Anne Sexton,O Ye Tongues,[First Psalm],First Psalm
952,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57369/the-send-off,Wilfred Owen,The Send-Off,[ ],
953,https://www.poetryfoundation.org/poets/wilfred-owen,georgian,https://www.poetryfoundation.org/poems/57347/smile-smile-smile,Wilfred Owen,"Smile, Smile, Smile","[Head to limp head, the sunk-eyed wounded scanned]","Head to limp head, the sunk-eyed wounded scanned"
1231,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53772/spring-day-56d233626c49b,Amy Lowell,Spring Day,[<em> Bath</em>],<em> Bath</em>
1234,https://www.poetryfoundation.org/poets/amy-lowell,imagist,https://www.poetryfoundation.org/poems/53773/towns-in-colour,Amy Lowell,Towns in Colour,"[Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!]","Red slippers in a shop-window, and outside in the street, flaws of grey, windy sleet!"
1389,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poems/54567/kora-in-hell-improvisations-xi,William Carlos Williams,Kora in Hell: Improvisations XI,[XI],XI
1603,https://www.poetryfoundation.org/poets/lyn-hejinian,language_poetry,https://www.poetryfoundation.org/poems/47892/my-life-a-name-trimmed-with-colored-ribbons,Lyn Hejinian,My Life: A name trimmed with colored ribbons,[A name trimmed],A name trimmed
1615,https://www.poetryfoundation.org/poets/fanny-howe,language_poetry,https://www.poetryfoundation.org/poems/46762/everythings-a-fake,Fanny Howe,Everything’s a Fake,"[Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it’s spring. At night the mockingbirds ring their warnings of ...","Coyote scruff in canyons off Mulholland Drive. Fragrance of sage and rosemary, now it’s spring. At night the mockingbirds ring their warnings of c..."


- **After building out some specific rescraping functions, I can replace the poem_lines and poem_string values.**

In [100]:
# rescrape poem based on index from above 
df_trim.loc[428,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[428,'poem_url'])[0])
df_trim.loc[428,'poem_string'] = PoemView_rescraper(df_trim.loc[428,'poem_url'])[1]

df_trim.loc[703,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[703,'poem_url'])[0])
df_trim.loc[703,'poem_string'] = PoemView_rescraper(df_trim.loc[703,'poem_url'])[1]

df_trim.loc[952,'poem_lines'] = str(poempara_rescraper(df_trim.loc[952,'poem_url'])[0])
df_trim.loc[952,'poem_string'] = poempara_rescraper(df_trim.loc[952,'poem_url'])[1]

df_trim.loc[953,'poem_lines'] = str(modified_regular_rescraper(df_trim.loc[953,'poem_url'])[0])
df_trim.loc[953,'poem_string'] = modified_regular_rescraper(df_trim.loc[953,'poem_url'])[1]

df_trim.loc[1231,'poem_lines'] = str(justify_rescraper(df_trim.loc[1231,'poem_url'])[0])
df_trim.loc[1231,'poem_string'] = justify_rescraper(df_trim.loc[1231,'poem_url'])[1]

df_trim.loc[1234,'poem_lines'] = str(justify_rescraper(df_trim.loc[1234,'poem_url'])[0])
df_trim.loc[1234,'poem_string'] = justify_rescraper(df_trim.loc[1234,'poem_url'])[1]

df_trim.loc[1389,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1389,'poem_url'])[0])
df_trim.loc[1389,'poem_string'] = PoemView_rescraper(df_trim.loc[1389,'poem_url'])[1]

df_trim.loc[1603,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1603,'poem_url'])[0])
df_trim.loc[1603,'poem_string'] = PoemView_rescraper(df_trim.loc[1603,'poem_url'])[1]

df_trim.loc[2514,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2514,'poem_url'])[0])
df_trim.loc[2514,'poem_string'] = PoemView_rescraper(df_trim.loc[2514,'poem_url'])[1]

df_trim.loc[2517,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[2517,'poem_url'])[0])
df_trim.loc[2517,'poem_string'] = PoemView_rescraper(df_trim.loc[2517,'poem_url'])[1]

df_trim.loc[3335,'poem_lines'] = str(ranged_rescraper(df_trim.loc[3335,'poem_url'])[0])
df_trim.loc[3335,'poem_string'] = ranged_rescraper(df_trim.loc[3335,'poem_url'])[1]

df_trim.loc[3418,'poem_lines'] = str(center_rescraper(df_trim.loc[3418,'poem_url'])[0])
df_trim.loc[3418,'poem_string'] = center_rescraper(df_trim.loc[3418,'poem_url'])[1]

df_trim.loc[3421,'poem_lines'] = str(justify_rescraper(df_trim.loc[3421,'poem_url'])[0])
df_trim.loc[3421,'poem_string'] = justify_rescraper(df_trim.loc[3421,'poem_url'])[1]

df_trim.loc[4217,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4217,'poem_url'])[0])
df_trim.loc[4217,'poem_string'] = poempara_rescraper(df_trim.loc[4217,'poem_url'])[1]

df_trim.loc[4611,'poem_lines'] = str(poempara_rescraper(df_trim.loc[4611,'poem_url'])[0])
df_trim.loc[4611,'poem_string'] = poempara_rescraper(df_trim.loc[4611,'poem_url'])[1]

In [104]:
# found some more...
df_trim.loc[1388,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1388,'poem_url'])[0])
df_trim.loc[1388,'poem_string'] = PoemView_rescraper(df_trim.loc[1388,'poem_url'])[1]

df_trim.loc[1390,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1390,'poem_url'])[0])
df_trim.loc[1390,'poem_string'] = PoemView_rescraper(df_trim.loc[1390,'poem_url'])[1]

df_trim.loc[1391,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1391,'poem_url'])[0])
df_trim.loc[1391,'poem_string'] = PoemView_rescraper(df_trim.loc[1391,'poem_url'])[1]

df_trim.loc[1392,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1392,'poem_url'])[0])
df_trim.loc[1392,'poem_string'] = PoemView_rescraper(df_trim.loc[1392,'poem_url'])[1]

In [106]:
# another one...
df_trim.loc[3399,'poem_lines'] = str(image_rescraper(df_trim.loc[3399,'poem_url'])[0])
df_trim.loc[3399,'poem_string'] = image_rescraper(df_trim.loc[3399,'poem_url'])[1]

- **Some scrapings contain only BeautifulSoup garbage, so I'll try to re-scrape those.**

In [108]:
# check if html tags are in the string
df_trim[df_trim.poem_string.str.contains('<div')]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
237,https://www.poetryfoundation.org/poets/nikki-giovanni,black_arts_movement,https://www.poetryfoundation.org/poems/90181/no-complaints,Nikki Giovanni,No Complaints,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917—2001)</span></p><...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">(For Gwendolyn Brooks, 1917—2001)</span></p></..."
1687,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55563/you-part-i,Ron Silliman,"You, part I","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
1688,https://www.poetryfoundation.org/poets/ron-silliman,language_poetry,https://www.poetryfoundation.org/poems/55564/you-part-xii,Ron Silliman,"You, part XII","[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</di...","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><p><span style=""font-style:normal"">for Pat Silliman</span></p></div>\n</p>\n</div>\n"
4260,https://www.poetryfoundation.org/poets/emma-lazarus,victorian,https://www.poetryfoundation.org/poems/46791/by-the-waters-of-babylon,Emma Lazarus,By the Waters of Babylon,"[, <div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>, ]","\n<div class=""c-epigraph"">\n<p>\n<div style=""font-style:italic;""><div align=""center"">Little Poems in Prose</div></div>\n</p>\n</div>\n"


In [159]:
# rescrape poem based on index from above 
df_trim.loc[237,'poem_lines'] = str(PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[0])
df_trim.loc[237,'poem_string'] = PoemView_rescraper_2(df_trim.loc[237,'poem_url'])[1]

df_trim.loc[1687,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1687,'poem_url'])[0])
df_trim.loc[1687,'poem_string'] = PoemView_rescraper(df_trim.loc[1687,'poem_url'])[1]

df_trim.loc[1688,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[1688,'poem_url'])[0])
df_trim.loc[1688,'poem_string'] = PoemView_rescraper(df_trim.loc[1688,'poem_url'])[1]

df_trim.loc[4260,'poem_lines'] = str(PoemView_rescraper(df_trim.loc[4260,'poem_url'])[0])
df_trim.loc[4260,'poem_string'] = PoemView_rescraper(df_trim.loc[4260,'poem_url'])[1]

In [160]:
# re-run the destringify function
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

- **Re-check for any missing poem_lines values that aren't NaNs.**

In [165]:
df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0]

Unnamed: 0,poet_url,genre,poem_url,poet,title,poem_lines,poem_string
783,https://www.poetryfoundation.org/poets/randall-jarrell,fugitive,https://www.poetryfoundation.org/poetrymagazine/poems/25237/goodbye-wendover-goodbye-mountain-home,Randall Jarrell,Goodbye Wendover Goodbye Mountain Home,[],
1326,https://www.poetryfoundation.org/poets/ezra-pound,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/13071/dogmatic-statement-concerning-the-game-of-chess-theme-for-a-series-of-pictures,Ezra Pound,Dogmatic Statement Concerning The Game Of Chess Theme For A Series Of Pictures,[],
1433,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/20226/a-foot-note,William Carlos Williams,A Foot Note,[],
1438,https://www.poetryfoundation.org/poets/william-carlos-williams,imagist,https://www.poetryfoundation.org/poetrymagazine/poems/24855/paterson-book-ii,William Carlos Williams,Paterson Book Ii,[],
1736,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/22702/poem-he-watched-with-all-his,W. H. Auden,Poem He Watched With All His,[],
1738,https://www.poetryfoundation.org/poets/w-h-auden,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21500/poem-o-who-can-ever-praise-enough-the-price,W. H. Auden,Poem O Who Can Ever Praise Enough The Price,[],
1775,https://www.poetryfoundation.org/poets/louise-bogan,modern,https://www.poetryfoundation.org/poetrymagazine/poems/21807/untitled-tender-and-insolent,Louise Bogan,Untitled Tender And Insolent,[],
1826,https://www.poetryfoundation.org/poets/hart-crane,modern,https://www.poetryfoundation.org/poetrymagazine/poems/17345/at-melvilles-tomb,Hart Crane,At Melvilles Tomb,[],
2056,https://www.poetryfoundation.org/poets/a-m-klein,modern,https://www.poetryfoundation.org/poetrymagazine/poems/23448/come-two-like-shadows,A. M. Klein,Come Two Like Shadows,[],
2582,https://www.poetryfoundation.org/poets/wallace-stevens,modern,https://www.poetryfoundation.org/poetrymagazine/poems/19837/good-man-bad-woman,Wallace Stevens,Good Man Bad Woman,[],


In [169]:
# create a list of indices
lookups6 = list(df_trim[df_trim['poem_lines'].map(lambda d: len(d)) == 0].index)
lookups6

[783,
 1326,
 1433,
 1438,
 1736,
 1738,
 1775,
 1826,
 2056,
 2582,
 2685,
 2790,
 2817,
 3191]

In [174]:
%%time

# iterate over the list, attempting to re-scrape the lines and string
# NOTE: I reworked the image_rescraper_poet function from earlier, so I'm running that again
for i in lookups6:
    try:
        info = image_rescraper_title(df_trim.loc[i, 'poem_url'], df_trim.loc[i, 'title'])
        df_trim.loc[i,'poem_lines'] = str(info[0])
        df_trim.loc[i,'poem_string'] = info[1]
        print(f'Success -- {i}')
    except:
        print(f'Failure -- {i}')
        continue

Success -- 783
Success -- 1326
Success -- 1433
Success -- 1438
Success -- 1736
Success -- 1738
Success -- 1775
Success -- 1826
Success -- 2056
Success -- 2582
Success -- 2685
Success -- 2790
Success -- 2817
Failure -- 3191
CPU times: user 1.58 s, sys: 214 ms, total: 1.79 s
Wall time: 51.6 s


In [177]:
# one final one to redo
df_trim.loc[3191,'title'] = 'Radio'
info = image_rescraper_title(df_trim.loc[3191, 'poem_url'], df_trim.loc[3191, 'title'])
df_trim.loc[3191,'poem_lines'] = str(info[0])
df_trim.loc[3191,'poem_string'] = info[1]

In [181]:
# re-run destringify
df_trim['poem_lines'] = df_trim['poem_lines'].apply(destringify)

## 💾 SAVE IT!

In [182]:
df_trim.to_csv('data/poetry_foundation_raw_rescrape.csv')

## Next notebook: [NLP, Feature Engineering, and EDA](03_nlp_features_eda.ipynb)

[[go back to the top](#Data-Cleaning)]

- The next notebook includes natural language processing, engineering of features, exploring data, and analyzing data.
⏰