# <ins>Predicting Poetic Movements</ins>

# Webscraping

### Using Selenium and BeautifulSoup

- This notebook details my process for scraping genre-labeled poetry from [PoetryFoundation.org](https://www.poetryfoundation.org/).

#### Important note
Due to the imperfection and idiosyncracies of scraping text from images, a lot of rescraping was necessary, sometimes in a manner that is best described, rather unfortunately, as nonprogrammatic. As a result, this notebook is extremely messy, which is not a reflection on the other notebooks for this project.

Thank you for understanding :)

## Table of contents

1. [Import necessary packages](#Import-necessary-packages)
2. [Initial scrape](#Initial-scrape)

    - [Text poems](#Text-poems)
    - [Scanned poems](#Scanned-poems)


3. [Rescrape](#Rescrape)

    - [Combine DataFrames](#Combine-DataFrames)
    
    
4. [Next notebook: Data Cleaning](#Next-notebook:-Data-Cleaning)
    
## Import necessary packages

[[go back to the top](#Predicting-Poetic-Movements)]

In [170]:
# custom functions for webscraping
from functions_webscraping import *

# standard dataframe packages
import pandas as pd
import numpy as np

# timekeeping/progress packages
import time
from tqdm import tqdm

# reload functions/libraries when edited
%load_ext autoreload
%autoreload 2

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# increase column width of dataframe
pd.set_option('max_colwidth', 150)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Initial scrape

[[go back to the top](#Predicting-Poetic-Movements)]

- Load URL codes for each genre.
- Create dictionary of URLs to poets' pages within each genre.
    - *NOTE: The function for this process uses Selenium, which will open dummy browser windows, so to speak.*
- Scrape URLs to poems' pages within each genre, separating into two groups:
    - Poems known to be in text format on the site.
    - Poem suspected to be within scanned images.
- Attempt to scrape each variety of poem.

In [536]:
# dictionary of genre codes found in poetryfoundation.org urls
genre_codes = load_genre_codes()
genre_codes

{'augustan': 149,
 'beat': 150,
 'black_arts_movement': 304,
 'black_mountain': 151,
 'confessional': 152,
 'fugitive': 153,
 'georgian': 154,
 'harlem_renaissance': 155,
 'imagist': 156,
 'language_poetry': 157,
 'middle_english': 158,
 'modern': 159,
 'new_york_school': 160,
 'new_york_school_2nd_generation': 161,
 'objectivist': 162,
 'renaissance': 163,
 'romantic': 164,
 'victorian': 165}

- Run function in a loop to create dictionary of poet urls.

In [None]:
# dictionary creation using custom function
poet_urls = {genre: poet_urls_by_genre(genre_code, 3) for genre, genre_code in genre_codes.items()}

# check a genre
poet_urls['augustan']

- Selenium can be finicky, so the loop only partially worked.
- I'll re-run sections in which some URLs are missing.

In [196]:
# re-run on genre
poet_urls['black_arts_movement'] = poet_urls_by_genre(genre_codes['black_arts_movement'])

In [198]:
# re-run on genre
poet_urls['modern'] = poet_urls_by_genre(genre_codes['modern'])

In [200]:
# re-run on genre
poet_urls['renaissance'] = poet_urls_by_genre(genre_codes['renaissance'])

In [203]:
# re-run on genre
poet_urls['romantic'] = poet_urls_by_genre(genre_codes['romantic'])

In [206]:
# re-run on genre
poet_urls['victorian'] = poet_urls_by_genre(genre_codes['victorian'])

In [207]:
# confirm all urls have been grabbed
url_lens = {k:len(v) for k,v in poet_urls.items()}
url_lens

{'augustan': 23,
 'beat': 13,
 'black_arts_movement': 23,
 'black_mountain': 10,
 'confessional': 7,
 'fugitive': 7,
 'georgian': 22,
 'harlem_renaissance': 17,
 'imagist': 6,
 'language_poetry': 18,
 'middle_english': 3,
 'modern': 54,
 'new_york_school': 9,
 'new_york_school_2nd_generation': 16,
 'objectivist': 5,
 'renaissance': 41,
 'romantic': 51,
 'victorian': 55}

- Ezra Pound and Richard Aldington both appear in two genres: Imagist and Modern.
- Since Modern has so many poets within it, and Imagist so few, I'll give them to the Imagists.

In [541]:
# remove urls that appear in two genres
poet_urls['modern'] = [url for url in poet_urls['modern'] \
                        if url not in \
                        ['https://www.poetryfoundation.org/poets/richard-aldington', 
                         'https://www.poetryfoundation.org/poets/ezra-pound']]

In [544]:
# confirm drop
url_lens = {k:len(v) for k,v in poet_urls.items()}
url_lens['modern']

52

#### 💾 Save/Load poet URLs dictionary

In [545]:
# # uncomment to save
# with gzip.open('data/poet_url_dict.pkl', 'wb') as goodbye:
#     pickle.dump(poet_urls, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# # uncomment to load
# with gzip.open('data/poet_url_dict.pkl', 'rb') as hello:
#     poet_url_dict = pickle.load(hello)

- Scrape poem URLs.

⏰ *NOTE*: Next cell takes about 10 minutes to run.

In [7]:
%%time

# loop over keys and values of dictionary
for genre, poet_urls in poet_url_dict.items():
    # scrape poem urls (text and scan poems) for each poet in each genre
    # now each poet's url with be a key and
    # the value will be a tuple of their text poems' urls and their scan poems' urls 
    poet_url_dict[genre] = [{poet_url: poem_url_scraper(poet_url)} for poet_url in poet_urls]

CPU times: user 41.1 s, sys: 699 ms, total: 41.8 s
Wall time: 9min 32s


- Simplify the structure of the dictionary.

In [8]:
# instantiate dictionaries of text and scan urls in each genre
poem_url_dict = {genre:{'text_urls':[],'scan_urls':[]} for genre in poet_url_dict}

# fill in empty lists with each type of url
for genre, poets in poet_url_dict.items():
    for poet in poets:
        for poet_url, poems in poet.items():
            poem_url_dict[genre]['text_urls'].extend(poems[0])
            poem_url_dict[genre]['scan_urls'].extend(poems[1])

In [12]:
#-------DATA STRUCTURE--------#
#
# genre ==> 'text_urls' ==> list of urls known to be text-based
#    \
#     ==> 'scan_urls' ==> list of urls thought to be scanned images

#### 💾 Save/Load poem URLs dictionary

In [9]:
# # uncomment to save
# with gzip.open('data/poem_url_dict.pkl', 'wb') as goodbye:
#     pickle.dump(poem_url_dict, goodbye, protocol=pickle.HIGHEST_PROTOCOL)

# # uncomment to load
# with gzip.open('data/poem_url_dict.pkl', 'rb') as hello:
#     poem_url_dict = pickle.load(hello)

In [10]:
# confirm everything's there
poem_url_dict.keys()

dict_keys(['augustan', 'beat', 'black_arts_movement', 'black_mountain', 'confessional', 'fugitive', 'georgian', 'harlem_renaissance', 'imagist', 'language_poetry', 'middle_english', 'modern', 'new_york_school', 'new_york_school_2nd_generation', 'objectivist', 'renaissance', 'romantic', 'victorian'])

### Text poems

[[go back to the top](#Predicting-Poetic-Movements)]

- Scrape poems I know are already in text format.

⏰ *NOTE*: Next cell takes about an hour and a half to run.

In [17]:
%%time

# instantiate list for dictionaries
text_poems = []

# loop through text_urls of each genre
for genre in tqdm(poem_url_dict.keys()):
    for text_url in poem_url_dict[genre]['text_urls']:
        
        # scrape poem, title, and poet
        poem = text_poem_scraper(text_url)
        
        # add genre and url
        poem['genre'] = genre
        poem['poem_url'] = text_url
        
        # add to big list
        text_poems.append(poem)
        
        # timeout so I don't get flagged
        time.sleep(0.01)

100%|██████████| 18/18 [1:26:33<00:00, 288.51s/it]

CPU times: user 5min 46s, sys: 38.3 s, total: 6min 25s
Wall time: 1h 26min 33s





In [18]:
# convert to dataframe
text_poems_df = pd.DataFrame(text_poems)
text_poems_df.shape

(3261, 6)

In [19]:
# check for bad scrapes
text_poems_df[text_poems_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
154,Allen Ginsberg,https://www.poetryfoundation.org/poems/47660/a-supermarket-in-california,A Supermarket in California,[],,beat
166,Bob Kaufman,https://www.poetryfoundation.org/poems/55713/a-terror-is-more-certain-,A Terror is More Certain . . .,[],,beat
210,Lawrence Ferlinghetti,https://www.poetryfoundation.org/poetrymagazine/poems/58150/beatitudes-visuales-mexicanas,Beatitudes Visuales Mexicanas,[],,beat
268,Henry Dumas,https://www.poetryfoundation.org/poems/53477/kef-21,Kef 21,[],,black_arts_movement
288,Nikki Giovanni,https://www.poetryfoundation.org/poems/90181/no-complaints,No Complaints,[],,black_arts_movement
290,Nikki Giovanni,https://www.poetryfoundation.org/poems/90180/rosa-parks,Rosa Parks,[],,black_arts_movement
298,Etheridge Knight,https://www.poetryfoundation.org/poems/51371/a-fable-56d22f0fa5920,A Fable,[],,black_arts_movement
401,Robert Duncan,https://www.poetryfoundation.org/poems/46316/a-poem-beginning-with-a-line-by-pindar,A Poem Beginning with a Line by Pindar,[],,black_mountain
505,Anne Sexton,https://www.poetryfoundation.org/poems/152252/o-ye-tongues,O Ye Tongues,[],,confessional
683,W. E. B. Du Bois,https://www.poetryfoundation.org/poems/43026/my-country-tis-of-thee,My Country ’Tis of Thee,[],,harlem_renaissance


In [20]:
# use custom function to rescrape poems that were in a slightly different format
# NOTE: overwrites the rows shown above
for index in text_poems_df[text_poems_df.poem_string == ''].index:
    try:
        text_poems_df.loc[index,'poem_lines'] = PoemView_rescraper(text_poems_df.loc[index,'poem_url'])[0]
        text_poems_df.loc[index,'poem_string'] = PoemView_rescraper(text_poems_df.loc[index,'poem_url'])[1]
    except:
        print(index)

In [21]:
# confirm it worked
text_poems_df[text_poems_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
1388,Dylan Thomas,https://www.poetryfoundation.org/poems/26804/poem-on-his-birthday-facs-drafts,Poem on His Birthday [Facs. drafts],[],,modern
1540,Barbara Guest,https://www.poetryfoundation.org/poems/49367/imagined-room,Imagined Room,[],,new_york_school


- The remaining two poems are blank on the website, so I'll just get rid of those.

In [548]:
# remove empty poems
text_poems_df = text_poems_df[text_poems_df.poem_string != '']
text_poems_df.shape

(3259, 6)

#### 💾 Save/Load text poems DataFrame

In [556]:
# # uncomment to save
# text_poems_df.to_csv('data/text_poems_df.csv')

# # uncomment to load
# text_poems_df = pd.read_csv('data/text_poems_df.csv', index_col=0)

### Scanned poems

[[go back to the top](#Predicting-Poetic-Movements)]

- Scrape poems I suspect are in image format (magazine page scans) because of their URLs, though some may be already in text format.

#### Important note
The cell immediately below is what I originally ran. The cell below that (marked by stars) is what I would run if I did it over again, as that would scrape poems if they were indeed in text format and expedite some of the later process.

This would also require some additional work such as converting these text poems to a dataframe and saving separately.

⏰ *NOTE*: Next cell takes about three and a half hours to run.

In [203]:
%%time

# instantiate list for dictionaries
scan_poem_dicts = []

# instantiate list of urls that need to be rescraped
need_to_rescrape = []

# loop through scan_urls of each genre
for genre in tqdm(poem_url_dict.keys()):
    for scan_url in poem_url_dict[genre]['scan_urls']:
        try:
            # attempt to scrape poem, title, and poet
            poem = scan_poem_scraper(scan_url)
            
            # add genre and url
            poem['genre'] = genre
            poem['poem_url'] = scan_url
            
            # add to big list
            scan_poem_dicts.append(poem)
            
        except:
            # add to list of rescrape urls if an error occurs
            need_to_rescrape.append(scan_url)

100%|██████████| 18/18 [3:25:25<00:00, 684.77s/it]   

CPU times: user 7min 49s, sys: 5min 10s, total: 12min 59s
Wall time: 3h 25min 25s





In [204]:
# check the numbers
len(scan_poem_dicts), len(need_to_rescrape)

(1775, 161)

⏰ *NOTE*: Next cell takes about three and a half hours to run.

In [None]:
# ⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ #

%%time

# instantiate list for dictionaries
more_text_poem_dicts = []

# instantiate list for dictionaries
scan_poem_dicts = []

# instantiate list of urls that need to be rescraped
need_to_rescrape = []

# loop through scan_urls of each genre
for genre in tqdm(poem_url_dict.keys()):
    for scan_url in poem_url_dict[genre]['scan_urls']:
        try:
            # attempt to scrape poem, title, and poet as text format
            poem = text_poem_scraper(text_url)

            # add genre and url
            poem['genre'] = genre
            poem['poem_url'] = text_url

            # add to big list
            more_text_poem_dicts.append(poem)
            
        except:
            # if not text format (or error)
            try:
                # attempt to scrape poem, title, and poet as image format
                poem = scan_poem_scraper(scan_url)

                # add genre and url
                poem['genre'] = genre
                poem['poem_url'] = scan_url

                # add to big list
                scan_poem_dicts.append(poem)

            except:
                # add to list of rescrape urls if an error occurs
                need_to_rescrape.append(scan_url)

In [205]:
# convert to dataframe
scan_poems_df = pd.DataFrame(scan_poem_dicts)
scan_poems_df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Richard Brautigan,https://www.poetryfoundation.org/poetrymagazine/poems/31338/wood,Wood,"[We age in darkness like wood, and watch our phantoms change, eir clothes, of shingles and boards, for a purpose that can only be, described as wo...",We age in darkness like wood\nand watch our phantoms change\neir clothes\nof shingles and boards\nfor a purpose that can only be\ndescribed as wood.,beat
1,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21676/dust-and-the-glory,Dust And The Glory,"[On a low Lorrainian knoll a leaning peasant sinking a pit, Meets rotted rock and a slab., The slab cracks and is split, the old grave opened,, Hi...","On a low Lorrainian knoll a leaning peasant sinking a pit\nMeets rotted rock and a slab.\nThe slab cracks and is split, the old grave opened,\nHis...",beat
2,William Everson,https://www.poetryfoundation.org/poetrymagazine/poems/21675/we-in-the-fields,We In The Fields,"[Dawn and a high film, the sun burned it,, But noon had a thick sheet, and the clouds coming,, The low rain-bringers, trooping in from the north,,...","Dawn and a high film, the sun burned it,\nBut noon had a thick sheet, and the clouds coming,\nThe low rain-bringers, trooping in from the north,\n...",beat
3,Allen Ginsberg,https://www.poetryfoundation.org/poetrymagazine/poems/36505/written-in-my-dream-by-w-c-williams,Written In My Dream By W C Williams,"[“As Is, you're bearing, a common, Truth, Commonly known, as desire, No need, to dress, it up, as beauty, No need, to distort, what’s not, standar...",“As Is\nyou're bearing\na common\nTruth\nCommonly known\nas desire\nNo need\nto dress\nit up\nas beauty\nNo need\nto distort\nwhat’s not\nstandard...,beat
4,Jack Hirschman,https://www.poetryfoundation.org/poetrymagazine/poems/30162/the-baseball-poem,The Baseball Poem,"[A wrist (to repeat, with a shift, of ac-, cent, mood, of emphasis, attentive to) now, needed, The wrist I lost, hold of, of, what was most, loved...","A wrist (to repeat\nwith a shift\nof ac-\ncent, mood, of emphasis\nattentive to) now\nneeded\nThe wrist I lost\nhold of, of\nwhat was most\nloved ...",beat


#### 💾 Save/Load text poems DataFrame

In [522]:
# # uncomment to save
# scan_poems_df.to_csv('data/scan_poems_df.csv')

# # uncomment to load
# scan_poems_df = pd.read_csv('data/scan_poems_df.csv', index_col=0)

## Rescrape

[[go back to the top](#Predicting-Poetic-Movements)]

- So here is the messy bit. In the interest of my sanity, I will refrain from commenting on this code for the time being.
- Click [here](#Load-pre-cleaned-DataFrame) to skip to the post-rescraped product.

#### Part 1

In [263]:
scan_poem_df[scan_poem_df.poem_string == '']

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
6,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke,[],,beat
23,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,[],,beat
606,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot,Some Simple Measures In The American Idiom And The Variable Foot,[],,imagist
723,Guillaume Apollinaire,https://www.poetryfoundation.org/poetrymagazine/poems/25655/toward-the-south-tr-by-harry-duncan,Toward The South Tr By Harry Duncan,[],,modern
775,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968,[],,modern
778,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19926/the-urn-enrich-my-resignation,The Urn Enrich My Resignation,[],,modern
779,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn Purgatorio,[],,modern
780,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn Reply,[],,modern
782,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn The Sad Indian,[],,modern
1170,Stephen Spender,https://www.poetryfoundation.org/poetrymagazine/poems/22310/poem-after-the-wrestling,Poem After The Wrestling,[],,modern


In [71]:
rescrapes = []

In [72]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke'
rescrape = scan_poem_scraper(url, input_poet='Michael McClure', input_title='2 For Theodore Roethke: Premonition')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [73]:
url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=87&issue=4&page=28'
rescrape = scan_poem_scraper(url, input_poet='Michael McClure', input_title='2 For Theodore Roethke: 2')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [74]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968'
rescrape = scan_poem_scraper(url, input_poet='Malcolm Cowley', input_title='A Countryside 1918 1968: Boy in Sunlight')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [75]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot'
rescrape = scan_poem_scraper(url, 
                             input_poet='William Carlos Williams',
                             input_title='Some Simple Measures In The American Idiom And The Variable Foot',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!COMMENT).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
rescrapes.append(rescrape)

In [76]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes'
rescrape = scan_poem_scraper(url, 
                             input_poet='Kenneth Patchen',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!comment).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
rescrapes.append(rescrape)

In [79]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19915/the-urn-reliquary'
rescrape = scan_poem_scraper(url, input_poet='Hart Crane')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [81]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=2'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Purgatorio')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [83]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=6'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: The Sad Indian')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [85]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=7'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Reply')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [87]:
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=10'
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply'
rescrape = scan_poem_scraper(actual_url, input_poet='Hart Crane', input_title='The Urn: Enrich My Resignation')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrapes.append(rescrape)

In [94]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/31123/places-for-oscar-salvador'
rescrape = scan_poem_scraper(url, 
                             input_poet="Frank O'Hara",
                             input_title='Places for Oscar Salvador',
                             first_pattern='.*((?:\r?\n.*)*)',
                             next_pattern='\n((?:\r?\n(?!SUDDEN SNOW).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school'
rescrapes.append(rescrape)

In [523]:
rescrapes_pt1 = pd.DataFrame(rescrapes)
rescrapes_pt1

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/26838/2-for-theodore-roethke,2 For Theodore Roethke: Premonition,"[My bones ascend by arsenics of sight., Where noise is all the sound there is to hear,, Beginning in the heart I work towards light., My toes are ...","My bones ascend by arsenics of sight.\nWhere noise is all the sound there is to hear,\nBeginning in the heart I work towards light.\nMy toes are c...",beat
1,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/browse?volume=87&issue=4&page=28,2 For Theodore Roethke: 2,"[This copse is earth’s cockade, this corpse my drum, To beat upon and play the mole a dance;, These hands are my defeat, these eyes my thumb., Opp...","This copse is earth’s cockade, this corpse my drum\nTo beat upon and play the mole a dance;\nThese hands are my defeat, these eyes my thumb.\nOppo...",beat
2,Malcolm Cowley,https://www.poetryfoundation.org/poetrymagazine/poems/30954/a-countryside-1918-1968,A Countryside 1918 1968: Boy in Sunlight,"[The boy having fished alone, down Empfield Run from where it started on stony ground,, in oak and chestnut timber,, then crossed the Nicktown Roa...","The boy having fished alone\ndown Empfield Run from where it started on stony ground,\nin oak and chestnut timber,\nthen crossed the Nicktown Road...",modern
3,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/27969/some-simple-measures-in-the-american-idiom-and-the-variable-foot,Some Simple Measures In The American Idiom And The Variable Foot,"[EXERCISE IN TIMING, Oh, the sumac died, it’s, the first time, I, noticed it, HISTOLOGY, There is, the, microscopic, anatomy, of, the whale, this ...",EXERCISE IN TIMING\nOh\nthe sumac died\nit’s\nthe first time\nI\nnoticed it\nHISTOLOGY\nThere is\nthe\nmicroscopic\nanatomy\nof\nthe whale\nthis i...,imagist
4,Kenneth Patchen,https://www.poetryfoundation.org/poetrymagazine/poems/27128/poemscapes,Poemscapes,"[XVI, No sooner had the clowns got a new house built,, a worse wind than the first blew it down. And it also, re-blew down the old house which the...","XVI\nNo sooner had the clowns got a new house built,\na worse wind than the first blew it down. And it also\nre-blew down the old house which they...",beat
5,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19915/the-urn-reliquary,The Urn Reliquary,"[ENDERNESS and resolution!, What is our life without a sudden pillow,, What is death without a ditch?, The harvest laugh of bright Apollo, And the...","ENDERNESS and resolution!\nWhat is our life without a sudden pillow,\nWhat is death without a ditch?\nThe harvest laugh of bright Apollo\nAnd the ...",modern
6,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19916/the-urn-purgatorio,The Urn: Purgatorio,"[My country, O my land, my friends—, Am I apart—here from you in a land, Where all your gas-lights, faces, sputum gleam, Like something left, fors...","My country, O my land, my friends—\nAm I apart—here from you in a land\nWhere all your gas-lights, faces, sputum gleam\nLike something left, forsa...",modern
7,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19920/the-urn-the-sad-indian,The Urn: The Sad Indian,"[Sad heart, the gymnast of inertia, does not count, Hours, days—and scarcely sun and moon., The warp is in his woof, and his keen vision, Spells w...","Sad heart, the gymnast of inertia, does not count\nHours, days—and scarcely sun and moon.\nThe warp is in his woof, and his keen vision\nSpells wh...",modern
8,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn: Reply,"[Thou canst read nothing except through appetite,, And here we join eyes in that sanctity, Where brother passes brother without sight,, But finall...","Thou canst read nothing except through appetite,\nAnd here we join eyes in that sanctity\nWhere brother passes brother without sight,\nBut finally...",modern
9,Hart Crane,https://www.poetryfoundation.org/poetrymagazine/poems/19922/the-urn-reply,The Urn: Enrich My Resignation,"[Enrich my resignation as I usurp those far, Feints of control, hear rifles blown out on the stag, Below the aeroplane, and see the fox’s brush, W...","Enrich my resignation as I usurp those far\nFeints of control, hear rifles blown out on the stag\nBelow the aeroplane, and see the fox’s brush\nWh...",modern


In [265]:
rescrapes_pt1.to_csv('data/temp_rescrapes_pt1.csv')

#### Part 2

In [264]:
need_to_rescrape

['https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29577/valery-as-dictator',
 'https://www.poetryfoundation.org/poetrymagazine/poems/146231/haiku-and-tanka-for-harriet-tubman',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30270/ritual-ix',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you',
 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks',
 'https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note',
 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law',
 'https://www.poetryfoundation.org/poetrym

In [266]:
error_rescrapes = []
still_errors = need_to_rescrape.copy()

In [299]:
%%time

for url in tqdm(still_errors):
    try:
        rescrape = text_poem_scraper(url)
        error_rescrapes.append(rescrape)
        still_errors.remove(url)
    except:
        continue

 72%|███████▏  | 106/148 [07:49<03:05,  4.43s/it] 

CPU times: user 55.4 s, sys: 1min 40s, total: 2min 35s
Wall time: 7min 49s





In [270]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=18'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: We Shall Be Free')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [271]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=3&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Michael McClure', input_title='Mad Sonnet: When Spirit Has No Edge')
rescrape['poem_url'] = url
rescrape['genre'] = 'beat'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [272]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=5&page=6'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Enough: Left After That')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [273]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=104&issue=3&page=19'
rescrape = scan_poem_scraper(actual_url, input_poet='Robert Creeley', input_title='Walking: In My Head')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [274]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=13&issue=6&page=13'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Epitaph')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [275]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28310/elaine'
rescrape = scan_poem_scraper(url, input_poet='William Carlos Williams', input_title='Elainb')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [276]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28312/emily'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=3'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Emily')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [277]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28311/erica'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=95&issue=6&page=2'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Erica')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [278]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/18899/poem-as-the-cat'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=36&issue=4&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='William Carlos Williams', input_title='Poem: As the cat')
rescrape['poem_url'] = url
rescrape['genre'] = 'imagist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [279]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
rescrape = scan_poem_scraper(url, input_poet='Conrad Aiken', input_title='Discordants')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [280]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/13202/from-discordants-iv'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=6&issue=6&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='Conrad Aiken', input_title='Discordants IV')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)

In [283]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25512/jacks-white-horseup'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="jack's white")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [290]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25263/imc-a-tmo'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='E. E. Cummings')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrape['title'] = 'Untitled [5]'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [298]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29577/valery-as-dictator'
rescrape = text_poem_scraper(url)
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [308]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30270/ritual-ix'
rescrape = scan_poem_scraper(url, input_poet='Paul Blackburn', input_title='Ritual IX: Gathering Winter Fuel')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [305]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=108&issue=1&page=42',
                  input_poet='Paul Blackburn', input_title='the same barrels')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['the same barrels',
 '& cans & older men in long',
 'overcoats from the mission,',
 '& here, the scene unabated, 20-odd years later',
 'the fruit & vegetable market, First Ave. & Ninth, using',
 'wood from crates',
 'New Jersey, Delaware, Cali-',
 'for-ni-yay,',
 'Florida, New Mexico, Georgia, Louisiana, Texas, all',
 'e same fire, how',
 'reunite the South & North, the West & East',
 'IN SUNLIGHT YOU NEVER SEE Ir, ry just walking by &',
 'feel the warmth e.',
 'Fire in a barrel, burning',
 'the hands,',
 'the hands / the italian',
 'bakery next door is still discreet,',
 'but the kosher butcher shop next to',
 'that comes out for a word or two, the',
 'gesture',
 'palms stiff out at arms’ length, passing',
 'the time of day, their magic hands',
 'reddened & liverspotted maybe,',
 'no peyis or beard, sti',
 'here at First Ave. & Ninth St. it’s',
 'the jews uniting the world, the country, the city,',
 'mankind down geological time perhaps,',
 'to keep their hands warm']

In [312]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30550/the-sundering-up-tracks'
rescrape = scan_poem_scraper(url, 
                             input_poet='Edward Dorn', 
                             input_title='The Sundering U.P. Tracks: The End of the North Atlantic Turbine Poem')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [310]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=8',
                  input_poet='Edward Dorn', input_title='"Compared to the majestic legal thievery')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['"Compared to the majestic legal thievery',
 'of Commodore Vanderbilt men like Jay G Gould',
 'and Jim Fisk were second-story workers .',
 'Each side of the shining double knife',
 'from Chicago to Fri',
 'to Denver, the Cheyenne cutoff',
 'the Right of Way they called it',
 'and still it runs that way',
 'right through the heart',
 'the Union Pacific rails run also to Portland.',
 'Even through the heart of the blue beech',
 'hard as it is.',
 'each hamlet',
 'the winter sanctuar',
 'of the rare Jailbird',
 'and the Ishmaelite',
 'the esoteric summer firebombs',
 'of Chicago',
 'the same scar tissue',
 'I saw in Pocatello',
 'made',
 'by the rapacious geo-economic',
 'surgery of Harriman, the old isolator',
 'that ambassador-at-large',
 'You talk of color?',
 'Ob cosmological america, how well',
 'and with what geometry',
 'you teach your citizens']

In [315]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30551/the-first-note'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=9'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Edward Dorn', 
                             input_title='The First Note: From London')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [320]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30225/song-i-wouldnt-embarrass-you'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=107&issue=5&page=46'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Creeley', 
                             input_title='Song')
rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [330]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
rescrape = scan_poem_scraper(url, 
                             input_poet='Robert Duncan', 
                             input_title='The Law: A Series in Variation')

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [331]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=33'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: Song's Fateful Crime")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [332]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=34'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: Cursed be he that")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [333]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28862/the-law'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=3&page=35'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="The Law: No! Took an Other way as its law")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)

In [337]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/27415/poem-when-the-immortal-blond'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=90&issue=6&page=20'
rescrape = scan_poem_scraper(actual_url, 
                             input_poet='Robert Duncan', 
                             input_title="Poem")

rescrape['poem_url'] = url
rescrape['genre'] = 'black_mountain'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [340]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55733/what-next'))

In [342]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/41677/pacemaker'))

In [344]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/13075/the-dead'))

In [352]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55313/god-56d236c65624c'))

In [354]:
# shrug, should've worked in earlier loop
error_rescrapes.append(text_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/poems/55537/and'))

In [357]:
remove_list = [
    'https://www.poetryfoundation.org/poetrymagazine/poems/55733/what-next',
    'https://www.poetryfoundation.org/poetrymagazine/poems/41677/pacemaker',
    'https://www.poetryfoundation.org/poetrymagazine/poems/13075/the-dead',
    'https://www.poetryfoundation.org/poetrymagazine/poems/55313/god-56d236c65624c',
    'https://www.poetryfoundation.org/poetrymagazine/poems/55537/and'
]

for item in remove_list:
    still_errors.remove(item)

In [356]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/54278/accounts'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [359]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/150946/elsewhere-5d70274a8beed'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [363]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/53487/paragraph'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [364]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/89349/object-permanence'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [365]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/92669/natural-histories'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [370]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/144605/my-house-59df845fd8d77'
error_rescrapes.append(text_poem_scraper(url))
still_errors.remove(url)

In [373]:
%%time

for url in tqdm(still_errors[20:]):
    try:
        rescrape = text_poem_scraper(url)
        error_rescrapes.append(rescrape)
        still_errors.remove(url)
    except:
        continue

100%|██████████| 54/54 [00:52<00:00,  1.02it/s]

CPU times: user 4.2 s, sys: 248 ms, total: 4.45 s
Wall time: 52.8 s





In [383]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28835/pity-his-how-illimitable-plight'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=9'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title='pity his how illimitable plight')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
rescrape['poem_lines'].extend(temp_rescrape_lines)
rescrape['poem_lines'].extend(temp_rescrape_lines2)
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [379]:
temp_rescrape = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=10',
                  input_poet='E. E. Cummings', input_title='without the mercy of')
temp_rescrape_lines = [temp_rescrape['title']]
temp_rescrape_lines.extend(temp_rescrape['poem_lines'])
temp_rescrape_lines

['without the mercy of',
 'your eyes your',
 'voice your',
 'ways (o very most my shining love)',
 'how more than dark i am,',
 'no song (no',
 'thing) no',
 'silence ever told; it has no name—',
 'but should this namelessness',
 '(completely',
 'fleetly',
 'vanish, at the infinite precise',
 'thrill of your beauty, then',
 'my lost my',
 'my',
 'whereful selves they put on here again',
 '—to livingest one star',
 'as small these',
 'all these',
 'thankful (hark) birds singing wholly are']

In [381]:
temp_rescrape2 = scan_poem_scraper('https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=11',
                  input_poet='E. E. Cummings', input_title='annie died the other day')
temp_rescrape_lines2 = [temp_rescrape2['title']]
temp_rescrape_lines2.extend(temp_rescrape2['poem_lines'])
temp_rescrape_lines2

['annie died the other day',
 'never was there such a lay—',
 'whom, among her dollies, dad',
 'first (“don’t tell your mother”) had;',
 'making annie slightly mad',
 'but very wonderful in bed',
 '—-saints and satyrs, go your way',
 'youths and maidens: let us pray']

In [392]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/22223/six'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=53&issue=4&page=5'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title='six')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [396]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25960/springmay0151'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='spring! may')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [399]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/25513/-oroundmoonhow'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title='o(rounD)moon, how')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [403]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28566/why-dont-be'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="why don't be sil ly o no in deed; money")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [405]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/24807/thislets-rememberday'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="this(let's remember)day died again and")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [409]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28834/for-any-ruffian-of-the-sky'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=8'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title="for any ruffian of the sky")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [410]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/28833/if-seventy-were-young'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=99&issue=2&page=7'
rescrape = scan_poem_scraper(actual_url, input_poet='E. E. Cummings', input_title="if seventy were young")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [412]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/26048/rosetreerosetree'
rescrape = scan_poem_scraper(url, input_poet='E. E. Cummings', input_title="rosetree, rosetree")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [415]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/15533/lenvoi'
rescrape = scan_poem_scraper(url, input_poet='Marion Strobel', input_title="envoi")
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [421]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/17874/discus-thrower'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=31&issue=5&page=3'
rescrape = scan_poem_scraper(actual_url, input_poet='Marion Strobel', input_title="Discus-Thrower",
                            next_pattern='\n((?:\r?\n(?!SURF-BOARDING).*)*)')
rescrape['poem_url'] = url
rescrape['genre'] = 'modern'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [423]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/33672/poem-green-things-are-flowers'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=130&issue=2&page=10'
rescrape = scan_poem_scraper(actual_url, input_poet="Frank O'Hara", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [426]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30246/poem-to-simply-talk'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=107&issue=6&page=10'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [428]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30549/poem-like-musical-instruments'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=109&issue=6&page=14'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Poem")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [429]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30775/sonnet-five-am-on-east-fourteenth'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=8'
rescrape = scan_poem_scraper(actual_url, input_poet="Tom Clark", input_title="Sonnet")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [434]:
# i've truly lost my mind
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30723/jungle-56d21432117a5'
rescrape = scan_poem_scraper(url, input_poet="Aram Saroyan", input_title="saroyan")
rescrape['poem_url'] = url
rescrape['genre'] = 'new_york_school_2nd_generation'
rescrape['poem_lines'] = ['j;u;n;g;l;e']
rescrape['poem_string'] = ' '.join(rescrape['poem_lines'])
rescrape['title'] = 'Untitled'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [439]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30140/spring-stood-there'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=29'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Spring")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [440]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30139/march-56d213a2b802b'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=28'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="March")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [443]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29475/now-in-one-year'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=102&issue=5&page=27'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Now in one year")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [446]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30141/the-park-a-darling-walk'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=30'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="The park")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [449]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30138/consider-at-the-outset'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=106&issue=5&page=28'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Consider at the outset")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [450]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/30790/smile-to-see-the-lake'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=111&issue=3&page=22'
rescrape = scan_poem_scraper(actual_url, input_poet='Lorine Niedecker', input_title="Smile")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [455]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/29745/giovannis-rape-of-the-sabine-women-at-wildensteins'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?contentId=29745'
rescrape = scan_poem_scraper(actual_url, input_poet='George Oppen', input_title="Giovanni's Rape of the Sabine Women at Wildenstein")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [457]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19125/1930s'
rescrape = scan_poem_scraper(url, input_poet='George Oppen', input_title="1930")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [461]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/19940/along-the-flat-roofs'
actual_url = 'https://www.poetryfoundation.org/poetrymagazine/browse?volume=41&issue=4&page=17'
rescrape = scan_poem_scraper(actual_url, input_poet='Charles Reznikoff', input_title="Along the flat roofs beneath our window")
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [467]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/31024/a-21-rudens-act-iii'
rescrape = scan_poem_scraper(url, input_poet='Louis Zukofsky', input_title='"A"21: Rudens, Dads')
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [468]:
url = 'https://www.poetryfoundation.org/poetrymagazine/poems/32643/a-22-an-era-any-time'
rescrape = scan_poem_scraper(url, input_poet='Louis Zukofsky', input_title='"A"22: An Era Any Time Of Year')
rescrape['poem_url'] = url
rescrape['genre'] = 'objectivist'
error_rescrapes.append(rescrape)
still_errors.remove(url)

In [558]:
# 38 poems that are not worth rescraping
len(still_errors)

38

In [470]:
rescrapes_pt2 = pd.DataFrame(error_rescrapes)
rescrapes_pt2

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29416/mad-sonnet-we-shall-be-free,Mad Sonnet: We Shall Be Free,"[IN THE WORLD OF DESTINY, when Heaven and Hell are a dream, left draped like blue silk upon a useless chair., But now we have come together by ato...",IN THE WORLD OF DESTINY\nwhen Heaven and Hell are a dream\nleft draped like blue silk upon a useless chair.\nBut now we have come together by atom...,beat
1,Michael McClure,https://www.poetryfoundation.org/poetrymagazine/poems/29415/mad-sonnet-when-spirit-has-no-edge,Mad Sonnet: When Spirit Has No Edge,"[the Human frame. Men swell to blindness, without pain and are stupefied. Smooth fingertips, receive no pleasure. They become what they call Soul,...",the Human frame. Men swell to blindness\nwithout pain and are stupefied. Smooth fingertips\nreceive no pleasure. They become what they call Soul\n...,beat
2,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/30530/song-how-simply-for-another,Enough: Left After That,"[not to my own mind,, but stayed, and stayed. Years, went by. What, were they. Days—, some happy,, but some bitter, and sad. If I walked, across t...","not to my own mind,\nbut stayed\nand stayed. Years\nwent by. What\nwere they. Days—\nsome happy,\nbut some bitter\nand sad. If I walked\nacross th...",black_mountain
3,Robert Creeley,https://www.poetryfoundation.org/poetrymagazine/poems/29779/walking-56d2134a84892,Walking: In My Head,"[is there to walk,, not thought of, is, the road itself more, than seen. I think, it might be, feel, as my feet do, and, continue, and, at last re...","is there to walk,\nnot thought of, is\nthe road itself more\nthan seen. I think\nit might be, feel\nas my feet do, and\ncontinue, and\nat last rea...",black_mountain
4,William Carlos Williams,https://www.poetryfoundation.org/poetrymagazine/poems/14358/epitaph-an-old-willow,Epitaph,"[An old willow with hollow branches, Slowly swayed his few high bright tendrils, And sang:, “Love is a young green willow, Shimmering at the bare ...",An old willow with hollow branches\nSlowly swayed his few high bright tendrils\nAnd sang:\n“Love is a young green willow\nShimmering at the bare w...,imagist
...,...,...,...,...,...,...
122,George Oppen,https://www.poetryfoundation.org/poetrymagazine/poems/29745/giovannis-rape-of-the-sabine-women-at-wildensteins,Giovanni's Rape of the Sabine Women at Wildenstein,"[Showing the girl, On the shoulder of the warrior, calling, Behind her in the young body’s triumph, With its slight, despairing arms aloft, And th...","Showing the girl\nOn the shoulder of the warrior, calling\nBehind her in the young body’s triumph\nWith its slight, despairing arms aloft\nAnd the...",objectivist
123,George Oppen,https://www.poetryfoundation.org/poetrymagazine/poems/19125/1930s,1930,"[Thus, Hides the, Parts—the prudery, Of Frigidaire, of, Soda-jerking—, Thus, Above the, Plane of lunch, of wives,, Removes itself, (As soda-jerkin...","Thus\nHides the\nParts—the prudery\nOf Frigidaire, of\nSoda-jerking—\nThus\nAbove the\nPlane of lunch, of wives,\nRemoves itself\n(As soda-jerking...",objectivist
124,Charles Reznikoff,https://www.poetryfoundation.org/poetrymagazine/poems/19940/along-the-flat-roofs,Along the flat roofs beneath our window,"[in the morning sunshine, I read the signature of last night’s rain., v, The squads, platoons, and regiments, of lighted windows,, ephemeral under...","in the morning sunshine\nI read the signature of last night’s rain.\nv\nThe squads, platoons, and regiments\nof lighted windows,\nephemeral under ...",objectivist
125,Louis Zukofsky,https://www.poetryfoundation.org/poetrymagazine/poems/31024/a-21-rudens-act-iii,"""A""21: Rudens, Dads","[Miraculously gods playfellows dream in, men, don’t let us sleep, like me last night dreaming, this weird and silly dream:, a swallow’s nest, a mo...","Miraculously gods playfellows dream in\nmen, don’t let us sleep\nlike me last night dreaming\nthis weird and silly dream:\na swallow’s nest, a mon...",objectivist


In [521]:
rescrapes_pt2.to_csv('data/temp_rescrapes_pt2.csv')

### Combine DataFrames

[[go back to the top](#Predicting-Poetic-Movements)]

- Combine texts, scans, and rescrapes.
- Save.

In [549]:
# check out the sizes of each
text_poems_df.shape, scan_poem_df.shape, rescrapes_pt1.shape, rescrapes_pt2.shape

((3259, 6), (1774, 6), (11, 6), (124, 6))

In [550]:
# combine and ignore the indices
df = pd.concat([text_poems_df, scan_poem_df, rescrapes_pt1, rescrapes_pt2], axis=0, ignore_index=True)

# confirm
df.shape

(5168, 6)

In [552]:
# sort dataframe
df.sort_values(by=['genre', 'poet', 'title'], inplace=True)

# reset indices
df.reset_index(drop=True, inplace=True)

In [553]:
# confirm top
df.head()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
0,Alexander Pope,https://www.poetryfoundation.org/poems/44896/an-essay-on-criticism-part-1,An Essay on Criticism: Part 1,"[PART 1, 'Tis hard to say, if greater want of skill, Appear in writing or in judging ill;, But, of the two, less dang'rous is th' offence, To tire...","PART 1\n'Tis hard to say, if greater want of skill\nAppear in writing or in judging ill;\nBut, of the two, less dang'rous is th' offence\nTo tire ...",augustan
1,Alexander Pope,https://www.poetryfoundation.org/poems/44897/an-essay-on-criticism-part-2,An Essay on Criticism: Part 2,"[Of all the causes which conspire to blind, Man's erring judgment, and misguide the mind,, What the weak head with strongest bias rules,, Is pride...","Of all the causes which conspire to blind\nMan's erring judgment, and misguide the mind,\nWhat the weak head with strongest bias rules,\nIs pride,...",augustan
2,Alexander Pope,https://www.poetryfoundation.org/poems/44898/an-essay-on-criticism-part-3,An Essay on Criticism: Part 3,"[Learn then what morals critics ought to show,, For 'tis but half a judge's task, to know., 'Tis not enough, taste, judgment, learning, join;, In ...","Learn then what morals critics ought to show,\nFor 'tis but half a judge's task, to know.\n'Tis not enough, taste, judgment, learning, join;\nIn a...",augustan
3,Alexander Pope,https://www.poetryfoundation.org/poems/44899/an-essay-on-man-epistle-i,An Essay on Man: Epistle I,"[Awake, my St. John! leave all meaner things, To low ambition, and the pride of kings., Let us (since life can little more supply, Than just to lo...","Awake, my St. John! leave all meaner things\nTo low ambition, and the pride of kings.\nLet us (since life can little more supply\nThan just to loo...",augustan
4,Alexander Pope,https://www.poetryfoundation.org/poems/44900/an-essay-on-man-epistle-ii,An Essay on Man: Epistle II,"[I., Know then thyself, presume not God to scan;, The proper study of mankind is man., Plac'd on this isthmus of a middle state,, A being darkly w...","I.\nKnow then thyself, presume not God to scan;\nThe proper study of mankind is man.\nPlac'd on this isthmus of a middle state,\nA being darkly wi...",augustan


In [554]:
# confirm bottom
df.tail()

Unnamed: 0,poet,poem_url,title,poem_lines,poem_string,genre
5163,William Barnes,https://www.poetryfoundation.org/poems/52365/tokens,Tokens,"[Green mwold on zummer bars do show, That they’ve a-dripp’d in winter wet;, The hoof-worn ring o’ groun’ below, The tree, do tell o’ storms or het...","Green mwold on zummer bars do show\nThat they’ve a-dripp’d in winter wet;\nThe hoof-worn ring o’ groun’ below\nThe tree, do tell o’ storms or het;...",victorian
5164,William Barnes,https://www.poetryfoundation.org/poems/52362/zun-zet,Zun-zet,"[Where the western zun, unclouded,, Up above the grey hill-tops,, Did sheen drough ashes, lofty sh’ouded,, On the turf beside the copse,, In zumme...","Where the western zun, unclouded,\nUp above the grey hill-tops,\nDid sheen drough ashes, lofty sh’ouded,\nOn the turf beside the copse,\nIn zummer...",victorian
5165,William Ernest Henley,https://www.poetryfoundation.org/poems/51642/invictus,Invictus,"[Out of the night that covers me,, Black as the pit from pole to pole,, I thank whatever gods may be, For my unconquerable soul., In the fell clut...","Out of the night that covers me,\nBlack as the pit from pole to pole,\nI thank whatever gods may be\nFor my unconquerable soul.\nIn the fell clutc...",victorian
5166,William Makepeace Thackeray,https://www.poetryfoundation.org/poems/52711/the-cane-bottomd-chair,The Cane-Bottom’d Chair,"[In tattered old slippers that toast at the bars,, And a ragged old jacket perfumed with cigars,, Away from the world and its toils and its cares,...","In tattered old slippers that toast at the bars,\nAnd a ragged old jacket perfumed with cigars,\nAway from the world and its toils and its cares,\...",victorian
5167,William Miller,https://www.poetryfoundation.org/poems/46949/willie-winkie-56d2271169ef7,Willie Winkie,"[Wee Willie Winkie, Rins through the toun,, Up stairs and doun stairs, In his nicht-gown,, Tirling at the window,, Crying at the lock,, “Are the w...","Wee Willie Winkie\nRins through the toun,\nUp stairs and doun stairs\nIn his nicht-gown,\nTirling at the window,\nCrying at the lock,\n“Are the we...",victorian


#### Save pre-cleaned DataFrame

In [555]:
# # uncomment to save
# df.to_csv('data/poems_df_pre_clean.csv')

#### Load pre-cleaned DataFrame

[[go back to the top](#Webscraping)]

In [560]:
# # uncomment to load
# df = pd.read_csv('data/poems_df_pre_clean.csv', index_col=0)

# confirm
df.shape

(5168, 6)

## Next notebook: [Data Cleaning](02_data_cleaning.ipynb)

[[go back to the top](#Webscraping)]

- The next notebook includes data cleaning and some more rescraping.