# Scraper:  famouspoetsandpoems.com

This script borrows heavily here:  https://github.com/chaimgluck.  It scrapes a ton of poems of all sorts, but is rather indiscriminant:  some poems are huge, some tiny, and themes are all over the place.  All told there are about 35,000 poems on the site.  The script also scrapes themes labels where available.  (About 2600 poems are labelled.)

This script is divided into two parts:

1.  <a href='#scraping'> Scraping </a>
2.  <a href='#cleaning'> Cleaning </a>

<a id="scraping"></a>
## Part I:  Scraping

**Path names:**  As the script scrapes data, it will write the data to a file, and eventually read and write dataframes to/from a data directory.  The following cell should be modified to align with your desired directory configuration.  Note that the use of relative pathnames.  

In [1]:
rootdir = "../data/poems"         # root directory for all poetry data
csvdir   = rootdir + '/csv'       # subdirectory for csv files
pkldir   = rootdir + '/pkl'       # subdirectory for pkl files
main_website = 'http://www.famouspoetsandpoems.com' # source of all these poems
years_fname = 'famouspoets_years'  # place to store year information (can be deleted after script runs)
themes_fname = 'famouspoets_themes'# place to store theme information (can be deleted after script runs)
rawpoems_fname = 'famouspoets_raw'      # base name of variables and files containing raw, uncleaned poems
cleanpoems_fname = 'famouspoets_clean'  # base name for cleaned poems

**Import statements:** mostly pandas, beautiful soup and some stirng manipulation.

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re, os
import pickle
import numpy as np

**Years for each poet**.  The code below scrapes the web page that has the years of each poet's life.  This is one way to separate the poems into eras. 

In [3]:
# connect to main webpage
all_poets = requests.get(main_website + '/poets.html')
poets = BeautifulSoup(all_poets.text, 'lxml')

poet_years = []

# for every row, extract year information
for tag in poets.findAll('td'):
    try:
        if '(' in tag.get_text():
            poet_years.append(tag.get_text().strip())
    except:
        pass

In [4]:
# take a peak at what we've got
poet_years[3:7]

['Maya Angelou (18)(1928 - present)',
 'Maya Angelou (18)(1928 - present)',
 'Margaret Atwood (28)(1939 - present)',
 'Margaret Atwood (28)(1939 - present)']

Let's clean this up and store it in a dataframe.

In [5]:
poet_years2 = poet_years[3:]
poet_years2 = [x.strip() for x in poet_years2]
poet_years = poet_years2[::2]
poets_and_years = []
for i in poet_years:
    poets_and_years.append(i.split('('))
df_poets_and_years = pd.DataFrame(poets_and_years, columns=['name', 'number', 'years', 'blech'])
df_poets_and_years.drop(['blech'], axis=1, inplace=True)
df_poets_and_years.head()

Unnamed: 0,name,number,years
0,Maya Angelou,18),1928 - present)
1,Margaret Atwood,28),1939 - present)
2,Matthew Arnold,45),1822 - 1888)
3,Yehuda Amichai,38),1924 - 2000)
4,Anna Akhmatova,26),1889 - 1966)


Let's store the year information to file.  We'll use this information after we scrape the poems themselves, but the this process is long, and possibly error-prone, so saving is a wise precaution.  We'll store it in the current directory and delete it when we're done.

In [6]:
df_poets_and_years.to_csv(years_fname + '.csv', index=False)
df_poets_and_years.head()

Unnamed: 0,name,number,years
0,Maya Angelou,18),1928 - present)
1,Margaret Atwood,28),1939 - present)
2,Matthew Arnold,45),1822 - 1888)
3,Yehuda Amichai,38),1924 - 2000)
4,Anna Akhmatova,26),1889 - 1966)


**Scraping a list of themes**:  Some poems are explicitly labelled on the `famoutspoetsandpoems` site.  Here I scrape these themes, along with the associated poems.  

In [7]:
# landing page for list of themes
all_themes = requests.get(main_website + '/' + 'thematic_poems.html')

# get the html from this page
all_themes_html = BeautifulSoup(all_themes.text, 'lxml')

In [8]:
# collect urls for landing page for each theme
theme_urls = []
for tag in all_themes_html.find_all('li',style="padding-bottom:5px;list-style-image: url(/images/_li.gif);padding-right:10px;"):
    theme_urls.append(tag.contents[0].get('href'))
# take a peak to make sure things are kosher
theme_urls[0:5]

['/thematic_poems/abortion_poems.html',
 '/thematic_poems/angel_poems.html',
 '/thematic_poems/animal_poems.html',
 '/thematic_poems/anniversary_poems.html',
 '/thematic_poems/april_poems.html']

In [9]:
# extract tags to store as labels in database
theme_tags = []
for url in theme_urls:
    m = re.search(r'(/)(\w+)_poems.html',url)
    theme_tags.append(m.group(2))
theme_tags[0:5]

['abortion', 'angel', 'animal', 'anniversary', 'april']

Here we need to loop over all the "themes", for each one scraping the title and author of every poem. We'll store everything as a big list of lists, turning it into a pandas dataframe later.

In [10]:
title_author_theme = []

for url in theme_urls:
    # serve up html for theme landing page
    theme_page = requests.get(main_website+url)
    theme_page_soup = BeautifulSoup(theme_page.text, 'lxml')
    
    # extract simple string for theme name
    m = re.search(r'(/)(\w+)_poems.html',url)
    theme_name = m.group(2)

    # extract author, title, and poem_url for each poem in this theme, and store, along with theme label
    for poem in theme_page_soup.find_all('div',style="width:440px;padding-left:12px;padding-top:14px;"):
        title = poem.a.get_text()
        author = poem.span.get_text()
        poem_url = poem.a.get('href')
        poem_page = requests.get(main_website+poem_url)
        poem_page_soup = BeautifulSoup(poem_page.text, 'lxml')
        text_blob = poem_page_soup.find('div',style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
        if text_blob:
            text = text_blob.get_text()
        else:
            print("no text for poem {}".format(poem_url))  
            text = ''
        title_author_theme.append([title,author,theme_name,poem_url,text])

no text for poem /poets//poems/10095.html
no text for poem /poets//poems/10106.html
no text for poem /poets//poems/20956.html
no text for poem /poets//poems/10109.html
no text for poem /poets//poems/10127.html
no text for poem /poets//poems/10119.html
no text for poem /poets//poems/10158.html


Convert to a dataframe, and store as csv. (Store in current directory...we'll delete it when we're done.)

In [73]:
df_themes_and_poems = pd.DataFrame(title_author_theme,columns=['title','author','theme','poem_url','poem'])
df_themes_and_poems.to_csv(themes_fname + '.csv',index=False)
df_themes_and_poems.head()

Unnamed: 0,title,author,theme,poem_url,poem
0,Inferno (English),by Dante Alighieri,abortion,/poets/dante_alighieri/poems/39.html,"\n\t\t\t\t\t\tCANTO I ONE night, when half my..."
1,Part 7 of Trout Fishing in America,by Richard Brautigan,abortion,/poets/richard_brautigan/poems/4006.html,\n\t\t\t\t\t\tTHE PUDDING MASTER OF ...
2,The Mother,by Gwendolyn Brooks,abortion,/poets/gwendolyn_brooks/poems/4136.html,\n\t\t\t\t\t\tAbortions will not let you forge...
3,The Glove,by Robert Browning,abortion,/poets/robert_browning/poems/4929.html,\n\t\t\t\t\t\t(PETER RONSARD _loquitur_.)``Hei...
4,130. Natureâ€™s Law: A Poem,by Robert Burns,abortion,/poets/robert_burns/poems/4989.html,\n\t\t\t\t\t\tLET other heroes boast their sca...


**Scrape a list of links to each poet's page**. First collect links to each poet's landing page. These pages each contain links to all of that poet's poems. Then collect all the links for all of their poems. Then scrape the actual poem from each of those pages.

In [75]:
poet_links = []

for tag in poets.findAll('td'):
    try:
        link = tag.find('a')['href']
        if '/poets/' in link:
            poet_links.append(link)
    except:
        pass

poet_links = list(set(poet_links))
poet_links[0:5]

['/poets/laurie_lee',
 '/poets/ezra_pound',
 '/poets/charles_webb',
 '/poets/jonas_mekas',
 '/poets/charles_simic']

This function collects all the links for individual poems from each poet's landing page. 

In [78]:
def get_poems(link):
    poetry = requests.get(link)
    bib_soup = BeautifulSoup(poetry.text, 'lxml')
     
    poem_links = []
    for poems in bib_soup.findAll('td'):
        try:
            poem = poems.find('a')['href']
            if '/poems/' in poem:
                poem_links.append(poem)
        except:
            pass
        
    poem_links = list(set(poem_links))
    poems = []
    for poem in poem_links:
        poem_link = main_website + poem      
        poems.append(poem_link)
    return poems

In [84]:
raw = main_website
poem_links = []
for poet in poet_links:
    link = raw + poet + '/poems'
    poem_links.append(get_poems(link))
poem_links[0]

['http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6843',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6839',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6841',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6840',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6842',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6838']

Each element of "poem_links" is a list of links to a set of poems from some particular poet. Let's unwind each sublist into a single list. 

In [85]:
poem_links_list = [item for sublist in poem_links for item in sublist]
poem_links_list[0:5]

['http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6843',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6839',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6841',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6840',
 'http://www.famouspoetsandpoems.com/poets/laurie_lee/poems/6842']

**Scraping all the poems from site.** Below, I go to each link in the list of poem pages and scrape the actual poem from the page. 

In [91]:
df_rawpoems = pd.DataFrame(columns=['url','title','poem','author'])
df_rawpoems['url']=poem_links_list
df_rawpoems.reset_index(inplace=True,drop=True)

In [142]:
# tempory file to store output as it is read...useful for crashes mid-loop
outfile = 'temp.csv'
# rate at which file is written to and updates reported
batchsize = 100
nbatches = np.floor(len(df_rawpoems)/batchsize)
# for each url...
for row in df_rawpoems[7816:].itertuples():
    # try to get the poem information
    try:
        url = requests.get(row.url,timeout=100)
    except:
        # otherwise, just keep going
        continue
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    title = soup.find('span',style="font-weight:bold;font-size:16px;color:#3C605B;font-family:Times New Roman;").get_text()
    df_rawpoems.loc[row.Index,['title','poem']] = [title, str(poem)]
    # write to file and display progress once in a while
    if row.Index % 100 == 0:
        print("{:1.0f} of {:1.0f} batches completed".format(row.Index/100,nbatches))
        with open(outfile, 'a') as f:
            df_rawpoems.to_csv(f, header=False,index=False)

79 of 344 batches completed
80 of 344 batches completed
81 of 344 batches completed
82 of 344 batches completed
83 of 344 batches completed
84 of 344 batches completed
85 of 344 batches completed
86 of 344 batches completed
87 of 344 batches completed
88 of 344 batches completed
89 of 344 batches completed
90 of 344 batches completed
91 of 344 batches completed
92 of 344 batches completed
93 of 344 batches completed
94 of 344 batches completed
95 of 344 batches completed
96 of 344 batches completed
97 of 344 batches completed
98 of 344 batches completed
99 of 344 batches completed
100 of 344 batches completed
101 of 344 batches completed
102 of 344 batches completed
103 of 344 batches completed
104 of 344 batches completed
105 of 344 batches completed
106 of 344 batches completed
107 of 344 batches completed
108 of 344 batches completed
109 of 344 batches completed
110 of 344 batches completed
111 of 344 batches completed
112 of 344 batches completed
113 of 344 batches completed
114 of

In [148]:
# write to file
df_rawpoems.to_csv(csvdir + '/' + rawpoems_fname + '.csv',index=False)
df_rawpoems.to_pickle(pkldir + '/' + rawpoems_fname + '.pkl')

In [None]:
# eliminate temporary file
try:
    os.remove(outfile)
except OSError:
    pass

In [150]:
## Just a peek at what we're dealing with. It will need cleaning.
df_rawpoems.loc[0,:]

url       http://www.famouspoetsandpoems.com/poets/ezra_...
title                        Cantico del Sole by Ezra Pound
poem      <div style="padding-left:14px;padding-top:20px...
author                                     by Carl Sandburg
Name: 0, dtype: object

<a id="cleaning"></a>
## Part II:  Cleaning

**Load data**:  Let's take a quick peak at the data

In [22]:
df_cleanpoems = pd.read_csv(csvdir + '/' + rawpoems_fname + '.csv')
print("size of data:  ", len(df_cleanpoems))
df_cleanpoems.head()

size of data:   34472


Unnamed: 0,url,title,poem,author
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole by Ezra Pound,"<div style=""padding-left:14px;padding-top:20px...",by Carl Sandburg
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend by Ezra Pound,"<div style=""padding-left:14px;padding-top:20px...",by Carl Sandburg
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case by Ezra Pound,"<div style=""padding-left:14px;padding-top:20px...",by Carl Sandburg
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet by Ezra Pound,"<div style=""padding-left:14px;padding-top:20px...",by Carl Sandburg
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return by Ezra Pound,"<div style=""padding-left:14px;padding-top:20px...",by Carl Sandburg


**Separate title and author information.**  The 'title' column contains both title and author.  Let's split them.  "By" would be a good word to split on.  Of course it could show up in titles.  But it most likely won't show up in author names.  So we'll do a right split, and take the last element for the author name.

In [23]:
df_cleanpoems[['title','author']] = df_cleanpoems.title.str.rsplit('by',n=1,expand=True)
df_cleanpoems.head()

Unnamed: 0,url,title,poem,author
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound


**Join  theme information.**  We scraped theme information separately from the poems themselves.  We'll need to join them.

In [24]:
# load the theme data
df_themes = pd.read_csv(themes_fname + '.csv')
print('number of poems with theme info:  ',len(df_themes))
df_themes.head()

number of poems with theme info:   2661


Unnamed: 0,title,author,theme,poem_url,poem
0,Inferno (English),by Dante Alighieri,abortion,/poets/dante_alighieri/poems/39.html,"\n\t\t\t\t\t\tCANTO I ONE night, when half my..."
1,Part 7 of Trout Fishing in America,by Richard Brautigan,abortion,/poets/richard_brautigan/poems/4006.html,\n\t\t\t\t\t\tTHE PUDDING MASTER OF ...
2,The Mother,by Gwendolyn Brooks,abortion,/poets/gwendolyn_brooks/poems/4136.html,\n\t\t\t\t\t\tAbortions will not let you forge...
3,The Glove,by Robert Browning,abortion,/poets/robert_browning/poems/4929.html,\n\t\t\t\t\t\t(PETER RONSARD _loquitur_.)``Hei...
4,130. Natureâ€™s Law: A Poem,by Robert Burns,abortion,/poets/robert_burns/poems/4989.html,\n\t\t\t\t\t\tLET other heroes boast their sca...


We have theme information for less than 10% of the poems, but it's still useful info, so let's add it.  We have urls for both themes and the full poem df, but they'll need some doctoring to be brought into the same format for subsequent joining.  

In [25]:
# remove prefix from first set
df_cleanpoems['simple_url'] = df_cleanpoems.url.str.replace('http://www.famouspoetsandpoems.com','')
# remove suffice from second set
df_themes['simple_url'] = df_themes.poem_url.str.strip().str.replace('.html','')

In [26]:
df_cleanpoems.simple_url.head()

0    /poets/ezra_pound/poems/18819
1    /poets/ezra_pound/poems/18808
2    /poets/ezra_pound/poems/18827
3    /poets/ezra_pound/poems/18825
4    /poets/ezra_pound/poems/18786
Name: simple_url, dtype: object

In [27]:
df_themes.simple_url.head()

0        /poets/dante_alighieri/poems/39
1    /poets/richard_brautigan/poems/4006
2     /poets/gwendolyn_brooks/poems/4136
3      /poets/robert_browning/poems/4929
4         /poets/robert_burns/poems/4989
Name: simple_url, dtype: object

OK, the urls are in the same format.  Now we'll use a left join to tack on theme labels if they exist.  But we'll need to gather themes as a list, since one poem might be assigned to more than one theme.

In [28]:
df_themes.set_index('simple_url',inplace=True)
df_themes['themes'] = df_themes.groupby('simple_url').apply(lambda x: x.theme.tolist())
df_themes.reset_index(inplace=True)
df_themes.drop_duplicates(subset='simple_url',inplace=True)
del df_themes['theme']
df_cleanpoems = pd.merge(df_cleanpoems, df_themes[['simple_url','themes']], how='left', on='simple_url')

In [29]:
print("The length of the merged data set:  ", len(df_cleanpoems))
df_cleanpoems.head()

The length of the merged data set:   34472


Unnamed: 0,url,title,poem,author,simple_url,themes
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18819,
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18808,
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18827,
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18825,
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18786,


Good!  Looks like the resulting df is the same size as the original, meaning that there were no duplicate urls in the right hand frame.  

**Drop nulls**

In [33]:
df_cleanpoems = df_cleanpoems.loc[~df_cleanpoems.poem.isnull(),:]
df_cleanpoems = df_cleanpoems.loc[~df_cleanpoems.author.isnull(),:]
df_cleanpoems = df_cleanpoems.loc[~df_cleanpoems.title.isnull(),:]
len(df_cleanpoems)

34471

**Encoding and stripping**

Note above that some of the characters in these poems are non-ascii.  Let's strip such characters, as well as white space. But first, we need to make sure that we drop any Null values that might have crept into our theme lists.

In [35]:
str_fields = ['poem','title','author']

for field in str_fields:
    df_cleanpoems[field]=df_cleanpoems[field].apply(lambda x:  x.encode("ascii",errors='ignore').decode())
    df_cleanpoems[field]=df_cleanpoems[field].str.strip()
    
df_cleanpoems.head()

Unnamed: 0,url,title,poem,author,simple_url,themes
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18819,
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18808,
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18827,
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18825,
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18786,


**Drop lead html**

Let's see what work each poem is going to need.  

In [37]:
df_cleanpoems.poem[0]

'<div style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;">\n\t\t\t\t\t\tThe thought of what America would be like<br/>If the Classics had a wide circulation<br/>       Troubles my sleep,<br/>The thought of what America,<br/>The thought of what America,The thought of what America would be like<br/>If the Classics had a wide circulation<br/>      Troubles my sleep.<br/>Nunc dimittis, now lettest thou thy servant,<br/>Now lettest thou thy servant<br/>       Depart in peace.<br/>The thought of what America,<br/>The thought of what America,<br/>The thought of what America would be like<br/>If the Classics had a wide circulation...<br/>       Oh well!<br/>       It troubles my sleep.\t\t\t\t\t\t</div>'

In general, each line is separated by a < br\>.  Not sure we can bank on this for all poems, especially those of a more narrative form.  Also looks like the poem proper begins after a succession of six tabs.  We'll start by split on breaks, and trying to leading and trailing html info.  Let's see to what extent this pattern runs through the data.

In [38]:
idx = df_cleanpoems.poem.str.contains('\t\t\t\t\t\t')
np.sum(idx)

34413

OK, most of the poems contain the six-fold tab, but some don't.  We'll need to process the two batches differently.

In [39]:
df_cleanpoems.loc[idx,'poem_clean'] = df_cleanpoems.loc[idx,'poem'].apply(lambda x: x.split('\t\t\t\t\t\t')[1])
df_cleanpoems.head()

Unnamed: 0,url,title,poem,author,simple_url,themes,poem_clean
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18819,,The thought of what America would be like<br/>...
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18808,,"Blue mountains to the north of the walls,<br/>..."
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18827,,"These fought in any case,<br/>and some believi..."
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18825,,"When I behold how black, immortal ink<br/>Drip..."
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound,/poets/ezra_pound/poems/18786,,"See, they return; ah, see the tentative<br/> ..."


It remains to get rid of html tags.  Let's first see what's going on with the poems without the six-fold tab.

In [40]:
#ix = np.where(~idx==True)[0]
len(df_cleanpoems.loc[~idx,:])

58

There are only 58 poems that don't fit the paradigm.  Out of 35000, that's not many.  Let's drop these.

In [41]:
df_cleanpoems = df_cleanpoems.loc[idx,:]

**Clean out remaining junk**  We need to get rid of returns, newlines, and breaks.

In [42]:
df_cleanpoems.poem_clean = df_cleanpoems.poem_clean.replace({r'\r|\n|<br/>': ' '}, regex=True)
df_cleanpoems.poem_clean.head()

0    The thought of what America would be like If t...
1    Blue mountains to the north of the walls, Whit...
2    These fought in any case, and some believing p...
3    When I behold how black, immortal ink Drips fr...
4    See, they return; ah, see the tentative      M...
Name: poem_clean, dtype: object

Looks OK!  Let's delete the messy poems, and rename the clean ones.

In [43]:
del df_cleanpoems['poem']
df_cleanpoems.rename(columns={'poem_clean':'poem'},inplace=True)
df_cleanpoems.head()

Unnamed: 0,url,title,author,simple_url,themes,poem
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,Ezra Pound,/poets/ezra_pound/poems/18819,,The thought of what America would be like If t...
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,Ezra Pound,/poets/ezra_pound/poems/18808,,"Blue mountains to the north of the walls, Whit..."
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,Ezra Pound,/poets/ezra_pound/poems/18827,,"These fought in any case, and some believing p..."
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,Ezra Pound,/poets/ezra_pound/poems/18825,,"When I behold how black, immortal ink Drips fr..."
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,Ezra Pound,/poets/ezra_pound/poems/18786,,"See, they return; ah, see the tentative M..."


**Add dates**

Let's add a 'years' columns to the cleaned data.

In [45]:
years = pd.read_csv(years_fname + '.csv')
years.head(3)

Unnamed: 0,name,number,years
0,Maya Angelou,18),1928 - present)
1,Margaret Atwood,28),1939 - present)
2,Matthew Arnold,45),1822 - 1888)


In [46]:
years = years.apply(lambda x: x.str.replace(')', ''))
years['birth'] = years.years.apply(lambda x: x.split('-')[0])
years['death'] = years.years.apply(lambda x: x.split('-')[-1])
years.drop('years', axis=1, inplace=True)
#years.to_csv('years-cleaned.csv', index=False)
years.head(3)

Unnamed: 0,name,number,birth,death
0,Maya Angelou,18,1928,present
1,Margaret Atwood,28,1939,present
2,Matthew Arnold,45,1822,1888


In [47]:
years.name = years.name.apply(lambda x: x.strip())
years.head(3)

Unnamed: 0,name,number,birth,death
0,Maya Angelou,18,1928,present
1,Margaret Atwood,28,1939,present
2,Matthew Arnold,45,1822,1888


In [48]:
years[years.name == 'Vlanes']

Unnamed: 0,name,number,birth,death
425,Vlanes,Vladislav Nekliaev,13,13


In [49]:
years.iloc[425] = ['Vlanes', 13, '1969', ' present'] 
years.iloc[425]

name        Vlanes
number          13
birth         1969
death      present
Name: 425, dtype: object

In [50]:
years.number = years.number.astype(int)
years.birth = years.birth.astype(int)
years.death = [' 2016' if x == ' present' else x for x in years.death]
years.death = years.death.astype(int)
years.dtypes

name      object
number     int64
birth      int64
death      int64
dtype: object

In [51]:
df_cleanpoems = df_cleanpoems.merge(years, left_on='author',right_on='name')

In [52]:
df_cleanpoems[0:2]

Unnamed: 0,url,title,author,simple_url,themes,poem,name,number,birth,death
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,Ezra Pound,/poets/ezra_pound/poems/18819,,The thought of what America would be like If t...,Ezra Pound,71,1885,1972
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,Ezra Pound,/poets/ezra_pound/poems/18808,,"Blue mountains to the north of the walls, Whit...",Ezra Pound,71,1885,1972


**Add word count information**

In [53]:
df_cleanpoems['length']=df_cleanpoems.poem.apply(lambda x: len(x.split()))
df_cleanpoems.head(3)

Unnamed: 0,url,title,author,simple_url,themes,poem,name,number,birth,death,length
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,Ezra Pound,/poets/ezra_pound/poems/18819,,The thought of what America would be like If t...,Ezra Pound,71,1885,1972,91
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,Ezra Pound,/poets/ezra_pound/poems/18808,,"Blue mountains to the north of the walls, Whit...",Ezra Pound,71,1885,1972,60
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,Ezra Pound,/poets/ezra_pound/poems/18827,,"These fought in any case, and some believing p...",Ezra Pound,71,1885,1972,97


Done!  Let's get rid of the `simple_url` column and the redundant `name` column, then save and exit.

In [54]:
# delete simplified url columns
del df_cleanpoems['simple_url']
del df_cleanpoems['name']

In [55]:
# save dataframe as both csv and pkl files
df_cleanpoems.to_csv(csvdir + '/' + cleanpoems_fname + '.csv',index=False)
df_cleanpoems.to_pickle(pkldir + '/' + cleanpoems_fname + '.pkl')

One last thing:  we stored theme and date information in files, just in case the program crashed mid-scrape, but we don't need these files anymore.  Let's clean up our mess.

In [56]:
# eliminate theme file
try:
    os.remove(themes_fname+'.csv')
except OSError:
    pass

# eliminate years file
try:
    os.remove(years_fname+'.csv')
except OSError:
    pass