# poeml_scraper

At the heart of this project is a corpus of poems.  I was unable to find a decent database of modern, short verse, and so decided to scrape my own set of poems from various websites.  Specifically, I scraped poems from:

www.poets.org (mostly relatively modern, sophisticated, complex poems)
www.poemhunter.com (a wildly diverse set of poems from many different eras)
www.poetrysoup.com (from here I took the list "Top 100 Famous Poems"

Of course the rub is that in order to have a well-functioning recommender system, the poems should be stylistically "suited" to the task.  Really that requires some custom curating:  my taste is not yours, and the "matches" will be more satisfying if the underlying set of poems is palatable.  For this project, I'll end up focusing on modern, short poems and hope for the best, but this is a good example of a project for which at least some "human in the loop" could make a big difference.  

----

## Part I:  www.poets.org


The poems on this site are not directly labeled, but it is possible to browse them by theme.  This script scrolls through all of the themes, and scrapes all the resultant poems.  Some poems belong to more than one theme, and are thus included multiple times in the resulting dataset, once for each theme-word.

The script takes a while to run (on the order of hours.)  For a shorter run time, edit the list of themes and only scrape themes of interest.

**Gather theme labels and poem urls**

Import statements

In [136]:
from bs4 import BeautifulSoup
import numpy as np
import requests
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import os
%matplotlib inline

Establish connection to landing page at www.poets.org

In [101]:
main = requests.get("https://www.poets.org/poetsorg/poems")
main_soup = BeautifulSoup(main.text,'lxml') 

Scrape urls of landing page for each theme 

In [146]:
theme_urls = []
for theme in main_soup.find_all('div',class_="themes")[0].find_all('li'):
    theme_url = theme.contents[0].get('href')
    theme_name = theme.get_text()
    theme_urls.append([theme_url,theme_name])
    
themes = pd.DataFrame(theme_urls,columns=['url','theme'])
themes[0:15]

Unnamed: 0,url,theme
0,/poetsorg/poems?field_poem_themes_tid=851,Afterlife
1,/poetsorg/poems?field_poem_themes_tid=856,Aging
2,/poetsorg/poems?field_poem_themes_tid=861,Ambition
3,/poetsorg/poems?field_poem_themes_tid=866,America
4,/poetsorg/poems?field_poem_themes_tid=871,American Revolution
5,/poetsorg/poems?field_poem_themes_tid=1691,Americana
6,/poetsorg/poems?field_poem_themes_tid=876,Ancestry
7,/poetsorg/poems?field_poem_themes_tid=881,Anger
8,/poetsorg/poems?field_poem_themes_tid=886,Animals
9,/poetsorg/poems?field_poem_themes_tid=1531,Anxiety


*Audio* as actually not a theme...it just means that there is an audio recording of this poem.  Since this fact is irrelevant for this classification scheme, we'll drop that row.  Ditto for *Public domain*.

In [103]:
themes = themes.loc[(themes.theme != 'Audio') & (themes.theme != 'Public Domain'),:]

*Optional:* Scrape url of landing page for each occasion. (I didn't actually use these, but someone might want to, so I've kept the code here.)

In [106]:
occasion_urls = []
for occasion in main_soup.find_all('div',class_="occasions")[0].find_all('li'):
    occasion_url = occasion.contents[0].get('href')
    occasion_name = occasion.get_text()
    occasion_urls.append([occasion_url,occasion_name])
    
occasions = pd.DataFrame(occasion_urls,columns=['url','occasion'])
occasions.head()

Unnamed: 0,url,occasion
0,/poetsorg/poems?field_occasion_tid=476,Anniversary
1,/poetsorg/poems?field_occasion_tid=1671,Asian/Pacific American Heritage Month
2,/poetsorg/poems?field_occasion_tid=484,Autumn
3,/poetsorg/poems?field_occasion_tid=478,Birthdays
4,/poetsorg/poems?field_occasion_tid=1501,Black History Month


Now build a master list of poem/theme combinations.  This involves some clever webscraping...for each theme, the results are displayed in a variable number of pages.  We need to figure out the number of pages, and use that number to predict the urls of the pages listing poems ('landing pages').  Once we have these landing pages, we scrape poem info from them.

In order to implement this scheme, we'll need a helper function that returns the individual poem urls from each landing subpage for a given theme.

In [143]:
def get_poem_urls(landing_page_url,theme):# assemble list of poems from a landing page
    year_title_url_poet_theme = []
    main = requests.get(landing_page_url)
    main_soup = BeautifulSoup(main.text,'lxml') 
    odd_entries = main_soup.find_all('tr', class_="odd")
    even_entries = main_soup.find_all('tr', class_="even")
    all_entries = odd_entries + even_entries
    for entry in all_entries:
        year = entry.find('span',class_="date-display-single")
        if year:
            year = year.get_text()
        else:
            year = ''
        titleblob = entry.find('td',class_="views-field views-field-title")
        title = titleblob.a.get_text()
        poem_url = titleblob.a.get('href')
        poet = entry.find('td',class_="views-field views-field-field-first-name")
        if poet.find('a'):
            poet = poet.a.get_text()
        else:
            poet = ''
        year_title_url_poet_theme.append([year,title,poem_url,poet,theme])

    return year_title_url_poet_theme

We'll also need to write to files.  The following functions provide a little error checking to make sure we don't overwrite stuff.

In [172]:
def modify_filename(filename):
    """Add .bak suffix to string """
    return filename + '.bak'

def check_filename(filename):
    """If file that we're going to write to already exists, resave it with a .bak extension"""
    if os.path.isfile(filename):
        filename_mod = modify_filename(filename)
        while os.path.isfile(filename_mod):
            filename_mod = modify_filename(filename_mod)
        os.rename(filename,filename_mod)

Now we'll simply crawl along each theme, and for each theme crawl along each page, and for each page grab all the poem info.  To manage store, we'll write the output for each page to a file as we go along.

**Caution:** this cell takes a while to run (on the order of half an hour for all themes.)

In [150]:
# define output file, and rename if it already exists
outfile = 'poetsorg_urls.csv'
check_filename(outfile)

# comment this section out if you want all themes
theme_subset = ['Cities','Nature','Love']
themes = themes.loc[themes.theme.isin(mythemes)]

# for each theme....
for index, row in themes.iterrows():
    url='https://www.poets.org'+row['url']
    theme = row['theme']
    
    # ...figure out how many pages of output there are
    main = requests.get(url)
    main_soup = BeautifulSoup(main.text,'lxml') 
    npages=main_soup.find('li',class_="pager-item last")
    if npages:
        npages = npages.get_text()
    else:
        npages=main_soup.find('li',class_="pager-last last")
        if npages:
            npages = npages = npages.get_text()
        else:
            npages = 1
    npages_int = int(npages)
    
    # then extract the urls for each page, and print the theme for monitoring
    page_urls = [url, theme]
    print('theme:  ',theme)
    
    # then run over each such page and get the poem urls
    for i in np.arange(1,npages_int):
        page_url = url + "&page=" + str(i)
        page_content = get_poem_urls(page_url,theme)
        page_df = pd.DataFrame(page_content,columns=['year','title','url','author','theme'])
        
        # writing to disc after each page.
        with open(outfile, 'a') as f:
            page_df.to_csv(f, header=False,index=False)

theme:   Cities
theme:   Dogs
theme:   Love


Finally, turn all this info into a data frame and save it as a pickle file. 

In [151]:
df_poetsorg_urls = pd.read_csv(outfile,names=['year','title','url','author','theme'])
df_poetsorg_urls.to_pickle('poetsorg_urls.pkl')
df_poetsorg_urls.to_csv('poetsorg_urls.csv',index=False)

In [152]:
df_poetsorg_urls

Unnamed: 0,year,title,url,author,theme
0,1917.0,Dawn,/poetsorg/poem/dawn-2,John Gould Fletcher,Cities
1,2017.0,Daily Conscription,/poetsorg/poem/daily-conscription,Kyle Dargan,Cities
2,1920.0,Recuerdo,/poetsorg/poem/recuerdo-0,Edna St. Vincent Millay,Cities
3,1917.0,"[Immortal?... No,]",/poetsorg/poem/immortal-no,F. S. Flint,Cities
4,2017.0,from “The Last Bohemian of Avenue A”,/poetsorg/poem/last-bohemian-avenue,Yusef Komunyakaa,Cities
5,2009.0,The City Outside My Ear,/poetsorg/poem/city-outside-my-ear,Michael Luis Medrano,Cities
6,1922.0,On Broadway,/poetsorg/poem/broadway-0,Claude McKay,Cities
7,1917.0,"[London, my beautiful]",/poetsorg/poem/london-my-beautiful,F. S. Flint,Cities
8,1919.0,Solitare,/poetsorg/poem/solitare,Amy Lowell,Cities
9,2017.0,In the City,/poetsorg/poem/city-2,Chen Chen,Cities


### Part II:  scrape the poems


OK, now we've assembled an exhaustive list of poem urls, we can scrape the poems themselves!  This will really take a long time:  there are over 10000 labelled poems.  (*Suggestion:  run the cell just before you go to bed and check it in the morning.*)

The script writes poems to file every so often.  You can adjust how often it writes and prints out progress information. If y

In [176]:
# define batch size for writing to file and printing progress
batchsize=2
outfile = 'poetsorg_poems.csv'
check_filename(outfile)

# create a dataframe to hold the poems, indexed by url
df_poetsorg_poems = pd.DataFrame(columns=['url','poem'])
df_poetsorg_poems['url']=poetsorg_urls['url']

# iterate over all poems
counter = 0
for row in poetsorg_poems[0:2*batchsize+1].itertuples():
    
    # connecting poem page
    url = 'https://www.poets.org' + row.url
    main = requests.get(url)
    main_soup = BeautifulSoup(main.text,'lxml') 

    # finding the relevant div snippet, printing error message if we can't find it
    poem_div=main_soup.find('div',class_="field field-name-body field-type-text-with-summary field-label-hidden")
    if poem_div:
        text = poem_div.get_text(" ")
    else:
        print('missed {}'.format(url))
        
    # store poem in dataframe
    df_poetsorg_poems.loc[row.Index,'poem'] = text
    
    counter+=1
    #every 50 poems, write to file and print out counter 
    if np.mod(counter,batchsize)==0:
        with open(outfile, 'a') as f:
            df_poetsorg_poems[counter-batchsize:counter].to_csv(f, header=False,index=False)
        print("counter: ", counter)

# one last print statement to get residuals
with open(outfile, 'a') as f:
            df_poetsorg_poems[counter-np.mod(counter,batchsize):counter+1].to_csv(f, header=False,index=False)

does poetsorg_poems.csv.bak exist? False
counter:  2
counter:  4


Turn list into a datafame and save.  Note that linebreaks are saved.  These might be useful for subsequent formatting, but will need to be cleaned out for processing.

In [175]:
df = pd.DataFrame(all_poems,columns=['url','text'])
df.to_csv('poetsorg_full.csv',index=False)
df.to_pickle('poetsorg_full.pkl',index=False)
df.head()

NameError: name 'all_poems' is not defined

## Part III:  clean the data!

In [None]:
df = pd.read_csv('poetsorg_full.csv')
print("A total of {} poems!".format(len(df)))

Drop any record with missing values.  

In [None]:
df = df.dropna(axis=0,subset=['poem','theme'],how='any')
len(df)

Remove carriage returns.

In [None]:
df = df.replace({r'\r|\n': ' '}, regex=True)
df.poem = df.poem.str.strip()
df.head()

Generate and store length information for each poem

In [None]:
# calculate length for each poem, and plot it in a histogram
df['lengths'] = df['poem'].map(lambda x: len(x))
max_length = 3000
plt.hist(df.lengths.loc[df.lengths<max_length],20)

In [None]:
grouped = df.groupby('theme')
grouped.count().lengths[0:5]

Store the clean data

In [None]:
df.to_csv('poetsorg_clean.csv',index=False)
df.to_pickle('poetsorg_clean.pkl')

In [99]:
poetsorg_poems[0:3]

Unnamed: 0,url,poem
0,/poetsorg/poem/dark-tree-cold-sea,although I know you can never be found althoug...
1,/poetsorg/poem/sky-0,"Whatever I care for, someone else loves it mor..."
2,/poetsorg/poem/afterlife-fame,is dark a neglected ...


In [98]:
df_poetsorg_poems.iloc[0,:]

url                     /poetsorg/poem/dark-tree-cold-sea
poem    "Nam Sibyllam quidem Cumis ego ipse oculis mei...
Name: 0, dtype: object