# Webscraper:
A general rule of data science is that no data science project can begin without data. This project follows that principle exactly. Unfortunately, I could not find a suitable digital corpus of poetry published with text from different eras. So I got to collect my own. What follows is the code for my webscraper, and some initial data cleaning.

I settled on a website that contained many poems from different eras. The challenge with scraping this website was that both the HTML structure and the URL scheme weren't designed very cleanly, so it wasn't just a simple scrape. In order to collect the URL's for each poem, I first scraped the URL's for each poet's landing page, which led to a page containing the links for each poem. I then scraped the links for each poem, ending with a list of about 32,000 URL's. I then scraped each of those URL's to collect the text of the poem embedded in the HTML. The Python package used for manipulating HTML code contains a convenient method for getting all the actual text from HTML, but in this situation, that would have pushed the ends of the lines of each poem together, causing me to lose some words. Instead, I decided to manually clean each poem using Python functions to extract the poem text from the HTML code. This process was more time-consuming, but considering the importance of every word in this stylistic analysis, it was certainly worth the time. 

In [1]:
## Imports.
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
all_poets = requests.get('http://www.famouspoetsandpoems.com/poets.html')

In [3]:
poets = BeautifulSoup(all_poets.text, 'lxml')

### Getting years for each poet
In the code below, I scrape the web page that has the years of each poet's life. This was how I was able to separate the poems into eras. It isn't as clean as using the publication date of each poem, but I felt it justified for two reasons:
+ Retrieving the publication years of each poem would be a very time-consuming process, assuming that the dates even exist for each poem. 
+ I made what I believe to be a safe assumption; A poet's style likely follows the style of their era. That said, it is unlikely that they would change their style as the style changes around them. It is more likely that their style stays somewhat consistent throughout their life. Going forward, I intend to find a better way to classify the poets and their works.

In [4]:
poet_years = []

for tag in poets.findAll('td'):
    try:
        if '(' in tag.get_text():
            poet_years.append(tag.get_text().strip())
    except:
        pass

In [5]:
poet_years2 = [x.strip() for x in poet_years]

In [6]:
poet_years2 = poet_years2[3:]

In [7]:
poet_years = poet_years2[::2]

In [8]:
poet_years

[u'Maya Angelou (18)(1928 - present)',
 u'Margaret Atwood (28)(1939 - present)',
 u'Matthew Arnold (45)(1822 - 1888)',
 u'Yehuda Amichai (38)(1924 - 2000)',
 u'Anna Akhmatova (26)(1889 - 1966)',
 u'Ai (1)(1947 - 2010)',
 u'John Ashbery (9)(1927 - present)',
 u'A. R. Ammons (26)(1926 - 2001)',
 u'Jane Austen (13)(1775 - 1817)',
 u'Guillaume Apollinaire (7)(1880 - 1918)',
 u'Dante Alighieri (10)(1265 - 1321)',
 u'Louisa May Alcott (16)(1832 - 1888)',
 u'Richard Aldington (12)(1892 - 1962)',
 u'Lascelles Abercrombie (6)(1881 - 1938)',
 u'Sarah Flower Adams (6)(1805 - 1848)',
 u'Delmira Agustini (1)(1886 - 1914)',
 u'William Allingham (31)(1821 - 1889)',
 u'Deborah Ager (10)(1971 - present)',
 u'Conrad Aiken (64)(1889 - 1973)',
 u'Catherine Anderson (3)(1954 - present)',
 u'Maggie Anderson (1)(1948 - present)',
 u'Ralph Angel (7)(1951 - present)',
 u'Kingsley Amis (1)(1922 - 1995)',
 u'Jean Hans Arp (1)(1886 - 1966)',
 u'Kelli Russell Agodon (6)(1969 - present)',
 u'Julie Hill Alger (4)(19

In [9]:
poets_and_years = []
for i in poet_years:
    poets_and_years.append(i.split('('))

In [10]:
poets_and_years_df = pd.DataFrame(poets_and_years, columns=['name', 'number', 'years', 'blech'])

In [11]:
poets_and_years_df.drop(['blech'], axis=1, inplace=True)

In [12]:
poets_and_years_df.to_csv('poets_years.csv', index=False)

### Scraping a list of links to each poet's page.
The website I'm retrieving my poems from required some clever webscraping. First, I had to collect links to each poet's landing page. These pages each contain links to all of that poet's poems. So after I have all the poet pages, I collect all the links for all of their poems. Then I scrape the actual poem from each of those pages. It amounted to over 32,000 pages that I scraped in total. 

In [13]:
poet_links = []

for tag in poets.findAll('td'):
    try:
        link = tag.find('a')['href']
        if '/poets/' in link:
            poet_links.append(link)
    except:
        pass

poet_links = list(set(poet_links))
poet_links

['/poets/rebecca_elson',
 '/poets/james_thomson',
 '/poets/anne_kingsmill_finch',
 '/poets/julia_ward_howe',
 '/poets/conrad_aiken',
 '/poets/forrest_hamer',
 '/poets/david_lehman',
 '/poets/maxine_kumin',
 '/poets/thomas_chatterton',
 '/poets/patricia_goedicke',
 '/poets/margaret_atwood',
 '/poets/vanessa_perkins',
 '/poets/robert_william_service',
 '/poets/dimitris_varos',
 '/poets/gertrude_stein',
 '/poets/carolyn_forche',
 '/poets/edwin_muir',
 '/poets/n__k__osho',
 '/poets/howard_nemerov',
 '/poets/seamus_heaney',
 '/poets/amy_levy',
 '/poets/mark_strand',
 '/poets/william_vaughn_moody',
 '/poets/sir_thomas_wyatt',
 '/poets/william_ernest_henley',
 '/poets/john_trumbull',
 '/poets/elizabeth_barrett_browning',
 '/poets/hayden_carruth',
 '/poets/joy_harjo',
 '/poets/marianne_moore',
 '/poets/ruth_padel',
 '/poets/graham_burchell',
 '/poets/d__c__berry',
 '/poets/galway_kinnell',
 '/poets/lawrence_ferlinghetti',
 '/poets/robert_louis_stevenson',
 '/poets/catherine_anderson',
 '/poets

This function collects all the links for individual poems from each poet's landing page. 

In [14]:
def get_poems(link):
    poetry = requests.get(link)
    bib_soup = BeautifulSoup(poetry.text, 'lxml')
     
    poem_links = []
    for poems in bib_soup.findAll('td'):
        try:
            poem = poems.find('a')['href']
            if '/poems/' in poem:
                poem_links.append(poem)
        except:
            pass
        
    poem_links = list(set(poem_links))
    poems = []
    for poem in poem_links:
        poem_link = 'http://www.famouspoetsandpoems.com' + poem      
        poems.append(poem_link)
    return poems

In [16]:
from time import time
start = time()
raw = 'http://www.famouspoetsandpoems.com'
poem_links = []
for poet in poet_links:
    link = raw + poet + '/poems'
    poem_links.append(get_poems(link))
print((time() - start) / 60)

4.58135058482


In [17]:
poem_links[:5]

[['http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22156',
  'http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22157'],
 ['http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4078',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4088',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4080',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4083',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4082',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4085',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4084',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4087',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4086',
  'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4089'],
 ['http://www.famouspoetsandpoems.com/poets/anne_kingsmill_finch/poems/7382',
  'http://www.famouspoetsandpoems.com/poets/anne_kingsmill_finch/p

In [18]:
poem_links_list = [item for sublist in poem_links for item in sublist]

In [19]:
poem_links_list[:5]

['http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22156',
 'http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22157',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4078',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4088',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4080']

### Scraping all the poems from site:
Below, I go to each link in the list of 32,000 pages and scrape the actual poem from the page. 

In [20]:
start = time()
poems_soup = []
count = 0
for link in poem_links_list[len(poems_soup):]:
    url = requests.get(link)
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    for tag in soup('span'):
        if 'by' in tag.get_text():
            poet = tag.get_text().strip()
    poems_soup.append([poem, poet])
    if count % 500 == 0:
        print count
    count += 1
print((time() - start) / 60)

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
12500
13000
13500
14000
14500
15000
15500
16000
16500
17000
17500
18000
18500
19000
19500
20000
20500
21000
21500
22000
22500
23000
23500
24000
24500
25000
25500
26000
26500
27000
27500
28000
28500
29000
29500


ConnectionError: HTTPConnectionPool(host='www.famouspoetsandpoems.com', port=80): Max retries exceeded with url: /poets/william_henry_davies/poems/3054 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x4784fd8d0>: Failed to establish a new connection: [Errno 60] Operation timed out',))

In [None]:
## Creating a DataFrame from all the poems and poets.
df_all = pd.DataFrame(poems_soup)

In [None]:
df_all.head()

In [None]:
df_all.to_csv('all_poets.csv')

In [None]:
## Just a peek at what we're dealing with. It will need cleaning.
for i in df_all[0][0:5]:
    print i

In [None]:
len(poems_soup)