# Webscraper:
A general rule of data science is that no data science project can begin without data. This project follows that principle exactly. Unfortunately, I could not find a suitable digital corpus of poetry published with text from different eras. So I got to collect my own. What follows is the code for my webscraper, and some initial data cleaning.

I settled on a website that contained many poems from different eras. The challenge with scraping this website was that both the HTML structure and the URL scheme weren't designed very cleanly, so it wasn't just a simple scrape. In order to collect the URL's for each poem, I first scraped the URL's for each poet's landing page, which led to a page containing the links for each poem. I then scraped the links for each poem, ending with a list of about 32,000 URL's. I then scraped each of those URL's to collect the text of the poem embedded in the HTML. The Python package used for manipulating HTML code contains a convenient method for getting all the actual text from HTML, but in this situation, that would have pushed the ends of the lines of each poem together, causing me to lose some words. Instead, I decided to manually clean each poem using Python functions to extract the poem text from the HTML code. This process was more time-consuming, but considering the importance of every word in this stylistic analysis, it was certainly worth the time. 

In [1]:
## Imports.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
from time import time
from tqdm import tqdm_notebook as tqdm

In [2]:
# Url contains list of all poets on site.
all_poets_page = requests.get('http://www.famouspoetsandpoems.com/poets.html')
all_poets_page = BeautifulSoup(all_poets_page.text, 'lxml')

### Getting years for each poet
In the code below, I scrape the web page that has the years of each poet's life. This was how I was able to separate the poems into eras. It isn't as clean as using the publication date of each poem, but I felt it justified for two reasons:
+ Retrieving the publication years of each poem would be a very time-consuming process, assuming that the dates even exist for each poem. 
+ I made what I believe to be a safe assumption; A poet's style likely follows the style of their era. That said, it is unlikely that they would change their style as the style changes around them. It is more likely that their style stays somewhat consistent throughout their life. Going forward, I intend to find a better way to classify the poets and their works.

In [3]:
def grab_poets_and_years(all_poets_page):
    poets = list()
    for tag in all_poets_page.findAll('td'):
        if '(' in tag.get_text():
            poets.append(tag.get_text().strip())
    poets = [x.strip() for x in poets]
    poets = poets[3:]
    poets = poets[::2]
    return poets

In [4]:
poets = grab_poets_and_years(all_poets_page)
poets[0:5]

[u'Maya Angelou (18)(1928 - present)',
 u'Margaret Atwood (28)(1939 - present)',
 u'Matthew Arnold (45)(1822 - 1888)',
 u'Yehuda Amichai (38)(1924 - 2000)',
 u'Anna Akhmatova (26)(1889 - 1966)']

In [5]:
def extract_poet_info(poet_string):
    poet_name = re.findall('^[^\(]+', poet_string)[0].strip()
    number_of_poems = re.findall('\((.*?)\)', poet_string)[0]
    poet_years = re.findall('\((.*?)\)', poet_string)[1]
    return poet_name, number_of_poems, poet_years

In [6]:
poets_info = map(extract_poet_info, poets)
poets_df = pd.DataFrame(poets_info, columns=['name', 'number', 'years'])
poets_df.shape

(631, 3)

In [7]:
poets_df.to_csv('poets_info.csv', index=False)

### Scraping a list of links to each poet's page.
The website I'm retrieving my poems from required some clever webscraping. First, I had to collect links to each poet's landing page. These pages each contain links to all of that poet's poems. So after I have all the poet pages, I collect all the links for all of their poems. Then I scrape the actual poem from each of those pages. It amounted to over 32,000 pages that I scraped in total. 

In [8]:
def extract_poet_links(all_poets_page):
    poet_links = list()
    for tag in all_poets_page.findAll('td'):
        try:
            link = tag.find('a')['href']
            if '/poets/' in link:
                poet_links.append(link)
        except:
            pass
    poet_links = list(set(poet_links))
    base = 'http://www.famouspoetsandpoems.com'
    poet_pages = [base + poet + '/poems' for poet in poet_links]
    return poet_pages

In [9]:
all_poet_pages = extract_poet_links(all_poets_page)
all_poet_pages[:5]

['http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems',
 'http://www.famouspoetsandpoems.com/poets/anne_kingsmill_finch/poems',
 'http://www.famouspoetsandpoems.com/poets/julia_ward_howe/poems',
 'http://www.famouspoetsandpoems.com/poets/conrad_aiken/poems']

**This function collects all the links for individual poems from each poet's landing page.**

In [10]:
def get_poems(poet_page):
    poet_page = requests.get(poet_page)
    bib_soup = BeautifulSoup(poet_page.text, 'lxml')
    
    raw_poem_links = list()
    for poems in bib_soup.findAll('td'):
        try:
            poem = poems.find('a')['href']
            if '/poems/' in poem:
                raw_poem_links.append(poem)
        except:
            pass     
    raw_poem_links = list(set(raw_poem_links))
    poem_links = ['http://www.famouspoetsandpoems.com' + poem for poem in raw_poem_links]
    return poem_links

In [11]:
start = time()
all_poem_links = map(get_poems, all_poet_pages)
print((time() - start) / 60)

4.41807373365


In [12]:
all_poem_links_list = [item for sublist in all_poem_links for item in sublist]
all_poem_links_list[:5]

['http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22156',
 'http://www.famouspoetsandpoems.com/poets/rebecca_elson/poems/22157',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4078',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4088',
 'http://www.famouspoetsandpoems.com/poets/james_thomson/poems/4080']

### Scraping all the poems from site:
Below, I go to each link in the list of 32,000 pages and scrape the actual poem from the page. 

In [None]:
start = time()
poems_soup = []
for link in tqdm(all_poem_links_list, desc='Scrape all poems'):
    url = requests.get(link)
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    for tag in soup('span'):
        if 'by' in tag.get_text():
            poet = tag.get_text().strip()
    poems_soup.append([poem, poet])
print((time() - start) / 60)

In [None]:
# If scrape was interrupted, continue from where you left off.
start = time()
poems_soup = []
for link in tqdm(all_poem_links_list[len(poems_soup):], len(all_poem_links_list)):
    url = requests.get(link)
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    for tag in soup('span'):
        if 'by' in tag.get_text():
            poet = tag.get_text().strip()
    poems_soup.append([poem, poet])
print((time() - start) / 60)

In [None]:
## Creating a DataFrame from all the poems and poets.
df_all = pd.DataFrame(poems_soup)
df_all.head()

In [None]:
df_all.to_csv('all_poets.csv', index=False)

In [None]:
## Just a peek at what we're dealing with. It will need cleaning.
for i in df_all[0][0:5]:
    print i

In [None]:
len(poems_soup)