# Webscraper:
A general rule of data science is that no data science project can begin without data. This project follows that principle exactly. Unfortunately, there is no suitable digital corpus of poetry published with text from different eras. So I got to collect my own. What follows is the code for my webscraper, and some initial data cleaning.

I settled on a website that contained many poems from different eras. The challenge with scraping this website was that both the HTML structure and the URL scheme weren't designed very cleanly, so it wasn't just a simple scrape. In order to collect the URL's for each poem, I first scraped the URL's for each poet's landing page, which led to a page containing the links for each poem. I then scraped the links for each poem, ending with a list of about 32,000 URL's. I then each of those URL's to collect the text of the poem embedded in the HTML code. The Python package used for manipulating HTML code contains a convenient method for getting all the actual text from HTML, but in this situation, that would have pushed the ends of the lines of each poem together, causing me to lose some words. Instead, I decided to manually clean each poem using Python functions to extract the poem text from the HTML code. This process was more time-consuming, but considering the importance of every word in this stylistic analysis, I decided it was certainly worth the time. 

In [None]:
## Imports.
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [None]:
all_poets = requests.get('http://www.famouspoetsandpoems.com/poets.html')

In [None]:
poets = BeautifulSoup(all_poets.text, 'lxml')

### Getting years for each poet
In the code below, I scrape the web page that has the years of each poet's life. This was how I was able to separate the poems into eras. It isn't as clean as using the publication date of each poem, but I felt it justified for two reasons:
+ Retrieving the publication years of each poem would be a very time-consuming process, assuming that the dates even exist for each poem. 
+ I made what I believe to be a safe assumption; A poet's style likely follows the style of their era. That said, it is unlikely that they would change their style as the style changes around them. It is more likely that their style stays consistent throughout their life.  

In [None]:
poet_years = []

for tag in poets.findAll('td'):
    try:
        if '(' in tag.get_text():
            poet_years.append(tag.get_text().strip())
    except:
        pass

In [None]:
poet_years2 = [x.strip() for x in poet_years]

In [None]:
poet_years2 = poet_years2[3:]

In [None]:
poet_years = poet_years2[::2]

In [None]:
poet_years

In [None]:
poets_and_years = []
for i in poet_years:
    poets_and_years.append(i.split('('))

In [None]:
poets_and_years_df = pd.DataFrame(poets_and_years, columns=['name', 'number', 'years', 'blech'])

In [None]:
poets_and_years_df.drop(['blech'], axis=1, inplace=True)

In [None]:
poets_and_years_df.to_csv('poets_years.csv', index=False)

### Scraping a list of links to each poet's page.
The website I'm retrieving my poems from required some clever webscraping. First, I had to collect links to each poet's landing page. These pages each contain links to all of that poet's poems. So after I have all the poet pages, I collect all the links for all of their poems. Then I scrape the actual poem from each of those pages. It amounted to over 32,000 pages that I scraped in total. 

In [None]:
poet_links = []

for tag in poets.findAll('td'):
    try:
        link = tag.find('a')['href']
        if '/poets/' in link:
            poet_links.append(link)
    except:
        pass

poet_links = list(set(poet_links))
poet_links

This function collects all the links for individual poems from each poet's landing page. 

In [None]:
def get_poems(link):
    poetry = requests.get(link)
    bib_soup = BeautifulSoup(poetry.text, 'lxml')
     
    poem_links = []
    for poems in bib_soup.findAll('td'):
        try:
            poem = poems.find('a')['href']
            if '/poems/' in poem:
                poem_links.append(poem)
        except:
            pass
        
    poem_links = list(set(poem_links))
    poems = []
    for poem in poem_links:
        poem_link = 'http://www.famouspoetsandpoems.com' + poem      
        poems.append(poem_link)
    return poems

In [None]:
raw = 'http://www.famouspoetsandpoems.com'
poem_links = []
for poet in poet_links:
    link = raw + poet + '/poems'
    poem_links.append(get_poems(link))

In [None]:
poem_links[0:5]

In [None]:
poem_links_list = [item for sublist in poem_links for item in sublist]

In [None]:
poem_links_list

### Scraping all the poems from site:
Below, I go to each link in the list of 32,000 pages and scrape the actual poem from the page. 

In [None]:
poems_soup = []

In [None]:
count = 0
for link in poem_links_list[len(poems_soup):]:
    url = requests.get(link)
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    for tag in soup('span'):
        if 'by' in tag.get_text():
            poet = tag.get_text().strip()
    poems_soup.append([poem, poet])
    if count % 500 == 0:
        print count
    count += 1

In [None]:
## Creating a DataFrame from all the poems and poets.
df_all = pd.DataFrame(poems_soup)

In [None]:
df_all.head()

In [None]:
df_all.to_csv('all_poets.csv')

In [None]:
## Just a peek at what we're dealing with. It will need cleaning.
for i in df_all[0][0:5]:
    print i

In [None]:
len(poems_soup)