# Scraper:  poetrysoup.com

This notebook scrapes the "Top 100 Short Poems" from www.PoetrySoup.com

This script is divided into two parts:

1.  <a href='#scraping'> Scraping </a>
2.  <a href='#cleaning'> Cleaning </a>

<a id='scraping'></a>
## Part I:  Scraping

**Path and file names**

In [1]:
rootdir = '../data/poems'                    # root directory for data
csvdir   = rootdir + '/csv'            # subdirectory for csv files
pkldir   = rootdir + '/pkl'            # subdirectory for pkl files
rawpoems_fname = 'top100_raw'          # base name of variables and files containing raw, uncleaned poems
cleanpoems_fname = 'top100_clean'      # base name for cleaned poems
main_website = 'https://www.poetrysoup.com' # main website for these poems
top100 = '/famous/poems/short/top_100_famous_short_poems' # url extension with this collection

**Import statements**:  Some pandas, beautiful soup, and string functionality.

In [2]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import urllib

**Get landing page for poem links**

In [14]:
main = requests.get(main_website + top100 + '.aspx')
main_soup = BeautifulSoup(main.text,'lxml') 

**Scrape url, title, and author**

In [16]:
df = []
all_items=main_soup.find_all('tr')
for item in all_items[1::]:
    info=item.find_all('a')
    url = info[0].get('href')
    title = info[0].get_text().strip()
    author = info[1].get_text().strip()
    df.append([url,title,author])
df_rawpoems = pd.DataFrame(df,columns=['url','title','author'])
df_rawpoems.head()

Unnamed: 0,url,title,author
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali


**Now scrape off poems**

In [19]:
# store poem info in a list
poems = []
for row in df_rawpoems.itertuples():
    main = requests.get(main_website+row[1])
    main_soup = BeautifulSoup(main.text,'lxml') 
    poem = main_soup.find_all('pre')
    poems.append(poem[0].get_text())

In [20]:
# convert to dataframe
df_rawpoems['poem'] = poems

# take a peak
df_rawpoems.head()

Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you?\r\nAre you nobody, to..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again.\r\nVeins collapse, open..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill\r\nand the power o...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day\r\nknocked livi...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


**Save to disk**.  

In [24]:
df_rawpoems.to_pickle(pkldir + '/' + rawpoems_fname + '.pkl')
df_rawpoems.to_csv(csvdir + '/' + rawpoems_fname + '.csv',index=False)

<a id='cleaning'></a>
## Part II:  Clean the Poems

**Load the data**

In [3]:
df_cleanpoems = pd.read_pickle(pkldir + '/' + rawpoems_fname + '.pkl')
print("length of data:  ", len(df_cleanpoems))
df_cleanpoems.head()

length of data:   100


Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you?\r\nAre you nobody, to..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again.\r\nVeins collapse, open..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill\r\nand the power o...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day\r\nknocked livi...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


**Replace spaces and newlines.**

The poems are full of tabs, newlines, etc.  Let's clean them up.


In [5]:
# strip out newlines
df_cleanpoems['poem']=df_cleanpoems['poem'].replace({r'\r|\n': ' '}, regex=True)
df_cleanpoems.head()

Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day knocked living...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


**Count words**

In [29]:
df_cleanpoems['length']=df_cleanpoems.poem.apply(lambda x: len(x.split()))
df_cleanpoems.head(3)

Unnamed: 0,url,title,author,poem,length
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?...",44
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin...",56
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...,53


**Encode into ascii, por si acaso**

In [31]:
df_cleanpoems['poem']=df_cleanpoems['poem'].apply(lambda x:  x.encode("ascii",errors='ignore').decode())
df_cleanpoems.head(3)

Unnamed: 0,url,title,author,poem,length
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?...",44
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin...",56
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...,53


**Add full url** 

In [6]:
df_cleanpoems = pd.read_pickle(pkldir + '/' + cleanpoems_fname + '.pkl')
print("length of data:  ", len(df_cleanpoems))
df_cleanpoems.head()

length of data:   100


Unnamed: 0,url,title,author,poem,length
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?...",44
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin...",56
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...,53
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day knocked living...,31
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee",9


In [8]:
df_cleanpoems['url'] = main_website + df_cleanpoems.url
df_cleanpoems.head()

Unnamed: 0,url,title,author,poem,length
0,https://www.poetrysoup.com/famous/poem/im_nobo...,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?...",44
1,https://www.poetrysoup.com/famous/poem/the_les...,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin...",56
2,https://www.poetrysoup.com/famous/poem/the_pow...,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...,53
3,https://www.poetrysoup.com/famous/poem/a_total...,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day knocked living...,31
4,https://www.poetrysoup.com/famous/poem/i_float...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee",9


**Save to disk**

In [9]:
df_cleanpoems.to_pickle(pkldir + '/' + cleanpoems_fname + '.pkl')
df_cleanpoems.to_csv(csvdir + '/' + cleanpoems_fname + '.csv',index=False)

**Done!**