###  This notebook scrapes the "Top 100 Short Poems" from PoetrySoup.com

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import urllib

**Get landing page for poem links**

In [2]:
main = requests.get("https://www.poetrysoup.com/famous/poems/short/top_100_famous_short_poems.aspx")
main_soup = BeautifulSoup(main.text,'lxml') 

**Scrape url, title, and author**

In [3]:
df = []
all_items=main_soup.find_all('tr')
for item in all_items[1::]:
    info=item.find_all('a')
    url = info[0].get('href')
    title = info[0].get_text().strip()
    author = info[1].get_text().strip()
    df.append([url,title,author])
df = pd.DataFrame(df,columns=['url','title','author'])
df.head()

Unnamed: 0,url,title,author
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali


**Now scrape off poems**

In [5]:
poems = []
for row in df.itertuples():
    main = requests.get('https://www.poetrysoup.com'+row[1])
    main_soup = BeautifulSoup(main.text,'lxml') 
    poem = main_soup.find_all('pre')
    poems.append(poem[0].get_text())

**We now have a dataframe of messy poems**

In [6]:
df['poem'] = poems
df.to_pickle('top100_full.pkl')
df.to_csv('top100_full.csv',header=['url','title','author','poem'],index=False)
df.head()

Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you?\r\nAre you nobody, to..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again.\r\nVeins collapse, open..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill\r\nand the power o...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day\r\nknocked livi...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


**Let's clean them up**

In [7]:
df['poem']=df['poem'].replace({r'\r|\n': ' '}, regex=True)
df.head()

Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day knocked living...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


In [8]:
df.to_pickle('top100_clean.pkl')
df.to_csv('top100_clean.csv',header=['url','title','author','poem'],index=False)
df.head()

Unnamed: 0,url,title,author,poem
0,/famous/poem/im_nobody!_who_are_you_41,Im nobody! Who are you?,Emily Dickinson,"I'm nobody! Who are you? Are you nobody, too?..."
1,/famous/poem/the_lesson_115,The Lesson,Maya Angelou,"I keep on dying again. Veins collapse, openin..."
2,/famous/poem/the_power_of_a_smile_21350,The Power of a Smile,Tupac Shakur,The power of a gun can kill and the power of ...
3,/famous/poem/a_total_stranger_one_black_day_408,a total stranger one black day,Edward Estlin (E E) Cummings,a total stranger one black day knocked living...
4,/famous/poem/i_float_like_a_butterfly_sting_li...,"I float like a butterfly, sting like a bee",Muhammad Ali,"I float like a butterfly, sting like a bee"


**Done!**

In [88]:
df.to_csv?