# scraper_famouspoetsandpoems

This script is a modification of a notebook provided by Chaim Gluck:  https://github.com/chaimgluck.  It scrapes a ton of poems of all sorts, but is rather indiscriminant:  some poems are huge, some tiny, and themes are all over the place.  All told there are about 35,000 poems on the site.  

In [2]:
## Imports.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re, os
import pickle

In [2]:
all_poets = requests.get('http://www.famouspoetsandpoems.com/poets.html')

In [3]:
poets = BeautifulSoup(all_poets.text, 'lxml')

### Part I:  Scrape the data

**Years for each poet**
In the code below, I scrape the web page that has the years of each poet's life. This was how I was able to separate the poems into eras. It isn't as clean as using the publication date of each poem, but I felt it justified for two reasons:
+ Retrieving the publication years of each poem would be a very time-consuming process, assuming that the dates even exist for each poem. 
+ I made what I believe to be a safe assumption; A poet's style likely follows the style of their era. That said, it is unlikely that they would change their style as the style changes around them. It is more likely that their style stays somewhat consistent throughout their life. Going forward, I intend to find a better way to classify the poets and their works.

In [4]:
poet_years = []

for tag in poets.findAll('td'):
    try:
        if '(' in tag.get_text():
            poet_years.append(tag.get_text().strip())
    except:
        pass

In [6]:
poet_years[0]

'document.write(\'<scr\' + \'ipt type="text/javascript">(function () {try{VCM.media.render({sid:53785,media_id:2,media_type:2,version:"1.2",pfc:900000});} catch(e){}}());</scr\' + \'ipt>\');\n\n\n\nFamous Poets and Poems:\xa0 Home\xa0\xa0|\xa0\xa0Poets\xa0\xa0|\xa0\xa0Poem of the Month\xa0\xa0|\xa0\xa0Poet of the Month\xa0\xa0|\xa0\xa0Top 50 Poems\xa0\xa0|\xa0\xa0Famous Quotes\xa0\xa0|\xa0\xa0Famous Love Poems\n\n\n \n\n\n\n\nSearch for: PoemsPoets\n\n\n\n\n\nvar vclk_options = {sid:53785,media_type:5,version:"1.4"};\n\n\n\n\n\nFamousPoetsAndPoems.com / List of Poets\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPoets\n\n\n\n\n\n\n\n\n\n\nPoet of the Month\n\n\n\n\n\n\n\n\n\n\nPoem of the Month\n\n\n\n\n\n\n\n\n\n\nTop 50 Poems\n\n\n\n\n\n\n\n\n\n\nFamous Quotes\n\n\n\n\n\n\n\n\n\n\nThematic Poems\n\n\n\n\n\n\n\n\n\n\nThematic Quotes\n\n\n\n\n \nvar vclk_options = {sid:53785,media_type:7,version:"1.4"};\n\n\n\nPopular Poets\n\n\n\n\n\n\nLangston Hughes\nShel Silverstein\nPablo Neruda\nMaya Angelou\

In [7]:
poet_years2 = [x.strip() for x in poet_years]

In [8]:
poet_years2 = poet_years2[3:]

In [9]:
poet_years = poet_years2[::2]

In [10]:
poets_and_years = []
for i in poet_years:
    poets_and_years.append(i.split('('))

In [11]:
poets_and_years_df = pd.DataFrame(poets_and_years, columns=['name', 'number', 'years', 'blech'])

In [12]:
poets_and_years_df.drop(['blech'], axis=1, inplace=True)

In [13]:
poets_and_years_df.to_csv('poets_years.csv', index=False)

In [14]:
poets_and_years_df.head()

Unnamed: 0,name,number,years
0,Maya Angelou,18),1928 - present)
1,Margaret Atwood,28),1939 - present)
2,Matthew Arnold,45),1822 - 1888)
3,Yehuda Amichai,38),1924 - 2000)
4,Anna Akhmatova,26),1889 - 1966)


### Scraping a list of themes

In [15]:
# landing page for list of themes
all_themes = requests.get('http://www.famouspoetsandpoems.com/thematic_poems.html')

# get the html from this page
all_themes_html = BeautifulSoup(all_themes.text, 'lxml')

In [16]:
# collect urls for landing page for each theme
theme_urls = []
for tag in all_themes_html.find_all('li',style="padding-bottom:5px;list-style-image: url(/images/_li.gif);padding-right:10px;"):
    theme_urls.append(tag.contents[0].get('href'))
# take a peak to make sure things are kosher
theme_urls[0:5]

['/thematic_poems/abortion_poems.html',
 '/thematic_poems/angel_poems.html',
 '/thematic_poems/animal_poems.html',
 '/thematic_poems/anniversary_poems.html',
 '/thematic_poems/april_poems.html']

In [17]:
# extract tags to store as labels in database
theme_tags = []
for url in theme_urls:
    m = re.search(r'(/)(\w+)_poems.html',url)
    theme_tags.append(m.group(2))
theme_tags[0:5]

['abortion', 'angel', 'animal', 'anniversary', 'april']

Here we need to loop over all the "themes", for each one scraping the title and author of every poem. We'll store everything as a big list of lists, turning it into a pandas dataframe later.

In [21]:
title_author_theme = []

for url in theme_urls:
    # serve up html for theme landing page
    theme_page = requests.get('http://www.famouspoetsandpoems.com'+url)
    theme_page_soup = BeautifulSoup(theme_page.text, 'lxml')
    
    # extract simple string for theme name
    m = re.search(r'(/)(\w+)_poems.html',url)
    theme_name = m.group(2)

    # extract author, title, and poem_url for each poem in this theme, and store, along with theme label
    for poem in theme_page_soup.find_all('div',style="width:440px;padding-left:12px;padding-top:14px;"):
        title = poem.a.get_text()
        author = poem.span.get_text()
        poem_url = poem.a.get('href')
        poem_page = requests.get('http://www.famouspoetsandpoems.com'+poem_url)
        poem_page_soup = BeautifulSoup(poem_page.text, 'lxml')
        text_blob = poem_page_soup.find('div',style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
        if text_blob:
            text = text_blob.get_text()
        else:
            print("no text for poem {}".format(poem_url))  
            text = ''
#        text = [x.strip() for x in poem.get_text().split('by')]
#        text.append(theme_name)
        title_author_theme.append([title,author,theme_name,poem_url,text])

no text for poem /poets//poems/10095.html
no text for poem /poets//poems/10106.html
no text for poem /poets//poems/20956.html
no text for poem /poets//poems/10109.html
no text for poem /poets//poems/10127.html
no text for poem /poets//poems/10119.html
no text for poem /poets//poems/10158.html


Convert to a dataframe, and store as csv.

In [51]:
df_themes_and_poems = pd.DataFrame(title_author_theme,columns=['title','author','theme','poem_url','poem'])
df_themes_and_poems.to_csv('themes_and_poems.csv',index=False)
df_themes_and_poems.theme.unique().tolist()

['abortion',
 'angel',
 'animal',
 'anniversary',
 'april',
 'autumn',
 'baby',
 'ballad',
 'baptism',
 'beach',
 'beautiful',
 'beauty',
 'birthday',
 'brother',
 'butterfly',
 'cat',
 'child',
 'childhood',
 'christian',
 'christmas',
 'courage',
 'dad',
 'dance',
 'dark',
 'daughter',
 'death',
 'depression',
 'dog',
 'dream',
 'easter',
 'faith',
 'family',
 'farewell',
 'father',
 'flower',
 'football',
 'freedom',
 'friendship',
 'funeral',
 'funny',
 'girl',
 'god',
 'goodbye',
 'graduation',
 'grandfather',
 'grandmother',
 'haiku',
 'halloween',
 'happy',
 'hate',
 'heaven',
 'history',
 'hockey',
 'holiday',
 'holocaust',
 'home',
 'hope',
 'horse',
 'humorous',
 'husband',
 'inspiration',
 'january',
 'jesus',
 'journey',
 'life',
 'lonely',
 'loss',
 'lost',
 'lyric',
 'marriage',
 'memory',
 'metaphor',
 'miracle',
 'mom',
 'moon',
 'mother',
 'music',
 'name',
 'nature',
 'ocean',
 'pain',
 'passion',
 'patriotic',
 'peace',
 'people',
 'prayer',
 'rain',
 'relationship',

### Scraping a list of links to each poet's page.
The website I'm retrieving my poems from required some clever webscraping. First, I had to collect links to each poet's landing page. These pages each contain links to all of that poet's poems. So after I have all the poet pages, I collect all the links for all of their poems. Then I scrape the actual poem from each of those pages. It amounted to over 32,000 pages that I scraped in total. 

In [34]:
poet_links = []

for tag in poets.findAll('td'):
    try:
        link = tag.find('a')['href']
        if '/poets/' in link:
            poet_links.append(link)
    except:
        pass

poet_links = list(set(poet_links))
poet_links

['/poets/ezra_pound',
 '/poets/matthew_arnold',
 '/poets/dante_alighieri',
 '/poets/jorie_graham',
 '/poets/charles_sorley',
 '/poets/paul_verlaine',
 '/poets/tony_hoagland',
 '/poets/geraldine_connolly',
 '/poets/derek_walcott',
 '/poets/brooks_haxton',
 '/poets/pablo_neruda',
 '/poets/john_ashbery',
 '/poets/maggie_anderson',
 '/poets/michael_lally',
 '/poets/katherine_mansfield',
 '/poets/wilfred_owen',
 '/poets/wislawa_szymborska',
 '/poets/ambrose_bierce',
 '/poets/wallace_stevens',
 '/poets/eileen_carney_hulme',
 '/poets/francis_scott_key',
 '/poets/cesar_vallejo',
 '/poets/tadeusz_rozewicz',
 '/poets/stephen_crane',
 '/poets/giacomo_leopardi',
 '/poets/sara_teasdale',
 '/poets/claire_nixon',
 '/poets/li-young_lee',
 '/poets/forrest_hamer',
 '/poets/yusef_komunyakaa',
 '/poets/hilda_doolittle',
 '/poets/annie_dillard',
 '/poets/hilaire_belloc',
 '/poets/gwendolyn_brooks',
 '/poets/edith_nesbit',
 '/poets/william_henry_davies',
 '/poets/aleksandr_blok',
 '/poets/louis_macneice',
 

This function collects all the links for individual poems from each poet's landing page. 

In [35]:
def get_poems(link):
    poetry = requests.get(link)
    bib_soup = BeautifulSoup(poetry.text, 'lxml')
     
    poem_links = []
    for poems in bib_soup.findAll('td'):
        try:
            poem = poems.find('a')['href']
            if '/poems/' in poem:
                poem_links.append(poem)
        except:
            pass
        
    poem_links = list(set(poem_links))
    poems = []
    for poem in poem_links:
        poem_link = 'http://www.famouspoetsandpoems.com' + poem      
        poems.append(poem_link)
    return poems

In [36]:
raw = 'http://www.famouspoetsandpoems.com'
poem_links = []
for poet in poet_links:
    link = raw + poet + '/poems'
    poem_links.append(get_poems(link))

In [37]:
poem_links[0:5]

[['http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18819',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18808',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18827',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18825',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18786',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18839',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18823',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18801',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18781',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18844',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18790',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18803',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18791',
  'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18809',
  'http://www.famouspoetsandpoems.

In [38]:
poem_links_list = [item for sublist in poem_links for item in sublist]

In [39]:
poem_links_list

['http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18819',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18808',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18827',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18825',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18786',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18839',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18823',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18801',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18781',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18844',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18790',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18803',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18791',
 'http://www.famouspoetsandpoems.com/poets/ezra_pound/poems/18809',
 'http://www.famouspoetsandpoems.com/poets/ezra_

### Scraping all the poems from site:
Below, I go to each link in the list of 32,000 pages and scrape the actual poem from the page. 

In [41]:
url = requests.get(poem_links_list[0])
soup = BeautifulSoup(url.text,'lxml')
title = soup.find('span',style="font-weight:bold;font-size:16px;color:#3C605B;font-family:Times New Roman;").get_text()
title

'Cantico del Sole by Ezra Pound'

In [44]:
len(poem_links_list[5:])

34467

In [91]:
df_poems_soup = pd.DataFrame(columns=['url','title','poem','author'])
df_poems_soup['url']=poem_links_list
df_poems_soup.reset_index(inplace=True,drop=True)

In [142]:
# tempory file to store output as it is read...useful for crashes mid-loop
outfile = 'temp.csv'
# rate at which file is written to and updates reported
batchsize = 100
nbatches = np.floor(len(df_poems_soup)/batchsize)
# for each url...
for row in df_poems_soup[7816:].itertuples():
    # try to get the poem information
    try:
        url = requests.get(row.url,timeout=100)
    except:
        # otherwise, just keep going
        continue
    soup = BeautifulSoup(url.text, 'lxml')
    poem = soup.find('div', style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;")
    title = soup.find('span',style="font-weight:bold;font-size:16px;color:#3C605B;font-family:Times New Roman;").get_text()
    df_poems_soup.loc[row.Index,['title','poem']] = [title, str(poem)]
    # write to file and display progress once in a while
    if row.Index % 100 == 0:
        print("{:1.0f} of {:1.0f} batches completed".format(row.Index/100,nbatches))
        with open(outfile, 'a') as f:
            df_poems_soup.to_csv(f, header=False,index=False)


79 of 344 batches completed
80 of 344 batches completed
81 of 344 batches completed
82 of 344 batches completed
83 of 344 batches completed
84 of 344 batches completed
85 of 344 batches completed
86 of 344 batches completed
87 of 344 batches completed
88 of 344 batches completed
89 of 344 batches completed
90 of 344 batches completed
91 of 344 batches completed
92 of 344 batches completed
93 of 344 batches completed
94 of 344 batches completed
95 of 344 batches completed
96 of 344 batches completed
97 of 344 batches completed
98 of 344 batches completed
99 of 344 batches completed
100 of 344 batches completed
101 of 344 batches completed
102 of 344 batches completed
103 of 344 batches completed
104 of 344 batches completed
105 of 344 batches completed
106 of 344 batches completed
107 of 344 batches completed
108 of 344 batches completed
109 of 344 batches completed
110 of 344 batches completed
111 of 344 batches completed
112 of 344 batches completed
113 of 344 batches completed
114 of

In [148]:
# write to file
df_poems_soup.to_csv('famouspoets.csv',index=False)
# eliminate temporary file
try:
    os.remove(outfile)
except OSError:
    pass

In [150]:
## Just a peek at what we're dealing with. It will need cleaning.
df_poems_soup.loc[0,:]

url       http://www.famouspoetsandpoems.com/poets/ezra_...
title                        Cantico del Sole by Ezra Pound
poem      <div style="padding-left:14px;padding-top:20px...
author                                     by Carl Sandburg
Name: 0, dtype: object

### Part II:  Clean the poems

Rename dataframe for easy typing, and drop NAs

In [4]:
#df = pd.read_csv('famouspoets.csv')

In [151]:
df = df_poems_soup
del df_poems_soup
df.dropna(inplace=True)

Separate title and author information

In [5]:
df[['title','author']] = df.title.str.rsplit('by',n=1,expand=True)
df.head()

Unnamed: 0,url,title,poem,author
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"<div style=""padding-left:14px;padding-top:20px...",Ezra Pound


Let's see what work each poem is going to need.  We'll look at the first.

In [6]:
#df_temp.poem.replace({r'\r|\n': ' '}, regex=True)
df.poem[0]

'<div style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;">\n\t\t\t\t\t\tThe thought of what America would be like<br/>If the Classics had a wide circulation<br/>       Troubles my sleep,<br/>The thought of what America,<br/>The thought of what America,The thought of what America would be like<br/>If the Classics had a wide circulation<br/>      Troubles my sleep.<br/>Nunc dimittis, now lettest thou thy servant,<br/>Now lettest thou thy servant<br/>       Depart in peace.<br/>The thought of what America,<br/>The thought of what America,<br/>The thought of what America would be like<br/>If the Classics had a wide circulation...<br/>       Oh well!<br/>       It troubles my sleep.\t\t\t\t\t\t</div>'

In [7]:
%qtconsole

In general, each line is separated by a < br\>.  Not sure we can bank on this for all poems, especially those of a more narrative form.  Also looks like the poem proper begins after a succession of tabs.  We'll start by split on breaks, and trying to leading and trailing html info.  

In [188]:
df.poem = df.poem.str.split('<br/>')
df.head(2)

Unnamed: 0,url,title,poem,author
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound


Store first and last lines separately.

In [266]:
df['last_line'] = df.poem.apply(lambda x: x[-1].split('\t')[0])
df['first_line'] = df.poem.apply(lambda x: x[0].split('\t')[-1])
df.head(2)

Unnamed: 0,url,title,poem,author,last_line,first_line
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,It troubles my sleep.,The thought of what America would be like
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,as we are departing.,"Blue mountains to the north of the walls,"


In [270]:
df.poem=df.poem.apply(lambda x: x[1:-1])
df.poem = df.poem.apply(lambda x: ' '.join(x))
df['full_poem'] = df.first_line + ' ' + df.poem + ' ' + df.last_line
df.full_poem[0]

In [276]:
df.full_poem

0        The thought of what America would be like If t...
1        Blue mountains to the north of the walls, Whit...
2        These fought in any case, and some believing p...
3        When I behold how black, immortal ink Drips fr...
4        See, they return; ah, see the tentative      M...
5        Empty are the ways,  Empty are the ways of thi...
6        Green arsenic smeared on an egg-white cloth,  ...
7        The small dogs look at the big dogs; They obse...
8        All the while they were talking the new morali...
9        Lady of rich allure,  Queen of the spring's em...
10       Italian Campagna 1309, the open road   Bah! I ...
11       And the days are not full enough And the night...
12       Winter is icummen in,  Lhude sing Goddamm.  Ra...
13       O woe, woe,  People are born and die,  We also...
14       O God, O Venus, O Mercury, patron of thieves, ...
15       When I am old I will not have you look apart F...
16       I am a grave poetic hen That lays poetic eggs .

Looks like things largely worked OK, but there are some poems with "< /div \>" still in them.  Let's see why.

In [360]:
t=df_temp.loc[df.first_line.str.contains('</div>'),:]
t.head()

Unnamed: 0,url,title,poem,author,last_line,first_line
163,http://www.famouspoetsandpoems.com/poets/derek...,The Sea Is History,"[<div style=""padding-left:14px;padding-top:20p...",Derek Walcott,"<div style=""padding-left:14px;padding-top:20px...",</div>
809,http://www.famouspoetsandpoems.com/poets/raymo...,The Current,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
810,http://www.famouspoetsandpoems.com/poets/raymo...,Late Fragment,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
811,http://www.famouspoetsandpoems.com/poets/raymo...,An Afternoon,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
812,http://www.famouspoetsandpoems.com/poets/raymo...,Bobber,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>


In [364]:
t.loc[809,'poem']

['<div style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;">\n\t\t\t\t\t\tUnfortunately this poem has been removed from our archives at the insistence of the copyright holder.\t\t\t\t\t\t</div>']

In [310]:
df_temp[['title','author']] = df_temp.title.str.rsplit('by',n=1,expand=True)
df.head()

Unnamed: 0,url,title,poem,author,last_line,first_line,full_poem
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,If the Classics had a wide circulation ...,Ezra Pound,It troubles my sleep.,The thought of what America would be like,The thought of what America would be like If t...
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,White river winding about them; Here we must m...,Ezra Pound,as we are departing.,"Blue mountains to the north of the walls,","Blue mountains to the north of the walls, Whit..."
2,http://www.famouspoetsandpoems.com/poets/ezra_...,These Fought in Any Case,"and some believing pro domo, in any case ........",Ezra Pound,laughter out of dead bellies.,"These fought in any case,","These fought in any case, and some believing p..."
3,http://www.famouspoetsandpoems.com/poets/ezra_...,Silet,"Drips from my deathless pen - ah, well-away! W...",Ezra Pound,To plague to-morrow with a testament!,"When I behold how black, immortal ink","When I behold how black, immortal ink Drips fr..."
4,http://www.famouspoetsandpoems.com/poets/ezra_...,The Return,"Movements, and the slow feet, The tr...",Ezra Pound,pallid the leash-men!,"See, they return; ah, see the tentative","See, they return; ah, see the tentative M..."


In [348]:
df_temp.poem = df_temp.poem.str.split('<br/>')
df_temp.head(2)

Unnamed: 0,url,title,poem,author,last_line,first_line
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,>,<
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,>,<


In [349]:
df_temp['last_line'] = df_temp.poem.apply(lambda x: x[-1].split('\t')[0])
df_temp['first_line'] = df_temp.poem.apply(lambda x: x[0].split('\t')[-1])
df_temp.head(2)

Unnamed: 0,url,title,poem,author,last_line,first_line
0,http://www.famouspoetsandpoems.com/poets/ezra_...,Cantico del Sole,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,It troubles my sleep.,The thought of what America would be like
1,http://www.famouspoetsandpoems.com/poets/ezra_...,Taking Leave of a Friend,"[<div style=""padding-left:14px;padding-top:20p...",Ezra Pound,as we are departing.,"Blue mountains to the north of the walls,"


In [354]:
t=df_temp.loc[df.first_line.str.contains('</div>'),:]

In [356]:
#t.reset_index(inplace=True,drop=True)
t.loc[163,'poem']

['<div style="padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;">\n\t\t\t\t\t\tThe Sea Is History \t\t\t\t\t\t</div>']

In [355]:
t

Unnamed: 0,url,title,poem,author,last_line,first_line
163,http://www.famouspoetsandpoems.com/poets/derek...,The Sea Is History,"[<div style=""padding-left:14px;padding-top:20p...",Derek Walcott,"<div style=""padding-left:14px;padding-top:20px...",</div>
809,http://www.famouspoetsandpoems.com/poets/raymo...,The Current,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
810,http://www.famouspoetsandpoems.com/poets/raymo...,Late Fragment,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
811,http://www.famouspoetsandpoems.com/poets/raymo...,An Afternoon,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
812,http://www.famouspoetsandpoems.com/poets/raymo...,Bobber,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
813,http://www.famouspoetsandpoems.com/poets/raymo...,The Scratch,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
814,http://www.famouspoetsandpoems.com/poets/raymo...,Drinking While Driving,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
815,http://www.famouspoetsandpoems.com/poets/raymo...,The Cobweb,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
816,http://www.famouspoetsandpoems.com/poets/raymo...,The Best Time Of The Day,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
817,http://www.famouspoetsandpoems.com/poets/raymo...,Photograph of My Father in His Twenty-Second Y...,"[<div style=""padding-left:14px;padding-top:20p...",Raymond Carver,"<div style=""padding-left:14px;padding-top:20px...",</div>
