# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [305]:
# read the HTML code for a web page and save as a string
with open('../data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [306]:
#print html

In [308]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

In [309]:
# print out the object
#print b
#print b.prettify()
type(b)

bs4.BeautifulSoup

#### 'find' method returns the first matching Tag (and everything inside of it)

In [310]:
b.find(name='body')
b.find(name='h1')

<h1 id="main">DAT10 Class 6</h1>

In [311]:
# Tags allow you to access the 'inside text'
h1_text = b.find(name='h1').text
print type(h1_text)
h1_text

<type 'unicode'>


u'DAT10 Class 6'

In [312]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

u'main'

#### 'find_all' method is useful for finding all matching Tags

In [313]:
list_p = b.find_all(name='p')    # returns a ResultSet (like a list of Tags)
len(list_p)

5

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

In [314]:
# Note you can't do b.find_all(name='p')['id'], but you can do this with a 
# single find result (see above or example below)

Hint: ResultSets can be sliced

In [315]:
print len(b.find_all(name='p'))
print b.find_all(name='p')[0]
print b.find_all(name='p')[0].text
b.find_all(name='p')[0]['id']
b.find_all(name='p')[1]['class']

5
<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>
First, we are covering APIs, which are useful for getting data.


[u'topic']

In [316]:
# iterate over a ResultSet
results = b.find_all(name='p')
for tag in results:
    print tag.text

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:


### Quiz: How would you write the above as a list comprenhension?

In [317]:
str = [tag.text+'\n' for tag in b.find_all(name='p')]
print ''.join(str)

First, we are covering APIs, which are useful for getting data.
Then, we are covering web scraping, which is a more flexible way to get data.
Finally, I will ask you to fill out yet another feedback form!
Here are some helpful API resources:
Here are some helpful web scraping resources:



### Limit search by Tag attribute

In [318]:
b.find(name='p', attrs={'id':'scraping'})

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [319]:
# this returns the same answer
b.find_all(name='p', attrs={'class':'topic'})
b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

### Limit search to specific sections

In [320]:
b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## In Class Exercise

1) Find the 'h2' tag and then print its text

In [321]:
b.find('h2').text

u'Resource List'

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


In [322]:
b.find_all(name='p', attrs={'id':'feedback'})

[<p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

3) Find the first 'p' tag and then print the value of the 'id' attribute


In [323]:
print b.find('p')
print b.find('p')['id']

<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>
api


4) Print the text of all four resources

In [324]:
b.find_all('li')

[<li>API resource 1</li>,
 <li>API resource 2</li>,
 <li>Web scraping resource 1</li>,
 <li>Web scraping resource 2</li>]

5) Using a list comprehension can you extract the text of only the API resources?

In [325]:
api = b.find_all(name='ul', attrs={'id':'api'})[0].find_all('li')
[a.text for a in api] 

[u'API resource 1', u'API resource 2']

### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping IMDB

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [326]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')
r.status_code

200

#### What is r? What can we do with it?

In [327]:
#r.text

#### convert HTML into Soup

In [328]:
b = BeautifulSoup(r.text, 'html.parser')
#print b

In [329]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [330]:
b.find('h1').text

u'The Shawshank Redemption\n                   (1994)\n                   \n'

#### Get the Star Rating (as a float)

In [331]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)
# Couldn't see to find this:
# <div class="titlePageSprite star-box-giga-star"> 9.3 </div>
b.find_all('div', {'class':'titlePageSprite star-box-giga-star'})

[]

#### Get the Movie Rating

In [332]:
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many
panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

### In-Class Exercise

#### Intro Level:

Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.

Challege Challenge Level:

Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')? Use the function above to scrape each of the movie pages.

#### Questions:
* How many of the Top movies are rated 'R'?
* What is the average duration of movies with a star_rating above 9?
* What is the average duration of movies before 1985 and after?

In [338]:
# Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.
# http://www.omdbapi.com/?t=Minion&y=&plot=short&r=json

import requests
def getYearOMDB(title):
    try:
        r = requests.get('http://www.omdbapi.com/?t='+title+'&y=&plot=short&r=json')
        if r.status_code == 200:
            a = int(r.json()['Year'])
            print title, a
            return a
        else:
            a = 0
            print title, a
            return a
    except:
        a = 0
        print "Exiting...", title, a
        return a

In [339]:
import pandas as pd
movies = pd.read_csv('/home/anna/DAT-DC-10/data/imdb_1000.csv', header=0, na_filter=False)
movies.shape

(979, 6)

In [340]:
getYearOMDB("12 Angry Men")

1957

In [341]:
movies.loc[0,'title']
movies.shape

(979, 6)

In [16]:
import time

for i,row in movies.iterrows():
    time.sleep(20) if i % 42 == 0 else time.sleep(3)
    a = getYearOMDB(row['title'])
    movies.loc[i,'Year'] = a


The Shawshank Redemption 1994
The Godfather 1972
The Godfather: Part II 1974
The Dark Knight 2008
Pulp Fiction 1994
12 Angry Men 1957
The Good, the Bad and the Ugly 1966
The Lord of the Rings: The Return of the King 2003
Schindler's List 1993
Fight Club 1999
The Lord of the Rings: The Fellowship of the Ring 2001
Inception 2010
Star Wars: Episode V - The Empire Strikes Back 1980
Forrest Gump 1994
The Lord of the Rings: The Two Towers 2002
Interstellar 2014
One Flew Over the Cuckoo's Nest 1975
Seven Samurai 1954
Goodfellas 1990
Star Wars 1983
The Matrix 1999
City of God 2002
It's a Wonderful Life 1946
The Usual Suspects 1995
Se7en 1995
Life Is Beautiful 1997
Once Upon a Time in the West 1968
The Silence of the Lambs 1991
Leon: The Professional 1994
City Lights 1931
Spirited Away 2001
The Intouchables 2011
Casablanca 1942
Whiplash 2014
American History X 1998
Modern Times 1936
Saving Private Ryan 1998
Raiders of the Lost Ark 1981
Rear Window 1954
Psycho 1960
The Green Mile 1999
Sunset Blv

In [337]:
import numpy
#movies.Year1
null_bool = movies.Year != None
sum(null_bool)


AttributeError: 'DataFrame' object has no attribute 'Year'

#### Questions for all movies in IMDB_1000.csv
* How many of the movies are rated 'R'? 
* What is the average duration of movies with a star_rating above 9? 
* What is the average duration of movies before 1985 and after?

In [342]:
# How many of the movies are rated 'R'? 
l = [True for i, r in movies.iterrows() if movies.loc[i, 'content_rating'] == 'R']
sum(l)

460

In [343]:
# What is the average duration of movies with a star_rating above 9?
rates = [movies.loc[i, 'duration'] for i, r in movies.iterrows() if movies.loc[i, 'star_rating'] > 9]
print sum(rates), '/', len(rates), ' ==> ', sum(rates) / float(len(rates))

In [344]:
# What is the average duration of movies before 1985 and after?
rates = [movies.loc[i, 'duration'] for i, r in movies.iterrows() if (movies.loc[i, 'Year'] != 1985)]
print sum(rates), '/', len(rates), ' ==> ', sum(rates) / float(len(rates))

KeyError: 'the label [Year] is not in the [index]'

In [74]:
yr_1985 = movies.Year == 1985
sum(yr_1985)

9

### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.




In [1]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.imdb.com/chart/top?ref_=nv_mv_250_6'
r = requests.get(url)
bs = BeautifulSoup(r.text, 'html.parser')

# getting all of the paragraphs that match attrs, there are 250
td = bs.find_all(attrs={'class':'titleColumn'})


In [2]:
td[0]

<td class="titleColumn">\n      1.\n      <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=2239792642&amp;pf_rd_r=10KHY9RXVY2XAZ6N0CDR&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=top&amp;ref_=chttp_tt_1" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>\n<span class="secondaryInfo">(1994)</span>\n</td>

In [3]:
td[0].find(name='a')['href']

u'/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2239792642&pf_rd_r=10KHY9RXVY2XAZ6N0CDR&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'

In [4]:
strings = td[0].find(name='a')['href'].split('/')

In [5]:
strings

[u'',
 u'title',
 u'tt0111161',
 u'?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2239792642&pf_rd_r=10KHY9RXVY2XAZ6N0CDR&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1']

In [6]:
strings[2]

u'tt0111161'

In [236]:
import requests
from bs4 import BeautifulSoup

def getTop250Ids():
    url = 'http://www.imdb.com/chart/top?ref_=nv_mv_250_6'
    r = requests.get(url)
    bs = BeautifulSoup(r.text, 'html.parser')
    list_250 = []
    
    # getting all of the paragraphs that match attrs, there are 250
    td = bs.find_all(attrs={'class':'titleColumn'}) 
    for i in td:
        strings = i.find(name='a')['href'].split('/')
        list_250.append((i.find('a').text, strings[2]))
    
    return list_250


In [231]:
td[0].find('a').text

u'The Shawshank Redemption'

In [244]:
import pandas as pd
mylist = getTop250Ids()
df_movies = pd.DataFrame(mylist, columns=['title','idx'])

In [245]:
df_movies

Unnamed: 0,title,idx
0,The Shawshank Redemption,tt0111161
1,The Godfather,tt0068646
2,The Godfather: Part II,tt0071562
3,The Dark Knight,tt0468569
4,12 Angry Men,tt0050083
5,Schindler's List,tt0108052
6,Pulp Fiction,tt0110912
7,"The Good, the Bad and the Ugly",tt0060196
8,The Lord of the Rings: The Return of the King,tt0167260
9,Fight Club,tt0137523


In [251]:
#some_rsp = findMovieInfo(df_movies[0].id, df_movies[0])
print df_movies.iloc[0,0]
print df_movies.idx[0]
print df_movies.title[249]

The Shawshank Redemption
tt0111161
The Killing


In [285]:
# www.imdb.com/title/<imdb_id>
# title <span class="itemprop" itemprop="name">The Godfather</span>
# rating <meta itemprop="contentRating" content="R">
# star <span itemprop="ratingValue">9.2</span>
# duration <time itemprop="duration" datetime="PT175M">175 min</time>
# description <p itemprop="description">The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.</p>
import requests
from bs4 import BeautifulSoup

# id ==> df.id[0] or df.iloc[0][0] or df.iloc[0,0]
# Question:  Is it better to pass in the DataFrame by reference (can this be done?), or pass the values back via list
# and have the calling function populate the DataFrame?  Or make the DataFrame global parameter (yuck)
def findMovieInfo(i):  #row):
    return_list = []
    url = 'http://www.imdb.com/title/' + i
    try:
        r = requests.get(url)
        bs = BeautifulSoup(r.text, 'html.parser')
        title  = bs.find(name='h1').text.split('\n')[0]
        star   = bs.find(name='span', attrs={'itemprop':'ratingValue'})
        r_star = 0 if star == None else float(star.text)
        rating = bs.find(name='meta', attrs={'itemprop':'contentRating'})
        c_rate = 0 if rating == None else rating.text.split('\n')[0]
        descr  = bs.find(name='div', attrs={'class':'summary_text','itemprop':'description'}).text.strip()
        #This works for everything, except one entry.  Have to derive another way
        #time   = bs.find_all(name='time', attrs={'itemprop':'duration'})
        #minute = 0 if time == None else time[1].text
        #duration = int(minute.split(' ')[0])
        duration = int(bs.find(name='time')['datetime'][2:-1])
        # print i, title, bs.find(name='time')['datetime'], c_rate, r_star, duration, '\n'
    
        # Question:  This doesn't work... is row just a copy ?
        #row['title'] = title
        return_list.append(('title',title))
        #row['content_rating'] = rating
        return_list.append(('rating', c_rate))
        #row['star_rating'] = star
        return_list.append(('star', r_star))
        #row['duration'] = duration
        return_list.append(('duration', duration))
        #row['description'] = descr
        return_list.append(('description', descr))
    
        return(return_list)

    except:
        print "errors with: ", title, i
        pass
    

In [302]:
# Side Note:  Have to use index, row for dataframe.iterrows()
# This is a test function
count = 0
partial = df_movies[0:250]
for i, row in partial.iterrows():
    values_list = []
    url = 'http://www.imdb.com/title/' + row.idx
    values_list = findMovieInfo(row.idx)
    if (i+1)%50 == 0:
        print '.'
    else:
        print'.',
    for tup in values_list:
        partial.loc[i,tup[0]] = tup[1]
print '\nComplete! View the "partial" dataframe!'

 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Complete! View the "partial" dataframe!


In [303]:
partial.head(5)

Unnamed: 0,title,idx,rating,star,duration,description
0,The Shawshank Redemption,tt0111161,R,9.3,142,Two imprisoned men bond over a number of years...
1,The Godfather,tt0068646,R,9.2,175,The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,tt0071562,R,9.0,202,The early life and career of Vito Corleone in ...
3,The Dark Knight,tt0468569,PG-13,9.0,152,When the menace known as the Joker wreaks havo...
4,12 Angry Men,tt0050083,Not Rated,8.9,96,A dissenting juror in a murder trial slowly ma...


### Note A:  Ways to get fields from HTML page

In [192]:
df_movies.idx[244]

u'tt0169102'

In [265]:
# this method of finding the duration does not always work
time   = bs.find_all(name='time', attrs={'itemprop':'duration'})
# Problematic Result:
# [<time datetime="PT146M" itemprop="duration">\n                        2h 26min\n                    </time>]
minute = 0 if time == None else time[1].text
duration = int(minute.split(' ')[0])

[<time datetime="PT146M" itemprop="duration">\n                        2h 26min\n                    </time>]

In [268]:
int(bs.find(name='time')['datetime'][2:-1])

146

In [257]:
bs.find("time")['datetime']

u'PT146M'

In [260]:
cr = bs.find('meta', attrs={'itemprop':'contentRating'}) #.text.split('\n')[0]
#bs.find(name='meta', attrs={'itemprop':'contentRating'}).text.split('\n')[0]
#l = bs.find_all(name='div')
#l[1]
cr.text.split('\n')[0]

u'PG-13'

In [256]:
# Using prettify() in order to figure out how to parse the html and know what attributes and tags I need.
# tt1454029 (#244)
url = 'http://www.imdb.com/title/' + 'tt1454029' #df_movies.idx[245]
r = requests.get(url)
bs = BeautifulSoup(r.text,'html.parser')


###  Note B:  How to extract data from Pandas dataframe using df.iloc and df.loc

In [152]:
# Using integer indexing, df.iloc[row index range, col index range]
df_movies.iloc[0,0]

u'tt0111161'

In [133]:
# Using integer indexing, df.iloc[single row index][col index range]
df_movies.iloc[0][0]

u'tt0111161'

In [158]:
# Using dataframe attribute name and row index range
df_movies.id[0:3]

0    tt0111161
1    tt0068646
2    tt0071562
Name: id, dtype: object

In [159]:
# Using dataframe NAME or INTEGER index, df.loc[row index range, col index range]
# Can also do the following with NAMEs,  df.loc[row index range, 'col name start':'col name end']
df_movies.loc[0:4,'id']

0    tt0111161
1    tt0068646
2    tt0071562
3    tt0468569
4    tt0050083
Name: id, dtype: object

In [174]:
#### Note what DOES NOTE work is dataframe[0].  Have to use dataframe.iloc[0,0] or dataframe.loc[0,0] 
#### or dataframe.<attribute>[0]
# this does not work ==>  movies[0]
# this does not work ==>  df_movies[0]
# this DOES WORK !!! ==>  movies.title[23:27]
movies.title[23:27]

23              The Usual Suspects
24                           Se7en
25               Life Is Beautiful
26    Once Upon a Time in the West
Name: title, dtype: object