# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [114]:
# read the HTML code for a web page and save as a string
with open('../data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [117]:
html

"<!DOCTYPE html>\n<html lang='en'>\n\n<head>\n    <title>Example Web Page</title>\n</head>\n\n<body>\n\n    <h1 id='main'>DAT10 Class 6</h1>\n\n    <p class='topic' id='api'>First, we are covering APIs, which are useful for getting data.</p>\n    <p class='topic' id='scraping'>Then, we are covering web scraping, which is a more flexible way to get data.</p>\n    <p class='topic' id='feedback'>Finally, I will ask you to fill out yet another feedback form!</p>\n\n    <h2>Resource List</h2>\n\n    <p>Here are some helpful API resources:</p>\n\n    <ul id='api'>\n        <li>API resource 1</li>\n        <li>API resource 2</li>\n    </ul>\n\n    <p>Here are some helpful web scraping resources:</p>\n\n    <ul id='scraping'>\n        <li>Web scraping resource 1</li>\n        <li>Web scraping resource 2</li>\n    </ul>\n\n</body>\n\n</html>\n"

In [120]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

In [125]:
# print out the object
# b
b.prettify()

u'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <title>\n   Example Web Page\n  </title>\n </head>\n <body>\n  <h1 id="main">\n   DAT10 Class 6\n  </h1>\n  <p class="topic" id="api">\n   First, we are covering APIs, which are useful for getting data.\n  </p>\n  <p class="topic" id="scraping">\n   Then, we are covering web scraping, which is a more flexible way to get data.\n  </p>\n  <p class="topic" id="feedback">\n   Finally, I will ask you to fill out yet another feedback form!\n  </p>\n  <h2>\n   Resource List\n  </h2>\n  <p>\n   Here are some helpful API resources:\n  </p>\n  <ul id="api">\n   <li>\n    API resource 1\n   </li>\n   <li>\n    API resource 2\n   </li>\n  </ul>\n  <p>\n   Here are some helpful web scraping resources:\n  </p>\n  <ul id="scraping">\n   <li>\n    Web scraping resource 1\n   </li>\n   <li>\n    Web scraping resource 2\n   </li>\n  </ul>\n </body>\n</html>\n'

#### 'find' method returns the first matching Tag (and everything inside of it)

In [131]:
b.find(name='body')
title = b.find(name='h1')
title.text

u'DAT10 Class 6'

In [133]:
# Tags allow you to access the 'inside text'
b.find(name='h1')

<h1 id="main">DAT10 Class 6</h1>

In [132]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

u'main'

#### 'find_all' method is useful for finding all matching Tags

In [134]:
b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>,
 <p>Here are some helpful API resources:</p>,
 <p>Here are some helpful web scraping resources:</p>]

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

In [138]:
tag = b.find_all(name='p')[0]
tag

<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>

Hint: ResultSets can be sliced

In [142]:
# len(b.find_all(name='p'))
# b.find_all(name='p')[0]
# b.find_all(name='p')[0].text
b.find_all(name='p')[0]['id']

u'api'

In [143]:
# iterate over a ResultSet
results = b.find_all(name='p')
# for tag in results:
#     print tag.text

### Quiz: How would you write the above as a list comprenhension?

#### Part II - Make a string with each tag.text separated by a new line character '\n'

In [153]:
[tag.text for tag in results if len(tag.text) > 0]

[tag.text for tag in b.find_all(name='p')]

bstring = ''
for tag in results:
    bstring += tag.text + '\n '
bstring

'!!!!!'.join([tag.text for tag in b.find_all(name='p')])



u'First, we are covering APIs, which are useful for getting data.!!!!!Then, we are covering web scraping, which is a more flexible way to get data.!!!!!Finally, I will ask you to fill out yet another feedback form!!!!!!Here are some helpful API resources:!!!!!Here are some helpful web scraping resources:'

### Limit search by Tag attribute

In [154]:
b.find(name='p', attrs={'id':'scraping'})

<p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>

In [156]:
# b.find_all(name='p', attrs={'class':'topic'})
b.find_all(attrs={'class':'topic'})

[<p class="topic" id="api">First, we are covering APIs, which are useful for getting data.</p>,
 <p class="topic" id="scraping">Then, we are covering web scraping, which is a more flexible way to get data.</p>,
 <p class="topic" id="feedback">Finally, I will ask you to fill out yet another feedback form!</p>]

### Limit search to specific sections

In [160]:
# b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

[<li>Web scraping resource 1</li>, <li>Web scraping resource 2</li>]

## In Class Exercise

1) Find the 'h2' tag and then print its text

In [161]:
b.find('h2').text

u'Resource List'

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


In [163]:
b.find('p', attrs={'id':'feedback'}).text

u'Finally, I will ask you to fill out yet another feedback form!'

3) Find the first 'p' tag and then print the value of the 'id' attribute


In [164]:
b.find('p')['id']

u'api'

4) Print the text of all four resources

In [165]:
[tag.text for tag in b.findAll(name='li')]

[u'API resource 1',
 u'API resource 2',
 u'Web scraping resource 1',
 u'Web scraping resource 2']

5) Using a list comprehension can you extract the text of only the API resources?

In [166]:
[tag.text for tag in b.findAll(name='li') if 'API' in tag.text]

[u'API resource 1', u'API resource 2']

### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping IMDB

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [167]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

#### What is r? What can we do with it?

#### convert HTML into Soup

In [170]:
b = BeautifulSoup(r.text, 'html.parser')
print b

In [171]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [183]:
b.find('h1').text

u'The Shawshank Redemption'

#### Get the Star Rating (as a float)

In [182]:
# get the star rating (as a float)
b.find(name='span', attrs={'itemprop':'ratingValue'})

# <div class="titlePageSprite star-box-giga-star"> 9.3 </div>
b.find( class_='star-box-giga-star')

#### Get the Movie Rating

In [95]:
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many
<meta itemprop="contentRating" content="R">

panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

### In-Class Exercise

#### Intro Level: 
Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.
    
#### Challege Challenge Level:
Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?
Use the function above to scrape each of the movie pages.


**Questions:**

How many of the Top movies are rated 'R'?

What is the average duration of movies with a star_rating above 9?

What is the average duration of movies before 1985 and after?


### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.


