# Web Scraping with BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

What is web scraping?
- Extracting information from websites (simulates a human copying and pasting)
- Based on finding patterns in website code (usually HTML)

What are best practices for web scraping?
- Scraping too many pages too fast can get your IP address blocked
- Pay attention to the robots exclusion standard (robots.txt)
- Let's look at http://www.imdb.com/robots.txt

What is HTML?
- Code interpreted by a web browser to produce ("render") a web page
- Let's look at example.html
- Tags are opened and closed
- Tags have optional attributes

How to view HTML code:
- To view the entire page: "View Source" or "View Page Source" or "Show Page Source"
- To view a specific part: "Inspect Element"
- Safari users: Safari menu, Preferences, Advanced, Show Develop menu in menu bar
- Let's inspect example.html

### Aquire Your Data

In [None]:
# read the HTML code for a web page and save as a string
with open('../data/example.html', 'rU') as f:
    html = f.read()

### Look at / Explorre Your Data (html)

In [None]:
print html

In [3]:
# convert HTML into a structured Soup object
from bs4 import BeautifulSoup
b = BeautifulSoup(html, 'html.parser')

NameError: name 'html' is not defined

In [None]:
# print out the object
# print b
# print b.prettify()

#### 'find' method returns the first matching Tag (and everything inside of it)

In [None]:
b.find(name='body')
b.find(name='h1')

In [None]:
# Tags allow you to access the 'inside text'
b.find(name='h1').text

In [None]:
# Tags also allow you to access their attributes
b.find(name='h1')['id']

#### 'find_all' method is useful for finding all matching Tags

In [None]:
b.find_all(name='p')    # returns a ResultSet (like a list of Tags)

### Quiz: What is the datatype returned by 'find_all'? What kinds of operations can we do on that datatype?

Hint: ResultSets can be sliced

In [None]:
len(b.find_all(name='p'))
b.find_all(name='p')[0]
b.find_all(name='p')[0].text
b.find_all(name='p')[0]['id']

In [None]:
# iterate over a ResultSet
results = b.find_all(name='p')
for tag in results:
    print tag.text

### Quiz: How would you write the above as a list comprenhension?

### Limit search by Tag attribute

In [None]:
b.find(name='p', attrs={'id':'scraping'})

In [None]:
b.find_all(name='p', attrs={'class':'topic'})
b.find_all(attrs={'class':'topic'})

### Limit search to specific sections

In [None]:
b.find_all(name='li')
b.find(name='ul', attrs={'id':'scraping'}).find_all(name='li')

## In Class Exercise

1) Find the 'h2' tag and then print its text

2) Find the 'p' tag with an 'id' value of 'feedback' and then print its text


3) Find the first 'p' tag and then print the value of the 'id' attribute


4) Print the text of all four resources

5) Using a list comprehension can you extract the text of only the API resources?

### Tool: Selector Gadget
http://selectorgadget.com/

## Scraping IMDB

#### First open your browser and look at the website and the html structure

http://www.imdb.com/title/tt0111161/

#### Get the HTML from the Shawshank Redemption page

In [1]:
import requests
r = requests.get('http://www.imdb.com/title/tt0111161/')

#### What is r? What can we do with it?

#### convert HTML into Soup

In [46]:
b = BeautifulSoup(r.text, 'html.parser')
print b

In [25]:
# run this code if you have encoding errors
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#### Get the title

In [71]:
b.find('h1').text

u'The Shawshank Redemption'

#### Get the Star Rating (as a float)

In [72]:
# get the star rating (as a float)
float(b.find(name='span', attrs={'itemprop':'ratingValue'}).text)

9.3

#### Get the Movie Rating

In [95]:
panel = b.find('meta', attrs={'itemprop':'contentRating'}) # too many
panel.text

u'R\n| \n                        2h 22min\n                    \n|\nCrime, \nDrama\n|\n14 October 1994 (USA)\n\n '

In [None]:
### In-Class Exercise

Intro Level: 
Using the Omdbapi, c
    

Challege Challenge Level:
Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')?
Use the function above to scrape each of the movie pages.


**Questions:**

How many of the Top movies are rated 'R'?

What is the average duration of movies with a star_rating above 9?

What is the average duration of movies before 1985 and after?


### In-Class Exercise

#### Intro Level:

Using the Omdbapi, request all years of the 1000 movies in the CSV. Answer the questions below.

Challege Challenge Level:

Can you scrape the IMDB Top 250 list (http://www.imdb.com/chart/top?ref_=nv_mv_250_6) and return a Data frame with the movide name, rating, year and the unique movie identifier ie('tt0111161')? Use the function above to scrape each of the movie pages.

####Questions:
How many of the Top movies are rated 'R'?
What is the average duration of movies with a star_rating above 9?
What is the average duration of movies before 1985 and after?

### Optional Wed Scraping Homework

First, define a function that accepts an IMDb ID and returns a dictionary of
movie information: title, star_rating, description, content_rating, duration.
The function should gather this information by scraping the IMDb website, not
by calling the OMDb API. (This is really just a wrapper of the web scraping
code we wrote above.)

For example, `get_movie_info('tt0111161')` should return:
```
{'content_rating': 'R',
 'description': u'Two imprisoned men bond over a number of years...',
 'duration': 142,
 'star_rating': 9.3,
 'title': u'The Shawshank Redemption'}
 ```

Then, open the file imdb_ids.txt using Python, and write a for loop that builds
a list in which each element is a dictionary of movie information.
Finally, convert that list into a DataFrame.


