# Web scraping practice with requests and bs4

By Adi Bronhstein and maybe other people. Updated by Jeff Hale.

### Learning Objectives

- Become more comfortable using requests and Beautiful Soup (BS4)
- Learn the most common BS4 commands with `.html`
  - .find()
  - .findall()
  - .select()
  - .get_text()
  - .text
  - .pretiffy()

- Be able to select parts of an HTML website based on tags and attributes and turn them into DataFrames

#### First things first, import those libraries

In [90]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

#### Perform a `get` request on `http://dataquestio.github.io/web-scraping-pages/simple.html` and print the status code

In [91]:
url = 'http://dataquestio.github.io/web-scraping-pages/simple.html'
response = requests.get(url)
response

<Response [200]>

#### To see the content of the response, use the content attribute

In [7]:
response.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#### When life gives you HTML, make soup. 🍜

In [92]:
soup = BeautifulSoup(response.content)
soup

<!DOCTYPE html>
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

#### We can make this look pretty, with proper indentation, using the `.prettify()` method

In [93]:
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   A simple example page\n  </title>\n </head>\n <body>\n  <p>\n   Here is some simple content for this page.\n  </p>\n </body>\n</html>'

In [94]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


#### The fastest way to grab this HTML is through the built in `.html` attribute

In [95]:
html = soup.html
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [96]:
type(html)

bs4.element.Tag

#### The `.text` attribute is the one you usually want

In [97]:
html.text

'\n\nA simple example page\n\n\nHere is some simple content for this page.\n\n'

#### `html.text` gives you a string that you can use string functions on 😀

In [22]:
html.text.strip()

'A simple example page\n\n\nHere is some simple content for this page.'

Note, there are many ways to accomplish the same thing with BS4. If you are Googling or reading the docs you might come across other ways. 😉

## Quick Practice


1. Create a new request to get the html at `http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html`    
    
2. Turn that HTML into a soup object    
    
3. Using the `.find()` method, save the title tag as its own variable. Strip the whitespace.
    
4. Do the same with the body tag  


### Question: Whats the difference between the `.find()` method and the `.find_all()` method?

In [109]:
res = requests.get('http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(res.content)

In [110]:
head = soup.html
head

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

### Title

In [111]:
title = soup.find('head').text
title

'\nA simple example page\n'

In [112]:
title = title.strip()

### Body

In [113]:
body = soup.find('body')
body

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

### The `get_text()` method is handy for using beautiful soup.

In [114]:
body.get_text(strip=True, separator=' ')

'First paragraph. Second paragraph. First outer paragraph. Second outer paragraph.'

#### Now we can start ripping through these paragraphs using `.find_all()`

In [115]:
body.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### What if I only want paragraphs with the 'outer-text' class?

In [116]:
body.find_all('p', {'class':'outer-text'})

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### Or we can just use the class 'outer-text' if it only applies to the content we want.

In [117]:
body.find_all(class_='outer-text')  

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### Notice how we can pass the keyword argument 'class_' directly.

The underscore is there to differentiate it from the reserved keyword `class`.

#### Alternatively, we can grab an element by its id

### Question: What's the difference between a class and an id? 

In [118]:
# Get first id
soup.find(id='first')

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

In [119]:
# Get second id
soup.find(id='second')

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

#### We can use the `.select()` method with CSS selectors for more complicated searches.

In [120]:
soup.select('div ' )

[<div>
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>
 <p class="inner-text">
                 Second paragraph.
             </p>
 </div>]

#### We can use the `.select()` method with CSS selectors

In [121]:
soup.select('div p#first' )

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

#### Or just this

In [122]:
soup.select('p#first' )

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

#### But for some reason this doesn't work 😞

In [123]:
soup.select('div p.innter-text.first_item' )

[]

### Notice we pass a string with the tag types and a `#` for an id and 

# Lets put this into action on a real site

Our task is to scrape the Extended Forecast for DC, found [here](https://forecast.weather.gov/MapClick.php?x=259&y=110&site=lwx&zmx=&zmy=&map_x=259&map_y=109#.XpTWY1NKhTY). Before we begin, lets explore the site using chrome's dev tools:
- How would we select the entire extended forecast container?
- What method is best for selecting each forecast as individual elements?
- And how would we select them?

**Once we know how to drill down to the 'observations', we can begin scraping**

**Create the initial request and grab the HTML**

In [71]:
url = 'https://forecast.weather.gov/MapClick.php?lat=38.85&lon=-77.032#.Xid0i1NKi1s'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')

**Grab our seven-day container**

In [74]:
dc_forecast = soup.find(id='seven-day-forecast')
dc_forecast

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Washington DC, Reagan National Airport VA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><div class="current-hazard" id="headline-container" style="margin-left: 124px"><div id="headline-separator" style="top: 34px; height: 131px"></div><div id="headline-info" onclick="$('#headline-detail').toggle(); $('#headline-detail-now').hide()" style="margin-top: 5px"><div id="headline-detail"><div>Wind Advisory until April 13, 06:00pm</div></div><span class="fa fa-info-circle"></span>Click here for hazard details and duration</div><div class="headline-bar headline-advisory" style="top: 40px; left: 19px; height: 125px; width: 105px">
<div class="headline-title">Wind Advisory</div>
</div></div><ul class="list-unstyled" id="seven-day-forecast-list" style="padding-top: 60px"><li class="forecast-

**Select all of the tombstone-containers**

In [75]:
forecast_items = dc_forecast.find_all(attrs={'class':'tombstone-container'})

**Print the first one**

In [76]:
forecast_items[0]

<div class="tombstone-container">
<p class="period-name">NOW until<br/>6:00pm Mon</p>
<p><img alt="" class="forecast-icon" src="newimages/medium/wind_bkn.png" title=""/></p><p class="short-desc">Wind Advisory</p></div>

**Using the class of the html and the `.text` attribute, select the period for which the forecast is made, the short description of the forecast, and the temperature.**

In [87]:
tonight = forecast_items[0]
period = tonight.find(class_='period-name').text
period

'NOW until6:00pm Mon'

In [88]:
desc = tonight.find(class_='short-desc').text
desc

'Wind Advisory'

In [89]:
desc = tonight.find(class_='temp').text
desc

AttributeError: 'NoneType' object has no attribute 'text'

**Back on the actual webpage, we can mouse over the images and see a bit more information**

**Now lets run this process for every forecast on the page, and print the data**

In [125]:
for item in forecast_items:
    print(item.find(class_='period-name').text)
    print(item.find(class_='short-desc').text)
    print('------')

NOW until6:00pm Mon
Wind Advisory
------
ThisAfternoon
Partly Sunnyand Breezy
------
Tonight
Mostly Clear
------
Tuesday
Partly Sunny
------
TuesdayNight
Rain
------
Wednesday
Rain Likely
------
WednesdayNight
Partly Cloudy
------
Thursday
Mostly Sunny
------
ThursdayNight
Partly Cloudy
------


**One more time, but lets save the data and turn it into a DataFrame**

In [128]:
forecast_list = []
for item in forecast_items:
    period = item.find(class_='period-name').text
    desc = item.find(class_='short-desc').text

    
    forecast_dict = {
        'Period':period,
        'Description':desc,

    }
    
    forecast_list.append(forecast_dict)

In [129]:
forecast_df = pd.DataFrame(forecast_list)
forecast_df

Unnamed: 0,Period,Description
0,NOW until6:00pm Mon,Wind Advisory
1,ThisAfternoon,Partly Sunnyand Breezy
2,Tonight,Mostly Clear
3,Tuesday,Partly Sunny
4,TuesdayNight,Rain
5,Wednesday,Rain Likely
6,WednesdayNight,Partly Cloudy
7,Thursday,Mostly Sunny
8,ThursdayNight,Partly Cloudy


#### Save to a csv file

In [None]:
forecast_df.to_csv('forecast.csv', index=False)

## Summary

You've seen how to use requests with BS4 to scrape websites and get data into DataFrames.

### Check for understanding

When would you use the following BS4 html methods?
- `.find()`?
- `.find_all()`?
- `.select()`?

Webscraping is a great way to get data for your final project!