# Week 7

## Today

- Web scraping
- More on Pandas
- Errata

## Anouncements


## Review: Last Week's Lab

## Review: Questions from the Course

## Web Scraping

Two parts:
    
1. Get the HTML.

2. Parse the HTML and pull out what you need.

1. Get the HTML with the `urllib` library.
2. Parse with the `BeautifulSoup` library.

In [2]:
import urllib
from bs4 import BeautifulSoup

## Getting the data

We won't focus on this too much today, since it's mostly a matter of boilerplate (i.e. consistent, reusable) code.


#1. Create a request.

In [3]:
url = 'http://www.du.edu'
req = urllib.request.Request(url)

#2. Use the request to fetch the page.

In [4]:
page = urllib.request.urlopen(req)

The results can be read to a string with `page.read()`

In [5]:
content = page.read()
print(content)

b'\n<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n<head>\n    <meta charset="utf-8" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"NRJS-ab08c239bec33d61f91",applicationID:"497891743"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(a

## On Headers

Why create a 'request'? Sometimes you want to specify information beyond the url, which browsers sends in a 'header'. e.g.

- Define a browser: Some websites don't accept connections if Chrome/Firefox/Safari/Edge aren't identified
- Add password information
- Specify view condition, like fetching a mobile version of a site

Defining a header:

```python
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=hdr)
```

In [1]:
url = "https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%7D"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=hdr)
page = urllib.request.urlopen(req)

NameError: name 'urllib' is not defined

[LINK](https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%7D)
![](../images/congress1.png)

## Parsing the Page

In [11]:
soup = BeautifulSoup(page, 'html.parser')

e.g.

In [12]:
soup.find('a')

<a href="https://www.congress.gov"><img alt="Congress.gov" height="28" src="/img/svg/congress-gov-logo.svg" width="302"/></a>

An HTML page parsed with BeautifulSoup treats HTML elements as a complex object, rather than simply a string. Useful methods exist for working with such object, including:

- `el.find(name=tag)` - Find the first matching HTML element of type 'tag'
- `el.find_all(name=tag)` - Returns a list of all matching elements
- `el.children` - Return a list of all elements that are direct children of the current element.
- `el.text` - Extract only the text between tags in the element.
- `el.attrs` - Dictionary (like a JSON object) of element tag elements

In [16]:
first_link = soup.find('a')
first_link.attrs

{'href': 'https://www.congress.gov'}

In [17]:
list(first_link.children)

[<img alt="Congress.gov" height="28" src="/img/svg/congress-gov-logo.svg" width="302"/>]

Use tab autocomplete to see what methods are available!

![](../images/congress1.png)

![](../images/congress2.png)

![](../images/congress3.png)

In [18]:
ordered = soup.find('ol', attrs={'class':'basic-search-results-lists'})
ordered

<ol class="basic-search-results-lists expanded-view" start="1"><li class="expanded"> <div><span class="visualIndicator">BILL</span></div>
    1.
    <span class="result-heading"><a href="https://www.congress.gov/bill/116th-congress/house-bill/3051?s=1&amp;r=1">H.R.3051</a> — 116th Congress (2019-2020)</span>
<span class="result-title">To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.</span>
<span class="result-item">
<strong>Sponsor:</strong> <a href="/member/jefferson-van-drew/V000133" target="_blank">Rep. Van Drew, Jefferson [D-NJ-2]</a> (Introduced 05/30/2019) <strong>Cosponsors:</strong> (<a href="https://www.congress.gov/bill/116th-congress/house-bill/3051/cosponsors?s=1&amp;r=1&amp;overview=closed#tabs">1</a>)        </span>
<span class="result-item">
<strong>Committees:</strong> House - Ways and Means        </span>
<span class="result-item"><strong> Latest Action:            </strong> House - 05/30/2019 Referred

In [19]:
for child in ordered.children:
    print("============")
    print(child.text)

 BILL
    1.
    H.R.3051 — 116th Congress (2019-2020)
To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.

Sponsor: Rep. Van Drew, Jefferson [D-NJ-2] (Introduced 05/30/2019) Cosponsors: (1)        

Committees: House - Ways and Means        
 Latest Action:             House - 05/30/2019 Referred to the House Committee on Ways and Means. (All Actions)        

            Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:IntroducedArray
(
    [actionDate] => 2019-05-30
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduced
)
Passed HousePassed SenateTo PresidentBecame Law 



AttributeError: 'NavigableString' object has no attribute 'text'

What about the error?

### Debugging the error

Looking at the element that causes the error in the loop:

In [20]:
child

'\n'

Problem: not all children are list items (`<li>`)

Possible solutions:
 - Catch the error
 - use `find_all('li')`.
     - Possible problem: if there are more list items deeper in the hierarchy, you may accidentally pull them into the results. We want "`<li>` tags that are direct children of the element set to `ordered`"

**Using `find_all`**

Note that the Passed House, Passed Senate, etc. statuses are treated as their own elements

In [21]:
for child in ordered.find_all('li'):
    print("===================")
    print(child.text)

 BILL
    1.
    H.R.3051 — 116th Congress (2019-2020)
To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.

Sponsor: Rep. Van Drew, Jefferson [D-NJ-2] (Introduced 05/30/2019) Cosponsors: (1)        

Committees: House - Ways and Means        
 Latest Action:             House - 05/30/2019 Referred to the House Committee on Ways and Means. (All Actions)        

            Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:IntroducedArray
(
    [actionDate] => 2019-05-30
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduced
)
Passed HousePassed SenateTo PresidentBecame Law 

IntroducedArray
(
    [actionDate] => 2019-05-30
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduced
)

Passed House
Passed Senate
To President
Became Law
    1.
    H.R.3051 — 116th Congress (2019-2020)
To am

To President
Became Law
 BILL
    57.
    H.R.2995 — 116th Congress (2019-2020)
To amend the Nuclear Waste Policy Act of 1982 to prioritize the acceptance of high-level radioactive waste or spent nuclear fuel from certain civilian nuclear power reactors, and for other purposes.

Sponsor: Rep. Levin, Mike [D-CA-49] (Introduced 05/23/2019) Cosponsors: (9)        

Committees: House - Energy and Commerce        
 Latest Action:             House - 05/23/2019 Referred to the House Committee on Energy and Commerce. (All Actions)        

            Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:IntroducedArray
(
    [actionDate] => 2019-05-23
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduced
)
Passed HousePassed SenateTo PresidentBecame Law 

IntroducedArray
(
    [actionDate] => 2019-05-23
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduce

**Using `find_all` with class specified**

In [22]:
for child in ordered.find_all('li', attrs={'class':'expanded'}):
    print("===================")
    print(child.text)

 BILL
    1.
    H.R.3051 — 116th Congress (2019-2020)
To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.

Sponsor: Rep. Van Drew, Jefferson [D-NJ-2] (Introduced 05/30/2019) Cosponsors: (1)        

Committees: House - Ways and Means        
 Latest Action:             House - 05/30/2019 Referred to the House Committee on Ways and Means. (All Actions)        

            Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:IntroducedArray
(
    [actionDate] => 2019-05-30
    [displayText] => Introduced in House
    [externalActionCode] => 1000
    [description] => Introduced
)
Passed HousePassed SenateTo PresidentBecame Law 

 BILL
    2.
    H.R.3050 — 116th Congress (2019-2020)
To require the Securities and Exchange Commission to carry out a study of the 10 per centum threshold limitation applicable to the definition of a diversified company under the Investment Company Act of 1940,

## Extracting pertinent information

Test parsing with the first list item - then you can do the same for all the list items:

In [23]:
# Take the first <li> object
result = ordered.find('li')

In [24]:
result

<li class="expanded"> <div><span class="visualIndicator">BILL</span></div>
    1.
    <span class="result-heading"><a href="https://www.congress.gov/bill/116th-congress/house-bill/3051?s=1&amp;r=1">H.R.3051</a> — 116th Congress (2019-2020)</span>
<span class="result-title">To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.</span>
<span class="result-item">
<strong>Sponsor:</strong> <a href="/member/jefferson-van-drew/V000133" target="_blank">Rep. Van Drew, Jefferson [D-NJ-2]</a> (Introduced 05/30/2019) <strong>Cosponsors:</strong> (<a href="https://www.congress.gov/bill/116th-congress/house-bill/3051/cosponsors?s=1&amp;r=1&amp;overview=closed#tabs">1</a>)        </span>
<span class="result-item">
<strong>Committees:</strong> House - Ways and Means        </span>
<span class="result-item"><strong> Latest Action:            </strong> House - 05/30/2019 Referred to the House Committee on Ways and Means. (<a href="https://ww

In [25]:
result.find('span').text

'BILL'

In [26]:
heading = result.find('span', attrs={'class':'result-heading'})
heading

<span class="result-heading"><a href="https://www.congress.gov/bill/116th-congress/house-bill/3051?s=1&amp;r=1">H.R.3051</a> — 116th Congress (2019-2020)</span>

In [27]:
heading.find('a').text

'H.R.3051'

In [28]:
heading.find('a').attrs

{'href': 'https://www.congress.gov/bill/116th-congress/house-bill/3051?s=1&r=1'}

In [29]:
heading.find('a').attrs['href'] 

'https://www.congress.gov/bill/116th-congress/house-bill/3051?s=1&r=1'

In [30]:
title = result.find('span', attrs={'class':'result-title'})
title.text

'To amend the Internal Revenue Code of 1986 to extend by 2 years the energy efficient commercial buildings deduction.'

Possible info to extract:

- Name
- URL
- Bill status
- Bill number
- Sponsors
- Date / Congress
- Committees
- House / Senate

Putting it together into a DataFrame.

- Loop across elements, saving the extracted info into a lists of values

In [31]:
all_rows = []
for result in ordered.find_all('li', attrs={'class':'expanded'}):
    type_info = result.find('span').text
    heading = result.find('span', attrs={'class':'result-heading'})
    bill_no = heading.find('a').text
    url = heading.find('a').attrs['href']
    title =result.find('span', attrs={'class':'result-title'}).text
    
    row = [type_info, bill_no, url, title]
    all_rows.append(row)

In [32]:
import pandas as pd
pd.DataFrame(all_rows, columns=['type', 'billno', 'url', 'title'])

Unnamed: 0,type,billno,url,title
0,BILL,H.R.3051,https://www.congress.gov/bill/116th-congress/h...,To amend the Internal Revenue Code of 1986 to ...
1,BILL,H.R.3050,https://www.congress.gov/bill/116th-congress/h...,To require the Securities and Exchange Commiss...
2,BILL,H.R.3049,https://www.congress.gov/bill/116th-congress/h...,To ensure public disclosure of nutrition stand...
3,BILL,H.R.3048,https://www.congress.gov/bill/116th-congress/h...,To extend the Secure Rural Schools and Communi...
4,BILL,H.R.3047,https://www.congress.gov/bill/116th-congress/h...,To provide support to Ukraine to defend its in...
5,BILL,H.R.3046,https://www.congress.gov/bill/116th-congress/h...,"To amend title 10, United States Code, to auth..."
6,BILL,H.R.3045,https://www.congress.gov/bill/116th-congress/h...,"To amend section 1065 of title 10, United Stat..."
7,BILL,H.R.3044,https://www.congress.gov/bill/116th-congress/h...,To establish the Medical Device Sterilization ...
8,BILL,H.R.3043,https://www.congress.gov/bill/116th-congress/h...,"To amend section 6906 of title 31, United Stat..."
9,BILL,H.R.3042,https://www.congress.gov/bill/116th-congress/h...,To direct the Comptroller General of the Unite...


## Exercises

- Scrape the calendar from https://www.du.edu/registrar/
  - How would you gather the information from the details page?

In [54]:
import urllib
from bs4 import BeautifulSoup

url = "https://www.denverlibrary.org/events/upcoming"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request(url, headers=hdr)
page = urllib.request.urlopen(req)
soup = BeautifulSoup(page, 'html.parser')

In [None]:
....

In [84]:
results = soup.find_all('div', attrs={'class':'lc-event'})

In [85]:
# Choose a random item from the list to test our eventual loop on one result
item = results[1]
month = item.find('span', attrs={'class':'lc-date-icon__item--month'}).text
day = item.find('span', attrs={'class':'lc-date-icon__item--day'}).text
title_link = item.find('a', attrs={'class':'lc-event__link'})
month, day, title_link.text, title_link.attrs['href']

('May',
 '30',
 'After Promontory: 150 Years of Transcontinental Railroading Exhibit\n',
 '/event/after-promontory-150-years-transcontinental-railroading-exhibit')

In [88]:
all_rows = []
for item in results:
    month = item.find('span', attrs={'class':'lc-date-icon__item--month'}).text
    day = item.find('span', attrs={'class':'lc-date-icon__item--day'}).text
    title_link = item.find('a', attrs={'class':'lc-event__link'})
    row = [month, day, title_link.text.strip(), title_link.attrs['href']]
    all_rows.append(row)
df = pd.DataFrame(all_rows, columns=['Month', 'Date', 'Name', 'URL'])
df.head()

Unnamed: 0,Month,Date,Name,URL
0,May,30,Children's Exhibit: Pink is for Blobfish,/event/childrens-exhibit-pink-blobfish
1,May,30,After Promontory: 150 Years of Transcontinenta...,/event/after-promontory-150-years-transcontine...
2,Jun,1,"Christine Fontenot: Kaizen, My Zen",/event/christine-fontenot-kaizen-my-zen
3,Jun,4,Preschool Storytime,/event/preschool-storytime-665
4,Jun,4,All Ages Storytime,/event/all-ages-storytime-696


# Thank you