Intro to Web Scraping
--------------------------

While open data portals, APIs and other publication mechanisms provide easy ways to get to information we need for our analysys and reporting, there are plenty other valuable data sources for us to take advantage of: web pages (HTML), PDF files, email dumps, etc. Automating the extraction of useful information from web pages is known as **"web scraping."** A terrible name aside, web scraping is very powerful and it's something you'll want to master. Today, we'll close our session talking about some of the basics of web scraping in Python.

![Web Scraping](https://blog.hartleybrody.com/wp-content/uploads/2012/12/scraper-tool.jpg)

There are many ways to scrape information from the web, but we're going to use Python, [requests](http://docs.python-requests.org/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).

COVID Tracking State Data
---------------------

The COVID Tracking project is trying to assemble state data on testing. They collect data from each state and scrape the basic testing statistics [Their descriptions of each state's data are here](https://covidtracking.com/data/). Let's focus on California.

[Here is the page](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/ncov2019.aspx). Have a look and tell me what kind of information is available. Compare it to data from a couple other states. 

Now, the first part of web scraping is making HTTP requests to pull the pages we need into Python. We will use the [requests](http://docs.python-requests.org/en/master/) library again.

In [None]:
# make the request to get the Trump Lies HTML
from requests import get

url = 'https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/ncov2019.aspx'

headers = {
    'From': '<put your email here>',
}
response = get(url, headers=headers)

Above, we include a "header" field (represented as a dictionary). The header passes information to the web server that might change the way it returns content. In later exercises, we might need to specify the header "User-Agent" which tells the server what kind of  browser the requeste is being made from -- some servers don't like handing pages out to bots. 

For now, we are using the "From" header to announce ourselves. I like to tell a source that I'm taking data. If you want to know more about headers, have a look [here](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields).

In [None]:
# let's see what we have. remember that response.text will give us a string value of the page HTML

print(response.text)

This is kind of a mess. The whole web page has been read in as a string. Thankfully, one of the great things about Python is a package called [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/), designed by [Leonard Richardson](http://www.crummy.com/self/). It is truly a thing of beauty. BeautifulSoup is a parser for HTML (and XML) that creates an object that lets you interact with the components of a web page. You can search for tags, extract attributes from the tags and pull the content contained in a tag. [The documentation is pretty simple too.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) The latest version of BeautifulSoup is 4.6.0 and the package is called bs4.

**An example**

We will come back to the COVID data with BeautifulSoup but let's start with a simple example first. Here is some very simple HTML that I'd like to run through BeautifulSoup:

```html
<html>

    <head>
        <title>My Technology News Site</title>
    </head>

    <body>
        <div>
            <p class="title"><strong>Steve Jobs introduces the public beta of Mac OS X</strong></p>
            <div class="description">Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div class="author">Author: Michael Young</div>
        </div>
    </body>

</html>

```

In [None]:
from bs4 import BeautifulSoup

our_html = '''
<html>

    <head>
        <title>My Technology News Site
    </head>

    <body>
        <div>
            <p class="title"><strong>Steve Jobs introduces the public beta of Mac OS X</strong></p>
            <div class="description">Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div class="author">Author: Michael Young</div>
        </div>
    </body>

</html>
'''

# BeautifulSoup takes two arguments: a string (hopefully with HTML in it) and the parser we'd like to use
soup = BeautifulSoup(our_html, 'html.parser')

# print out a pretty version of the BeautifulSoup object
print(soup.prettify())

Let's do a super-quick review of [HTML](https://en.wikipedia.org/wiki/HTML):

Hypertext Markup Language is the language used for creating web pages. HTML uses `tags` which help label as well as structure the data in the document. Web browsers use the tags to help render the web page but does not display the tags. 

`Tags` normall come in pairs and have an opening tag `<p>` and a closing tag `</p>`:
```html
<p>this is a paragraph tag</p>
```

Other tags like the image tag `<img>` don't have a closing tag:
```html
<img src="http://somesite.com/images/logo.jpg" />
```

The other important thing about HTML tags is that they can contain one or more `attributes`. Like in the `<img>` tag above, the `src` attribute is used specify the URL of the image. A tag with multiple attributes could look like this:
```html
<p attribute_1="value1" attribute_2="value2">
Our content goes here
</p>
```

HTML documents typically have nested tags (think of a tree!) that looks like this:
```html
<html>
  <head>
    <title>My First Website!</title>
  </head>

    <body>
        <p>My mom would be proud of this.</p>
    </body>  
</html>
```



**Back to BeautifulSoup...**

When we run our HTML document through BeautifulSoup, we get a python object that allows us to traverse, query and manipulate the HTML document.

Here are a few ways to inspect our simple HTML document that we loaded above:

In [None]:
# <title> tag
print(soup.title)

In [None]:
# name of the <title> tag
print(soup.title.name)

In [None]:
# string value in the <title> tag
print(soup.title.string)

In [None]:
# how about if we want to find the first <div> tag?

soup.div

In [None]:
# the string value for the first <p> tag within the first <div>

soup.div.p.string

In [None]:
# the value of the "class" attribute of the first <div> under the first <div> (!?!)

soup.div.div['class']

In [None]:
# here is how we'd find the the <a> tag

soup.div.div.a

In [None]:
#### For You To Try

# how would you find the url in the description?



We can use `find` and `find_all` to search through the HTML to find certains tags and tag/attribute combinations. Let's take a look:

In [None]:
# find all <p> tags
for p in soup.find_all('p'):
    print(p.text)

In [None]:
# find all <div class="author>...</div> tags
for author_div in soup.find_all('div', attrs={'class': 'author'}):
    print(author_div.text)

**COVID Tracking**

Coming back to California's COVID page, how can we use BeautifulSoup to parse our the different categories like ages and genders? And then how might we store the data for comparison over time?

Let's start by pulling the HTML for the page and parsing it with Beautiful Soup.

In [None]:
from requests import get
from bs4 import BeautifulSoup

url = 'https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/ncov2019.aspx'

headers = {
    'From': '<put your email here>'
}

# http request
response = get(url, headers=headers)

# run the HTML through BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# print out a pretty version of the BeautifulSoup object
print(soup.prettify())

Still a mess! Let's open up the [link](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/ncov2019.aspx) in Chrome and view the HTML there. The `View Page Source` option  allows us to peek behind the scenes of any web page. Another great resource are the [Chrome Developer Tools](https://developer.chrome.com/devtools) which give you an even greater look at how a page is constructed.

Do we see any patterns in the California COVID HTML?

```html
<h2>COVID-19 By the Numbers<br/></h2><p>As of March 27, 2020, 2 p.m. Pacific Daylight Time, there are a total of 4,643&#160;positive cases and 101&#160;deaths in California (including one non-California resident).&#160;</p><ul style="list-style-type: disc;"><li>923: Community-acquired cases</li><li>3,720: Cases acquired through person-to-person transmission, travel (including cruise ship passengers), repatriation, or under investigation.</li><ul style="list-style-type: circle;"><li>This includes 73&#160;health care workers</li></ul></ul><p><strong>Ages of all confirmed positive cases:</strong></p><ul style="list-style-type: disc;"><li>Age 0-17: 54&#160;cases</li><li>Age 18-49: 2,368&#160;cases</li><li>Age 50-64: 1,184&#160;cases<br/></li><li>Age 65 and older: 1,016&#160;cases</li><li>Unknown: 21&#160;cases</li></ul><p><strong>Gender of all confirmed positive cases:</strong></p><ul style="list-style-type: disc;"><li>Female: 2,057&#160;cases</li><li>Male: 2,536&#160;cases</li><li>Unknown: 50&#160;cases<br/><br/></li></ul>
```

It seems as though most of the data are presented as unordered lists of various styles. That's a `<ul>` tag with some kind of `style=` attribute. We notice, for example that the age breakdown has style `"list-style-type: disc;"`, but so do several other categorizations. 

Let's pull all of them.

In [None]:
# we can find each category between the <ul style="list-style-type: disc;"> and </ul> tags

for category in soup.find_all('ul', attrs={"style": "list-style-type: disc;"}): 
    print(category.prettify())
    print("---"*10)

So we see some categories and then the same style is used for a list of things you can do to protect yourself from the virus. Now, the age breakdown is the second set of `<ul>` tags. The `find_all()` method produces a list-like object meaning we can get at the second one simply by using  square brackets with the index 1.

In [None]:
soup.find_all('ul', attrs={"style": "list-style-type: disc;"})[1]

In [None]:
# or pretty

print(soup.find_all('ul', attrs={"style": "list-style-type: disc;"})[1].prettify())

Great. Now, let's store this in a variable and within this tag, we will search for all the `<li>` tags and extract the text. Each tag has a method called `get_text()` that does just what we want. 

In [None]:
# how would we find the age breakdown?

ages = soup.find_all('ul', attrs={"style": "list-style-type: disc;"})[1]

for age_range in ages.find_all('li'):
    
    # from the list item, we extract the 
    print(age_range.get_text())

We can now store this in a number of ways. Right now, each row is a string. How could we clean up these strings and store them so we can track them over time?

Your suggestions here




**Searching around in a document**

We can use BeautifulSoup to search for tags that contain certain text as well. As you might guess, the search involves specifying patterns using regular expressions. To find the header `<h2>` that has the phrase "COVID-19 By the Numbers", we can simply use this string as its own regular expression (we are matching literals). (We use the command `compile()` to create a pattern object that can be applied across the text in each tag, returning those that match.)

Because we know there is only one expression "COVID-19 By the Numbers" on the page, we can use `find()` to find the first one as opposed to `find_all()`. Below we also look at the tag that comes next after the `<h2>` that we want and then the tag after that, using subsequent calls to `next_element()`. 

Eventually we end up with the string that gives the date of the data...

In [None]:
from re import compile

header = soup.find(text=compile(r"COVID-19 By the Numbers"))
print(header)

In [None]:
# the next tag after the one with the text we wanted
print(header.next_element)

In [None]:
# and the tag after that
print(header.next_element.next_element)

And finally we use `get_text()` to pull the text.

In [None]:
date = header.next_element.next_element.get_text()
date

From here we have to extract the date and maybe the total case counts. Regular expressions to the rescue again!

&#x1f3c6; **Challenge round!** &#x1f3c6;

Pick one or two of these tasks and use your skills with web scraping to answer the question. In each case, there is a URL and a data question attached to it. These come mainly from an excellent list compiled by Dan Nguyen at Stanford.

>Site: [https://analytics.usa.gov/](https://analytics.usa.gov/)<br>
Task: Number of people visiting US Government web sites now<br><br>
Site: [http://www.state.gov/r/pa/ode/socialmedia/](http://www.state.gov/r/pa/ode/socialmedia/)<br>
Task: The number of Pinterest accounts maintained by U.S. State Department embassies and missions<br><br>
Site: [https://petitions.whitehouse.gov/](https://petitions.whitehouse.gov/)<br>
Task: Number of petitions that have reached their goal<br><br>
Site: [https://www.faa.gov/air_traffic/flight_info/aeronav/aero_data/](https://www.faa.gov/air_traffic/flight_info/aeronav/aero_data/)<br>
Task: Number of airports with existing construction related activity<br><br>
Site: [https://www.osha.gov/pls/imis/establishment.html](https://www.osha.gov/pls/imis/establishment.html)<br>
Number of OSHA enforcement inspections involving Wal-Mart in California since 2014<br><br>
Site: [https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html](https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html)<br>
Task: Number of days until Texas's next scheduled execution <br><br>