# Web Scraping with Python

Copying and pasting is great. You should definitely use it if it's the simplest way. But if you don't want to copy and paste 10 webpages into Excel or if you have some time to kill, then web scraping is the answer or time sink you've been looking for.

We start by importing the python modules that we will need to do web scraping. We will be using `requests` to fetch html pages and `BeautifulSoup` to parse the html page. `pandas` will be used for data manipulation. The `pd.options.display` lines are for formatting purposes when printing out results in `pandas`.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

pd.options.display.max_rows = 10

# Example 1: Wikipedia Page (HTML Table)

We start by opening the <a href='https://en.wikipedia.org/wiki/List_of_Michigan_locations_by_per_capita_income'>website of interest</a> in a browser. We can see that it looks nicely formatted like a table. We start with passing the website of interest to the `requests.get` method.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_Michigan_locations_by_per_capita_income'
R = requests.get(url)
R.raise_for_status()

**Note**: Without the `R.raise_for_status()` line, bad urls will fail silently which is probably not what you want.

We then use `BeautifulSoup` to parse the contents of the response.

In [3]:
soup = BeautifulSoup(R.content, 'html.parser')

I like to use the browser's developer tool to do the detective work of figuring out where the data resides (*right click -> Inspect*). In this example, the data of interest resides in a html table (makes life easier).

We see that the data of interest resides in a table tag &lt;table&gt;. To grab everything between the table tags, we use the `find_all` method (one of many options but probably the only one you need and the one you will use most often).

In [4]:
tables = soup.find_all('table')

`find_all` returns a list of matches. We can use the `len` function to see how many matches came back.

In [5]:
len(tables)

4

We can check the contents of the table using the `text` method. It will be a formatting mess but that's okay.

In [6]:
tables[0].text

'This article is part of a series onIncome in theUnited States of America\nTopics\nHousehold\nPersonal\nAffluence\nSocial class\nIncome inequality\ngender pay gap\nethnic wage gap\n\nLists by income\nStates (by equality (Gini))\nCounties (highest\xa0/ lowest)\nLocations (lowest)\nMetropolitan statistical areas\nUrban areas\nZIP Code Tabulation Areas\nEthnic groups\n\n United States portalvte'

In [7]:
tables[1].text

'\n\nRank\n\nCounty\n\nPer capitaincome\n\nMedianhouseholdincome\n\nMedianfamilyincome\n\nPopulation\n\nNumber ofhouseholds\n\n\n1\n\nOakland\n\n$56,138\n\n$85,991\n\n$94,783\n\n1,202,362\n\n483,698\n\n\n2\n\nLeelanau\n\n$32,194\n\n$56,527\n\n$65,342\n\n21,708\n\n9,255\n\n\n3\n\nLivingston\n\n$31,609\n\n$72,129\n\n$82,637\n\n180,967\n\n67,380\n\n\n4\n\nWashtenaw\n\n$31,316\n\n$59,065\n\n$82,184\n\n344,791\n\n137,193\n\n\n5\n\nCharlevoix\n\n$28,403\n\n$48,704\n\n$57,022\n\n25,949\n\n10,882\n\n\n6\n\nMidland\n\n$28,363\n\n$51,103\n\n$63,299\n\n83,629\n\n33,437\n\n\n7\n\nEmmet\n\n$28,308\n\n$49,235\n\n$61,600\n\n32,694\n\n13,601\n\n\n\n\nUnited States\n\n$27,334\n\n$51,914\n\n$62,982\n\n308,745,538\n\n116,716,292\n\n\n8\n\nClinton\n\n$27,223\n\n$58,016\n\n$69,611\n\n75,382\n\n28,766\n\n\n9\n\nGrand Traverse\n\n$27,091\n\n$50,647\n\n$61,780\n\n86,986\n\n35,328\n\n\n10\n\nMacomb\n\n$26,524\n\n$53,996\n\n$67,423\n\n840,978\n\n331,667\n\n\n11\n\nEaton\n\n$25,963\n\n$54,885\n\n$66,788\n\n107,7

Looks like this is the table we want. 

A better way to find the table of interest is to pass in extra search terms to the `find_all` method after using the developer tool to find searchable attributes.

In [8]:
tables = soup.find_all('table', class_="wikitable sortable")
len(tables)

1

We get one result now instead.

In [9]:
table = tables[0]

The data resides in a html table standard cell tag `<td>` within a table row tag `<tr>`. We use `find_all` to look for all the table row tags within the table tag `<table>`.

In [10]:
table_rows = table.find_all('tr')
len(table_rows)

86

We now see that there are a lot of matches for that tag, as expected. It's close to the total number of countries. Let's look at the first few entries of `table_rows`.

In [11]:
table_rows[0]

<tr valign="bottom">
<th>Rank
</th>
<th>County
</th>
<th>Per capita<br/>income
</th>
<th>Median<br/>household<br/>income
</th>
<th>Median<br/>family<br/>income
</th>
<th>Population
</th>
<th>Number of<br/>households
</th></tr>

Looks like the header row. The `<th>` tag also gives it away, fyi.

In [12]:
table_rows[1]

<tr>
<td>1
</td>
<td><a href="/wiki/Oakland_County,_Michigan" title="Oakland County, Michigan">Oakland</a>
</td>
<td>$56,138
</td>
<td>$85,991
</td>
<td>$94,783
</td>
<td>1,202,362
</td>
<td>483,698
</td></tr>

Its the row data for the __first county__. 

We will use a nested `for` loop to go through the list of table rows. The inner `for` loop will go through each `<td>` tag appending the text to a list. We will grab all the data in the tags regardless of whether we want to keep them for now. 

We have two lists in the `for` loop. `row` will contain a list of the each `td` tag in a table row. Once the row is iterated through, we will convert it to a `pandas` dataframe. `list_df` will contain a list of those dataframes.

In [13]:
rows = []
for row in table_rows[1:]:
    table_cells = row.find_all('td')
    cells = []
    for cell in table_cells:
        cells.append(cell.text)
    rows.append(cells)

We take the list of dataframes and concatenate them together into one dataframe.

In [14]:
data = pd.DataFrame(rows) 

Let's look at the first 5 rows of the dataframe.

In [15]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1\n,Oakland\n,"$56,138\n","$85,991\n","$94,783\n","1,202,362\n","483,698\n"
1,2\n,Leelanau\n,"$32,194\n","$56,527\n","$65,342\n","21,708\n","9,255\n"
2,3\n,Livingston\n,"$31,609\n","$72,129\n","$82,637\n","180,967\n","67,380\n"
3,4\n,Washtenaw\n,"$31,316\n","$59,065\n","$82,184\n","344,791\n","137,193\n"
4,5\n,Charlevoix\n,"$28,403\n","$48,704\n","$57,022\n","25,949\n","10,882\n"


We didn't get the header row because it was contained in `<th>` tags and not `<td>` tags. We use the same type of `for` loop to extract the text from the table headers.

In [16]:
headers = table_rows[0].find_all('th')
columns = []
for header in headers:
    columns.append(header.text)

Set the column headers of the dataframe.

In [17]:
data.columns = columns
data.head()

Unnamed: 0,Rank\n,County\n,Per capitaincome\n,Medianhouseholdincome\n,Medianfamilyincome\n,Population\n,Number ofhouseholds\n
0,1\n,Oakland\n,"$56,138\n","$85,991\n","$94,783\n","1,202,362\n","483,698\n"
1,2\n,Leelanau\n,"$32,194\n","$56,527\n","$65,342\n","21,708\n","9,255\n"
2,3\n,Livingston\n,"$31,609\n","$72,129\n","$82,637\n","180,967\n","67,380\n"
3,4\n,Washtenaw\n,"$31,316\n","$59,065\n","$82,184\n","344,791\n","137,193\n"
4,5\n,Charlevoix\n,"$28,403\n","$48,704\n","$57,022\n","25,949\n","10,882\n"


We now have all the data of interest. __Webscraping DONE!__ If you want to do some data cleanup and management, this is where knowledge of `pandas` becomes useful (CSCAR has a panda workshop on how to do just that).

## `pandas` Approach to HTML Tables

The pandas `read_html` method can read in HTML Tables (and only HTML Tables). It uses `BeautifulSoup` under the hood.

In [18]:
list_tables = pd.read_html(url)

**Note**: If you get an `ImportError: html5lib not found, please install it` or `ImportError: lxml not found, please install it` message, you will need to install it via `conda install html5lib` or `conda install lxml` and restart the kernel or Juypter Notebook.

This returns a list of dataframes. One dataframe for each table in the html page. Recall how many tables there were. 

In [19]:
len(list_tables)

4

Let's check their contents.

In [20]:
list_tables[1].head()

Unnamed: 0,Rank,County,Per capitaincome,Medianhouseholdincome,Medianfamilyincome,Population,Number ofhouseholds
0,1.0,Oakland,"$56,138","$85,991","$94,783",1202362,483698
1,2.0,Leelanau,"$32,194","$56,527","$65,342",21708,9255
2,3.0,Livingston,"$31,609","$72,129","$82,637",180967,67380
3,4.0,Washtenaw,"$31,316","$59,065","$82,184",344791,137193
4,5.0,Charlevoix,"$28,403","$48,704","$57,022",25949,10882


The same table contains our data of interest (as expected).

You can also pass tag attributes to `pandas` like we did before. Use the keyword argument `match` to do a text search within the table.

In [21]:
list_tables = pd.read_html(url, attrs={'class':'wikitable sortable'})
list_tables[0].head()

Unnamed: 0,Rank,County,Per capitaincome,Medianhouseholdincome,Medianfamilyincome,Population,Number ofhouseholds
0,1.0,Oakland,"$56,138","$85,991","$94,783",1202362,483698
1,2.0,Leelanau,"$32,194","$56,527","$65,342",21708,9255
2,3.0,Livingston,"$31,609","$72,129","$82,637",180967,67380
3,4.0,Washtenaw,"$31,316","$59,065","$82,184",344791,137193
4,5.0,Charlevoix,"$28,403","$48,704","$57,022",25949,10882


This leaves us at a similiar point that we encountered earlier in the example. `pandas` has basically done the web scraping for us and left us with the data cleanup and wrangling. You should always expect to do some data manipulation if you use the `pd.read_html()` method or any webscraping for that matter.

Q: So why did we learn a more complicated way of doing things when `pd.read_html()` can do it for you?  
A: Because not everything resides in an HTML table and you'll need to use the same techniques to get at the data. If the data resides in a HTML table, consider yourself fortunate, use `pandas` and move on.

## Exercise

Scrape Serena Williams' Wikipedia page https://en.wikipedia.org/wiki/Serena_Williams for the data in the html table **Grand Slam tournament finals**  for *Singles* in the *Career Statistics* section.

# Example 2: USA TODAY Best Selling Books (not in HTML Table)

In this example, the website of interest is https://www.usatoday.com/life/books/best-selling/. Suppose we are interested in getting some basic information about the book list. For this example, the data of interest does not exist in a html table.

We start with the boilerplate template of passing the website of interest to the `requests.get` method. We then use `BeautifulSoup` to parse the contents of the response.

In [36]:
url = 'https://www.usatoday.com/life/books/best-selling/'
R = requests.get(url)
R.raise_for_status()
soup = BeautifulSoup(R.content, 'html.parser')

Navigate to the website and use the browser's developer tool to inspect items of interest.

The data seems to be residing in a `div` tag. Let's search for that and the class info and see how many matches we get

In [37]:
booklist = soup.find_all('div', class_="front-booklist-info-container")
len(booklist)

10

Let's look at the text associated within each `<div>` book tag.

In [38]:
for i, book in enumerate(booklist):
    print(i, book.text)

0 1Bloody GeniusbyJohn SandfordVirgil Flowers investigates the murder of a scholar at a local university; 12th in seriesGenre:General fictionDebuted:October 10 2019
                            Buy Now
                        
                                    Buy Now
                                AmazonBarnes & NobleGoogle PlayiBooksIndieboundKoboZola1Weeks Listed#1Best Week
1 2BlowoutbyRachel MaddowSubtitle: "Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth"Genre:Current affairsDebuted:October 10 2019
                            Buy Now
                        
                                    Buy Now
                                AmazonBarnes & NobleGoogle PlayiBooksIndieboundKoboZola1Weeks Listed#2Best Week
2 3Where the Crawdads SingbyDelia OwensThe reclusive Kya Clark is suspected in the death of Chase AndrewsGenre:General fictionDebuted:September 13 2018
                            Buy Now
                        
              

Suppose we want the title, author and genre. Let's use the browser to inspect which tags the data lies in. Then we'll use the `find_all` method to search for it.

In [39]:
for rank, book in enumerate(booklist, start=1):
    title = book.find_all('h3', class_='books-front-meta-title')[0].text
    author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
    genre = book.find_all('div', class_='books-front-meta-genre')[0].text
    print(f'{rank}. {title}, {author}, {genre}')

1. Bloody Genius, John Sandford, Genre:General fiction
2. Blowout, Rachel Maddow, Genre:Current affairs
3. Where the Crawdads Sing, Delia Owens, Genre:General fiction
4. The Institute, Stephen King, Genre:General fiction
5. The Water Dancer, Ta-Nehisi Coates, Genre:Fantasy/Sci-fi
6. The Dutch House, Ann Patchett, Genre:General fiction
7. Dog Man: For Whom the Ball Rolls, Dav Pilkey, Genre:Youth
8. The Book of Gutsy Women, Hillary Rodham Clinton, Chelsea Clinton, Genre:Biography
9. Lethal Agent, Vince Flynn, Kyle Mills, Genre:General fiction
10. Guts, Raina Telgemeier, Genre:Youth


Webscraping DONE! At least for page one. To scrape the rest of the pages, let's navigate to the other pages. Notice anything different in the url?

Before: https://www.usatoday.com/life/books/best-selling/  
After: https://www.usatoday.com/life/books/best-selling/week/2019/41/page/2/

The **before url** is the webpage for the current bestseller list, page 1. The **after url** is for a specific week and page. We can use this format to scrape any page for any given week. Much more useful than the generic url.

In [40]:
rank = 0
data = []
for page in range(1,4):    
    url = f'http://www.usatoday.com/life/books/best-selling/week/2019/41/page/{page}/'
    R = requests.get(url)
    R.raise_for_status()
    soup = BeautifulSoup(R.content, 'html.parser')
    # find the element of interest matching our criteria
    booklist = soup.find_all('div', class_='front-booklist-info-container')
    for book in booklist:
        title = book.find_all('h3', class_='books-front-meta-title')[0].text
        author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
        genre = book.find_all('div', class_='books-front-meta-genre')[0].text
        rank += 1
        print(f'{rank}. {title}, {author}, {genre}')
        data.append((rank,title,author,genre))

1. Bloody Genius, John Sandford, Genre:General fiction
2. Blowout, Rachel Maddow, Genre:Current affairs
3. Where the Crawdads Sing, Delia Owens, Genre:General fiction
4. The Institute, Stephen King, Genre:General fiction
5. The Water Dancer, Ta-Nehisi Coates, Genre:Fantasy/Sci-fi
6. The Dutch House, Ann Patchett, Genre:General fiction
7. Dog Man: For Whom the Ball Rolls, Dav Pilkey, Genre:Youth
8. The Book of Gutsy Women, Hillary Rodham Clinton, Chelsea Clinton, Genre:Biography
9. Lethal Agent, Vince Flynn, Kyle Mills, Genre:General fiction
10. Guts, Raina Telgemeier, Genre:Youth
11. Talking to Strangers, Malcolm Gladwell, Genre:Current affairs
12. The United States of Trump, Bill O'Reilly, Genre:Current affairs
13. Room on the Broom, Julia Donaldson, Axel Scheffler, Genre:Children
14. The Testaments, Margaret Atwood, Genre:General fiction
15. Stillness Is the Key, Ryan Holiday, Genre:Psychology/Self-help
16. Inside Out, Demi Moore, Genre:Memoir
17. The Trials of Apollo: The Tyrant's T

Let's convert the data into a `pandas` dataframe with column headers

In [41]:
topsellers = pd.DataFrame(data, columns=['rank','title','author','genre'])
topsellers

Unnamed: 0,rank,title,author,genre
0,1,Bloody Genius,John Sandford,Genre:General fiction
1,2,Blowout,Rachel Maddow,Genre:Current affairs
2,3,Where the Crawdads Sing,Delia Owens,Genre:General fiction
3,4,The Institute,Stephen King,Genre:General fiction
4,5,The Water Dancer,Ta-Nehisi Coates,Genre:Fantasy/Sci-fi
...,...,...,...,...
25,26,On Tyranny,Timothy Snyder,Genre:Current affairs
26,27,Pete the Cat: Trick or Pete,James Dean,Genre:Children
27,28,Rhythms of Renewal,Rebekah Lyons,Genre:Religion/Inspiration
28,29,Where Do I Begin,Elvis Duran,Genre:Memoir


## Exercise

Modify the code below to scrape the entire list and add columns such as when it debuted and how long its been on the charts.

In [42]:
rank = 0
data = []
for page in range(1,4):    
    url = r'http://www.usatoday.com/life/books/best-selling/week/2019/41/page/{}/'.format(page)    
    R = requests.get(url)
    R.raise_for_status()
    soup = BeautifulSoup(R.content, 'html.parser')
    # find the element of interest matching our criteria
    booklist = soup.find_all('div', class_='front-booklist-info-container')
    for book in booklist:
        title = book.find_all('h3', class_='books-front-meta-title')[0].text
        author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
        genre = book.find_all('div', class_='books-front-meta-genre')[0].text
        rank += 1
        print(f'{rank}. {title}, {author}, {genre}')
        data.append((rank,title,author,genre))

1. Bloody Genius, John Sandford, Genre:General fiction
2. Blowout, Rachel Maddow, Genre:Current affairs
3. Where the Crawdads Sing, Delia Owens, Genre:General fiction
4. The Institute, Stephen King, Genre:General fiction
5. The Water Dancer, Ta-Nehisi Coates, Genre:Fantasy/Sci-fi
6. The Dutch House, Ann Patchett, Genre:General fiction
7. Dog Man: For Whom the Ball Rolls, Dav Pilkey, Genre:Youth
8. The Book of Gutsy Women, Hillary Rodham Clinton, Chelsea Clinton, Genre:Biography
9. Lethal Agent, Vince Flynn, Kyle Mills, Genre:General fiction
10. Guts, Raina Telgemeier, Genre:Youth
11. Talking to Strangers, Malcolm Gladwell, Genre:Current affairs
12. The United States of Trump, Bill O'Reilly, Genre:Current affairs
13. Room on the Broom, Julia Donaldson, Axel Scheffler, Genre:Children
14. The Testaments, Margaret Atwood, Genre:General fiction
15. Stillness Is the Key, Ryan Holiday, Genre:Psychology/Self-help
16. Inside Out, Demi Moore, Genre:Memoir
17. The Trials of Apollo: The Tyrant's T

# Example 3: Geocoding (using an API) 

This example shows how to use an API to gather data of interest. Sometimes, you will need to get an API key or access token to access the website. The website should have a developers or API section to let you know how to query the API with the appropriate parameters. This section will also detail the terms of usage and any usage limits on using the API. APIs usually follow a freemium (free for a little stuff, pay for more stuff) business model. 

We'll show you how to geocode addresses using the Google Maps Geocoding API. Let's look at the documentation quickly.    
https://developers.google.com/maps/documentation/geocoding/intro

The documentation tells us the format looks something like this

`https://maps.googleapis.com/maps/api/geocode/outputFormat?parameters`

In summary, the documentation tells us about:
1. Needing an API key.
2. Format of the request needed
2. Choosing an output format
3. Required parameters
4. Optional parameters

**Note**: Starting June 2018, Google requires you to enable billing to use its Google Maps APIs. It gives you $200/month of free usage before billing starts.

## Query String Parameters

A url that contains a query string will contain three parts:
1. Resource (base) URL
2. Question Mark (?)
3. Parameters (key=value pairs) separated by an ampersand (&)

Let's look at the Google geocoding example with our own API key.

In [51]:
import os
apikey = 'AIzaSyBaneriwLXc9UQy5pNGF0sThcQ9sfMUibQ'
url = f'https://maps.googleapis.com/maps/api/geocode/json?address=915+E+Washington%2C+Ann+Arbor&key={apikey}'
url

'https://maps.googleapis.com/maps/api/geocode/json?address=915+E+Washington%2C+Ann+Arbor&key=AIzaSyCmXigJXZovZrP16Yjzn8i7XbozDoibO9A'

Let's break down the different pieces for this example.
1. The resource url is `https://maps.googleapis.com/maps/api/geocode/json`
2. The output format chosen was json
3. `?`
4. First required parameter is `address=1600+Amphitheatre+Parkway,+Mountain+View,+CA`
5. `&`
6. Second required parameter is `key=<YOUR API KEY>`
7. No optional key/value pair

Note that some characters can not be part of the URL like spaces in the above example. Spaces are encoded as `+` or `%20`.

We now can make a `GET` request. Same as before.

In [52]:
R = requests.get(url)
response = R.json()
response

{'results': [{'address_components': [{'long_name': '915',
     'short_name': '915',
     'types': ['street_number']},
    {'long_name': 'East Washington Street',
     'short_name': 'E Washington St',
     'types': ['route']},
    {'long_name': 'Burns Park',
     'short_name': 'Burns Park',
     'types': ['neighborhood', 'political']},
    {'long_name': 'Ann Arbor',
     'short_name': 'Ann Arbor',
     'types': ['locality', 'political']},
    {'long_name': 'Washtenaw County',
     'short_name': 'Washtenaw County',
     'types': ['administrative_area_level_2', 'political']},
    {'long_name': 'Michigan',
     'short_name': 'MI',
     'types': ['administrative_area_level_1', 'political']},
    {'long_name': 'United States',
     'short_name': 'US',
     'types': ['country', 'political']},
    {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
    {'long_name': '1070',
     'short_name': '1070',
     'types': ['postal_code_suffix']}],
   'formatted_address': '915 E Wa

Webscraping done! API conquered!

Manually encoding strings for URLs can be a pain. Thankfully, the `requests` library takes care of all this for us if we pass a dictionary to the `params` keyword argument.

First we need to construct the dictionary for the parameters and the specify the resource url.

In [53]:
params = {'address': '915 E Washington, Ann Arbor',
          'key':apikey
         }
baseurl = 'https://maps.googleapis.com/maps/api/geocode/json'

Make a `GET` request with the params keyword argument

In [54]:
R = requests.get(baseurl, params=params)
R.raise_for_status()
response = R.json()
response

{'results': [{'address_components': [{'long_name': '915',
     'short_name': '915',
     'types': ['street_number']},
    {'long_name': 'East Washington Street',
     'short_name': 'E Washington St',
     'types': ['route']},
    {'long_name': 'Burns Park',
     'short_name': 'Burns Park',
     'types': ['neighborhood', 'political']},
    {'long_name': 'Ann Arbor',
     'short_name': 'Ann Arbor',
     'types': ['locality', 'political']},
    {'long_name': 'Washtenaw County',
     'short_name': 'Washtenaw County',
     'types': ['administrative_area_level_2', 'political']},
    {'long_name': 'Michigan',
     'short_name': 'MI',
     'types': ['administrative_area_level_1', 'political']},
    {'long_name': 'United States',
     'short_name': 'US',
     'types': ['country', 'political']},
    {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
    {'long_name': '1070',
     'short_name': '1070',
     'types': ['postal_code_suffix']}],
   'formatted_address': '915 E Wa

Same result. And we can also spy the actual url that was sent.

In [55]:
R.url

'https://maps.googleapis.com/maps/api/geocode/json?address=915+E+Washington%2C+Ann+Arbor&key=AIzaSyCmXigJXZovZrP16Yjzn8i7XbozDoibO9A'

That's it! Pretty simple, eh! A lot of APIs work just like this. Some, of course, are a bit more complicated. 

**Note:** CSCAR occasionally runs workshops on using social media APIs.

# Traversing a JSON object

Now, that we have a JSON response from the API, we need to know how to parse it for the information we are looking for. A JSON object behaves like a Python dictionary in that it consists of key-value pairs. JSON objects consist of dictionaries and lists.

To look at the variables in a JSON object hierarchically, use the `keys` method.

In [56]:
response.keys()

dict_keys(['results', 'status'])

Access the key like you would a Python dictionary

In [57]:
response['results']

[{'address_components': [{'long_name': '915',
    'short_name': '915',
    'types': ['street_number']},
   {'long_name': 'East Washington Street',
    'short_name': 'E Washington St',
    'types': ['route']},
   {'long_name': 'Burns Park',
    'short_name': 'Burns Park',
    'types': ['neighborhood', 'political']},
   {'long_name': 'Ann Arbor',
    'short_name': 'Ann Arbor',
    'types': ['locality', 'political']},
   {'long_name': 'Washtenaw County',
    'short_name': 'Washtenaw County',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'Michigan',
    'short_name': 'MI',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'United States',
    'short_name': 'US',
    'types': ['country', 'political']},
   {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
   {'long_name': '1070',
    'short_name': '1070',
    'types': ['postal_code_suffix']}],
  'formatted_address': '915 E Washington St, Ann Arbor, MI 48109, USA

Occasionally, you will encounter a `list` of key-value pairs in the hierarchy.

You would access the `list` the same way as a Python list.

In [58]:
response['results'][0]

{'address_components': [{'long_name': '915',
   'short_name': '915',
   'types': ['street_number']},
  {'long_name': 'East Washington Street',
   'short_name': 'E Washington St',
   'types': ['route']},
  {'long_name': 'Burns Park',
   'short_name': 'Burns Park',
   'types': ['neighborhood', 'political']},
  {'long_name': 'Ann Arbor',
   'short_name': 'Ann Arbor',
   'types': ['locality', 'political']},
  {'long_name': 'Washtenaw County',
   'short_name': 'Washtenaw County',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Michigan',
   'short_name': 'MI',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'United States',
   'short_name': 'US',
   'types': ['country', 'political']},
  {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
  {'long_name': '1070',
   'short_name': '1070',
   'types': ['postal_code_suffix']}],
 'formatted_address': '915 E Washington St, Ann Arbor, MI 48109, USA',
 'geometry': {'location

Note, there is only one element in this list.

If you want to drill deeper into the `results` key, you would repeat the process to look at available keys and so forth.

In [59]:
response['results'][0].keys()

dict_keys(['address_components', 'formatted_address', 'geometry', 'place_id', 'plus_code', 'types'])

To arrive at the latitude coordinate, you would need the following syntax:

In [60]:
response['results'][0]['geometry']['location']['lat']

42.2808083

**Note**: If you have a keen eye, you can just eyeball the hierarchy without needing to use the `keys` method.

# Example 4: Weather Data (using an API)

Here is an example of an API that doesn't follow the query string parameter format.

We will be using <a href='https://darksky.net/app/'>https://darksky.net/app/</a> to gather weather data.  The developer section is at https://darksky.net/dev/. The API documentation is at https://darksky.net/dev/docs. You will need to create an account to get an API key (or you can borrow mine).

**Note**: Dark Sky is a relatively simple API (part of the reason why I'm using it as an example). 

## Weather Forecast

APIs usually have different endpoints depending on the data you are interested in. The documentation page shows that you can make two types of API requests. 
1. The current weather forecast for the next week (forecast request)
2. An observed or forecast weather conditions for a date in the past or future (time machine request)

**Note**: BTW, two is a relatively small number. The twitter API has over 100.

Let's look at endpoint #2.

The documentation shows that a forecast request takes the form of:

https://api.darksky.net/forecast/[key]/[latitude],[longitude],[time]

A historical weather request returns the observed weather at a given time (for many places, up to 60 years in the past).

Start by specifying the API key and then the GPS coordinates and timestamp.

In [61]:
apikey = 'b80fad6bb807f251eff7b0e5cca2ca21'

In [62]:
latitude = 42.28
longitude = -83.74
time = '2017-01-31T16:00:00'

In [63]:
url = f'https://api.darksky.net/forecast/{apikey}/{latitude},{longitude},{time}'
R = requests.get(url)
R.raise_for_status()

API responses consist of a JSON-formatted object (UTF-8) according to the documentation.

In [64]:
pastforecast = R.json()

Webscraping DONE!

Hourly summary and temperature for that day.

In [65]:
for i, hr in enumerate(pastforecast["hourly"]["data"]):
    print(i, hr['summary'], hr['temperature'])

0 Overcast 23.11
1 Foggy 23.74
2 Foggy 24.37
3 Foggy 24.85
4 Foggy 25.39
5 Foggy 25.96
6 Foggy 26.29
7 Overcast 26.23
8 Overcast 26.45
9 Overcast 27.3
10 Overcast 28.36
11 Overcast 29.75
12 Overcast 30.92
13 Overcast 32.4
14 Overcast 32.64
15 Overcast 32.67
16 Overcast 32.79
17 Possible Flurries 32.22
18 Overcast 31.31
19 Overcast 31.11
20 Overcast 31.21
21 Overcast 31.31
22 Mostly Cloudy 31.33
23 Mostly Cloudy 31.41


**Note**: API responses come in a variety of flavours. Some common ones are JSON, XML, and CSV format. I recommend choosing JSON over XML whenever possible. XML is a pain to work with IMO. If you do choose XML, you can use `BeautifulSoup` or `ElementTree` to parse the data.

## API Wrappers

For the more popular APIs (like Twitter and Google), the open source community has probably written a Python wrapper for the API to abstract away some of the details for you. The Dark Sky API is pretty simple as you can see but someone has written a wrapper for it because of its popularity. The Python module is called `python-forecastio`. You can install it using `pip install python-forecastio`.

In [66]:
import forecastio

lon = -83.7
lat = 42.3
forecast = forecastio.load_forecast(apikey, lat, lon)

Accessing the data depends on the structure the developer set up which is not necessarily the same as the API. You will need to read the documentation of the module to understand the structure or figure it out by trial and error.

In [67]:
for i, hr in enumerate(forecast.hourly().data):
    print(i, hr.time, hr.summary, hr.temperature)

0 2019-10-15 19:00:00 Partly Cloudy 61.38
1 2019-10-15 20:00:00 Possible Drizzle 62.41
2 2019-10-15 21:00:00 Partly Cloudy 63.43
3 2019-10-15 22:00:00 Partly Cloudy 63.64
4 2019-10-15 23:00:00 Mostly Cloudy 60.42
5 2019-10-16 00:00:00 Mostly Cloudy 57.06
6 2019-10-16 01:00:00 Overcast 57.09
7 2019-10-16 02:00:00 Mostly Cloudy 57.98
8 2019-10-16 03:00:00 Mostly Cloudy 57.87
9 2019-10-16 04:00:00 Mostly Cloudy 57.64
10 2019-10-16 05:00:00 Possible Light Rain 57.49
11 2019-10-16 06:00:00 Possible Light Rain 56.27
12 2019-10-16 07:00:00 Possible Light Rain 53.41
13 2019-10-16 08:00:00 Overcast 50.4
14 2019-10-16 09:00:00 Mostly Cloudy 47.65
15 2019-10-16 10:00:00 Overcast 45.1
16 2019-10-16 11:00:00 Overcast 43.11
17 2019-10-16 12:00:00 Overcast 41.87
18 2019-10-16 13:00:00 Mostly Cloudy 42.67
19 2019-10-16 14:00:00 Mostly Cloudy 44.72
20 2019-10-16 15:00:00 Mostly Cloudy 46.8
21 2019-10-16 16:00:00 Mostly Cloudy 48.84
22 2019-10-16 17:00:00 Mostly Cloudy 50.28
23 2019-10-16 18:00:00 Mostl

## Exercise

Using the Google Maps Distance Matrix API, get the the distance from Ann Arbor, MI to San Diego, CA and Anchorage, AK.

The documentation is located at https://developers.google.com/maps/documentation/distance-matrix/start

## Example 5: POST Requests

This example was taken from the book **Web Scraping with Python** by Ryan Mitchell.

This website http://pythonscraping.com/pages/files/form.html shows a basic web form. Let's look at the page source to see info related to the `post` request.

The page source contains the variable names of the two input fields which need to be submitted in the `<form>` tag. We create a dictionary to represent these fields.

In [68]:
data = {'firstname':'Mister',
        'lastname' :'Cao'}

There is an `action` attribute associated with the `post` request. This is the url where the `post` is being sent. This is a **relative** path to the current url. 

In [69]:
R = requests.post("http://pythonscraping.com/pages/files/processing.php", data=data)
R.text

'Hello there, Mister Cao!'

We'll see in the next section a way to get at the same information through the browser's developer tool.

## Example 6: Farm Equipment Crashes in Ann Arbor (data that is not visible in the page source)

This example is meant to illustrate how to grab data that is visible on the webpage but not in the page source. The website of interest is https://www.michigantrafficcrashfacts.org/querytool. We are interested in grabbing the gps coordinates of the crashes on the map. Here is our query to generate the map https://www.michigantrafficcrashfacts.org/querytool#q1;0;2017,2016,2015,2014,2013;c8189;0,42:1. The crashes also has some information in a popup tooltip when you click on it.

The goal is to find the URL where the GET/POST request is being sent. How do we do that? The answer lies in the reference link at the bottom of this example. Basically, you need your browser's developer tool and some detective work.  
[SHORT OVERLUDE TO THE WEBPAGE]

Congrats! You can consider yourself a developer now that you can use the toolbox :)

Now that we have found the URL of interest, we go back to our regularly scheduled programming (pun intended). Everything should be easy peasy moving forward. We know the url where the `post` request is being sent. There is also some data being sent with it. We will create a dictionary for that bit of data.

In [70]:
mtcf_url = 'https://www.michigantrafficcrashfacts.org/qjson'
query = {'q': '1;0;2017,2016,2015,2014,2013;c8189;0,42:1',
        'v': 'map',
        'p': '37.341651,-98.030605,50.964815,-73.289395,6',
}
R = requests.post(mtcf_url, data=query)
R.raise_for_status()

Grab json response

In [71]:
crash_data = R.json()
crash_data

 'errors': [],
 'meta': {'count': 2,
  'bounds': {'west': -83.78405039994,
   'east': -83.762084436151,
   'south': 42.2444715503,
   'north': 42.281289282301},
  'mode': 'points'},
 'headers': {'crash.id': 'Crash ID',
  'crash_day': 'Crash Day',
  'crash_month': 'Crash Month',
  'crash_year': 'Crash Year',
  'crash_time_of_day': 'Crash Hour',
  'crash_worst_injury': 'Crash Worst Injury',
  'lng': 'Crash Lat',
  'lat': 'Crash Long',
  'seg_crnt': 'Road Segment',
  'seg_orig': 'Orig Segment',
  'vehicle_id': 'Unit Count',
  'person_id': 'Person Count'},
 '0': {'crash.id': 2015113037,
  'crash_day': 26,
  'crash_month': 'February',
  'crash_year': 2015,
  'crash_time_of_day': '8:00 AM - 9:00 AM',
  'crash_worst_injury': 'No injury (O)',
  'lng': -83.78405039994,
  'lat': 42.281289282301,
  'seg_crnt': 4818100,
  'seg_orig': 0,
  'vehicle_id': 2,
  'person_id': 2},
 '1': {'crash.id': 2015156119,
  'crash_day': 14,
  'crash_month': 'July',
  'crash_year': 2015,
  'crash_time_of_day': '5:00

Webscraping DONE!

## Reference

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

## Exercise

This website generates random VINs upon demand http://randomvin.com. This is probably one of the most simplest html page you will ever see (and one of the reasons I picked it).

Here is some example code to grab 3 VINs from the website. Supply the appropriate url to finish the code. What file format is the response?

In [None]:
for i in range(3):
    R = requests.get('')
    R.raise_for_status()
    print(R.text)

# Retrying Module

This module is aptly name. Even though you have written valid code, sometimes it will still crash for unforseen reasons (e.g. bad network connection). This is where the `retry` behaviour becomes handy. You will need to install the module first via `pip install retrying`. Documentation is at https://pypi.python.org/pypi/retrying.

In [73]:
import random
from retrying import retry

## Simple Retry

Here is a simple case of how to use `retry`. First step is to put your code of interest in a function. This non-sensical function has a 90% chance of failing when run because of the `assert` statement.

In [74]:
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_B_equal_to_lucky7()
print('this has a 10% chance of printing')

2


AssertionError: 

Now let's add the `retry` behavior by adding an @ decorator at the beginning of the function. That's it! Pretty simple.

In [75]:
@retry()
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7
    
is_B_equal_to_lucky7()
print('this will ALWAYS print')

1
7
this will ALWAYS print


You can also add an argument `wait_fixed` in milliseconds to specify how long to wait between retries. Good practice so you don't have to bombard the server with constant requests during a failed connection during webscraping. Gmail does an exponential version of this when it loses the network connection.

## Retry on specific or general exceptions

We saw the function fail earlier because of an `AssertionError`. We can tell `retry` to only retry when certain exceptions occur.  This requires using the argument `retry_on_exception` and passing it the name of a function. The function will return either `True` or `False`.

In [76]:
def checkForSpecificError(exception):
    return isinstance(exception, AssertionError)

@retry(retry_on_exception=checkForSpecificError, wait_fixed=500)
def is_it_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_it_lucky7()
print("Lucky 7")

3
7
Lucky 7


If we had changed the last line in the function to `assert C == 7`, then the retry behaviour will not kick in because the function returns a `NameError`.

## Retry on return value

You don't necessarily need to have a program error to invoke the `retry` behaviour. You can use the `return` value to decide. This requires using the argument `retry_on_result` and passing it the name of a function. The function will return either `True` or `False`. 

In [78]:
def checkReturnValue(value):
    return value is True

# This function will never crash but we can still use retry
@retry(retry_on_result=checkReturnValue, wait_fixed=300)
def main():
    B = random.randint(1, 10)
    print(B)
    try:
        assert B == 7
        return None
    except AssertionError:
        return True

main()
print('Did I find a 7?')

10
3
10
6
2
2
7
Did I find a 7?


Other arguments of interest to `retry` which you can use are:  
- stop_max_attempt_number
- stop_max_delay
- wait_random_min
- wait_random_max
- wait_exponential_multiplier
- wait_exponential_max`

**Note**: You could implement the retry behavior without this module. You can use a `while` loop in some combination with `try` and `except` too. I don't recommend it.

## Exercise

Apply the `retry` module to this flaky code for the Dark Sky API to try 5 times waiting one second in between attempts.

In [None]:
def get_weather_from_Eastern_hemisphere(apikey):
    lon = random.randint(0,400)
    lat = random.randint(0,200)
    print(f'{lon} deg E, {lat} deg N')
    R = requests.get(f'https://api.darksky.net/forecast/{apikey}/{lat},{lon}')
    R.raise_for_status()
    return (lon,lat,R.json())
    
apikey = 'b80fad6bb807f251eff7b0e5cca2ca21'
lon, lat, forecast = get_weather_from_Eastern_hemisphere(apikey)
print(forecast["daily"]["data"][0]['summary'])

# Quiz

https://goo.gl/forms/Ym3hbKu45nzQ2puv2

# cURL to Python Requests

cURL is a command line tool for getting or sending files using URL syntax. You can always get the cURL command from the developer tool. This usually also includes *cookies and headers*. While Python can't use it directly, you can convert it to a Python requests syntax. Googling *curl to python requests* will bring you to this page https://curl.trillworks.com/. You can paste the curl code and it will return the equivalent Python code for you. Python does have a module that is suppose to do this for you but I haven't gotten it to work yet. Some modules are `uncurl, runcurl, curl_to_requests` but they support Python2 only.

For example, here is what is returned for the curl cmd from http://random.vin.com.

In [None]:
import requests

cookies = {
    '_ga': 'GA1.2.690271399.1483585728',
    '_gat': '1',
}

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Accept': '*/*',
    'Referer': 'http://randomvin.com/',
    'Connection': 'keep-alive',
}

requests.get('http://randomvin.com/getvin.php?type=real', headers=headers, cookies=cookies)

# Robots.txt

Also known as the robot exclusion standard, *robots.txt* is a standard used by website to communicate to web crawlers, scrapers etc. The robots.txt is a file that is used to communicate which parts of the website is allowed or disallowed to be scraped. You will find the robots.txt file in the root directory of the website. For the english version of wikipedia, it is located at https://en.wikipedia.org/robots.txt.  

The complement to *robots.txt* is the *sitemap* which is an XML file that lists the URLs for a site. 