# Web Scraping with Python

Copying and pasting is great. You should definitely use it if it's the simplest way. But if you don't want to copy and paste 10 webpages into Excel or if you have some time to kill, then web scraping is the answer or time sink you've been looking for.

We start by importing the python modules that we will need to do web scraping. We will be using `requests` to fetch html pages and `BeautifulSoup` to parse the html page. `pandas` will be used for data manipulation. The `pd.options.display` lines are for formatting purposes when printing out results in `pandas`.

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

pd.options.display.max_rows = 10

# Example 1: Wikipedia Page (HTML Table)

We start by opening the <a href='https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'>website of interest</a> in a browser. We can see that it looks nicely formatted like a table. We start with passing the website of interest to the `requests.get` method.

In [26]:
url = r'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
R = requests.get(url)
R.raise_for_status()

**Note**: Without the `R.raise_for_status()` line, bad urls will fail silently which is probably not what you want.

We then use `BeautifulSoup` to parse the contents of the response.

In [27]:
soup = BeautifulSoup(R.content, 'html.parser')

Let's check the web page's html source code to see if we can see the data within the webpage (*right click -> view page source*). A quick way to check is to search for some data using find (*Ctrl+F*). We can search for the population of the World. We find it in the page source so we know `BeautifulSoup` can parse it. We'll see later on what to do if its not in the page source.

I like to use the browser's developer tool to do the detective work of figuring out where the data resides (*right click -> Inspect*). In this example, the data of interest resides in a html table (makes life easier). You can see a hierarchy in the html tags. The hierarchy is called a DOM (Document Object Model) Tree or Hierarchy.

We see that the data of interest resides in a table tag &lt;table&gt;. To grab everything between the table tags, we use the `find_all` method (one of many options but probably the only one you need and the one you will use most often).

In [28]:
tables = soup.find_all('table')

`find_all` returns a list of matches. We can use the `len` function to see how many matches came back.

In [29]:
len(tables)

4

We can check the contents of the table using the `text` method. It will be a formatting mess but that's okay.

In [30]:
tables[0].text

'\n\n\n\n\n\nIt has been suggested that this article be merged into List of countries and dependencies by population. (Discuss) Proposed since January 2018.\n\n\n'

In [31]:
tables[1].text

'\n\n\n\n\n\n\nStatistical regions as defined by the United Nations Statistics Division.[1] Antarctica is omitted.\n\n\n\n\n'

In [32]:
tables[2].text

'\nCountries and areas ranked by population in 2017\n\nRank\nCountry or area\nUN continental\nregion[2]\nUN statistical\nregion[2]\nPopulation\n(1 July 2016)[3]\nPopulation\n(1 July 2017)[3]\nChange\n\n\n—\nWorld\n—\n—\n7,466,964,280\n7,550,262,101\n7000111555135228260♠+1.1%\n\n\n1\n\xa0China[a]\nAsia\nEastern Asia\n1,403,500,365\n1,409,517,397\n6999428716097982630♠+0.4%\n\n\n2\n\xa0India\nAsia\nSouthern Asia\n1,324,171,354\n1,339,180,127\n7000113344643460699♠+1.1%\n\n\n3\n\xa0United States\nAmericas\nNorthern America\n322,179,605\n324,459,463\n6999707635730076710♠+0.7%\n\n\n4\n\xa0Indonesia\nAsia\nSouth-eastern Asia\n261,115,456\n263,991,379\n7000110139899186970♠+1.1%\n\n\n5\n\xa0Brazil\nAmericas\nSouth America\n207,652,865\n209,288,278\n6999787570641031120♠+0.8%\n\n\n6\n\xa0Pakistan\nAsia\nSouthern Asia\n193,203,476\n197,015,955\n7000197329731272540♠+2.0%\n\n\n7\n\xa0Nigeria\nAfrica\nWestern Africa\n185,989,640\n190,886,311\n7000263276545940949♠+2.6%\n\n\n8\n\xa0Bangladesh\nAsia\nSou

Looks like this is the table we want. 

In [33]:
table = tables[1]

A better way to find the table of interest is to pass in extra search terms to the `find_all` method after using the developer tool to find searchable attributes.

In [34]:
tables = soup.find_all('table', style='text-align: right;')
len(tables)

1

We get one result instead of three.

In [35]:
table = tables[0]

The data resides in a html table standard cell tag `<td>` within a table row tag `<tr>`. We use `find_all` to look for all the table row tags within the table tag `<table>`.

In [36]:
table_rows = table.find_all('tr')
len(table_rows)

235

We now see that there are a lot of matches for that tag, as expected. It's close to the total number of countries. Let's look at the first few entries of `table_rows`.

In [37]:
table_rows[0]

<tr>
<th data-sort-type="number" scope="col">Rank</th>
<th scope="col">Country or area</th>
<th scope="col"><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN continental<br/>
region</a><sup class="reference" id="cite_ref-region_2-0"><a href="#cite_note-region-2">[2]</a></sup></th>
<th scope="col"><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN statistical<br/>
region</a><sup class="reference" id="cite_ref-region_2-1"><a href="#cite_note-region-2">[2]</a></sup></th>
<th scope="col">Population<br/>
(1 July 2016)<sup class="reference" id="cite_ref-UN_3-0"><a href="#cite_note-UN-3">[3]</a></sup></th>
<th scope="col">Population<br/>
(1 July 2017)<sup class="reference" id="cite_ref-UN_3-1"><a href="#cite_note-UN-3">[3]</a></sup></th>
<th scope="col">Change</th>
</tr>

Looks like the header row. The `<th>` tag also gives it away, fyi.

In [38]:
table_rows[1]

<tr>
<td>—</td>
<td align="left"><i><a href="/wiki/World_population" title="World population">World</a></i></td>
<td align="left">—</td>
<td align="left">—</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">7,466,964,280</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">7,550,262,101</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;"><span class="sortkey" style="display:none">7000111555135228260♠</span><span style="color:green">+1.1%</span></td>
</tr>

Its the row data for the _World_. 

We will use a nested `for` loop to go through the list of table rows. The inner `for` loop will go through each `<td>` tag appending the text to a list. We will grab all the data in the tags regardless of whether we want to keep them for now. 

We have two lists in the `for` loop. `row` will contain a list of the each tag in a table row. Once the row is iterated through, we will convert it to a `pandas` dataframe. `list_df` will contain a list of those dataframes.

In [39]:
list_df = []
for row in table_rows:
    table_cells = row.find_all('td')
    row = []
    for cell in table_cells:
        row.append(cell.text)    
    list_df.append(pd.DataFrame(row).transpose())

We take the list of dataframes and concatenate them together into one dataframe.

In [40]:
data = pd.concat(list_df, ignore_index=True) 

Let's look at the first 5 rows of the dataframe.

In [41]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,—,World,—,—,7466964280,7550262101,7000111555135228260♠+1.1%
1,1,China[a],Asia,Eastern Asia,1403500365,1409517397,6999428716097982630♠+0.4%
2,2,India,Asia,Southern Asia,1324171354,1339180127,7000113344643460699♠+1.1%
3,3,United States,Americas,Northern America,322179605,324459463,6999707635730076710♠+0.7%
4,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,7000110139899186970♠+1.1%


We didn't get the header row because it was contained in `<th>` tags and not `<td>` tags. We use the same type of `for` loop to extract the text from the table headers.

In [42]:
headers = table_rows[0].find_all('th')
columns = []
for header in headers:
    columns.append(header.text)

Set the column headers of the dataframe.

In [43]:
data.columns = columns
data.head()

Unnamed: 0,Rank,Country or area,UN continental region[2],UN statistical region[2],Population (1 July 2016)[3],Population (1 July 2017)[3],Change
0,—,World,—,—,7466964280,7550262101,7000111555135228260♠+1.1%
1,1,China[a],Asia,Eastern Asia,1403500365,1409517397,6999428716097982630♠+0.4%
2,2,India,Asia,Southern Asia,1324171354,1339180127,7000113344643460699♠+1.1%
3,3,United States,Americas,Northern America,322179605,324459463,6999707635730076710♠+0.7%
4,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,7000110139899186970♠+1.1%


We now have all the data of interest. __Webscraping DONE!__ Now its a matter of data cleanup and management. This is where `pandas` becomes useful. We want to clean up the last three columns.

We split the last column by the delimiter and grab the latter part of the text.

In [44]:
data['Change'] = data['Change'].apply(lambda x: x.split('♠')[1])
data.head(5)

Unnamed: 0,Rank,Country or area,UN continental region[2],UN statistical region[2],Population (1 July 2016)[3],Population (1 July 2017)[3],Change
0,—,World,—,—,7466964280,7550262101,+1.1%
1,1,China[a],Asia,Eastern Asia,1403500365,1409517397,+0.4%
2,2,India,Asia,Southern Asia,1324171354,1339180127,+1.1%
3,3,United States,Americas,Northern America,322179605,324459463,+0.7%
4,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,+1.1%


We use a regular expression to get rid of the references and newline feeds in the headers.

In [45]:
import re
data.columns = [re.sub(r'\W\d\W$', '', x).replace('\n',' ') for x in data.columns]
data.head(5)

Unnamed: 0,Rank,Country or area,UN continental region,UN statistical region,Population (1 July 2016),Population (1 July 2017),Change
0,—,World,—,—,7466964280,7550262101,+1.1%
1,1,China[a],Asia,Eastern Asia,1403500365,1409517397,+0.4%
2,2,India,Asia,Southern Asia,1324171354,1339180127,+1.1%
3,3,United States,Americas,Northern America,322179605,324459463,+0.7%
4,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,+1.1%


Get rid of the thousand separators.

In [46]:
data.iloc[:,-3] = data.iloc[:,-3].apply(lambda x: x.replace(',',''))
data.iloc[:,-2] = data.iloc[:,-2].apply(lambda x: x.replace(',',''))
data.head(5)

Unnamed: 0,Rank,Country or area,UN continental region,UN statistical region,Population (1 July 2016),Population (1 July 2017),Change
0,—,World,—,—,7466964280,7550262101,+1.1%
1,1,China[a],Asia,Eastern Asia,1403500365,1409517397,+0.4%
2,2,India,Asia,Southern Asia,1324171354,1339180127,+1.1%
3,3,United States,Americas,Northern America,322179605,324459463,+0.7%
4,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,+1.1%


Good enough! Let's save the data to a txt file.

In [47]:
data.to_csv('wikipedia_countries_population.txt', sep='|', index=False, header=True)

If you look at the text file, there is some more data cleanup we could have done but I'm not here to teach you how to do that with `pandas` (although we do have workshops to do just that)

## Pandas Approach to HTML Tables

The pandas `read_html` method can read in HTML Tables (and only HTML Tables). It uses `BeautifulSoup` under the hood.

In [48]:
countries = pd.read_html(url)

**Note**: If you get an `ImportError: html5lib not found, please install it` message, you will need to install it via `conda install html5lib` and you might need to restart the kernel or Juypter Notebook.

This returns a list of dataframes. One dataframe for each table in the html page. Recall how many tables there were. 

In [49]:
len(countries)

4

Let's check their contents.

In [50]:
countries[2].head()

Unnamed: 0,0,1,2,3,4,5,6
0,Rank,Country or area,UN continental region[2],UN statistical region[2],Population (1 July 2016)[3],Population (1 July 2017)[3],Change
1,—,World,—,—,7466964280,7550262101,7000111555135228260♠+1.1%
2,1,China[a],Asia,Eastern Asia,1403500365,1409517397,6999428716097982630♠+0.4%
3,2,India,Asia,Southern Asia,1324171354,1339180127,7000113344643460699♠+1.1%
4,3,United States,Americas,Northern America,322179605,324459463,6999707635730076710♠+0.7%


The third table contains our data of interest (as expected). 

Since `pandas` uses `BeautifulSoup` under the hood, we can specify attributes to create a better filter.

In [51]:
countries = pd.read_html(url, attrs = {'style': 'text-align: right;'})

We now only get one table as before.

In [52]:
countries[0].head()

Unnamed: 0,0,1,2,3,4,5,6
0,Rank,Country or area,UN continental region[2],UN statistical region[2],Population (1 July 2016)[3],Population (1 July 2017)[3],Change
1,—,World,—,—,7466964280,7550262101,7000111555135228260♠+1.1%
2,1,China[a],Asia,Eastern Asia,1403500365,1409517397,6999428716097982630♠+0.4%
3,2,India,Asia,Southern Asia,1324171354,1339180127,7000113344643460699♠+1.1%
4,3,United States,Americas,Northern America,322179605,324459463,6999707635730076710♠+0.7%


We are left at a similiar point that we encountered earlier in the example. `pandas` has basically done the web scraping for us and left us with the data cleanup and wrangling. You should always expect to do some data manipulation if you use the `pd.read_html()` method or any webscraping for that matter.

Q: So why did we learn a more complicated way of doing things when `pd.read_html()` can do it for you?  
A: Because not everything resides in an HTML table and you'll need to use the same techniques to get at the data. If the data resides in a HTML table, consider yourself fortunate, use `pandas` and move on.

## Exercise

Scrape Serena Williams' Wikipedia page https://en.wikipedia.org/wiki/Serena_Williams for the data in the html table **Grand Slam tournament finals**  for *Singles* in the *Career Statistics* section.

# Page Source vs. Chrome's Developer Tools Elements Tab

I want to point out a subtle difference between two things we did. Although the page source and the Elements tab look similar in this example, they are in fact not the same. A quick example to show you the difference is to sort the wikipedia table so its not in order by population. You can see that the table order in the Elements tab matches what you see on the website. However, if you look at the page source after the sort, you will notice that it has remained the same (sort has not affect on it). 

**Note:** When you make a GET request, you are asking for the page source. This is what you are parsing with `BeautifulSoup`.

# Example 2: USA TODAY Best Selling Books (not in HTML Table)

In this example, the website of interest is https://www.usatoday.com/life/books/best-selling/. Suppose we are interested in getting some basic information about the book list. For this example, the data of interest does not exist in a html table.

We start with the boilerplate template of passing the website of interest to the `requests.get` method. We then use `BeautifulSoup` to parse the contents of the response.

In [53]:
url = r'https://www.usatoday.com/life/books/best-selling/'
R = requests.get(url)
R.raise_for_status()
soup = BeautifulSoup(R.content, 'html.parser')

Navigate to the website and use the browser's developer tool to inspect items of interest.

The data seems to be residing in a `div` tag. Let's search for that and the class info and see how many matches we get

In [54]:
booklist = soup.find_all('div', class_="front-booklist-info-container")
len(booklist)

10

Let's look at the text associated within each `<div>` book tag.

In [55]:
for i, book in enumerate(booklist):
    print(i, book.text)

0 1The Great AlonebyKristin HannahIn 1970s Alaska, a teenage girl — the daughter of a disturbed Vietnam War POW — has to grow up in a hurryGenre:General fictionDebuted:February 15 2018
                            Buy Now
                        
                                    Buy Now
                                AmazonBarnes & NobleGoogle PlayiBooksIndieboundKoboZola1Weeks Listed#1Best Week
1 212 Rules for Life: An Antidote to ChaosbyJordan B. PetersonPsychologist proposes 12 practical rules to live by based on science, faith and human natureGenre:Psychology/Self-helpDebuted:February 01 2018
                            Buy Now
                        
                                    Buy Now
                                AmazonBarnes & NobleGoogle PlayiBooksIndieboundKoboZola#4Last Week3Weeks Listed#2Best Week
2 3A Wrinkle in TimebyMadeleine L’EngleYouth: Otherworldly visitor informs family of a “tesseract,” a wrinkle in time; classicGenre:YouthDebuted:May 20 2004
        

Suppose we want the title, author and genre. Let's use the browser to inspect which tags the data lies in. Then we'll use the `find_all` method to search for it.

In [56]:
for rank, book in enumerate(booklist, start=1):
    title = book.find_all('h3', class_='books-front-meta-title')[0].text
    author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
    genre = book.find_all('div', class_='books-front-meta-genre')[0].text
    print('{}. {}, {}, {}'.format(rank, title, author, genre))

1. The Great Alone, Kristin Hannah, Genre:General fiction
2. 12 Rules for Life: An Antidote to Chaos, Jordan B. Peterson, Genre:Psychology/Self-help
3. A Wrinkle in Time, Madeleine L’Engle, Genre:Youth
4. Look for Me, Lisa Gardner, Genre:General fiction
5. An American Marriage, Tayari Jones, Genre:General fiction
6. Fire and Fury, Michael Wolff, Genre:Current affairs
7. The Woman in the Window, A.J. Finn, Genre:General fiction
8. Llama Llama I Love You Board Book, Anna Dewdney, Genre:Children
9. Still Me, Jojo Moyes, Genre:General fiction
10. Dark in Death, J.D. Robb, Genre:Mystery


Webscraping DONE! At least for page one. To scrape the rest of the pages, let's navigate to the other pages. Notice anything different in the url?

Before: https://www.usatoday.com/life/books/best-selling/  
After: https://www.usatoday.com/life/books/best-selling/week/2018/6/page/2/

The **before url** is the webpage for the current bestseller list, page 1. The **after url** is for a specific week and page. We can use this format to scrape any page for any given week. Much more useful than the generic url.

In [58]:
rank = 0
data = []
for page in range(1,4):    
    url = r'http://www.usatoday.com/life/books/best-selling/week/2018/7/page/{}/'.format(page)    
    R = requests.get(url)
    R.raise_for_status()
    soup = BeautifulSoup(R.content, 'html.parser')
    # find the element of interest matching our criteria
    booklist = soup.find_all('div', class_='front-booklist-info-container')
    for book in booklist:
        title = book.find_all('h3', class_='books-front-meta-title')[0].text
        author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
        genre = book.find_all('div', class_='books-front-meta-genre')[0].text
        rank += 1
        print('{}. {}, {}, {}'.format(rank, title, author, genre))
        data.append((rank,title,author,genre))

1. The Great Alone, Kristin Hannah, Genre:General fiction
2. 12 Rules for Life: An Antidote to Chaos, Jordan B. Peterson, Genre:Psychology/Self-help
3. A Wrinkle in Time, Madeleine L’Engle, Genre:Youth
4. Look for Me, Lisa Gardner, Genre:General fiction
5. An American Marriage, Tayari Jones, Genre:General fiction
6. Fire and Fury, Michael Wolff, Genre:Current affairs
7. The Woman in the Window, A.J. Finn, Genre:General fiction
8. Llama Llama I Love You Board Book, Anna Dewdney, Genre:Children
9. Still Me, Jojo Moyes, Genre:General fiction
10. Dark in Death, J.D. Robb, Genre:Mystery
11. Dog Man and Cat Kid, Dav Pilkey, Genre:Youth
12. Darker, E.L. James, Genre:Romance
13. Little Fires Everywhere, Celeste Ng, Genre:General fiction
14. Wonder, R.J. Palacio, Genre:Youth
15. Before We Were Yours, Lisa Wingate, Genre:General fiction
16. The Masterpiece, Francine Rivers, Genre:General fiction
17. The Alchemist, Paulo Coelho, Genre:General fiction
18. The Wife Between Us, Greer Hendricks, Sara

Let's convert the data into a `pandas` dataframe with column headers

In [59]:
topsellers = pd.DataFrame(data, columns=['rank','title','author','genre'])
topsellers

Unnamed: 0,rank,title,author,genre
0,1,The Great Alone,Kristin Hannah,Genre:General fiction
1,2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,Genre:Psychology/Self-help
2,3,A Wrinkle in Time,Madeleine L’Engle,Genre:Youth
3,4,Look for Me,Lisa Gardner,Genre:General fiction
4,5,An American Marriage,Tayari Jones,Genre:General fiction
...,...,...,...,...
25,26,Pete the Cat: Valentine's Day Is Cool,"James Dean, Kimberly Dean",Genre:Children
26,27,Live Fearless,Sadie Robertson,Genre:Psychology/Self-help
27,28,Milk and Honey,Rupi Kaur,Genre:----
28,29,The Deceivers,Alex Berenson,Genre:Mystery


## Exercise

See if you can scrape the entire list and add columns such as when it debuted and how long its been on the charts.

# Example 3: Geocoding (using an API) 

This example is how to use an API to gather data of interest. Sometimes, you will need to get an api key or access token to access the website. The website should have a developers or API section to let you know how to query the API with the appropriate parameters. This section will also detail the terms of usage and any usage limits on using the API. APIs usually follow a freemium business model. 

We'll show you how to Geocode addresses next using the Google Maps Geocoding API. Let's look at the documentation quickly.    
https://developers.google.com/maps/documentation/geocoding/intro

The documentation tells us the format looks something like this

`https://maps.googleapis.com/maps/api/geocode/outputFormat?parameters`

In summary, the documentation tells us about:
1. Needing an API key.
2. Format of the request needed
2. Choosing an output format
3. Required parameters
4. Optional parameters

## Query String Parameters

A url that contains a query string will contain three parts:
1. Resource (base) URL
2. Question Mark (?)
3. Parameters (key=value pairs) separated by an ampersand (&)

Let's look at the Google geocoding example with our own API key.

In [60]:
import os
apikey = os.getenv('GOOGLE_MAP_API_KEY')
url = r'https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key={}'.format(apikey)
url

'https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=AIzaSyA2UmVkSfbv8ab1YRMEfcwBGAR3MKdvgQU'

Let's break down the different pieces for this example.
1. The resource url is `https://maps.googleapis.com/maps/api/geocode/json`
2. The output format chosen was json
3. `?`
4. First required parameter is `address=1600+Amphitheatre+Parkway,+Mountain+View,+CA`
5. `&`
6. Second required parameter is `key=<YOUR API KEY>`
7. No optional key/value pair

Note that some characters can not be part of the URL like spaces in the above example. Spaces are encoded as `+` or `%20`.

We now can make a `GET` request. Same as before.

In [61]:
R = requests.get(url)
response = R.json()
response

{'results': [{'address_components': [{'long_name': 'Google Building 41',
     'short_name': 'Google Building 41',
     'types': ['premise']},
    {'long_name': '1600', 'short_name': '1600', 'types': ['street_number']},
    {'long_name': 'Amphitheatre Parkway',
     'short_name': 'Amphitheatre Pkwy',
     'types': ['route']},
    {'long_name': 'Mountain View',
     'short_name': 'Mountain View',
     'types': ['locality', 'political']},
    {'long_name': 'Santa Clara County',
     'short_name': 'Santa Clara County',
     'types': ['administrative_area_level_2', 'political']},
    {'long_name': 'California',
     'short_name': 'CA',
     'types': ['administrative_area_level_1', 'political']},
    {'long_name': 'United States',
     'short_name': 'US',
     'types': ['country', 'political']},
    {'long_name': '94043', 'short_name': '94043', 'types': ['postal_code']}],
   'formatted_address': 'Google Building 41, 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA',
   'geometry': {'boun

Webscraping done! API conquered!

Manually encoding strings for URLs can be a pain. Thankfully, the `requests` library takes care of all this for us if we pass a dictionary to the `params` keyword argument.

First we need to construct the dictionary for the parameters and the specify the resource url.

In [62]:
params = {'address': '915 E Washington, Ann Arbor',
          'key':apikey
         }
baseurl = 'https://maps.googleapis.com/maps/api/geocode/json'

Make a `GET` request with the params keyword argument

In [63]:
R = requests.get(baseurl, params=params)
R.raise_for_status()
response = R.json()
response

{'results': [{'address_components': [{'long_name': '915',
     'short_name': '915',
     'types': ['street_number']},
    {'long_name': 'East Washington Street',
     'short_name': 'E Washington St',
     'types': ['route']},
    {'long_name': 'Burns Park',
     'short_name': 'Burns Park',
     'types': ['neighborhood', 'political']},
    {'long_name': 'Ann Arbor',
     'short_name': 'Ann Arbor',
     'types': ['locality', 'political']},
    {'long_name': 'Washtenaw County',
     'short_name': 'Washtenaw County',
     'types': ['administrative_area_level_2', 'political']},
    {'long_name': 'Michigan',
     'short_name': 'MI',
     'types': ['administrative_area_level_1', 'political']},
    {'long_name': 'United States',
     'short_name': 'US',
     'types': ['country', 'political']},
    {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
    {'long_name': '1070',
     'short_name': '1070',
     'types': ['postal_code_suffix']}],
   'formatted_address': '915 E Wa

Same result. And we can also spy the actual url that was sent.

In [64]:
R.url

'https://maps.googleapis.com/maps/api/geocode/json?address=915+E+Washington%2C+Ann+Arbor&key=AIzaSyA2UmVkSfbv8ab1YRMEfcwBGAR3MKdvgQU'

That's it! Pretty simple, eh! A lot of APIs work just like this. Some, of course, are a bit more complicated. Most APIs are a bit more complicated (CSCAR also has a workshop on using APIs for social media).

**Note:** CSCAR also has workshops on using social media APIs

# Traversing a JSON object

Now, that we have a JSON response from the API, we need to know how to parse it for the information we are looking for. A JSON object behaves like a Python dictionary in that it consists of key-value pairs. JSON objects consist of dictionaries and lists.

To look at the variables in a JSON object hierarchically, use the `keys` method.

In [65]:
response.keys()

dict_keys(['status', 'results'])

Access the key like you would a Python dictionary

In [66]:
response['results']

[{'address_components': [{'long_name': '915',
    'short_name': '915',
    'types': ['street_number']},
   {'long_name': 'East Washington Street',
    'short_name': 'E Washington St',
    'types': ['route']},
   {'long_name': 'Burns Park',
    'short_name': 'Burns Park',
    'types': ['neighborhood', 'political']},
   {'long_name': 'Ann Arbor',
    'short_name': 'Ann Arbor',
    'types': ['locality', 'political']},
   {'long_name': 'Washtenaw County',
    'short_name': 'Washtenaw County',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'Michigan',
    'short_name': 'MI',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'United States',
    'short_name': 'US',
    'types': ['country', 'political']},
   {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
   {'long_name': '1070',
    'short_name': '1070',
    'types': ['postal_code_suffix']}],
  'formatted_address': '915 E Washington St, Ann Arbor, MI 48109, USA

Occasionally, you will encounter a `list` of key-value pairs in the hierarchy.

You would access the `list` the same way as a Python list.

In [67]:
response['results'][0]

{'address_components': [{'long_name': '915',
   'short_name': '915',
   'types': ['street_number']},
  {'long_name': 'East Washington Street',
   'short_name': 'E Washington St',
   'types': ['route']},
  {'long_name': 'Burns Park',
   'short_name': 'Burns Park',
   'types': ['neighborhood', 'political']},
  {'long_name': 'Ann Arbor',
   'short_name': 'Ann Arbor',
   'types': ['locality', 'political']},
  {'long_name': 'Washtenaw County',
   'short_name': 'Washtenaw County',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Michigan',
   'short_name': 'MI',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'United States',
   'short_name': 'US',
   'types': ['country', 'political']},
  {'long_name': '48109', 'short_name': '48109', 'types': ['postal_code']},
  {'long_name': '1070',
   'short_name': '1070',
   'types': ['postal_code_suffix']}],
 'formatted_address': '915 E Washington St, Ann Arbor, MI 48109, USA',
 'geometry': {'location

Note, there is only one element in this list.

If you want to drill deeper into the `results` key, you would repeat the process to look at available keys and so forth.

In [68]:
response['results'][0].keys()

dict_keys(['formatted_address', 'address_components', 'types', 'place_id', 'geometry'])

To arrive at the latitude coordinate, you would need the following syntax:

In [69]:
response['results'][0]['geometry']['location']['lat']

42.2807892

**Note**: If you have a keen eye, you can just eyeball the hierarchy without needing to use the `keys` method.

# Example 4: Weather Data (using an API)

Here is an example of an API that doesn't follow the query string parameter format.

We will be using <a href='https://darksky.net/app/'>https://darksky.net/app/</a> to gather weather data.  The developer section is at https://darksky.net/dev/. The API documentation is at https://darksky.net/dev/docs. You will need to create an account to get an API key (or you can borrow mine).

**Note**: Dark Sky is a relatively simple API (part of the reason why I'm using it as an example). 

## Weather Forecast

APIs usually have different endpoints depending on the data you are interested in. The documentation page shows that you can make two types of API requests. 
1. The current weather forecast for the next week (forecast request)
2. An observed or forecast weather conditions for a date in the past or future (time machine request)

**Note**: BTW, two is a relatively small number. The twitter API has over 100.

Let's look at endpoint #2.

The documentation shows that a forecast request takes the form of:

https://api.darksky.net/forecast/[key]/[latitude],[longitude],[time]

A historical weather request returns the observed weather at a given time (for many places, up to 60 years in the past).

Start by specifying the API key and then the GPS coordinates and timestamp.

In [70]:
apikey = os.getenv('FORECASTIO_API_KEY')

In [71]:
latitude = 42.28
longitude = -83.74
time = '2017-01-31T16:00:00'

In [72]:
url = r'https://api.darksky.net/forecast/{}/{},{},{}'.format(apikey,latitude,longitude,time)
R = requests.get(url)
R.raise_for_status()

API responses consist of a JSON-formatted object (UTF-8) according to the documentation.

In [73]:
pastforecast = R.json()
pastforecast

{'currently': {'apparentTemperature': 23.94,
  'cloudCover': 1,
  'dewPoint': 27.86,
  'humidity': 0.79,
  'icon': 'cloudy',
  'precipIntensity': 0.003,
  'precipProbability': 0.15,
  'precipType': 'snow',
  'pressure': 1004.81,
  'summary': 'Overcast',
  'temperature': 33.61,
  'time': 1485896400,
  'visibility': 5.81,
  'windBearing': 261,
  'windSpeed': 14.2},
 'daily': {'data': [{'apparentTemperatureHigh': 29.11,
    'apparentTemperatureHighTime': 1485874800,
    'apparentTemperatureLow': 20.56,
    'apparentTemperatureLowTime': 1485954000,
    'apparentTemperatureMax': 29.11,
    'apparentTemperatureMaxTime': 1485874800,
    'apparentTemperatureMin': 14.57,
    'apparentTemperatureMinTime': 1485838800,
    'dewPoint': 26.02,
    'humidity': 0.86,
    'icon': 'snow',
    'moonPhase': 0.12,
    'precipAccumulation': 0.653,
    'precipIntensity': 0.0027,
    'precipIntensityMax': 0.0189,
    'precipIntensityMaxTime': 1485853200,
    'precipProbability': 0.74,
    'precipType': 'snow'

Webscraping DONE!

Hourly summary and temperature for that day.

In [74]:
for i, hr in enumerate(pastforecast["hourly"]["data"]):
    print(i, hr['summary'], hr['temperature'])

0 Clear 24.34
1 Foggy 25.03
2 Foggy 25.26
3 Light Snow 25.89
4 Snow 26.56
5 Light Snow 26.8
6 Light Snow 27.34
7 Overcast 27.42
8 Clear 27.49
9 Clear 28.38
10 Overcast 29.11
11 Clear 30.5
12 Clear 31.61
13 Overcast 33.03
14 Clear 33.76
15 Clear 34.04
16 Overcast 33.61
17 Clear 32.59
18 Clear 32.44
19 Overcast 32.25
20 Breezy 32.29
21 Breezy 32.48
22 Partly Cloudy 31.88
23 Partly Cloudy 31.86


**Note**: API responses come in a variety of flavours. Some common ones are JSON, XML, and CSV format. I recommend choosing JSON over XML whenever possible. XML is a pain to work with IMO. If you do choose XML, you can use `BeautifulSoup` or `ElementTree` to parse the data.

## API Wrappers

For the more popular APIs (like Twitter and Google), the open source community has probably written a Python wrapper for the API to abstract away some of the details for you. The Dark Sky API is pretty simple as you can see but someone has written a wrapper for it because of its popularity. The Python module is called `python-forecastio`. You can install it using `pip install python-forecastio`.

In [75]:
import forecastio

lon = -83.7
lat = 42.3
forecast = forecastio.load_forecast(apikey, lat, lon)

Accessing the data depends on the structure the developer set up which is not necessarily the same as the API. You will need to read the documentation of the module to understand the structure.

In [76]:
forecast.daily().data[0].summary

'Light rain in the morning.'

## Exercise

Using the Google Maps Distance Matrix API, get the the distance from Ann Arbor, MI to San Diego, CA and Anchorage, AK.

The documentation is located at https://developers.google.com/maps/documentation/distance-matrix/start

**Tip:** Don't worry about including the API key as a parameter. For some reason, it is not needed for this API to work.

## Example 5: POST Requests</font>

This example was taken from the book **Web Scraping with Python** by Ryan Mitchell.

This website http://pythonscraping.com/pages/files/form.html shows a basic web form. Let's look at the page source to see info related to the `post` request.

The page source contains the variable names of the two input fields which need to be submitted in the `<form>` tag. We create a dictionary to represent these fields.

In [77]:
data = {'firstname':'Mister',
        'lastname' :'Cao'}

There is an `action` attribute associated with the `post` request. This is the url where the `post` is being sent. This is a **relative** path to the current url. 

In [78]:
R = requests.post("http://pythonscraping.com/pages/files/processing.php", data=data)
R.text

'Hello there, Mister Cao!'

We'll see in the next section a way to get at the same information through the browser's developer tool.

## Example 6: Farm Equipment Crashes in Ann Arbor (data that is not visible in the page source)

This example is meant to illustrate how to grab data that is visible on the webpage but not in the page source. The website of interest is https://www.michigantrafficcrashfacts.org/querytool. We are interested in grabbing the gps coordinates of the crashes on the map. The crashes also has some information in a popup tooltip when you click on it.

The goal is to find the URL where the GET/POST request is being sent. How do we do that? The answer lies in the reference link at the bottom of this example. Basically, you need your browser's developer tool and some detective work.  
[SHORT OVERLUDE TO THE WEBPAGE]

Congrats! You can consider yourself a developer now that you can use the toolbox :)

Now that we have found the URL of interest, we go back to our regularly scheduled programming (pun intended). Everything should be easy peasy moving forward. We know the url where the `post` request is being sent. There is also some data being sent with it. We will create a dictionary for that bit of data.

In [79]:
mtcf_url = r'https://www.michigantrafficcrashfacts.org/qjson'
query = {'q':'1;0;2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004;c8189;0,42:1',
        'v':'map',
        'p':'13,42.264652,-83.729607,0.3'}
R = requests.post(mtcf_url, data=query)
R.raise_for_status()

Grab json response

In [80]:
farm_data = R.json()
farm_data

{'0': {'crash.id': '2015156119',
  'crash_day': 14,
  'crash_month': 'July',
  'crash_time_of_day': '5:00 PM - 6:00 PM',
  'crash_worst_injury': 'Possible injury (C)',
  'crash_year': '2015',
  'gps_x_coordinate': -83.762084436151,
  'gps_y_coordinate': 42.2444715503,
  'person_id': '8',
  'seg_crnt': 10003762,
  'seg_orig': 0,
  'vehicle_id': '3'},
 '1': {'crash.id': '200578099',
  'crash_day': 9,
  'crash_month': 'February',
  'crash_time_of_day': '1:00 PM - 2:00 PM',
  'crash_worst_injury': 'No injury (O)',
  'crash_year': '2005',
  'gps_x_coordinate': -83.68753,
  'gps_y_coordinate': 42.24527,
  'person_id': '3',
  'seg_crnt': 4839525,
  'seg_orig': 0,
  'vehicle_id': '3'},
 '2': {'crash.id': '2015113037',
  'crash_day': 26,
  'crash_month': 'February',
  'crash_time_of_day': '8:00 AM - 9:00 AM',
  'crash_worst_injury': 'No injury (O)',
  'crash_year': '2015',
  'gps_x_coordinate': -83.78405039994,
  'gps_y_coordinate': 42.281289282301,
  'person_id': '2',
  'seg_crnt': 4818100,
  

Webscraping DONE!

## Reference

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

## Exercise

This website generates random VINs upon demand http://randomvin.com. This is probably one of the most simplest html page you will ever see (and one of the reasons I picked it).

Here is some example code to grab 3 VINs from the website. Supply the appropriate url to finish the code. What file format is the response?

In [None]:
for i in range(3):
    R = requests.get('')
    R.raise_for_status()
    print(R.text)

# Retrying Module

This module is aptly name. Even though you have written valid code, sometimes it will still crash for unforseen reasons (e.g. bad network connection). This is where the `retry` behaviour becomes handy. You will need to install the module first via `pip install retrying`. Documentation is at https://pypi.python.org/pypi/retrying.

In [81]:
import random
from retrying import retry

## Simple Retry

Here is a simple case of how to use `retry`. First step is to put your code of interest in a function. This non-sensical function has a 90% chance of failing when run because of the `assert` statement.

In [82]:
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_B_equal_to_lucky7()
print('this has a 10% chance of printing')

8


AssertionError: 

Now let's add the `retry` behavior by adding an @ decorator at the beginning of the function. That's it! Pretty simple.

In [84]:
@retry()
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7
    
is_B_equal_to_lucky7()
print('this will ALWAYS print')

1
6
10
2
6
10
4
4
6
5
5
8
7
this will ALWAYS print


You can also add an argument `wait_fixed` in milliseconds to specify how long to wait between retries. Good practice so you don't have to bombard the server with constant requests during a failed connection during webscraping. Gmail does an exponential version of this when it loses the network connection.

## Retry on specific or general exceptions

We saw the function fail earlier because of an `AssertionError`. We can tell `retry` to only retry when certain exceptions occur.  This requires using the argument `retry_on_exception` and passing it the name of a function. The function will return either `True` or `False`.

In [85]:
def checkForSpecificError(exception):
    return isinstance(exception, AssertionError)

@retry(retry_on_exception=checkForSpecificError, wait_fixed=500)
def is_it_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_it_lucky7()
print("Lucky 7")

6
5
5
2
2
7
Lucky 7


If we had changed the last line in the function to `assert C == 7`, then the retry behaviour will not kick in because the function returns a `NameError`.

## Retry on return value

You don't necessarily need to have a program error to invoke the `retry` behaviour. You can use the `return` value to decide. This requires using the argument `retry_on_result` and passing it the name of a function. The function will return either `True` or `False`. 

In [87]:
def checkReturnValue(value):
    return value is True

# This function will never crash but we can still use retry
@retry(retry_on_result=checkReturnValue, wait_fixed=300)
def main():
    B = random.randint(1, 10)
    print(B)
    try:
        assert B == 7
        return None
    except AssertionError:
        return True

main()
print('Did I find a 7?')

1
7
Did I find a 7?


Other arguments of interest to `retry` which you can use are:  
- stop_max_attempt_number
- stop_max_delay
- wait_random_min
- wait_random_max
- wait_exponential_multiplier
- wait_exponential_max`

**Note**: You could implement the retry behavior without this module. You can use a `while` loop in some combination with `try` and `except` too. I don't recommend it.

## Exercise

Apply the `retry` module to this flaky code for the Dark Sky API.
1. Convert to a Python function
2. Add `retry` module with at least one argument of your choice

In [None]:
apikey = <INSERT API KEY HERE>
lon = random.randint(0,400)
lat = random.randint(0,200)
print(lon,lat) # Somewhere in the Eastern Hemisphere
R = requests.get('https://api.darksky.net/forecast/{}/{},{}'.format(apikey,lat,lon))
R.raise_for_status()
forecast = R.json()
print(forecast["daily"]["data"][0]['summary'])

# Quiz

https://goo.gl/forms/Ym3hbKu45nzQ2puv2

# Robots.txt

Also known as the robot exclusion standard, *robots.txt* is a standard used by website to communicate to web crawlers, scrapers etc. The robots.txt is a file that is used to communicate which parts of the website is allowed or disallowed to be scraped. You will find the robots.txt file in the root directory of the website. For the english version of wikipedia, it is located at https://en.wikipedia.org/robots.txt.  

The complement to *robots.txt* is the *sitemap* which is an XML file that lists the URLs for a site. 

# cURL to Python Requests

cURL is a command line tool for getting or sending files using URL syntax. You can always get the cURL command from the developer tool. This usually also includes *cookies and headers*. While Python can't use it directly, you can convert it to a Python requests syntax. Googling *curl to python requests* will bring you to this page https://curl.trillworks.com/. You can paste the curl code and it will return the equivalent Python code for you. Python does have a module that is suppose to do this for you but I haven't gotten it to work yet. Some modules are `uncurl, runcurl, curl_to_requests` but they support Python2 only.

For example, here is what is returned for the curl cmd from http://random.vin.com.

In [None]:
import requests

cookies = {
    '_ga': 'GA1.2.690271399.1483585728',
    '_gat': '1',
}

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Accept': '*/*',
    'Referer': 'http://randomvin.com/',
    'Connection': 'keep-alive',
}

requests.get('http://randomvin.com/getvin.php?type=real', headers=headers, cookies=cookies)