# Web Scraping with Python

Copying and pasting is great. You should definitely use it if it's the simplest way. But if you don't want to copy and paste 10 webpages into Excel or if you have some time to kill, then web scraping is the answer or time sink you've been looking for.

We start by importing the python modules that we will need to do web scraping. We will be using `requests` to fetch html pages and `BeautifulSoup` to parse the html page. `pandas` will be used for data manipulation. The `pd.options.display` lines are for formatting purposes when printing out results in `pandas`.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

pd.options.display.max_rows = 10

# <font color='green'>Example 1: Wikipedia Page (HTML Table)</font>

We start by opening the <a href='https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'>website of interest</a> in a browser. We can see that it looks nicely formatted like a table. We start with passing the website of interest to the `requests.get` method.

In [None]:
url = r'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
R = requests.get(url)
R.raise_for_status()

**Note**: Without the `R.raise_for_status()` line, bad urls will fail silently which is probably not what you want.

We then use `BeautifulSoup` to parse the contents of the response.

In [None]:
soup = BeautifulSoup(R.content, 'html.parser')

I like to use the browser's developer tool to do the detective work of figuring out where the data resides (*right click -> Inspect*). In this example, the data of interest resides in a html table (makes life easier).

We see that the data of interest resides in a table tag &lt;table&gt;. To grab everything between the table tags, we use the `find_all` method (one of many options but probably the only one you need and the one you will use most often).

In [None]:
tables = soup.find_all('table')

`find_all` returns a list of matches. We can use the `len` function to see how many matches came back.

In [None]:
len(tables)

We can check the contents of the table using the `text` method. It will be a formatting mess but that's okay.

In [None]:
tables[0].text

In [None]:
tables[1].text

In [None]:
tables[2].text

Looks like the second table is the one we want. 

In [None]:
table = tables[1]

A better way to find the table of interest is to pass in extra search terms to the `find_all` method after using the developer tool to find searchable attributes.

In [None]:
tables = soup.find_all('table', style='text-align: right;')
len(tables)

We get one result instead of three.

In [None]:
table = tables[0]

The data resides in a html table standard cell tag `<td>` within a table row tag `<tr>`. We use `find_all` to look for all the table row tags within the table tag `<table>`.

In [None]:
table_rows = table.find_all('tr')
len(table_rows)

We now see that there are a lot of matches for that tag, as expected. It's close to the total number of countries. Let's look at the first few entries of `table_rows`.

In [None]:
table_rows[0]

Looks like the header row. The `<th>` tag also gives it away, fyi.

In [None]:
table_rows[1]

Its the row data for the _World_. 

We will use a nested `for` loop to go through the list of table rows. The inner `for` loop will go through each `<td>` tag appending the text to a list. We will grab all the data in the tags regardless of whether we want to keep them for now. 

We have two lists in the `for` loop. `row` will contain a list of the each tag in a table row. Once the row is iterated through, we will convert it to a `pandas` dataframe. `list_df` will contain a list of those dataframes.

In [None]:
list_df = []
for row in table_rows:
    table_cells = row.find_all('td')
    row = []
    for cell in table_cells:
        row.append(cell.text)    
    list_df.append(pd.DataFrame(row).transpose())

We take the list of dataframes and concatenate them together into one dataframe.

In [None]:
data = pd.concat(list_df, ignore_index=True) 

Let's look at the first 5 rows of the dataframe.

In [None]:
data.head()

We didn't get the header row because it was contained in `<th>` tags and not `<td>` tags. We use the same type of `for` loop to extract the text from the table headers.

In [None]:
headers = table_rows[0].find_all('th')
columns = []
for header in headers:
    columns.append(header.text)

Set the column headers of the dataframe.

In [None]:
data.columns = columns
data.head()

We now have all the data of interest. __Webscraping DONE!__ Now its a matter of data cleanup and management. This is where `pandas` becomes useful. We want to clean up the last three columns.

We split the last column by the delimiter and grab the latter part of the text.

In [None]:
data['Change'] = data['Change'].apply(lambda x: x.split('♠')[1])
data.head(5)

We use a regular expression to get rid of the references and newline feeds in the headers.

In [None]:
import re
data.columns = [re.sub(r'\W\d\W$', '', x).replace('\n',' ') for x in data.columns]
data.head(5)

Get rid of the thousand separators.

In [None]:
data['Population (1 July 2016)'] = data['Population (1 July 2016)'].apply(lambda x: x.replace(',',''))
data['Population (1 July 2015)'] = data['Population (1 July 2015)'].apply(lambda x: x.replace(',',''))
data.head(5)

Good enough! Let's save the data to a txt file.

In [None]:
data.to_csv('wikipedia_countries_population.txt', sep='|', index=False, header=True)

If you look at the text file, there is some more data cleanup we could have done but I'm not here to teach you how to do that with `pandas` (although we do have workshops to do just that)

## <font color='green'>Pandas Approach to HTML Tables</font>

The pandas `read_html` method can read in HTML Tables (and only HTML Tables). It uses `BeautifulSoup` under the hood.

In [None]:
countries = pd.read_html(url)

**Note**: If you get an `ImportError: html5lib not found, please install it` message, you will need to install it via `conda install html5lib` and you might need to restart the kernel or Juypter Notebook.

This returns a list of dataframes. One dataframe for each table in the html page. Recall there were 3 tables. 

In [None]:
len(countries)

Let's check their contents.

In [None]:
countries[0].head()

In [None]:
countries[1].head()

The second table contains our data of interest (as expected). It leaves us at a similiar point that we encountered earlier in the example. `pandas` has basically done the web scraping for us and left us with the data cleanup and wrangling. You should always expect to do some data manipulation if you use the `pd.read_html()` method or any webscraping for that matter.

Q: So why did we learn a more complicated way of doing things when `pd.read_html()` can do it for you?  
A: Because not everything resides in an HTML table and you'll need to use the same techniques to get at the data. If the data resides in a HTML table, consider yourself fortunate, use `pandas` and move on.

## <font color='green'>Exercise</font>

Scrape Serena Williams' Wikipedia page https://en.wikipedia.org/wiki/Serena_Williams for the data in the html table **Grand Slam tournament finals**  for *Singles* in the *Career Statistics* section.

# <font color='#1b8e77'>Example 2: USA TODAY Best Selling Books (not in HTML Table)</font>

In this example, the website of interest is https://www.usatoday.com/life/books/best-selling/. Suppose we are interested in getting some basic information about the book list. For this example, the data of interest does not exist in a html table.

We start with the boilerplate template of passing the website of interest to the `requests.get` method. We then use `BeautifulSoup` to parse the contents of the response.

In [None]:
url = r'https://www.usatoday.com/life/books/best-selling/'
R = requests.get(url)
R.raise_for_status()
soup = BeautifulSoup(R.content, 'html.parser')

Navigate to the website and use the browser's developer tool to inspect items of interest.

The data seems to be residing in a `div` tag. Let's search for that and the class info and see how many matches we get

In [None]:
booklist = soup.find_all('div', class_="front-booklist-info-container")
len(booklist)

Let's look at the text associated within each `<div>` book tag.

In [None]:
for i, book in enumerate(booklist):
    print(i, book.text)

Suppose we want the title, author and genre. Let's use the browser to inspect which tags the data lies in. Then we'll use the `find_all` method to search for it.

In [None]:
for rank, book in enumerate(booklist, start=1):
    title = book.find_all('h3', class_='books-front-meta-title')[0].text
    author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
    genre = book.find_all('div', class_='books-front-meta-genre')[0].text
    print('{}. {}, {}, {}'.format(rank, title, author, genre))

Webscraping DONE! At least for page one. To scrape the rest of the pages, let's navigate to the other pages. Notice anything different in the url?

Before: https://www.usatoday.com/life/books/best-selling/  
After: https://www.usatoday.com/life/books/best-selling/week/2017/22/page/2/

The **before url** is the webpage for the current bestseller list, page 1. The **after url** is for a specific week and page. We can use this format to scrape any page for any given week. Much more useful than the generic url.

In [None]:
rank = 0
data = []
for page in range(1,4):    
    url = r'http://www.usatoday.com/life/books/best-selling/week/2017/22/page/{}/'.format(page)    
    R = requests.get(url)
    R.raise_for_status()
    soup = BeautifulSoup(R.content, 'html.parser')
    # find the element of interest matching our criteria
    booklist = soup.find_all('div', class_='front-booklist-info-container')
    for book in booklist:
        title = book.find_all('h3', class_='books-front-meta-title')[0].text
        author = book.find_all('span', class_='books-front-meta-authorInfo')[0].text
        genre = book.find_all('div', class_='books-front-meta-genre')[0].text
        rank += 1
        print('{}. {}, {}, {}'.format(rank, title, author, genre))
        data.append((rank,title,author,genre))

Let's convert the data into a `pandas` dataframe with column headers

In [None]:
topsellers = pd.DataFrame(data, columns=['rank','title','author','genre'])
topsellers

## <font color='#1b8e77'>Exercise</font>

See if you can scrape the entire list and add columns such as when it debuted and how long its been on the charts.

# Example 3: Geocoding (using an API) 

This example is how to use an API to gather data of interest. Sometimes, you will need to get an api key or access token to access the website. The website should have a developers or API section to let you know how to query the API with the appropriate parameters. This section will also detail the terms of usage and any usage limits on using the API. APIs usually follow a freemium business model. 

We'll show you how to Geocode addresses next using the Google Maps Geocoding API. Let's look at the documentation quickly.    
https://developers.google.com/maps/documentation/geocoding/intro

The documentation tells us the format looks something like this

`https://maps.googleapis.com/maps/api/geocode/outputFormat?parameters`

In summary, the documentation tells us about:
1. Needing an API key.
2. Format of the request needed
2. Choosing an output format
3. Required parameters
4. Optional parameters

## Query String Parameters

A url that contains a query string will contain three parts:
1. Resource (base) URL
2. Question Mark (?)
3. Parameters (key=value pairs) separated by an ampersand (&)

Let's look at the Google geocoding example with our own API key.

In [None]:
apikey = 'AIzaSyCmXigJXZovZrP16Yjzn8i7XbozDoibO9A'
url = r'https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key={}'.format(apikey)
url

Let's break down the different pieces for this example.
1. The resource url is `https://maps.googleapis.com/maps/api/geocode/json`
2. The output format chosen was json
3. `?`
4. First required parameter is `address=1600+Amphitheatre+Parkway,+Mountain+View,+CA`
5. `&`
6. Second required parameter is `key=AIzaSyCmXigJXZovZrP16Yjzn8i7XbozDoibO9A`
7. No optional key/value pair

Note that some characters can not be part of the URL like spaces in the above example. Spaces are encoded as `+` or `%20`.

We now can make a `GET` request. Same as before.

In [None]:
R = requests.get(url)
response = R.json()
response

Webscraping done! API conquered!

Manually encoding strings for URLs can be a pain. Thankfully, the `requests` library takes care of all this for us if we pass a dictionary to the `params` keyword argument.

First we need to construct the dictionary for the parameters and the specify the resource url.

In [None]:
params = {'address': '915 E Washington, Ann Arbor',
          'key':apikey
         }
baseurl = 'https://maps.googleapis.com/maps/api/geocode/json'

Make a `GET` request with the params keyword argument

In [None]:
R = requests.get(baseurl, params=params)
R.raise_for_status()
response = R.json()
response

Same result. And we can also spy the actual url that was sent.

In [None]:
R.url

That's it! Pretty simple, eh! A lot of APIs work just like this. Some, of course, are a bit more complicated. Most APIs are a bit more complicated (CSCAR also has a workshop on using APIs for social media).

**Note:** CSCAR also has workshops on using social media APIs

# Traversing a JSON object

Now, that we have a JSON response from the API, we need to know how to parse it for the information we are looking for. A JSON object behaves like a Python dictionary in that it consists of key-value pairs. JSON objects consist of dictionaries and lists.

To look at the variables in a JSON object hierarchically, use the `keys` method.

In [None]:
response.keys()

Access the key like you would a Python dictionary

In [None]:
response['results']

Occasionally, you will encounter a `list` of key-value pairs in the hierarchy.

You would access the `list` the same way as a Python list.

In [None]:
response['results'][0]

Note, there is only one element in this list.

If you want to drill deeper into the `results` key, you would repeat the process to look at available keys and so forth.

In [None]:
response['results'][0].keys()

To arrive at the latitude coordinate, you would need the following syntax:

In [None]:
response['results'][0]['geometry']['location']['lat']

**Note**: If you have a keen eye, you can just eyeball the hierarchy without needing to use the `keys` method.

# <font color='#d95f02'>Example 4: Weather Data (using an API)</font>

Here is an example of an API that doesn't follow the query string parameter format.

We will be using <a href='https://darksky.net/app/'>https://darksky.net/app/</a> to gather weather data.  The developer section is at https://darksky.net/dev/. The API documentation is at https://darksky.net/dev/docs. You will need to create an account to get an API key (or you can borrow mine).

**Note**: Dark Sky is a relatively simple API (part of the reason why I'm using it as an example). 

## <font color='#d95f02'>Weather Forecast</font>

APIs usually have different endpoints depending on the data you are interested in. The documentation page shows that you can make two types of API requests. 
1. The current weather forecast for the next week (forecast request)
2. An observed or forecast weather conditions for a date in the past or future (time machine request)

**Note**: BTW, two is a relatively small number. The twitter API has over 100.

Let's look at endpoint #2.

The documentation shows that a forecast request takes the form of:

https://api.darksky.net/forecast/[key]/[latitude],[longitude],[time]

A historical weather request returns the observed weather at a given time (for many places, up to 60 years in the past).

Start by specifying the API key and then the GPS coordinates and timestamp.

In [None]:
apikey = "1bf42385cc58ad8ea6a6df3c78612493" #I'm sharing mine with you. BTW, this is not good practice.

In [None]:
latitude = 42.28
longitude = -83.74
time = '2017-01-31T16:00:00'

In [None]:
url = r'https://api.darksky.net/forecast/{}/{},{},{}'.format(apikey,latitude,longitude,time)
R = requests.get(url)
R.raise_for_status()

API responses consist of a JSON-formatted object (UTF-8) according to the documentation.

In [None]:
pastforecast = R.json()
pastforecast

Webscraping DONE!

Hourly summary and temperature for that day.

In [None]:
for i, hr in enumerate(pastforecast["hourly"]["data"]):
    print(i, hr['summary'], hr['temperature'])

**Note**: API responses come in a variety of flavours. Some common ones are JSON, XML, and CSV format. I recommend choosing JSON over XML whenever possible. XML is a pain to work with IMO. If you do choose XML, you can use `BeautifulSoup` or `ElementTree` to parse the data.

## <font color='#d95f02'>API Wrappers</font>

For the more popular APIs (like Twitter and Google), the open source community has probably written a Python wrapper for the API to abstract away some of the details for you. The Dark Sky API is pretty simple as you can see but someone has written a wrapper for it because of its popularity. The Python module is called `python-forecastio`. You can install it using `pip install python-forecastio`.

In [None]:
import forecastio

lon = -83.7
lat = 42.3
forecast = forecastio.load_forecast(apikey, lat, lon)

Accessing the data depends on the structure the developer set up which is not necessarily the same as the API. You will need to read the documentation of the module to understand the structure.

In [None]:
forecast.daily().data[0].summary

## <font color='#d95f02'>Exercise</font>

Print out the weather summary on your birthday for your city of birth. 

## <font color='#e7298a'>Example 5: POST Requests</font>

This example was taken from the book **Web Scraping with Python** by Ryan Mitchell.

This website http://pythonscraping.com/pages/files/form.html shows a basic web form. Let's look at the page source to see info related to the `post` request.

The page source contains the variable names of the two input fields which need to be submitted in the `<form>` tag. We create a dictionary to represent these fields.

In [None]:
data = {'firstname':'Mister',
        'lastname' :'Cao'}

There is an `action` attribute associated with the `post` request. This is the url where the `post` is being sent. This is a **relative** path to the current url. 

In [None]:
R = requests.post("http://pythonscraping.com/pages/files/processing.php", data=data)
R.text

We'll see in the next section a way to get at the same information through the browser's developer tool.

## <font color='#7570b3'>Example 6: Farm Equipment Crashes in Ann Arbor (data that is not visible in the page source)</font>

This example is meant to illustrate how to grab data that is visible on the webpage but not in the page source. The website of interest is https://www.michigantrafficcrashfacts.org/querytool. We are interested in grabbing the gps coordinates of the crashes on the map. The crashes also has some information in a popup tooltip when you click on it.

The goal is to find the URL where the GET/POST request is being sent. How do we do that? The answer lies in the reference link at the bottom of this example. Basically, you need your browser's developer tool and some detective work.  
[SHORT OVERLUDE TO THE WEBPAGE]

Congrats! You can consider yourself a developer now that you can use the toolbox :)

Now that we have found the URL of interest, we go back to our regularly scheduled programming (pun intended). Everything should be easy peasy moving forward. We know the url where the `post` request is being sent. There is also some data being sent with it. We will create a dictionary for that bit of data.

In [None]:
mtcf_url = r'https://www.michigantrafficcrashfacts.org/qjson'
query = {'q':'1;0;2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004;c8189;0,42:1',
        'v':'map',
        'p':'13,42.264652,-83.729607,0.3'}
R = requests.post(mtcf_url, data=query)
R.raise_for_status()

Grab json response

In [None]:
farm_data = R.json()
farm_data

Webscraping DONE!

## <font color='#7570b3'>Reference</font>

http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

## <font color='#7570b3'>Exercise</font>

This website generates random VINs upon demand http://randomvin.com. This is probably one of the most simplest html page you will ever see (and one of the reasons I picked it).

Here is some example code to grab 3 VINs from the website. Supply the appropriate url to finish the code. What file format is the response?

In [None]:
for i in range(3):
    R = requests.get('')
    R.raise_for_status()
    print(R.text)

# <font color='#66a61e'>Retrying Module</font>

This module is aptly name. Even though you have written valid code, sometimes it will still crash for unforseen reasons (e.g. bad network connection). This is where the `retry` behaviour becomes handy. You will need to install the module first via `pip install retrying`. Documentation is at https://pypi.python.org/pypi/retrying.

In [None]:
import random
from retrying import retry

## <font color='#66a61e'>Simple Retry</font>

Here is a simple case of how to use `retry`. First step is to put your code of interest in a function. This non-sensical function has a 90% chance of failing when run because of the `assert` statement.

In [None]:
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_B_equal_to_lucky7()
print('this has a 10% chance of printing')

Now let's add the `retry` behavior by adding an @ decorator at the beginning of the function. That's it! Pretty simple.

In [None]:
@retry()
def is_B_equal_to_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7
    
is_B_equal_to_lucky7()
print('this will ALWAYS print')

You can also add an argument `wait_fixed` in milliseconds to specify how long to wait between retries. Good practice so you don't have to bombard the server with constant requests during a failed connection during webscraping. Gmail does an exponential version of this when it loses the network connection.

## <font color='#66a61e'>Retry on specific or general exceptions</font>

We saw the function fail earlier because of an `AssertionError`. We can tell `retry` to only retry when certain exceptions occur.  This requires using the argument `retry_on_exception` and passing it the name of a function. The function will return either `True` or `False`.

In [None]:
def checkForSpecificError(exception):
    return isinstance(exception, AssertionError)

@retry(retry_on_exception=checkForSpecificError, wait_fixed=500)
def is_it_lucky7():
    B = random.randint(1, 10)
    print(B)
    assert B == 7

is_it_lucky7()
print("Lucky 7")

If we had changed the last line in the function to `assert C == 7`, then the retry behaviour will not kick in because the function returns a `NameError`.

## <font color='#66a61e'>Retry on return value</font>

You don't necessarily need to have a program error to invoke the `retry` behaviour. You can use the `return` value to decide. This requires using the argument `retry_on_result` and passing it the name of a function. The function will return either `True` or `False`. 

In [None]:
def checkReturnValue(value):
    return value is True

# This function will never crash but we can still use retry
@retry(retry_on_result=checkReturnValue, wait_fixed=300)
def main():
    B = random.randint(1, 10)
    print(B)
    try:
        assert B == 7
        return None
    except AssertionError:
        return True

main()
print('Did I find a 7?')

Other arguments of interest to `retry` which you can use are:  
- stop_max_attempt_number
- stop_max_delay
- wait_random_min
- wait_random_max
- wait_exponential_multiplier
- wait_exponential_max`

**Note**: You could implement the retry behavior without this module. You can use a `while` loop in some combination with `try` and `except` too. I don't recommend it.

## <font color='#66a61e'>Exercise</font>

Apply the `retry` module to this flaky code for the Dark Sky API.
1. Convert to a Python function
2. Add `retry` module with at least one argument of your choice

In [None]:
apikey = "1bf42385cc58ad8ea6a6df3c78612493"
lon = random.randint(0,400)
lat = random.randint(0,200)
print(lon,lat) # Somewhere in the Eastern Hemisphere
R = requests.get('https://api.darksky.net/forecast/{}/{},{}'.format(apikey,lat,lon))
R.raise_for_status()
forecast = R.json()
print(forecast["daily"]["data"][0]['summary'])

# Quiz

https://goo.gl/forms/Ym3hbKu45nzQ2puv2

# cURL to Python Requests

cURL is a command line tool for getting or sending files using URL syntax. You can always get the cURL command from the developer tool. This usually also includes *cookies and headers*. While Python can't use it directly, you can convert it to a Python requests syntax. Googling *curl to python requests* will bring you to this page https://curl.trillworks.com/. You can paste the curl code and it will return the equivalent Python code for you. Python does have a module that is suppose to do this for you but I haven't gotten it to work yet. Some modules are `uncurl, runcurl, curl_to_requests` but they support Python2 only.

For example, here is what is returned for the curl cmd from http://random.vin.com.

In [None]:
import requests

cookies = {
    '_ga': 'GA1.2.690271399.1483585728',
    '_gat': '1',
}

headers = {
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Accept': '*/*',
    'Referer': 'http://randomvin.com/',
    'Connection': 'keep-alive',
}

requests.get('http://randomvin.com/getvin.php?type=real', headers=headers, cookies=cookies)

# <font color='#addd8e'>Robots.txt</font>

Also known as the robot exclusion standard, *robots.txt* is a standard used by website to communicate to web crawlers, scrapers etc. The robots.txt is a file that is used to communicate which parts of the website is allowed or disallowed to be scraped. You will find the robots.txt file in the root directory of the website. For the english version of wikipedia, it is located at https://en.wikipedia.org/robots.txt.  

The complement to *robots.txt* is the *sitemap* which is an XML file that lists the URLs for a site. 