# Synopsis

In this unit we will learn the basics of retrieving data drom the Web using APIs.

* What is an API
* How to make a request from an URL
* How to identify the status of the request (was it successful? if not, why?)
* How to read the contents of the response
* How to pass parameters within a request
* How to authenticate requests

# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from pathlib import Path
from sys import path

path.append('../My_libraries')
path

In [None]:
import datetime
import json
import sys
import random
import requests
import scipy

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from IPython.display import HTML, display, Image
from IPython.lib.display import YouTubeVideo


# Videos

In [None]:
vid = YouTubeVideo('LosIGgon_KM', width = 600)
display(vid)

# Interacting with the Web

The Internet is a gigantic data dump. There is all the social networking data from Facebook, Twitter, and so on. There is the news from all the traditional media sources plus Quartz, Vox, and so on. Then there is the data from organizations such as the World Bank, the Bureau of Labor Statistics, the US Census, or Chicago's Data Portal.  Finally, you have all your scientific data sources: the National Cancer Institute, the ProteinBank, or the Kyoto Gene and Genomes Encyclopedia.

How can you use Python to access those sites and retrieve data for your research, your business, or your hobby?

There are two main approaches to retrieve data from websources. The preferred approach is using **Application Program Interfaces** or APIs.  If an organization has decided to share its data, and they have the forethought and resources to do it, they will develop an API that will let you interact with their data.

If the organization does not have the forethought or resources to create an API (or if they do not want to share their data), then you have to **crawl** their website and **scrape** their data.


# Application Program Interfaces

**We relied heavily for these materials on https://www.dataquest.io/blog/python-api-tutorial/**

APIs simplify the process of obtaining specific information from a data source.  You do not have to worry about figuring out the **format** in which the information is stored, or **where** the information is stored.  All of those matter are handled seamlessly by the API. 

But convenience is not the only advantage of an API. APIs are also particular useful when:

* You want a small piece of a much larger set of data. Reddit comments are one example. What if you want to just pull your own comments on Reddit? It doesn’t make much sense to download the entire Reddit database, then filter just your own comments.
    
* There is repeated computation involved. Spotify has an API that can tell you the genre of a piece of music. You could theoretically create your own classifier, and use it to categorize music, but you’ll never have as much data as Spotify does.
    
* The data is changing quickly. An example of this is stock price data. It doesn’t really make sense to regenerate a dataset and download it every minute – this will take a lot of bandwidth, and be pretty slow.
    
    
    
## Making a request

In order to learn how APIs work, we will first use the APIs developed to retrieve data on the **International Space Station (ISS)**.  The relevant APIs can be found at http://open-notify.org/.  We will first consider the API for retrieving the location (latitude and longitude) of the ISS (http://open-notify.org/Open-Notify-API/ISS-Location-Now/). The API is hosted at http://api.open-notify.org/iss-now.json. 

So, how do we make requests for information with this API?

Like standard webpages, APIs are also hosted on web servers. When you type http://www.google.com in your browser’s address bar, your computer is actually asking the http://www.google.com server for a webpage, which it then returns it to your browser for display. That action is called a `request`. APIs work much the same way, except instead of your web browser asking for a webpage, your program asks for **data**. This data is usually returned in JSON format.

There are many possible types of requests. The most common, and the one we will be using throughout this unit, is the `GET` request. A `GET` request simply accesses and downloads the webpage found at the URL you specified as an input. 

We will use the package [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/) package to crawl (load) webpages and scrape (download) their contents.

In [None]:
response = requests.get("http://api.open-notify.org/iss-now.json")

print( response )
print( response.status_code )


In [None]:
print(dir(response))
print()

help(response)

Methods from the `requests` package return `Response` objects. One of the most important properties of the response is its `status code`, which is printed by default but which we can also get explicitly.

Here are some of the most common status codes you might encounter:
* 200, **OK**. Standard response for successful HTTP requests. The actual response will depend on the request method used.
* 301, **Moved Permanently**. The server is redirecting you to a different endpoint. This and all future requests should be directed to the given URL. This can happen when a company switches domain names, or an endpoint name is changed.
* 303, **See Other**. The response to the request can be found under another URI using a GET method. When received in response to a POST (or PUT/DELETE), the client should presume that the server has received the data and should issue a redirect with a separate GET message. Your web browser automatically fetches the new URL but web crawlers do not usually do this unless you specify it.
* 400, **Bad Request**. The server cannot or will not process the request due to an apparent client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).
* 401, **Unauthorized**. Similar to `403 Forbidden`, but specifically for use when authentication is required and has failed or has not yet been provided. The response must include a WWW-Authenticate header field containing a challenge applicable to the requested resource.
* 403, **Forbidden**. The request was a valid request, but the server is refusing to respond to it. `403` error semantically means "unauthorized", i.e. the user does not have the necessary permissions for the resource.
* 404, **Not Found**. The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.
* 500, **Internal Server Error**. A generic error message, given when an `unexpected` condition was encountered and no more specific message is suitable.
* 503, **Service Unavailable**. The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.
* 504, **Gateway Timeout**. The server was acting as a gateway or proxy and did not receive a timely response from the upstream server.[



More codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

The status code of our request was **200**. It means that all went well -- we successfully connected to the web address we wanted and downloaded its contents.

But `status codes` are not the only methods available:

In [None]:
print( response.url )

In [None]:
print( response.text )
print()

print(type(response.text))

This is the content format specified http://open-notify.org/Open-Notify-API/ISS-Location-Now/. It is in `json` format which means that we can easily parse it using the `json` module.

In [None]:
data = json.loads(response.text)
print(type(data))
print()

print( data )


**YES**. 

The method `loads()` returns json formatted data as a dictionary. We can print whatever information we need from the dictionary using the appropriate `keys`.

In [None]:
print(datetime.datetime(1970, 1, 1, 0, 0, 0) + datetime.timedelta(seconds = data['timestamp']))

In [None]:
response = requests.get("http://api.open-notify.org/iss-now.json")
data = json.loads(response.text)

print( f"The ISS\'s current position is "
       f"{float(data['iss_position']['latitude']):.3f} degrees of latitude"
       f" and {float(data['iss_position']['longitude']):.3f} "
       f"degrees of longitude.")

.


.


**Contrast the niceness above with what you get back when you do a request 
on a typical webpage...**

In [None]:
my_url= 'http://www.google.com'
response = requests.get(my_url)
print(response.status_code)
print()

print(response.text)
print()

## Exercise: 

It is now time for you to try to use an API on your own. The last API available at [`Open Notify`](http://open-notify.org/) returns the number of astronauts in the ISS. Write the code to access that information.

.

.

.

# The US Census' APIs

The United States Census is a decennial census mandated by the United States Constitution. The United States Census Bureau (officially the Bureau of the Census) is responsible for the United States Census.

The first census after the American Revolution was taken in 1790, under Secretary of State Thomas Jefferson; there have been 22 federal censuses since that time. The current national census was held in 2010; the next census is scheduled for 2020 and will be largely conducted using the Internet. For years between the decennial censuses, the Census Bureau issues estimates made using surveys and statistical models.

The Census Bureau has begun rolling out their datasets via [APIs](http://www.census.gov/developers/). You can find a full list of APIs [here](http://www.census.gov/data/developers/data-sets.html).  In this unit, we will focus on the [decennial census](http://www.census.gov/data/developers/data-sets/decennial-census-data.html).

Because we are dealing with US data, we will start by loading some helpful data: US city names, their states, and their geographic codes.  The relevant data is stored in `json` format `Data`.



In [None]:
data_folder = Path.cwd() / 'Data' 

with open(data_folder / 'us_state_names.json') as file_in:
    state_codes = json.load( file_in )
    
with open(data_folder / 'us_places_by_state.json') as file_in:
    places_by_state = json.load( file_in )

print(state_codes.keys())

In [None]:
print(state_codes['MA']['Name'])

**FIPS state codes** are numeric and two-letter alphabetic codes defined in U.S. Federal Information Processing Standard Publication ("FIPS PUB") 5-2 to identify U.S. states and certain other associated areas. The codes are used in Geographic Names Information System, overseen by the U.S. Board on Geographic Names. 

In [None]:
print(places_by_state.keys())

In [None]:
print(state_codes['MT'])
print()

for i in range(2):
    print(places_by_state['MT'][i])
    print()

Now that we have the basic information, we can start using the API to retrieve data. The Census Bureau has a number of helpful resources.  The [decennial census page](http://www.census.gov/data/developers/data-sets/decennial-census-data.html) constains basic instructions on how to contruct queries. There is a also a [page with examples](http://api.census.gov/data/2010/sf1/examples.html), and a page with a list of all (and I *really* mean **all**) [variable codes](http://api.census.gov/data/2010/sf1/variables.html).

**But, before we can do anything, you need to obtain a `key` that will identify you as the person doing the queries.**

In [None]:
with open(Path.cwd().parent.parent.parent / 'My_libraries' / 'amaral_auth.json', 'r') as file_in: 
    auth = json.load( file_in )
    
print(auth.keys())
print()

print(auth['census']['my_key'][3:13])
print()

my_key = auth['census']['my_key']

In [None]:
census_url = 'http://api.census.gov/data/2010/dec/sf1?'

# P012 is the set of code for population by age and ethnicity
# P012A is white population
# P012A001 is total white population
# P012A002 is total white male population
# P012A003 is total white male population younger than 5 
# P012A026 is total white female population

response = requests.get( census_url, params = {'key': my_key, 
                                               'get': 'P012A001,NAME', 
                                               'for': 'state: *'} )

print(response.status_code)
HTML(response.text)

In [None]:
ordered_codes = sorted( list(state_codes.keys()) )

for key in ordered_codes:
    print(key, state_codes[key]['fips_state'])

We can also write queries that obtain several data sets all at once. For example, we can obtain population by age and ethnicity using the codes:

* P012A018 -- Sex By Age (White Alone) MALE 15 yrs old
* P012A038 -- Sex By Age (White Alone) MALE 35 yrs old
* P012B018 -- Sex By Age (Black Or African American Alone) MALE 15 yrs old

And we can also restrict the query to a single state.

In [None]:
data_codes = ''
for code in ['P012A018', 'P012A038', 'P012B018']:
    data_codes += code + ','
data_codes += 'NAME'
print(data_codes)
print()

state_fips = ( f"state:{state_codes['AK']['fips_state']},"
               f"{state_codes['IL']['fips_state']}" )

response = requests.get( census_url, params = {'key': my_key, 
                                               'get': data_codes, 
                                               'for': state_fips})

print(response.status_code)
data = json.loads(response.text)
df = pd.DataFrame(data[1:], columns = data[0])
df

We can also retrieve the population for specific cities.

In [None]:
my_cities = []
for city in ['Chicago', 'Evanston', 'Wilmette', 'Aurora']:
    for i, data in enumerate(places_by_state['IL']):
        if city in data['Name']:
            print(i, data['Name'], data['GEOID'])
            my_cities.append(i)
            break

print(my_cities)
print()
     
state_fips = 'state:' + state_codes['IL']['fips_state']
location_codes = 'place:'
for i in my_cities:
    location_codes += places_by_state['IL'][i]['GEOID'][2:] + ','
location_codes = location_codes[:-1]

print(location_codes)
print()

response = requests.get( census_url, params = {'key': my_key, 
                                               'get': 'P012A001,NAME', 
                                               'for': location_codes, 
                                               'in': state_fips})



print(response.status_code)
print('---')
data = json.loads(response.text)
print(data)
print('---')

df = pd.DataFrame(data[1:], columns = data[0])
df

## Refactor our code 

We have written code that can retrieve specific decennial census information, however, that code is not modular or generalizable. In order to write better code it is useful to refactor our code so it is modular and generalizable.


In [None]:
def create_query_for_census_API( ages, cities, state_code, census_key, 
                                 ethnicity_code = 'A' ):
    """
    Creates a query for retrieving male populations of given ethnicity 
    for a given set of cities
    
    input:
        ages - list : ages of male population to query
        cities - list : fips codes of cities to query 
        state_code - str : fips code of state for cities
        census_key - str : user personal key for census API
        ethnicity_code - str : ethnicity census code (A, B, C, D, H)
        
    output:
        query - dict : params for API query
    """
    # You code here
    
    return query

reponse = requests.get( census_url, params = create_query_for_census_API() )