# Week 10: Application programming interfaces

Government agencies, research institutes, and non-profit organizations have experienced a push to make data public and easily available. Historically they might publish information in relatively unstructured forms, such as written documents or coarse summary tables. Even when data were tabulated, they would be rendered in unstructured html or pdf files, making import difficult. 

A recent trend, however, is to post data in structured forms. Because it’s difficult to anticipate what combination of variables and criteria users will want, an organization will often allow you to query its database using an **application-programming interface (API)**. An API includes a set of documented commands that a database will recognize and respond to.

APIs take many forms, but perhaps the most recognizable kind are [web APIs](https://en.wikipedia.org/wiki/Application_programming_interface#Web_APIs), which are based on the hypertext transfer protocol (HTTP). Any software that implements HTTP — such as a browser, Python, R, or curl — can use the API to connect to the database and retrieve a subset of data in a structured form.

## Objectives

An API is not necessarily the best way to access data for a particular project. But learning to use them is helpful for three reasons:

- Using APIs helps you understand how a data source organizes its data, especially if it has a complex structure
- Having to process data from an API provides excellent practice in normalizing data sets
- In the future you’re likely to encounter the need for *streaming* data, i.e., up-to-date data provided in response to specific user demands

## Practical note

Before we begin, please observe this caveat:

- **Do not send API requests as part of a loop.** 

Most databases limit the number of requests your IP address is allowed to make in a day. Specify each request in a way that retrieves multiple records.

In addition, some APIs require you to include an **API key** (unique identifier) in each request, so the database can track who is making each request. In the interest of time, I am letting you use mine. If you decide to query these data sets later, please apply for and use your own key. 

# API requests

The fastest way to understand how an API works is to dive right in. We'll start by sending an API request, using a browser.

## The *Sunlight Congress* API

The [Sunlight Foundation](http://sunlightfoundation.com) maintains a project, called Sunlight Labs, which provides data about U.S. political process through several APIs. The following examples involve the [Sunlight Congress API](https://sunlightlabs.github.io/congress/).

### Root URL

APIs typically have a **root** URL. You'll be sending requests to this address, along with additional code.

In [96]:
# set the root url

root_url = "https://congress.api.sunlightfoundation.com/"

### Endpoints

A given API will often have a few methods, called **endpoints**, for retrieving different parts of a data set. The *Congress* API has several endpoints, including:

- `legislators`
- `legislators/locate`
- `committees`
- `bills`
- `votes`
- *etc...*

In [97]:
# Set the string for a few endpoints

endpoint_legislators = "legislators"
endpoint_locate = "legislators/locate"
endpoint_votes = "votes"

### API keys

Most APIs will severely limit the data you can retrieve, or refuse to respond at all, unless you supply an **api key**. The key is a unique string that identifies you, and tracks how much you are using the API. For now, you are welcome to use my key, with the understanding that you will not use it outside of class. (You can easily request your own key from Sunlight Foundation, which is a fairly quick process.)

In [98]:
# Garcia's API key. Please don't get me banned!

api_key = "a1b52caa60f74a8cac8c15130de6303a"

## Sending an API request

An API request is a URL containing the endpoint address, and some additional text, that together specify a **query** of the desired data. Any HTTP device, such as your browser, can send a request. Run the code below, and copy and paste the output (which should be a URL) into a new tab in your browser.

In [99]:
root_url = "https://congress.api.sunlightfoundation.com/"
endpoint_legislators = "legislators"
key = "apikey=" + api_key
query = "&state=FL&in_office=true&chamber=senate"
url = root_url + endpoint_legislators + "?" + key + query

print url

https://congress.api.sunlightfoundation.com/legislators?apikey=a1b52caa60f74a8cac8c15130de6303a&state=FL&in_office=true&chamber=senate


What did going to this url do? It located the server at `root_url`, activated the method specified by `endpoint_legislators`, and passed along the text after the `?` symbol. This text was parsed into the following information:

- `apikey` : `a1b52caa60f74a8cac8c15130de6303a`
- `state` : `FL`
- `in_office` : `true`
- `chamber` : `senate`

The server verified that the `apikey` is active, and then searched through its vast trove of Congressional data, filtering out the two U.S. senators from Florida who are currently in office. The `legislators` method assembled their data into a specific structure and sent it to your browser, which recognized it as a plain text file and displayed it as best it could.

Although the resulting web page looks like a mess, try to tease out its structure. The text you received is called a [JSON](https://en.wikipedia.org/wiki/JSON) document, which resembles a Python dictionary. Pay attention to curly and square braces. You should be able to infer the following, complex structure:

In [100]:
senator_results = {
    'count' : 2,
    'page' : {
        'count' : 2,
        'per_page' : 20,
        'page' : 1
    },
    'results' : [
        { # dictionary of attributes of Marco Rubio ... ,
        },
        { # dictionary of attributes of Bill Nelson ... 
        }
    ]
}

The API results are a dictionary with three keys: `count`, `page`, and `results`. `count` is the total number of results matching the query. `page` is a dictionary with information about the results that were actually retrieved; the API does not necessarily send back all the results represented by `count`. Finally `results` is a list of dictionaries, one per senator, each with keys representing the variables of the data set.

## Using `requests` to get data from an API

The `requests` module greatly simplifies API requests. It helps you form the request url in a consistent, repeatable way, and to process the results that come back. `requests` is included in the Anaconda package list, so you can `import` it right away.

In [101]:
import requests

We use `requests.get()` to send the request. The function can take two parameters: `url`, which should be the full endpoint url, and `params`, wich should be a dictionary of the conditions that define your query.

In [102]:
# Form the endpoint url

root_url = "https://congress.api.sunlightfoundation.com/"
endpoint_legislators = "legislators"

endpoint_url = root_url + endpoint_legislators

# List the query parameters in a dictionary

query = {
    'apikey'    : api_key,
    'state'     : "FL",
    'chamber'   : "senate",
    'in_office' : "true"
}

response = requests.get(url = endpoint_url, params = query)

The `get` function returns an object called a `Response`. It contains the url that was sent. Notice that it inserted the `?` and `&` symbols appropriately.

In [103]:
# Check that get() created the correct URL

print response.url

https://congress.api.sunlightfoundation.com/legislators?chamber=senate&state=FL&apikey=a1b52caa60f74a8cac8c15130de6303a&in_office=true


`response` contains the data sent back, which you can examine directly in its `text` attribute

In [104]:
print response.text

{"results":[{"bioguide_id":"R000595","birthday":"1971-05-28","chamber":"senate","contact_form":"http://www.rubio.senate.gov/public/index.cfm/contact","crp_id":"N00030612","district":null,"facebook_id":"178910518800987","fax":"202-228-0285","fec_ids":["S0FL00338"],"first_name":"Marco","gender":"M","govtrack_id":"412491","icpsr_id":41102,"in_office":true,"last_name":"Rubio","lis_id":"S350","middle_name":null,"name_suffix":null,"nickname":null,"oc_email":"Sen.Rubio@opencongress.org","ocd_id":"ocd-division/country:us/state:fl","office":"284 Russell Senate Office Building","party":"R","phone":"202-224-3041","senate_class":3,"state":"FL","state_name":"Florida","state_rank":"junior","term_end":"2017-01-03","term_start":"2011-01-05","thomas_id":"02084","title":"Sen","twitter_id":"SenRubioPress","votesmart_id":1601,"website":"http://www.rubio.senate.gov","youtube_id":"SenatorMarcoRubio"},{"bioguide_id":"N000032","birthday":"1942-09-29","chamber":"senate","contact_form":"http://www.billnelson.se

## Processing data from an API request

Before you can work with the API data, you need to parse it into a useful Python data structure and normalize it if necessary. 

Start by taking advantage of any implied structure, which in this case is implied by the JSON formatting. `response`'s `json()` function can interpret this formatting and convert it into a Python dictionary. (Ignore the `u`'s, we won't see them later.)

In [105]:
data = response.json()
print type(data)
data

<type 'dict'>


{u'count': 2,
 u'page': {u'count': 2, u'page': 1, u'per_page': 20},
 u'results': [{u'bioguide_id': u'R000595',
   u'birthday': u'1971-05-28',
   u'chamber': u'senate',
   u'contact_form': u'http://www.rubio.senate.gov/public/index.cfm/contact',
   u'crp_id': u'N00030612',
   u'district': None,
   u'facebook_id': u'178910518800987',
   u'fax': u'202-228-0285',
   u'fec_ids': [u'S0FL00338'],
   u'first_name': u'Marco',
   u'gender': u'M',
   u'govtrack_id': u'412491',
   u'icpsr_id': 41102,
   u'in_office': True,
   u'last_name': u'Rubio',
   u'lis_id': u'S350',
   u'middle_name': None,
   u'name_suffix': None,
   u'nickname': None,
   u'oc_email': u'Sen.Rubio@opencongress.org',
   u'ocd_id': u'ocd-division/country:us/state:fl',
   u'office': u'284 Russell Senate Office Building',
   u'party': u'R',
   u'phone': u'202-224-3041',
   u'senate_class': 3,
   u'state': u'FL',
   u'state_name': u'Florida',
   u'state_rank': u'junior',
   u'term_end': u'2017-01-03',
   u'term_start': u'2011-01-

The legislator data are stored under the key `results`, so we can extract them into a new variable, `results`, which is a list of dictionaries. Using this data structure, we can access information about each senator.

In [106]:
results = data['results']

for senator in results:
    print "%s %s, the %s U.S. senator from %s, is on Twitter: @%s" %(
        senator['first_name'], 
        senator['last_name'],
        senator['state_rank'],
        senator['state_name'],
        senator['twitter_id'])


Marco Rubio, the junior U.S. senator from Florida, is on Twitter: @SenRubioPress
Bill Nelson, the senior U.S. senator from Florida, is on Twitter: @SenBillNelson


### *Exercise*

In [107]:
# Retrieve the data for all House members of your favorite state,
# as a dictionary, and print the number of results. See the 
# website below to check your number. Repeat for California.

print "https://en.wikipedia.org/wiki/United_States_congressional_apportionment"

https://en.wikipedia.org/wiki/United_States_congressional_apportionment


## Referencing entities between API endpoints

Suppose that we wanted to keep track of how California representatives voted on certain bills, and compare it against biographical information, such as their party and end of term. Or perhaps you'd like to automatically generate a list of Tweets mentioning each one by their handle and saying how they voted. These use cases require us to join data from two different endpoints: `legislators` and `voting`.

### Getting the biographical information

Start by requesting the biographical data for California representatives. To ensure that we get all the results on the first page, set `per_page` to `"all"`.

In [108]:
query = {
    'apikey'    : api_key,
    'state'     : "CA",
    'chamber'   : "house",
    'in_office' : "true",
    'per_page'  : "all"
}

response = requests.get(endpoint_url, params = query)
data = response.json()
print data['count']  # should get 53 in 2015

53


We first isolate the `results`, and focus on the following fields: `bioguide_id` (which uniquely identifies them on the `voting` API), `first_name`, `last_name`, `party`, `twitter_id`, and `term_end`.

`bioguide_id` is critical, because this variable is how we identify legislators on the `voting` API later.

In [109]:
results = data['results']
print type(results), len(results)

<type 'list'> 53


### *Exercise*

In [110]:
# E: Write a loop that creates a list of dictionaries 
#    containing the keys indicated above.

If you don't feel like being selective, you can convert the entire list into a `DataFrame` using `pandas`. A DataFrame is essentially a list of dictionaries, so we can coerce `results` into a `DataFrame` type.

In [112]:
from pandas import *

df = DataFrame(results)
df.columns

Index([    u'bioguide_id',        u'birthday',         u'chamber',
          u'contact_form',          u'crp_id',        u'district',
           u'facebook_id',             u'fax',         u'fec_ids',
            u'first_name',          u'gender',     u'govtrack_id',
              u'icpsr_id',       u'in_office',       u'last_name',
       u'leadership_role',     u'middle_name',     u'name_suffix',
              u'nickname',        u'oc_email',          u'ocd_id',
                u'office',           u'party',           u'phone',
                 u'state',      u'state_name',        u'term_end',
            u'term_start',       u'thomas_id',           u'title',
            u'twitter_id',    u'votesmart_id',         u'website',
            u'youtube_id'],
      dtype='object')

In [113]:
df[['bioguide_id', 'first_name', 'last_name', 'party', 'twitter_id']].head()

Unnamed: 0,bioguide_id,first_name,last_name,party,twitter_id
0,W000820,Mimi,Walters,R,RepMimiWalters
1,T000474,Norma,Torres,D,NormaJTorres
2,L000582,Ted,Lieu,D,RepTedLieu
3,A000371,Pete,Aguilar,D,reppeteaguilar
4,K000387,Steve,Knight,R,SteveKnight25


### Get the voting information

How did these representatives vote on bills? Those data are provided through a different endpoint: `votes`. This API allows you to query by `bill_id`, so we'll look at the contentious "Protecting Cyber Networks Act" (HR1560-114). There are usually several votes (on amendments), but we can filter these out by setting the `vote_type` to `"passage"`. The `votes` endpoint has so many columns that you need to explicitly specify the ones you want returned. For our purposes, it's just the `voter_ids` list, but we also include info about the `bill` to verify that we've retrieved the right data. ([Detailed documentation about Votes here.](https://sunlightlabs.github.io/congress/votes.html))

In [114]:
endpoint_url = root_url + endpoint_votes

query = {
    'apikey'    : api_key,
    'bill_id'   : "hr1560-114",
    'vote_type' : "passage",
    'fields'    : "bill,voter_ids"
}

response = requests.get(endpoint_url, params = query)
data = response.json()
data.keys()

[u'count', u'results', u'page']

As with the `legislators` endpoint, `votes` returns a dictionary, with the desired results associated with the key `results`. 

In [115]:
results = data['results']
print len(results)
results[0].keys()

1


[u'voter_ids', u'bill']

Check bill info...

In [116]:
bill = results[0]['bill'] 
print "%s: %s \n\n%s" %(bill['bill_id'], 
                     bill['short_title'], 
                     bill['official_title'])

hr1560-114: Protecting Cyber Networks Act 

To improve cybersecurity in the United States through enhanced sharing of information about cybersecurity threats, to amend the Homeland Security Act of 2002 to enhance multi-directional sharing of information related to cybersecurity risks and strengthen privacy and civil liberties protections, and for other purposes.


... and get the votes. The dictionary `voter_ids` has `bioguide_id`s as keys, and strings (either "Yea" or "Nay") as the associated values. We'll put in Norma Torres' id to check that a vote is recorded.

In [117]:
votes = results[0]['voter_ids']
print len(votes)
print votes['T000474']

431
Yea


To convert this dictionary to a DataFrame, we actualy need the keys to be values in their own column. To facilitate this conversion, use a comprehension to create a list of tuples, each pairing the `bioguide_id` with the `vote`. 

Pandas can coerce a list of tuples into a `DataFrame`, as long as you specify the variable names in a list, using the `columns` parameter.

In [118]:
temp = [(key, votes[key]) for key in votes.keys()]

vf = pandas.DataFrame(temp, columns = ['bioguide_id', 'vote'])
vf.head()

Unnamed: 0,bioguide_id,vote
0,L000584,Yea
1,L000582,Nay
2,L000583,Nay
3,L000580,Yea
4,L000581,Yea


Now we can merge (join) these data on the common variable `bioguide_id`.

In [119]:
merged = pandas.merge(df, vf, on = 'bioguide_id')
twitter_blast = merged[['bioguide_id', 'first_name', 'last_name',
                        'party', 'twitter_id', 'vote']]
twitter_blast.head()

Unnamed: 0,bioguide_id,first_name,last_name,party,twitter_id,vote
0,W000820,Mimi,Walters,R,RepMimiWalters,Yea
1,T000474,Norma,Torres,D,NormaJTorres,Yea
2,L000582,Ted,Lieu,D,RepTedLieu,Nay
3,A000371,Pete,Aguilar,D,reppeteaguilar,Yea
4,K000387,Steve,Knight,R,SteveKnight25,Yea


In [120]:
for i in twitter_blast.index:
    each = twitter_blast.iloc[i,]
    print "%s %s (%s) voted %s on the Cybersecurity Bill. @%s" %(
           each['first_name'], each['last_name'], each['party'], 
           each['vote'], each['twitter_id'])

Mimi Walters (R) voted Yea on the Cybersecurity Bill. @RepMimiWalters
Norma Torres (D) voted Yea on the Cybersecurity Bill. @NormaJTorres
Ted Lieu (D) voted Nay on the Cybersecurity Bill. @RepTedLieu
Pete Aguilar (D) voted Yea on the Cybersecurity Bill. @reppeteaguilar
Steve Knight (R) voted Yea on the Cybersecurity Bill. @SteveKnight25
Mark DeSaulnier (D) voted Yea on the Cybersecurity Bill. @RepDeSaulnier
Scott Peters (D) voted Yea on the Cybersecurity Bill. @RepScottPeters
Juan Vargas (D) voted Yea on the Cybersecurity Bill. @RepJuanVargas
Alan Lowenthal (D) voted Nay on the Cybersecurity Bill. @RepLowenthal
Mark Takano (D) voted Nay on the Cybersecurity Bill. @RepMarkTakano
Raul Ruiz (D) voted Yea on the Cybersecurity Bill. @CongressmanRuiz
Tony Cárdenas (D) voted Yea on the Cybersecurity Bill. @RepCardenas
Julia Brownley (D) voted Yea on the Cybersecurity Bill. @JuliaBrownley26
David Valadao (R) voted Yea on the Cybersecurity Bill. @RepDavidValadao
Eric Swalwell (D) voted Yea on t