In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Consuming APIs (and JSON)
<!-- requirement: secrets/twitter_secrets.json.sample -->


Consuming APIs is supposed to be easy (that's the point of having an API).  

Let's look at a simple example of consuming a JSON API.  The example we'll look at is a *geocoder*: That is, a service for converting between addresses and normalized geographic information (e.g. latitude and longitude).  Going from addresses to normalized form is "forward geocoding" and going the other way is "reverse geocoding".

We'll interact with a free (and non-authenticated) geocoder run by OpenStreetMap.  The geocoded information is available by sending a GET request to <tt>http:&#8203;//nominatim.openstreetmap.org/search?q=<i>address</i>&addressdetails=1&format=json</tt>.  The portion before the question mark (`http://nominatim.openstreetmap.org/search`) is the endpoint on the server, while the portion following, known as the *query string*, contains the data being sent to the server.  (Thus, a GET request can be repeated simply by requesting the same URL again.  In contrast, the data sent in a POST request is contained in the request body, not in the URL.)

As is typical, the query string consists of several key=value pairs, separated by ampersands.  The requested address is specified with the `q` key in this case.  Some characters, like the spaces and commas, cannot be using in the URL, so they must be enoded with the `urllib2.quote()` function.

In [2]:
import urllib2

address = "1600 Pennsylvania Avenue, Washington, DC"
urllib2.quote(address)

'1600%20Pennsylvania%20Avenue%2C%20Washington%2C%20DC'

In [3]:
url = "http://nominatim.openstreetmap.org/search?q=%s&addressdetails=1&format=json" % urllib2.quote(address)
url

'http://nominatim.openstreetmap.org/search?q=1600%20Pennsylvania%20Avenue%2C%20Washington%2C%20DC&addressdetails=1&format=json'

We can request this URL with the `urlopen()` function, which returns a stream we can read from.

In [4]:
data = urllib2.urlopen(url).read()
data

'[{"place_id":"228527596","licence":"Data \xc2\xa9 OpenStreetMap contributors, ODbL 1.0. https:\\/\\/osm.org\\/copyright","osm_type":"way","osm_id":"564931814","boundingbox":["38.8957842","38.895924","-77.0309688","-77.0304609"],"lat":"38.8958536","lon":"-77.0307129","display_name":"Pennsylvania Ave, Penn Quarter, Washington, District of Columbia, 20006, United States of America","class":"highway","type":"path","importance":0.22875,"address":{"path":"Pennsylvania Ave","suburb":"Penn Quarter","city":"Washington","state":"District of Columbia","postcode":"20006","country":"United States of America","country_code":"us"}},{"place_id":"158306366","licence":"Data \xc2\xa9 OpenStreetMap contributors, ODbL 1.0. https:\\/\\/osm.org\\/copyright","osm_type":"way","osm_id":"397325778","boundingbox":["38.8633822","38.8637409","-76.9467576","-76.945632"],"lat":"38.8636383","lon":"-76.9463651","display_name":"Pennsylvania Avenue, Coral Hills, Prince George\'s County, District of Columbia, 20020, Unit

The result was returned to us in the form of JSON. JSON is JavaScript Object Notation&mdash;it's a human readable text-based format for transmitting key-value pairs (and strings, numbers, and arrays). The json package lets us convert between this and Python's native dictionaries, etc.

In [6]:
import simplejson as json

json.loads(data)

[{'address': {'city': 'Washington',
   'country': 'United States of America',
   'country_code': 'us',
   'path': 'Pennsylvania Ave',
   'postcode': '20006',
   'state': 'District of Columbia',
   'suburb': 'Penn Quarter'},
  'boundingbox': ['38.8957842', '38.895924', '-77.0309688', '-77.0304609'],
  'class': 'highway',
  'display_name': 'Pennsylvania Ave, Penn Quarter, Washington, District of Columbia, 20006, United States of America',
  'importance': 0.22875,
  'lat': '38.8958536',
  'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'lon': '-77.0307129',
  'osm_id': '564931814',
  'osm_type': 'way',
  'place_id': '228527596',
  'type': 'path'},
 {'address': {'country': 'United States of America',
   'country_code': 'us',
   'county': "Prince George's County",
   'locality': 'Coral Hills',
   'postcode': '20020',
   'road': 'Pennsylvania Avenue',
   'state': 'District of Columbia'},
  'boundingbox': ['38.8633822', '38.8637409', '-76.9467576', '

In [7]:
json.loads(data)[0]['boundingbox']

['38.8957842', '38.895924', '-77.0309688', '-77.0304609']

Note that this was a public API, with no authentication.  We'll go through an example of the code for an authenticated API at the end -- the example will be the free Twitter stream.  (The reason we didn't do this up front is that you can't run the code without signing up for an API key, etc.)

## Handling URL parameters


`urllib2` module requires an enormous amount of work to perform the simplest of tasks. The `requests` library provides a higher-level way to do web requests. This is already nice in examples, like the above, where we need to encode parameters into the URL.  It is even more convenient when there are also `POST` parameters (or cookies, or authentication, or...) involved.  (Don't worry if you don't know what that means.)

In [8]:
import requests
def geocode(address):
    params = { 'format'        :'json', 
               'addressdetails': 1, 
               'q'             : address}
    return requests.get('http://nominatim.openstreetmap.org/search', params=params)

response = geocode("107 Page St., San Francisco")

The parameters are automatically encoded and assembled into the query string.

In [9]:
response.url

u'http://nominatim.openstreetmap.org/search?q=107+Page+St.%2C+San+Francisco&addressdetails=1&format=json'

The raw response is available...

In [10]:
response.text

u'[{"place_id":"186558210","licence":"Data \xa9 OpenStreetMap contributors, ODbL 1.0. https:\\/\\/osm.org\\/copyright","osm_type":"way","osm_id":"32121427","boundingbox":["37.773928208333","37.774028208333","-122.42263933333","-122.42253933333"],"lat":"37.7739782083333","lon":"-122.422589333333","display_name":"107, Page Street, Western Addition, SF, California, 94102, United States of America","class":"place","type":"house","importance":0.31025,"address":{"house_number":"107","road":"Page Street","neighbourhood":"Western Addition","city":"SF","county":"SF","state":"California","postcode":"94102","country":"United States of America","country_code":"us"}}]'

...but it can also be converted to JSON.

In [11]:
response.json()

[{u'address': {u'city': u'SF',
   u'country': u'United States of America',
   u'country_code': u'us',
   u'county': u'SF',
   u'house_number': u'107',
   u'neighbourhood': u'Western Addition',
   u'postcode': u'94102',
   u'road': u'Page Street',
   u'state': u'California'},
  u'boundingbox': [u'37.773928208333',
   u'37.774028208333',
   u'-122.42263933333',
   u'-122.42253933333'],
  u'class': u'place',
  u'display_name': u'107, Page Street, Western Addition, SF, California, 94102, United States of America',
  u'importance': 0.31025,
  u'lat': u'37.7739782083333',
  u'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  u'lon': u'-122.422589333333',
  u'osm_id': u'32121427',
  u'osm_type': u'way',
  u'place_id': u'186558210',
  u'type': u'house'}]

In [None]:
response.json()[0]['boundingbox']

**Exercise:** The National Weather Service operates a free API for weather information.  A sample request looks like this: `http://forecast.weather.gov/MapClick.php?lat=37.7739&lon=-122.4225&FcstType=json`.

Use the geocoder to write a function

        def weather_at_address(address):
            ....
            
that gets the current weather (temperature, cloudy or not) from a human-entered address.

## Authenticated APIs


Lots of interesting APIs are free (or at least free for moderate use) but still require you to register first.  The `requests` library (together with some supporting ones, e.g. `requests_oauthlib`) make it easy to consume these too.

**Exercise:** In order to access the Twitter API, you must first sign up: create an app on http://apps.twitter.com, get an access token, et voila, you have your shiny new credentials -- consisting of four pieces of data. The file /secrets/twitter_secrets.json.sample in the datacourse repo has the format template; then rename the file to have a .nogit extension to prevent it being tracked in the repository.

In [12]:
from requests_oauthlib import OAuth1

with open("secrets/twitter_secrets.json.nogit") as fh:
    secrets = json.loads(fh.read())

# create an auth object
auth = OAuth1(
    secrets["api_key"],
    secrets["api_secret"],
    secrets["access_token"],
    secrets["access_token_secret"]
)

IOError: [Errno 2] No such file or directory: 'secrets/twitter_secrets.json.nogit'

Let's see all of Michael's friends.

In [13]:
r = requests.get(
    "https://api.twitter.com/1.1/friends/ids.json",
    auth=auth,
    params={'screen_name' : 'tianhuil'}
)
michaels_friends=r.json()

r2 = requests.post(
    'https://api.twitter.com/1.1/users/lookup.json',
    auth=auth,
    data={'user_id' : michaels_friends['ids'][:50]}
)
friends_info = r2.json()
[(f['screen_name'], f['name']) for f in friends_info]

NameError: name 'auth' is not defined

Requests also makes it easy to deal with simple streaming APIs.  Let's stream 100 tweets from the Twitter feed.

In [14]:
import sys
r_stream = requests.get('https://stream.twitter.com/1.1/statuses/sample.json', auth=auth, stream=True)
counter = 0
for line in r_stream.iter_lines():
    # filter out keep-alive new lines
    if not line:
        continue
    tweet = json.loads(line)
    if 'text' in tweet:
        counter +=1
        print tweet['text']
    sys.stdout.flush()
    if counter > 100:
        break

NameError: name 'auth' is not defined

We can restrict the location to be more-likely to get English-language tweets.

In [15]:
from itertools import islice  # Question: what does islice do?

r_stream = requests.post('https://stream.twitter.com/1.1/statuses/filter.json', auth=auth,
                          stream=True, data={"locations" : "-125,23,-70,50"} )
for line in islice(r_stream.iter_lines(), 100):
    # filter out keep-alive new lines
    if not line:
        continue
    tweet = json.loads(line)
    if 'text' in tweet:
        print tweet['text']
    sys.stdout.flush()

NameError: name 'auth' is not defined

## API Request Limitations


Some Authenticated APIs have hard limits on the total number of requests that can be made by one user in one day. An API service that uses a Fremium or Paid service model will enforce a limit so they can encourage high-volume users to pay for better data access. API providers also do this to force software developers to be disciplined and thoughtful in their use of the API service.

All APIs might have soft limits based on some ambiguous definition of excessive use. Google, for example, will block your IP address if you make too many requests to their services too quickly. Presumably this is done with a machine learning algorithm built specifically for this purpose. Bloomberg has a Python API associated with their desktop terminal application. They will revoke access if you exceed daily or monthly hard limits, but unfortunately specifics of those limits are not shared with any of their users.

These limits create challenges for the cost-conscious data scientist. Happily, Python has tools to help. One of them is the [ediblepickle](https://pypi.python.org/pypi/ediblepickle/1.1.3) package. This package provides a  convenient facility for caching the results of function calls. This can help prevent unnecessary duplicate requests to an API.

In the below example, the previous `geocode` function is modified with ediblepickle's `checkpoint` decorator. It wraps the `geocode2` function with additional functionality to cache the results of the first function call in a pickle file. The results are stored in a filename that is dependent on the function arguments.

If this function is called a second time with the same function arguments, the `checkpoint` decorator will intercept the call and retrieve the results from the cached pickle file.

It is important that the filename be valid filename that is unique to the function parameters. In this example, we use `urllib2.quote` to escape characters and generate a proper filename.

In [16]:
from ediblepickle import checkpoint
import os

cache_dir = 'cache'
if not os.path.exists(cache_dir):
    os.mkdir(cache_dir)

@checkpoint(key=lambda args, kwargs: urllib2.quote(args[0]) + '.p', work_dir=cache_dir)
def geocode2(address):
    params = { 'format'        :'json', 
               'addressdetails': 1, 
               'q'             : address}
    print 'making API request...'
    result = requests.get('http://nominatim.openstreetmap.org/search', params=params)
    print 'API request complete.'
    return result
    
address = "City Hall Park, New York, NY 10007"

In [17]:
%%time

# this created the cached file. observe the creation of a new pickle file in the cache directory.
response = geocode2(address)
print response.json()

making API request...
API request complete.
[{u'display_name': u'City Hall Park, Tribeca, Manhattan Community Board 1, New York County, NYC, New York, 10005, United States of America', u'importance': 0.6375, u'place_id': u'159216540', u'lon': u'-74.0062428541059', u'lat': u'40.71262475', u'osm_type': u'way', u'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', u'osm_id': u'413055246', u'boundingbox': [u'40.7118001', u'40.7139825', u'-74.0078192', u'-74.0043418'], u'type': u'park', u'class': u'leisure', u'address': {u'city': u'NYC', u'country': u'United States of America', u'park': u'City Hall Park', u'county': u'New York County', u'state': u'New York', u'postcode': u'10005', u'country_code': u'us', u'neighbourhood': u'Tribeca'}}]
CPU times: user 8.66 ms, sys: 85 µs, total: 8.74 ms
Wall time: 231 ms


In [18]:
%%time

# this reads the cached file. observe that this executes ~100x faster.
# the print statements in the geocode2 function do not appear because the function itself is not executed at all.
response = geocode2(address)
print response.json()

[{u'display_name': u'City Hall Park, Tribeca, Manhattan Community Board 1, New York County, NYC, New York, 10005, United States of America', u'importance': 0.6375, u'place_id': u'159216540', u'lon': u'-74.0062428541059', u'lat': u'40.71262475', u'osm_type': u'way', u'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright', u'osm_id': u'413055246', u'boundingbox': [u'40.7118001', u'40.7139825', u'-74.0078192', u'-74.0043418'], u'type': u'park', u'class': u'leisure', u'address': {u'city': u'NYC', u'country': u'United States of America', u'park': u'City Hall Park', u'county': u'New York County', u'state': u'New York', u'postcode': u'10005', u'country_code': u'us', u'neighbourhood': u'Tribeca'}}]
CPU times: user 2.65 ms, sys: 0 ns, total: 2.65 ms
Wall time: 2.34 ms


### Exercises


1. Write a Python script that takes as input an address and outputs 50 tweets from within about 10 miles of it.
Now modify it to return the top 10 hashtags within that 10 mile range (based on, say, a 1000 tweet sample).
1. You can plot maps using this [Python Package](http://peak5390.wordpress.com/2012/12/08/matplotlib-basemap-tutorial-plotting-points-on-a-simple-map/).  Get geo-located tweets from the streaming API and plot them on the map.

### Further reading for this lecture


To learn more about JSON (there isn't much more to know!):
 - http://www.secretgeek.net/json_3mins.asp
 - http://en.wikipedia.org/wiki/JSON (esp. "Data types, syntax, and examples")
 - http://tools.ietf.org/html/rfc7159

A useful tool for playing with JSON on the command line is [jq](http://stedolan.github.io/jq/).

To learn more about about the prevailing design pattern ("REST") for web-based APIs:
 - http://en.wikipedia.org/wiki/Representational_state_transfer
 
One wildcard is the wide variety of authentication strategies employed ("basic auth", cookies, bearer token, OAuth, OAuth 2, etc.).  For several of these, the documentation at http://docs.python-requests.org/en/latest/user/authentication/ is helpful.

### Exit Tickets

1. Explain the difference between requests.get() and requests.post().
2. What data structures do JSON objects in Python use?
3. Describe what the remote site is doing when it receives an API request from you.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*