# Working with data-access APIs over the web

Notebook developed by Sam Maurer

In Part 1, we'll load and parse results from an API feed of earthquake data.  
In Part 2, we'll add query parameters to the workflow, using the Google Maps Geolocation API as an example.  
In Part 3, we'll use an authenticated API to query public Twitter posts. 

# Part 1: Reading from an automated data feed

### USGS real-time earthquake feeds

This is an API for near-real-time data about earthquakes. Data is provided in JSON format over the web. No authentication is needed, and there's no way to customize the output. Instead, the API has a separate endpoint for each permutation of the data that users might want.

**API documentation:**  
http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php

**Sample API endpoint, for magnitude 4.5+ earthquakes in past day:**  
http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_day.geojson  


In [None]:
import pandas as pd

import json      # library for working with JSON-formatted text strings
import requests  # library for accessing content from web URLs

import pprint    # library for cleanly printing Python data structures
pp = pprint.PrettyPrinter()

In [None]:
# download data on magnitude 2.5+ quakes from the past week

endpoint_url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojson"
response = requests.get(endpoint_url)
results = response.text

# what's the data type of the results?

type(results)

#### New syntax -- `<list>[<position>]` and `<list>[<start>:<end>]`, RealPython tutorial [here](https://realpython.com/python-lists-tuples/)

In Python, you can retrieve an item from a list by referring to its position, using integers to enumerate the items. Counting begins with 0, as in most programming languages.

You can also get multiple consecutive items: `my_list[0:10]` will give you the first 10. (You get from item `0` to just *before* item `10`). And you can leave out an argument if you want to start at the beginning, or continue to the end.

The same syntax works with other data types that are similar to lists, like DataFrames or strings.

In [None]:
# print the first 500 characters to see a sample of the data

print(results[:500])

In [None]:
# it looks like the results are a string with JSON-formatted data inside

# parse the string into a Python dictionary
data = json.loads(results)  # loads = "load string"

type(data)

#### New data type -- `dict`, RealPython tutorial [here](https://realpython.com/python-dicts/)

A "dictionary" in Python contains key-value pairs, where the keys are (usually) strings and the values can be anything. Each value must be a single "item", but the item can contain other things -- it could be a list, or even an entire nested dictionary.

In [None]:
# print the dictionary

pp.pprint(data)

#### New syntax -- `<dict>['<key>']`, RealPython tutorial [here](https://realpython.com/python-dicts/)

You can access elements of a dictionary using the keys.

In [None]:
# save the list of quakes to a new variable

quakes = data['features']

# print the most recent quake

pp.pprint(quakes[0])

#### New syntax -- `for <item> in <list>:`, RealPython tutorial [here](https://realpython.com/python-for-loop/)

This creates a *loop*, running the subsequent indended code for each item in the list. `<list>` is the name of the list, and `<item>` is a new variable name you provide, which will refer to each item in turn as the loop runs.

In [None]:
# pull out the title from each earthquake listing

for q in quakes:
    print(q['properties']['title'])

#### New syntax -- `[<code> for <item> in <list>]`, RealPython tutorial [here](https://realpython.com/list-comprehension-python/)

This is a shortcut to create a new list where each item from an existing list is modified using a small piece of code. Just like with the `for` syntax, `<list>` is the name of a list and `<item>` is a new variable that will refer to each item in turn. The `<code>` should do something with the item variable.

This syntax is called "list comprehension". It can be hard to read, so I don't use it very much. But there are certain things it's really helpful for -- like parsing dictionary data into a table format.

In [None]:
# pull out magnitudes and depths into a Pandas dataframe

d = {'magnitude': [q['properties']['mag'] for q in quakes],
     'depth': [q['geometry']['coordinates'][2] for q in quakes]}

df = pd.DataFrame.from_dict(d)

# how many earthquakes were loaded into the dataframe?

len(df)

In [None]:
# print the first few lines of data

df.head()

In [None]:
# print some descriptive statistics

pd.set_option("display.precision", 1)
df.describe()

In [None]:
# plot the depth vs. magnitude

df.plot(x='magnitude', y='depth', kind='scatter')

# Part 2: Querying an API endpoint

### Mapbox Geocoding API

Services like Google Maps and Mapbox have various APIs that let you access its services through code instead of through GUI apps. This one from Mapbox lets you look up the latitude-longitude coordinates of street addresses.

It works similarly to the earthquakes example, but with query parameters added to the URL endpoint!

**API documentation:**  
https://www.mapbox.com/api-documentation/#geocoding

**API endpoint:**  
https://api.mapbox.com/geocoding/v5/mapbox.places

**API endpoint with query parameters:**  
https://api.mapbox.com/geocoding/v5/mapbox.places/Wurster+Hall.json?access_token=pk.eyJ1IjoiY3AyNTVkZW1vIiwiYSI6ImRPcTlnTUEifQ.3C0d0Nk_rcwV-8JF29PU-w

You can get your own access key by signing up for a Mapbox account, if you'd like.

In [None]:
import json      # library for working with JSON-formatted text strings
import requests  # library for accessing content from web URLs

import pprint    # library for cleanly printing Python data structures
pp = pprint.PrettyPrinter()

In [None]:
# we have to encode the search query so that it can be passed as a URL, 
# with spaces and other special characters removed

endpoint = 'https://api.mapbox.com/geocoding/v5/mapbox.places/'

address = 'Wurster Hall'

params = {'limit': 1,
          'access_token': 'pk.eyJ1IjoiY3AyNTVkZW1vIiwiYSI6ImRPcTlnTUEifQ.3C0d0Nk_rcwV-8JF29PU-w'}

url = requests.Request('GET', endpoint+address+'.json', params=params).prepare().url
print(url)

In [None]:
# download and parse the results

response = requests.get(url)
results = response.text
data = json.loads(results)

print(data)

In [None]:
# print it more nicely

pp.pprint(data)

In [None]:
# pull out the lat-lon coordinates

for r in data['features']:
    coords = r['geometry']['coordinates']
    print(coords)

### Exercises

1. Search for some other addresses or landmarks!
2. Take a look at the [API documentation](https://www.mapbox.com/api-documentation/#geocoding). Can you figure out how to retrieve other points of interest near Wurster Hall?

# Part 3: Querying an API with back-and-forth authentication

### Twitter search APIs

Twitter's APIs operate over the web as well, but they require a back-and-forth authentication process at the beginning of each connection. It's easier to have a Python library handle this than to create the query URLs ourselves.

Most Twitter APIs perform stand-alone operations: you submit a query and receive results, like in earlier examples. Twitter also has a "streaming" API that continues sending results in real time until you disconnect.

**API documentation:**  
https://developer.twitter.com/en/docs/tweets/search/overview

**Documentation for the Python helper library**:  
http://geduldig.github.io/TwitterAPI/

### Setup

This part of the demo requires a file of account credentials called `keys.py`. You'll find the file in a bCourses announcement -- download it and put it into the same DataHub folder as this notebook. 

Instructions [here](https://github.com/smmaurer/api-demo/blob/master/README.md) for generating your own credential tokens later on, if you'd like to.

#### New syntax -- `!pip install <library_name>`

In a Jupyter notebook, beginning a line with `!` passes the instruction directly to the computer's operating system instead of running it with Python.

This particular command uses a program called Pip, which is a tool for managing Python libraries. `pip install` searches Pip's index for the library and then automatically downloads and installs it.

In DataHub, the Python libraries are reset every time you log in. So you'll have to install any special ones each time you begin a new session. But if you run this notebook on your own computer, new libraries will stay installed.

In [None]:
!pip install TwitterAPI

In [None]:
from TwitterAPI import TwitterAPI

import pprint
pp = pprint.PrettyPrinter()

In [None]:
# import API credentials from keys.py file in the
# same directory as this notebook

from keys import *

In [None]:
# set up an API connection using credentials from the keys file

api = TwitterAPI(consumer_key, consumer_secret, 
                 access_token, access_token_secret)

print("Connection is set up but not tested")

### Making a simple data request

In [None]:
# most recent tweet from CED

endpoint = 'statuses/user_timeline'
params = {
    'screen_name': 'wursterlife', 
    'count': 1
}
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print(tweet['text'])

In [None]:
# what other data is there?

pp.pprint(tweet)

### Other API endpoints allow different types of searches

In [None]:
# search for recent tweets with your favorite emoji (or any other text string)

endpoint = 'search/tweets'
params = {
    'q': '💘', 
    'count': 5
}
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print(tweet['text'] + '\n')

In [None]:
# search for public tweets in Korean

endpoint = 'search/tweets'
params = {
    'q': '*', 
    'lang': 'ko', 
    'count': 5
} 
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print(tweet['text'] + '\n')

In [None]:
# search for public tweets geotagged near the UC Berkeley campus

endpoint = 'search/tweets'
params = {
    'q': '*', 
    'geocode': '37.873,-122.260,0.5km', 
    'count': 5
} 
r = api.request(endpoint, params)

for tweet in r.get_iterator():
    print(tweet['text'] + '\n')

### Exercises

1. Try some different search queries!
2. Display some more data fields in addition to the tweet text

### Bonus: Streaming live tweets in real time 

In [None]:
# Twitter allows only one or two simultaneous streaming connections for 
# each set of API credentials, so this part may not work during class

endpoint = 'statuses/filter'
params = {'locations': '-180,-90,180,90'}
r = api.request(endpoint, params)
LIMIT = 20

# 'enumerate' lets us count tweets as we receive them

for i, tweet in enumerate(r.get_iterator()):
    print(tweet['created_at'])
    print(tweet['place']['full_name'] + ', ' + tweet['place']['country'])
    print(tweet['text'] + '\n')
    if (i > LIMIT): break

# close the streaming connection
r.close()

### Exercises for the remainder of class

Choose your favorite:

1. Make a scatter plot of the lat-lon coordinates of earthquakes.  
   &nbsp;
   
2. The earthquakes API is actually returning a specific data format called [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON), also used by many other geospatial data feeds. Try saving the raw GeoJSON file and opening it in QGIS or in the [geojson.io](http://geojson.io) web viewer.  
   &nbsp;

2. Using the geocoding example as a starting point, try searching Mapbox's Directions API or Elevation API instead. You can read more about them on the [Mapbox API documentation page](https://www.mapbox.com/api-documentation/#introduction).  
   &nbsp;

3. Try out another API that you're interested in. Can you figure out how to connect to it using Python?  

   With municipal data it's often easiest to just download a data file, but APIs are great for querying big data sets or tracking live updates. Here are some resources.

   - San Francisco:  https://data.sfgov.org/developers  
   - Alameda County:  https://data.acgov.org/developers  
   - UC Berkeley:  https://api-central.berkeley.edu  
   - US Census:  http://www.census.gov/data/developers/data-sets.html  
   - Open Data Network:  https://www.opendatanetwork.com  
   - CivicData:  http://www.civicdata.io/  