In [None]:
%matplotlib inline
import json
import pandas as pd

In [None]:
# Read and return the "api_key" key from a JSON file
def read_key(fn):
    with open(fn) as json_file:
        data = json.load(json_file)
        return data["api_key"]

# Using Data APIs in Python to Improve Urban Life



## Obtaining US Census Data Through Python Packages

Every 10 years the US conducts a national census. The data collected are extensisve and are aggregated spatially in areas called census tracts. There is a python package for accessing the data through the [APIs served by the Census Bureau](http://www.census.gov/developers/).

This package requires you to know a fair amount about how census data is structured and what the various attributes are called. For an example, we'll look at the American Community Survey 5-year data (refered to as [acs5](http://www.census.gov/data/developers/data-sets/acs-5year.html)) and some of the variables related to povery measures including [total population](http://api.census.gov/data/2014/acs5/variables/B01003_001E.json), and [population below the povery level by work experiance](http://api.census.gov/data/2014/acs5/variables/B17009_002E.json).

The Census Bureau's API requires an API key to be able to access it. This is a unique string that is assiged to you so that your API use can be tracked and rate limited if needed. Some data API providers charge money for use of their APIs and keys let them bill the right person. The Python package will pass along your API key to the Census servers.

You can get a free [Census Bureau API key here](http://api.census.gov/data/key_signup.html).

API keys are secrets that should not be made public, like a password. Be careful about putting API keys into your code, especially if you commit it to a public code repository like Github. Remember, once you have commited an API key to a version control system, it is there forever. You must cancel your old key and get a new one immediately.

One pattern for managing API keys is to put them in a separate file which is excluded from verison control and read them into your code.

In [None]:
census_api_key = read_key("census.key")
print(census_api_key[0:10])

Now that we have a key loaded, let's get a couple parameters from the census by census tract for Alachua county. The census uses [FIPS codes](https://en.wikipedia.org/wiki/FIPS_county_code) to tell what state and county data is in.

In [None]:
# Before running you will need to install the 'census' package with
# pip install census
from census import Census

florida_fips = '12'
alachua_fips = '001'

census_api = Census(census_api_key)
census_data = census_api.acs5.state_county_tract(
    ('NAME', 'B01003_001E', 'B14006_002E', 'B17009_002E'), 
    florida_fips, alachua_fips, Census.ALL)

census_data[0:2]

What did we get back? This is a list of dictionaries. This is a common representation of data and when we talk about JSON in a few minutes you'll see why data APIs and libraries that work with them like this format.

This is such a common structure that Pandas includes a function to convert it directly to a data frame:

In [None]:
poverty_by_tract = pd.DataFrame.from_dict(census_data)
poverty_by_tract.describe()

Now let's calculate the percentage of the population living in poverty in each census tract. To do that we'll divide the total number living below the poverty line by the total population and put it in a new column of the dataframe. Because our data was returned as strings, we need to convert each one to a float before dividing.

In [None]:
poverty_by_tract['percent'] = poverty_by_tract.apply(
    lambda row: float(row['B17009_002E']) / float(row['B14006_002E']), axis=1)

poverty_by_tract.hist(column="percent")

You can see that the calls to the census package and even the data formating are just a few lines of code however you still need to know quite a lot about how census data is represented in order to use the Python package. Desciptive data about data are called "metadata". You can find the metadata for some of what we have used above at the Census Bureau.

## Obtaining Gaineville Regional Utilities (GRU) Billing Data Through Socrata's API from data.cityofgainesville.org

Not all APIs have packages that wrap them. Sometimes, like [Socrata's package](https://pypi.python.org/pypi/sodapy), they exist but don't include the methods you want.

When someone makes a web-based API, you can interact with it using HTTP requests from Python with the [requests library](http://docs.python-requests.org/en/master/). This is a very full-featured library that will make using the complex parts of the HTTP protocol like data encoding, HTTPS, and authentication very easy.

Let's look at how to interact with a web API without a package. We will need to determine four things to get started:

1. The API access point. This is the URL where the API requests need to be sent to.
1. The API methods. These are often called endpoints and in REST APIs.
1. Method parameters that contain any data the methods need to do their job.
1. The format of the returned data.

For the Socrata API, the API access point is the server run by the City of Gainesville:

In [None]:
api_access_point = "https://data.cityofgainesville.org/resource"

This is a simple API so the method in this case is just the name of the data set that you want to retieve data from. In other APIs these can be action nouns like "get" or "delete" or might refer to more complicated operations like "map" etc. Notice that datasets are named with a unique string. This is the one for electicity usage by month. You can find these names by clicking the [API link on the dataset's page](https://data.cityofgainesville.org/Environment-Energy/GRU-Customer-Electric-Consumption/gk3k-9435).

In [None]:
method = "9qim-t8hy"

To get data, we put these all together into a whole URL that we can then retrieve. For the moment let's also add some additional parameters for year and month as well as a cap on how many records will be returned.

In [None]:
year = "2016"
month = "January"

api_url = ("{0}/{1}.json?year={2}&month={3}&$limit=5"
    .format(api_access_point, method, year, month))
print(api_url)

Try clicking that link in your browser and see what you get. We'll also use the requests library to retrieve it in Python.

In [None]:
import requests
r = requests.get(api_url)
gru_data = r.json()
print(type(gru_data))
print gru_data

What was that data we retrieved and what is this json() method? Here is where we need to know what kind of data the API server sends back.

Most of the time you will get [JSON](http://www.json.org/) data, Javascript Object Notation. This is a way of representing simple data structures as text in a very compact manner. It supports arrays and objects for structures and strings and numbers for data types.

When you retrieved the API url in hyour browser, notice that the output starts with a "[". This indicates that you are getting a list. The next character is a "{" which starts an object so you are getting a list of objects. The remaining strings in quotes separated by colons ":" are the propery names and values in the object.

When you run the .json() method on the returned request object, the requsts library parses this JSON into Python data structures. You can see that gru_data is now a Python list of Python dictionaries with unicode string property names and values. You can also see that nesting data is allowed:

In [None]:
print(gru_data[0]["location_1"])

Requesting simple URLs and getting JSON back is such a common pattern that it's baked in to the Pandas library. You don't need the requests library although requests lets you intereact with much more complex APIs.

In [None]:
gru_dataframe = pd.read_json(api_url)
gru_dataframe.head(2)

## Obtaining Property Appraiser Data Through Manual Downloads from www.acpafl.org

Sometimes you can't win. Data may be publicly availible but not through an API or a package. In these cases you will need to get creative. One technique is to scrape web pages by picking out tables or formatted text and making data out of it. [Srapy](https://scrapy.org/) is powerful Python package for doing that.

Another technique is to find a common pattern in web URls and incriment or change part of the URL with a loop in Python to walk through a whole dataset. As an example, you can see that this URL for publications hosted on the Internet Archive use the barcode as part of the URL and there is a separate database of barcodes so you can easily make a loop to download everything:

[http://www.archive.org/download/CAT31293222/CAT31293222_files.xml](http://www.archive.org/download/CAT31293222/CAT31293222_files.xml)

You may also find data is availible in a nice format but the provider has limited what you can download, usually because they don't want their servers to be overloaded. If you want to work around this, please be considerate and limit what you take and also keep a copy locally so you don't have to keep going back to their servers.

The [Alachua County Property Appraiser](http://www.acpafl.org/advancedpropertysearch.asp) provides tax value and ownership information of all the properties in the county on their web site. But they limit downloads to only 5000 records and there isn't a nice URL pattern to use to script downloading more. To get a significant amount of data, pick some parameter that you can incriment to make sure you get all records. In this example, I downloaded all single family residences (land use code 00100) in the 32605 zip code in batches by assessed value increments.

We can read the CSV files into dataframes and concatenate them.

In [None]:
data1 = pd.read_csv("data/32605_00100_0k_50k.csv")
data2 = pd.read_csv("data/32605_00100_50k_100k.csv")
data3 = pd.read_csv("data/32605_00100_100k_200k.csv")
data4 = pd.read_csv("data/32605_00100_200k_300k.csv")
data5 = pd.read_csv("data/32605_00100_300k_999k.csv")


appraiser_data = pd.concat([data1, data2, data3, data4, data5])
appraiser_data.head()

### Distribution of Single Family Residence by Square Footage

In [None]:
appraiser_data.hist("TotSqFt")


### Geocoding Residences with Google's Service With the geocoder Package

Data APIs don't have to just be a source of downloading data, they can perform transformations and calcualtions as well.

The appariser data above does not have latitude and longitude in it so we can't plot it on a map. The process of converting a street address into map coordinates is called geocoding and there are many APIs for doing it. There is a Python package called [geocoder](https://pypi.python.org/pypi/geocoder) that wraps many of them up in to an easy to use interface. We will use it with the Google geocoding API to map some of the addresses.

In [None]:
import geocoder

google_api_key = read_key("google.key")
g = geocoder.google("1604 NW 21ST AVE GAINESVILLE FL 32605-4062", 
                    key=google_api_key)
g.json

You can see that the library takes a string that is the street address and returns JSON that contains a lat and lng property.

In order to do this to our data frame, we will first make a smaller dataframe with only a few rows. Google limits their API usage to 2000 requests per day. Then we will write a geocoding function and apply it to the dataframe to make a new column of the responses from the geocoding API. Finally we will pull the lat and lng out of those responses into thier own columns.

In [None]:
small_data = appraiser_data[appraiser_data["City_Desc"] == "Gainesville"].head()

def run_geocode(address):
    g = geocoder.google(address, key=google_api_key)
    return g.json

small_data["geocode_response"] = small_data.apply(lambda row: run_geocode(" ".join(
            [row["Loc_Address"], row["City_Desc"], "FL", "USA"])), axis=1)

small_data["lat"] = small_data.apply(lambda row: row["geocode_response"]["lat"], axis=1)
small_data["lng"] = small_data.apply(lambda row: row["geocode_response"]["lng"], axis=1)

In [None]:
small_data

We can map these points using basemap or other Python packages. We'll use a shape file of the census tracts as a background.

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

my_map = Basemap(llcrnrlon=-82.5,llcrnrlat=29.4,urcrnrlon=-82.1,urcrnrlat=29.8,
             resolution='i', projection='tmerc', lat_0 = 29.65, lon_0 = -82.33)
my_map.readshapefile("data/gz_2010_12_140_00_500k", "census_tracts")


x,y = my_map(small_data["lng"].tolist(), small_data["lat"].tolist())

my_map.plot(x, y, 'bo', markersize=7)

plt.show()