# Assignment 6
In this assignment, I import data from the San Francisco development pipeline from SF Open data's API. I then explore residential development that is proposed or currently under construction in San Francisco in a blog post. At the end of the notebook, I include a link to the blog post. 


First, import the packages necessary for calling an API. 

In [None]:
#import packages
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import re as re
import json    # library for working with JSON-formatted text strings
import requests  # library for accessing content from web URLs
import pprint  # library for making Python data structures readable
pp = pprint.PrettyPrinter()

The SF Planning Department releases this data quarterly. Quarterly reports go back all the way to 2012. For now, I just want to take a look at the latest data, which is the second quarter of 2016. The API endpoint for this data is below.

In [None]:
Q22016 = 'https://data.sfgov.org/resource/3n2r-nn4r.json'


Because I intend to eventually create longitudinal data, I define generalizable functions below in order to tranform data from the API calls into usable dataframes. The development pipeline data is messy and inconsistent. Therefore, I define a function called "importdata" that allows users to specify the field names for each API endpoint. 

The response we get after calling the API endpoint is a list of dictionaries (one for each development project). However, not all of these dictionaries has the same set of keys (i.e. some development projects are missing a "affordable units" field. Because of this, I define two functions called "includekey" and "include_coor_key", which create a list of entries of each field from each API endpoint. If a given development is lacking a field, I fill in the entry in the list with a nan value. This way, when we create the final set of keys for the final dictionary in order to create the dataframe, each key has the same amount of value entries. 

In [None]:
def importdata(quarter, field1, field2, field3, field4, field5, field6, field7, field8, field9, geogfield1, geogfield2):
    '''
    This function calls the SF open data API for the given "quarter" of development pipeline data.
    It returns a dataframe with the data. It takes 11 fields corresponding to the keys in the API
    endpoint that correspond for the values I am interested in
    '''
    
    def includekey(field):
        '''
        This function creates a list of values from each API endpoint that correspond to the given field.
        '''
        list = []
        for item in data: 
            if field in item.keys():
                list.append(item[field])
            else:
                list.append(np.nan)
        return list

    def include_coor_key(one, two):
        '''
        This function creates a list of values from each API endpoint that correspond to the given geography field.
        Because the geography fields typically are a dictionary within a dictionary, I create a separate function
        here to handle those cases. 
        '''
        list = []
        for item in data: 
            if field1 in item.keys():
                list.append(item[one][two])
            else:
                list.append(np.nan)
        return list
    
    response = requests.get(quarter) #call the API
    results = response.text #put API response into a string object
    data = json.loads(results) #translate the json format of the data into a list
    
    #import fields
    d = {}
    d['lot_number'] = includekey(field1)
    d['address'] = includekey(field2)
    d['status'] = includekey(field3)
    d['latest_date'] = includekey(field4)
    d['units'] = includekey(field5)
    d['net_units'] = includekey(field6)
    d['affordable_units'] = includekey(field7)
    d['net_affordable_units'] = includekey(field8)
    d['zone'] = includekey(field9)
    d['lat_lon'] = include_coor_key(geogfield1, geogfield2)
    
    df = pd.DataFrame.from_dict(d) #create the dataframe from the above dictionary
    
    return df

After defining the function, we import the quarter 2 2016 data.

In [None]:
Q22016df = importdata(Q22016, 'apn', 'nameaddr', 'beststat', 'bestdate', 'units', 'unitsnet', 'aff', 'affnet', 'zoning_sim', 'location', 'coordinates')   

Finally, I clean and export the data for the blog post.

In [None]:
#Clean data after importing
Q22016df['lon'] = Q22016df['lat_lon'].astype(str).str.split(',').str[0].str.strip('[')
Q22016df['lat'] = Q22016df['lat_lon'].astype(str).str.split(',').str[1].str.strip(']')
Q22016df['net_units'] = Q22016df['net_units'].astype(int) #convert to integer
Q22016df['units'] = Q22016df['units'].astype(int) #convert to integer
#filter out those observations that have no impact on residential construction (0 net units and 0 units)
Q22016df = Q22016df[(Q22016df['units'] != 0) | (Q22016df['net_units'] != 0)]

In [None]:
#Explore the data to figure out buckets to map in carto
Q22016df[Q22016df['net_units'] > 0].describe(percentiles = [.1, .2, .3, .4, .5, .6, .7, .8, .9])

In [None]:
export_path = "Output/current_dev.csv"
Q22016df.to_csv(export_path)

In [None]:
#Blog Post: https://www.ocf.berkeley.edu/~bgoggin/2016/10/09/mapping-sf-residential-development/