# NYC GeoClient tutorial

In this lesson, we are going to go over how to geocode addresses using Python and NYC's GeoClient.

[Presentation on geocoding](https://docs.google.com/presentation/d/1LyM9f6icWiee1HE5ai_H73_IZ4C65YAzX52cR-flifo/edit?usp=sharing)

The Department of City Planning (DCP) maintains the official NYC geocoding application called GeoSupport. There are multiple ways of accessing this application. A web interface ([GOAT](http://a030-goat.nyc.gov/goat/Default.aspx.)) lets you query addresses one-by-one.

Another way of accessing the online version of GeoSupport is through and API maintained by the Department of Information Technology and Telecommunications (DoITT). In order to use this API, you need to [register for an account and request an API key.](https://developer.cityofnewyork.us/api/geoclient-api) For now you can use the keys provided here.

To make things easier, [John Krauss](https://github.com/talos/nyc-geoclient) wrote Python bindings for DoITT’s Geoclient API that allows for querying using Python. Documentation here: [nyc_geoclient](https://nyc-geoclient.readthedocs.io/en/latest/geoclient.html). 

Install this package, from the command line 

> pip install nyc_geoclient.

For this tutorial we will be using pandas, so if you haven't already, install that as well. 

> pip install pandas.

## Part 1 - geocoding single addresses

We can querry GeoClient directly from a browser. This would be the querry for 253 Broadway in Manhattan, try it in your browser:

https://api.cityofnewyork.us/geoclient/v1/address.json?houseNumber=253&street=broadway&borough=manhattan&app_id=fb9ad04a&app_key=051f93e4125df4bae4f7c57517e62344

The query is a bit cumbersom because you have to include the app_id and app_key (this is what identifies you once you register on DOITT's website). Also the output is not easy to deal with in a browser. But you get the idea of how it works.

### python bindings
Now let's try it using the python bindings: nyc_geoclient.py

In [None]:
# import the package
from nyc_geoclient import Geoclient

#set up the app key and id (you can get your own from DOITT's website)
myAppID = 'fb9ad04a'
myKey = '051f93e4125df4bae4f7c57517e62344'

g = Geoclient(myAppID,myKey)

The nyc_geoclient package has stored our credentials and can use it to query the online API. We don't need to worry about the credentials after this, it's all stored in the variable g.

The address function needs a house number, street name, and either borough or zipcode. Try it a few times to see what you get back.

In [None]:
g.address(253,'Broadway','manhattan')

In [None]:
g.address(253,'Broadway','10007')

As you can see, the function returns a LOT of information. The information is returned in the form of a **dictionary**.

In this example, the first **key** of the dictionary above is 'assemblyDistrict', and the associated value is '66'.

Questions: 

How would you return *only* 'assemblyDistrict' or 'BBL' for instance?

What is the BIN and BBL for 100 Gold Street?

## Part 2 - geocoding a dataframe
This is great, but it only allows us to do one address at a time. What if we had a dataframe of addresses to geocode?

For this I have written a [geoclientBatch](https://github.com/deenapatel/geocode/blob/master/geoclient.py) function that loops through a dataframe, geocoding each row using Geoclient.

### setting up the data
First let's get some data to work with. Let's say we are interested in all the micro breweries in NYC.

The NY State Open Data portal has a listing of active liquor licenses https://data.ny.gov/Economic-Development/Liquor-Authority-Quarterly-List-of-Active-Licenses/hrvs-fxs2/data.

I downloaded all of the 'Micro Brewer' license types in NYC (filtering on County Names= NEW YORK, BRONX, BROOKLYN, QUEENS, RICHMOND) and saved it in the [data folder](https://github.com/deenapatel/geocode/tree/master/data).

Let's read this into a dataframe

In [None]:
# import pandas and set the options to diplay more rows and columns than the default
import pandas as pd
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

mblic = pd.read_csv('data/Liquor_Authority_Quarterly_List_of_Active_Licenses2018-07-30.csv')
print mblic.shape
mblic.head()

Notice this dataframe as Address as a single column. We'll need to separate this into a house number column and a street column before using geoclient.

Pandas lets you use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) via the [pandas.Series.str.extract](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html) function.

This won't work for every address, but it's pretty close.


In [None]:
# the address column name is a bit cumbersom, so let's store it as a variable
addressCol = 'Actual Address of Premises (Address1)'

# extracting out the house number, this is any number or - \
#(Queens has -'s in their numbers)
mblic['houseNo'] = mblic[addressCol].str.extract('(^[0-9|-]*)',expand=False)
# extract everything after the space as being the street
mblic['street'] = mblic[addressCol].str.extract('(\s.+$)',expand=False)

# rename the borough column
mblic['borough'] = mblic['County Name (Licensee)']
# let's see how it looks
mblic[[addressCol,'houseNo','street','borough']]

### running geoclient batch
Now we are ready to start to geocode it.

Make sure geoclient.py is in the current folder

In [None]:
from geoclient import geoclientBatch

In [None]:
mblic = geoclientBatch(mblic, houseNo='houseNo', street='street', boro='borough')
mblic

Did they all geocode? If not why?

What do you need to do to get most of them to geocode?

In [None]:
mblic[mblic.geoBBL=='']

#### followup exercises
1. What type of buildings are these microbreweries located in? 
DCP's PLUTO (Primary Land Use Tax Lot Output) has BBL as well as info on building class (and much more). Download this dataset and match it using BBL.

2. What neighborhoods are these breweries located in?
How would you modify geoclientBatch to include 'nta' or 'ntaName'? NTA stands for Neighborhood Tabulation Area.