# Mapping Latitude & Longitude Coordinates to NYC Neighborhoods

* In this notebook, latitude and longitude coordinates in our data are mapped to NYC neighborhoods. This is done using a geojson file of NYC neighborhoods and matplotlib's path functionality. Each geographic coordinate is tested to see which neighborhood's boundary does it lie inside.  

* CSV files are read in from Google Cloud Storage (GCS), neighborhood name added, and then written back as new files to GCS.

## Import necessary modules

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
import json

from matplotlib import path
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

%matplotlib inline
# !pip install -q google-cloud-storage
from google.cloud import storage
from io import StringIO, BytesIO

In [2]:
# set a high precision value to make it easier
# to see different latitude and longitude values
pd.set_option('precision', 15)

## Connect to Google Cloud Storage (GCS)

GCS stores files as "blobs" in "buckets" that are assigned to "projects". The following code connects variables to where the data is stored.

In [3]:
# get blob names from GCS
gcs_client = storage.Client(project = 'airy-environs-194607')
bucket = gcs_client.bucket('metis_bucket_av')

get_lines = 5
all_blobs = bucket.list_blobs(max_results = get_lines)

for blob in list(all_blobs):
    print(blob.name)

data/
data/NYC_Transit_Subway_Entrance_And_Exit_Data.csv
data/NY_neighborhoods.geojson
data/citibike/201404-citibike-tripdata.csv
data/citibike/201405-citibike-tripdata.csv


## Functions to open, edit, and upload csv files

In [4]:
def open_file(fn):
    '''
    Args:
    -----
    fn: filepath of the csv "blob" on google storage
    
    Returns:
    --------
    pandas dataframe of opened csv file
    '''
    
    # for this function to work, the bucket and 
    # client for GCS should already be defined
    
    blob = bucket.get_blob(fn)
    bytes_obj = BytesIO(blob.download_as_string())
    df = pd.read_csv(bytes_obj)
    return df

def add_neighborhood(df, lat_col, lon_col, new_col, remove_nonmatches=True):
    '''
    Args:
    -----
    df: dataframe with latitude and longitude coordinates
    lat_col: name of column with latitude information
    lon_col: name of column with longitude information
    new_col: name desired for new column with neighborhood
    remove_nonmatches: if True, any rows that are not matched
                       to a neighborhood are not included in 
                       the returned dataframe
    
    Returns:
    --------
    pandas dataframe with neighborhood info in a new column
    '''
    
    # function assumes that NYC neighborhood boundaries
    # are already loaded into an object called geofile
    
    df[new_col] = np.zeros(len(df))
    geo_points = list(zip(df[lon_col], df[lat_col]))

    for feature in geofile['features']:
        coords = feature['geometry']['coordinates'][0]
        p = path.Path(coords)
        inds = p.contains_points(geo_points)
        list_neighborhoods = [str(feature['properties']['neighborhood'])]*np.sum(inds)
        df.loc[df.index[inds], new_col] = list_neighborhoods
    
    # In all of the Uber files, there are pickups from 
    # NJ and outer NY suburbs - those do not match and
    # return a 0. If 'remove_nonmatches' is set to True,
    # they will be removed from the returned df.
    
    if remove_nonmatches:
        df = df[df[new_col] != 0]
    
    return df

def write_to_gcs(df, fn):
    '''
    Args:
    -----
    df: dataframe to write to GCS
    fn: file path+name for the destination csv file
    
    Warning:
    --------
    This function does not return anything. It will
    rewrite the original file stored on GCS if 'fn'
    provided is same as the one used to read the file.
    '''
    
    write_blob = bucket.blob(fn)
    write_blob.upload_from_string(df.to_csv(index=False))
    
def open_add_write(blobname, writename='unspecified', 
                   lat_col = 'Lat', lon_col = 'Lon', 
                   new_col = 'Loc', remove_nonmatches=True):
    '''
    Args:
    -----
    blobname: list of blob/file on GCS to open
    writename: list of blob/file on GCS to write to
               > if not specified, original file 
                 will be overwritten'''
    
    if writename == 'unspecified':
        writename = blobname
    
    read_rows = 0
    wrote_rows = 0
    
    for i, fn in enumerate(blobname):
        print(fn)
        print('>> Opening')
        df = open_file(fn)
        read_rows += len(df)
        
        print('>> Adding neighborhoods')
        df = add_neighborhood(df, lat_col, lon_col, 
                              new_col, remove_nonmatches)
        
        print('>> Writing back to GCS')
        write_to_gcs(df, writename[i])
        wrote_rows += len(df)
        print('')
    
    print('Read {:,} rows'.format(read_rows))
    print('Wrote back {:,} rows'.format(wrote_rows))

## Uber data

We can open the **NY_neighborhoods.geojson** file using Python's json module. 

In [5]:
geofile = json.load(open('NY_neighborhoods.geojson'))

### 2014 Uber data

In [6]:
blobpath = 'data/uber-tlc-foil-response/uber-trip-data/'

# Uber data for 2014
months = ['apr', 'may', 'jun', 'jul', 'aug', 'sep']
file_str = 'uber-raw-data-{}14.csv'

blobname = [blobpath + file_str.format(m) for m in months]
pprint(blobname)

['data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-apr14.csv',
 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-may14.csv',
 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-jun14.csv',
 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-jul14.csv',
 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-aug14.csv',
 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-sep14.csv']


In [None]:
open_add_write(blobname)

### 2015 Uber data  

However, the 2015 Uber data does not have latitude & longitude coordinates. What it does have are **locationID** numbers that we can match to a particular neighborhood using the **taxi_zone_lookup** file. 

In [7]:
# open uber 2015 data

uber15_fp = 'data/uber-tlc-foil-response/uber-trip-data/uber-raw-data-janjune-15.csv'
df = open_file(uber15_fp)

In [8]:
df.head()

Unnamed: 0,Dispatching_base_num,Pickup_date,Affiliated_base_num,locationID
0,B02617,2015-05-17 09:47:00,B02617,141
1,B02617,2015-05-17 09:47:00,B02617,65
2,B02617,2015-05-17 09:47:00,B02617,100
3,B02617,2015-05-17 09:47:00,B02774,80
4,B02617,2015-05-17 09:47:00,B02617,90


In [9]:
# open taxi lookup data

taxi_zone_fp = 'data/uber-tlc-foil-response/uber-trip-data/taxi-zone-lookup.csv'
taxi = open_file(taxi_zone_fp)

In [10]:
# merge information from 2 dataframes and drop 
# unnecessary columns

df = df.merge(taxi, how='left', left_on='locationID', 
              right_on='LocationID')
df.drop(labels=['Affiliated_base_num', 'LocationID'], 
        axis=1, inplace=True)

In [11]:
df.head()

Unnamed: 0,Dispatching_base_num,Pickup_date,locationID,Borough,Zone
0,B02617,2015-05-17 09:47:00,141,Manhattan,Lenox Hill West
1,B02617,2015-05-17 09:47:00,65,Brooklyn,Downtown Brooklyn/MetroTech
2,B02617,2015-05-17 09:47:00,100,Manhattan,Garment District
3,B02617,2015-05-17 09:47:00,80,Brooklyn,East Williamsburg
4,B02617,2015-05-17 09:47:00,90,Manhattan,Flatiron


Great!  

However, these neighborhood names might not match the ones from the geojson file. We can use fuzzy text matching to replace these taxi data neighborhoods from our defined neighborhoods that we used for the Uber 2014 data.

#### Fuzzy text matching

In [None]:
# installing fuzzywuzzy
# !pip install fuzzywuzzy[speedup]

In [18]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [12]:
# make a list of geographic neighborhoods from geojson file
# all trips should be translated to these neighborhoods

geo_areas = []
for i in geofile['features']:
    name = i['properties']['neighborhood']
    geo_areas.append(name)

In [15]:
# get list of neighborhoods from the taxi dataset
taxi_zones = df.Zone.unique()

In [30]:
# use fuzzy text matching to translate from taxi neighborhood
# names to geojson neighborhood names

from collections import defaultdict

top_match = defaultdict(str)
for tz in taxi_zones:
    matches = defaultdict(int)
    for ind, gz in enumerate(geo_areas):
        matches[gz] = fuzz.ratio(tz, gz)
    top_match[tz] = max(matches, key=matches.get)

In [35]:
# print top matches in the following format: 
# taxi neighborhood name: geojson neighborhood name
for tz in sorted(taxi_zones):
    print(tz, end=': ')
    print(top_match[tz])

Allerton/Pelham Gardens: Pelham Gardens
Alphabet City: Grant City
Arden Heights: Arden Heights
Arrochar/Fort Wadsworth: Fort Wadsworth
Astoria: Astoria
Astoria Park: Astoria
Auburndale: Bronxdale
Baisley Park: Marine Park
Bath Beach: Bath Beach
Battery Park: Battery Park City
Battery Park City: Battery Park City
Bay Ridge: Bay Ridge
Bay Terrace/Fort Totten: Bay Terrace
Bayside: Bayside
Bedford: Bedford-Stuyvesant
Bedford Park: Borough Park
Bellerose: Bellerose
Belmont: Belmont
Bensonhurst East: Bensonhurst
Bensonhurst West: Bensonhurst
Bloomfield/Emerson Hill: Emerson Hill
Bloomingdale: Bronxdale
Boerum Hill: Boerum Hill
Borough Park: Borough Park
Breezy Point/Fort Tilden/Riis Beach: Breezy Point
Briarwood/Jamaica Hills: Jamaica Hills
Brighton Beach: Brighton Beach
Broad Channel: Broad Channel
Bronx Park: Bronx Park
Bronxdale: Bronxdale
Brooklyn Heights: Brooklyn Heights
Brooklyn Navy Yard: Navy Yard
Brownsville: Brownsville
Bushwick North: Bushwick
Bushwick South: Bushwick
Cambria Hei

After a manual search, a few mistakes and incorrect matches are discovered. The following dictionary contains the corrections that are needed.

In [34]:
neighborhood_corrections = {'Alphabet City': 'East Village', 
                            'Auburndale': 'Fresh Meadows',
                            'Baisley Park': 'Jamaica',
                            'Bloomingdale': 'Upper West Side', 
                            'Central Harlem': 'Harlem',
                            'Central Harlem North': 'Harlem',
                            'East Flushing': 'Flushing',
                            'Erasmus': 'East Flatbush',
                            'Garment District': 'Midtown', 
                            'Hamilton Heights': 'Harlem',
                            'Heartland Village/Todt Hill': 'Todt Hill',
                            'Hillcrest/Pomonok': 'Flushing',
                            'Homecrest': 'Sheepshead Bay',
                            'Hudson Sq': 'SoHo', 
                            'Inwood Hill Park': 'Inwood',
                            'JFK Airport': 'John F. Kennedy International Airport',
                            'Kingsbridge Heights': 'Kingsbridge',
                            'Lenox Hill East': 'Upper East Side',
                            'Lenox Hill West': 'Upper East Side',
                            'Lincoln Square East': 'Upper West Side',
                            'Lincoln Square West': 'Upper West Side',
                            'Madison': 'Flatiron',
                            'Manhattan Valley': 'Upper West Side',
                            'Manhattanville': 'Harlem',
                            'New Dorp/Midland Beach': 'New Dorp',
                            'Newark Airport': '0',
                            'Oakland Gardens': 'Fresh Meadows',
                            'Ocean Hill': 'Bedford-Stuyvesant',
                            'Ocean Parkway South': 'Sheepshead Bay',
                            'Penn Station/Madison Sq West': 'Midtown',
                            'Queensboro Hill': 'Flushing',
                            'Queensbridge/Ravenswood': 'Astoria',
                            'Saint Michaels Cemetery/Woodside': 'Woodside',
                            'Seaport': 'Financial District',
                            'Starrett City': 'East New York',
                            'Stuy Town/Peter Cooper Village': 'Stuyvesant Town',
                            'Stuyvesant Heights': 'Bedford-Stuyvesant',
                            'Sutton Place/Turtle Bay North': 'Midtown',
                            'UN/Turtle Bay South': 'Midtown',
                            'Union Sq': 'Gramercy',
                            'Unknown': '0',
                            'West Chelsea/Hudson Yards': 'Chelsea',
                            'Willets Point': 'Flushing',
                            'Williamsburg (North Side)': 'Williamsburg',
                            'Williamsburg (South Side)': 'Williamsburg',
                            'World Trade Center': 'Financial District',
                            'Yorkville East': 'Upper East Side',
                            'Yorkville West': 'Upper East Side'}

In [42]:
# apply neighborhood corrections to the top_match dictionary
for i in neighborhood_corrections:
    top_match[i] = neighborhood_corrections[i]

pprint(top_match)

defaultdict(<class 'str'>,
            {'Allerton/Pelham Gardens': 'Pelham Gardens',
             'Alphabet City': 'East Village',
             'Arden Heights': 'Arden Heights',
             'Arrochar/Fort Wadsworth': 'Fort Wadsworth',
             'Astoria': 'Astoria',
             'Astoria Park': 'Astoria',
             'Auburndale': 'Fresh Meadows',
             'Baisley Park': 'Jamaica',
             'Bath Beach': 'Bath Beach',
             'Battery Park': 'Battery Park City',
             'Battery Park City': 'Battery Park City',
             'Bay Ridge': 'Bay Ridge',
             'Bay Terrace/Fort Totten': 'Bay Terrace',
             'Bayside': 'Bayside',
             'Bedford': 'Bedford-Stuyvesant',
             'Bedford Park': 'Borough Park',
             'Bellerose': 'Bellerose',
             'Belmont': 'Belmont',
             'Bensonhurst East': 'Bensonhurst',
             'Bensonhurst West': 'Bensonhurst',
             'Bloomfield/Emerson Hill': 'Emerson Hill',
             

#### Adding neighborhoods from GeoJSON file

In [55]:
# Now this dictionary needs to be applied to the uber 2015
# dataframe to translate the taxi neighborhoods into geojson
# neighborhoods

df['Loc'] = df['Zone'].map(lambda x: top_match[x])

In [60]:
df.sample(5)

Unnamed: 0,Dispatching_base_num,Pickup_date,locationID,Borough,Zone,Loc
7377734,B02682,2015-06-30 11:26:00,239,Manhattan,Upper West Side South,Upper West Side
5690482,B02764,2015-05-24 17:51:00,97,Brooklyn,Fort Greene,Fort Greene
4500371,B02764,2015-02-25 19:39:03,75,Manhattan,East Harlem South,East Harlem
76326,B02764,2015-04-15 08:27:00,141,Manhattan,Lenox Hill West,Upper East Side
9511832,B02617,2015-02-08 19:21:10,121,Queens,Hillcrest/Pomonok,Flushing


In [61]:
# drop redundant columns
df.drop(labels=['Zone'], axis=1, inplace=True)

In [69]:
# drop unmatched rows
len(df[df.Loc == '0'])

6369

In [72]:
df = df[df.Loc != '0']

In [73]:
# write to a new CSV file
write_name = 'data/uber-tlc-foil-response/uber-trip-data/uber-janjune-15-loc.csv'
write_to_gcs(df, write_name)

## Citibike Data

In [26]:
filename_template = 'data/citibike/{}-citibike-tripdata.csv'
writename_template = 'data/citibike/with_loc/{}-citibike-tripdata.csv'
times = ['201404', '201405', '201406', '201407', '201408', 
         '201409', '201501', '201502', '201503', '201504', 
         '201505', '201506']

filenames = [filename_template.format(t) for t in times]
writenames = [writename_template.format(t) for t in times]
pprint(filenames)

['data/citibike/201404-citibike-tripdata.csv',
 'data/citibike/201405-citibike-tripdata.csv',
 'data/citibike/201406-citibike-tripdata.csv',
 'data/citibike/201407-citibike-tripdata.csv',
 'data/citibike/201408-citibike-tripdata.csv',
 'data/citibike/201409-citibike-tripdata.csv',
 'data/citibike/201501-citibike-tripdata.csv',
 'data/citibike/201502-citibike-tripdata.csv',
 'data/citibike/201503-citibike-tripdata.csv',
 'data/citibike/201504-citibike-tripdata.csv',
 'data/citibike/201505-citibike-tripdata.csv',
 'data/citibike/201506-citibike-tripdata.csv']


In [29]:
open_add_write(filenames, writenames, 
               lat_col = 'start station latitude', 
               lon_col = 'start station longitude', 
               new_col = 'start_loc')

data/citibike/201404-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201405-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201406-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201407-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201408-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201409-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201501-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201502-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201503-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

data/citibike/201504-citibike-tripdata.csv
>> Opening
>> Adding neighborhoods
>> W

## MTA Stations

In [9]:
open_names = ['data/NYC_Transit_Subway_Entrance_And_Exit_Data.csv']
write_names = ['data/nyc_subway_with_loc.csv']

In [20]:
open_add_write(open_names, write_names, 
               lat_col = 'Station Latitude', 
               lon_col = 'Station Longitude', 
               new_col = 'loc')

data/NYC_Transit_Subway_Entrance_And_Exit_Data.csv
>> Opening
>> Adding neighborhoods
>> Writing back to GCS

Read 1,868 rows
Wrote back 1,868 rows
