## How it works:

1. we read DPR calls
2. we split them into geolocated and name-located
3. for geolocated, we spatially join them with park property, and give them property name.
4. after that, we join two groups back and create a list of unique names - **fuzzyNames**
5. Now, we read database of properties, and for each record convert it's Proprty ID into list with one Property ID
6. We also create empirically defined dataframe of large parks, containing more than one property. For each park, we pass list of properties instead of one. Now we join two dataframes together in **prop2**: each record have *type* attribute, showing if this was a record from original database of homebrewed one.
7. At this point, we use ontology we did before (ontology for Districts). We load this ontology, and manually perform some aggregation and attach **fuzz** name to each property.
8. Then we create a custom "fuzzname cleaning" function to pefrorm on calls.
9. And we try to match unique call locations with our properties. To improve matching, we perform **fuzzywuzzy process.extractOne** on unmatched ones - this helps us to improve our cleaning function. 
10. However, here we fail to recognize as much as 350 fuzz names: most of them just are not in our Proprties database, or their name changed/differs officially.
11. **As multiple calls named after large park, not specific zone, all calls choose random element in *property_id* list: most of the calls have only one in the list, therefore, they select it all the tyme.**
12. Both ontology pairs and matched calls are saved as csv files 

In [1]:
__author__ = "me"
__date__ = "2015_10_13"

%pylab inline
import pandas as pd
import geopandas as gp
import numpy as np
import random

import pylab as plt
import os

from geopandas.tools import sjoin
from shapely.geometry import Point

from fuzzywuzzy import process

import requests
try:
    s = requests.get("https://raw.githubusercontent.com/Casyfill/CUSP_templates/master/Py/fbMatplotlibrc.json").json()
    plt.rcParams.update(s)
except:
    pass


numpy.random.seed(2015)

PARQA = os.getenv('PARQA')

Populating the interactive namespace from numpy and matplotlib




## Split calls to named and geolocated

In [2]:
calls = pd.read_csv(PARQA + 'data/311/311DPR.csv', encoding='utf8', na_values='Unspecified')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
myCalls = calls[['Park Facility Name','Descriptor','Created Date','Closed Date','Longitude','Latitude','Location Type', 'Complaint Type']]
myCalls['Park Facility Name'] = myCalls['Park Facility Name'].str.lower()
myCalls['Park Facility Name'].head()


namedCalls = myCalls[pd.notnull(myCalls['Park Facility Name'])]
geoCalls = myCalls[(pd.isnull(myCalls['Park Facility Name'])) & (pd.notnull(myCalls.Latitude))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [4]:
geoCalls.head()

Unnamed: 0,Park Facility Name,Descriptor,Created Date,Closed Date,Longitude,Latitude,Location Type,Complaint Type
0,,Snow or Ice,12/31/2010 09:04:48 PM,01/03/2011 12:03:59 PM,-73.93112,40.668798,Park,Maintenance or Facility
3,,Snow or Ice,12/31/2010 03:36:37 PM,01/03/2011 09:41:24 AM,-73.962835,40.688556,Park,Maintenance or Facility
4,,Snow or Ice,12/31/2010 03:03:16 PM,01/03/2011 12:15:38 PM,-73.999809,40.636935,Park,Maintenance or Facility
6,,Snow or Ice,12/31/2010 12:59:59 PM,01/03/2011 12:23:04 PM,-73.999456,40.609951,Park,Maintenance or Facility
7,,Snow or Ice,12/31/2010 12:12:02 PM,01/03/2011 12:19:51 PM,-73.977616,40.633153,Park,Maintenance or Facility


## GeoCalls: spatial join with parks to get parkName

In [5]:
parks = gp.read_file(PARQA + 'data/SHP/DPR_ParksProperties_001/DPR_ParksProperties_001.shp')[['geometry','SIGNNAME']]

In [6]:
parks.columns

Index([u'geometry', u'SIGNNAME'], dtype='object')

In [7]:
def toGeoDataFrame(df, lat='Latitude',lon='Longitude'):
    '''dataframe to geodataframe'''
    df['geometry'] = df.apply(lambda z: Point(z[lon], z[lat]), axis=1)
    df = gp.GeoDataFrame(df)
    df.crs = {'init': 'epsg:4326', 'no_defs': True}
    return df 

In [8]:
geoCalls = toGeoDataFrame(geoCalls).to_crs(parks.crs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [9]:
geoCalls = sjoin(geoCalls, parks, how="left").to_crs(epsg=4326)

In [10]:
geoCalls = geoCalls[pd.notnull(geoCalls.SIGNNAME)]
geoCalls['Park Facility Name'] = geoCalls['SIGNNAME']
geoCalls = geoCalls[['Park Facility Name','Descriptor','Created Date','Closed Date','Longitude','Latitude','Location Type', 'Complaint Type']]

In [11]:
calls2 = pd.concat([namedCalls, geoCalls])

In [12]:
calls2.head()

Unnamed: 0,Park Facility Name,Descriptor,Created Date,Closed Date,Longitude,Latitude,Location Type,Complaint Type
1,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility
2,brookville park,Snow or Ice,12/31/2010 04:17:22 PM,01/06/2011 08:58:30 AM,,,Park,Maintenance or Facility
5,highland park,Snow or Ice,12/31/2010 02:57:34 PM,01/03/2011 11:31:26 AM,,,Park,Maintenance or Facility
10,prospect park - east parade grounds,Dead Animal,12/31/2010 11:26:34 AM,01/03/2011 11:12:37 AM,,,Park,Animal in a Park
11,central park - east 96th street playground,Snow or Ice,12/31/2010 11:18:31 AM,01/04/2011 12:11:02 PM,,,Park,Maintenance or Facility


In [13]:
calls2.shape

(55095, 8)

## Now match those names

In [77]:
calls2['Park Facility Name'] = calls2['Park Facility Name'].str.lower()

In [78]:
fuzzyNames = pd.DataFrame(calls2['Park Facility Name'].unique())

fuzzyNames['try'] = 'try'
fuzzyNames.rename(columns={0:'Park Facility Name'}, inplace=1)

In [99]:
# fuzzyNames[fuzzyNames['Park Facility Name'].str.contains('-')]

## ParkID

In [647]:
prop = pd.read_excel(PARQA + 'data/Input/Parks_Data/CUSP_Adjusted_Spatial_Data.xlsx')[['ParkID','Name']]
prop = prop.dropna()
prop.Name = prop.Name.str.lower()
prop['type']='pid'

In [648]:
def trySplit(x, spl='-'):
    '''get rid of addons'''
    if spl in x:
        return x.split(spl)[0].strip()
    else:
        return x

In [649]:
prop[prop.Name.str.contains('wayanda park')]

Unnamed: 0,ParkID,Name,type
2062,Q082,wayanda park,pid


In [650]:
### empirical dictionary

d = [
    {'ParkID':'B073','type':'abstr','Name': 'prospect park'},
    {'ParkID':'M010','type':'abstr','Name': 'central park'},
    {'ParkID':'Q004','type':'abstr','Name': 'astoria park'},
    {'ParkID':'X010','type':'abstr','Name': 'crotona park'},
    {'ParkID':'Q162','type':'abstr', 'Name': 'rockaway beach boardwalk'},
    {'ParkID':'M014','type':'abstr', 'Name': 'jackie robinson park'},
    {'ParkID':'M028','type':'abstr', 'Name': 'fort washington park'},
    {'ParkID':'M037','type':'abstr', 'Name': 'highbridge park'},
    {'ParkID':'M098','type':'abstr', 'Name': 'washington square park'},
    {'ParkID':'M105','type':'abstr', 'Name': 'sara d. roosevelt park'},
    {'ParkID':'M107','type':'abstr', 'Name': "hell's kitchen park"},
    {'ParkID':'M283','type':'abstr', 'Name': 'battery park city'},
    {'ParkID':'Q001','type':'abstr', 'Name': 'alley pond park'},
    {'ParkID':'Q005','type':'abstr', 'Name': 'baisley pond park'},
    {'ParkID':'Q009','type':'abstr', 'Name': 'macneil park'},
    {'ParkID':'Q012','type':'abstr', 'Name': 'crocheron park'},
    {'ParkID':'Q020','type':'abstr', 'Name': 'highland park'},
    {'ParkID':'Q021','type':'abstr', 'Name': 'cunningham park'},
    {'ParkID':'Q024','type':'abstr', 'Name': 'kissena park'},
    {'ParkID':'Q102','type':'abstr', 'Name': 'juniper valley park'},
    {'ParkID':'R129','type':'abstr', 'Name': 'greenbelt native plant center'},
    {'ParkID':'B058','type':'abstr', 'Name': 'mccarren park'},
    {'ParkID':'M071','type':'abstr', 'Name': 'riverside park'},
    {'ParkID':'M360','type':'abstr', 'Name': 'the high line'},
    {'ParkID':'X001','type':'abstr', 'Name': 'aqueduct walk'},
    {'ParkID':'X092','type':'abstr', 'Name': 'van cortlandt park'},
    {'ParkID':'Q099','type':'abstr', 'Name': 'flushing meadows corona park'},
    {'ParkID':'X039','type':'abstr', 'Name': 'pelham bay park'},
    {'ParkID':'Q015','type':'abstr', 'Name': 'forest park'},
    {'ParkID':'M042','type':'abstr', 'Name': 'inwood hill park'},
    {'ParkID':'B057','type':'abstr', 'Name': 'marine park'},
    {'ParkID':'B126','type':'abstr', 'Name': 'red hook park'},
    {'ParkID':'Q300','type':'abstr', 'Name': 'kissena corridor park'},
    {'ParkID':'X045','type':'abstr', 'Name': "st mary's playground"},
    {'ParkID':'X002','type':'abstr', 'Name': "bronx park"},
    {'ParkID':'M058','type':'abstr', 'Name': "marcus garvey park"},
    {'ParkID':'B371','type':'abstr', 'Name': "spring creek park"},
    {'ParkID':'B371','type':'abstr', 'Name': "spring creek park"},
    {'ParkID':'X039','type':'abstr', 'Name': "orchard beach and promenade"},
    {'ParkID':'M029','type':'abstr', 'Name': "fort tryon park"},
    {'ParkID':'R005','type':'abstr', 'Name': "clove lakes park"},
    {'ParkID':'X142','type':'abstr', 'Name': "riverdale park"},
    {'ParkID':'B082','type':'abstr', 'Name': "shore road park"},
    {'ParkID':'B431','type':'abstr', 'Name': "brooklyn bridge park"},
    {'ParkID':'R065','type':'abstr', 'Name': "willowbrook park"},
    {'ParkID':'B029','type':'abstr', 'Name': "eastern parkway"},
    {'ParkID':'M037','type':'abstr', 'Name': "highbridge park"},
    {'ParkID':'Q461','type':'abstr', 'Name': "powell's cove"},
    {'ParkID':'X002','type':'abstr', 'Name': "bronx river park"},
    {'ParkID':'X203','type':'abstr', 'Name': "randall's island park"},
    {'ParkID':'M077','type':'abstr', 'Name': "st nicholas park"},
    {'ParkID':'B018','type':'abstr', 'Name': "canarsie park"},
    {'ParkID':'B028','type':'abstr', 'Name': "dyker beach park"},
    {'ParkID':'B054','type':'abstr', 'Name': "lincoln terrace park / arthur s. somers plgd"}
    
    

]

abstr = pd.DataFrame(d)

In [651]:
def getIDList(ID):
    '''get list of pID for this park'''
    return prop.ParkID[prop.ParkID.str.startswith(ID)].tolist()

In [652]:
abstr.ParkID = abstr.ParkID.apply(getIDList)

In [653]:
prop.ParkID = prop.ParkID.apply(lambda x: [x])

In [654]:
prop2 = pd.concat([abstr, prop])

In [655]:
prop2[prop2.Name.str.contains('marie curie')]

Unnamed: 0,Name,ParkID,type
1954,marie curie park,[Q364],pid


## Check how ontology works

In [663]:
ontoMask = pd.read_csv(PARQA + 'parqa/311/ONTOLOGY/Ontology_districts.csv')[['cleanName','NAME']]
ontoMask.rename(columns={'NAME':'Name','cleanName':'fuzz'}, inplace=1)
ontoMask.head(2)

Unnamed: 0,fuzz,Name
0,geo soilan park - battery park city,battery park city
1,battery park city,battery park city


In [664]:
onto2 = prop2.merge(ontoMask, on='Name', how='left')
onto2.fuzz[pd.isnull(onto2.fuzz)] = onto2.Name[pd.isnull(onto2.fuzz)]

In [665]:
onto2.rename(inplace=1, columns={'newName':'Name'})

In [666]:
onto2[onto2['Name']=='long pond']

Unnamed: 0,Name,ParkID,type,fuzz


In [667]:
def cleanNames(series):
    '''clean series of park names'''
    series = series.str.lower().str.strip() #.str.replace("'","")
    
    ## to replace with loop
    series[series.str.contains('mccarren park')] = 'mccarren park'
    series[series.str.contains('hunt')] = 'hunts point riverside park'
    series[series.str.contains('waring')] = 'waring plgd'
    series[series.str.contains('red hook')] = 'red hook park'
    series[series.str.contains('rockaway beach and boardwalk')] = 'rockaway beach boardwalk'
    series[series.str.contains('fort tryon')] = 'fort tryon park'
    series[series.str.contains('clove lakes')] = 'clove lakes park'
    series[series.str.contains('riverdale park')] = 'riverdale park'
    series[series.str.contains('battery park city')] = 'battery park city'
    series[series.str.contains('riverside park')] = 'riverside park'
    series[series.str.contains('shore road park')] = 'shore road park'
    series[series.str.contains('brooklyn bridge park')] = 'brooklyn bridge park'
    series[series.str.contains('allison')] = 'allison park'
    series[series.str.contains('willowbrook')] = 'willowbrook park'
    series[series.str.contains('marcus garvey park')] = 'marcus garvey park'
    series[series.str.contains('marie curie')] = 'marie curie park'
    series[series.str.contains('eastern parkway')] = 'eastern parkway malls'
    series[series.str.contains('highbridge park')] = 'highbridge park'
    series[series.str.contains("powell's cove")] = "powell's cove"
    series[series.str.contains("powells cove")] = "powell's cove"
    series[series.str.contains("randalls's island")] = "randalls's island park"
    series[series.str.contains("nicholas park")] = "st nicholas park"
    series[series.str.contains("clearview park")] = "clearview park"
    series[series.str.contains("dinapoly")] = "dinapoly plgd"
    series[series.str.contains("luna park")] = "luna park"
    series[series.str.contains("john jay")] = "john jay park"
    series[series.str.contains("cooper park")] = "cooper park"
    series[series.str.contains("joan of arc")] = "joan of arc island"
    series[series.str.contains("somers")] = "lincoln terrace park / arthur s. somers plgd"
    series[series.str.contains("lincoln ter")] = "lincoln terrace park / arthur s. somers plgd"
    series[series.str.contains("lindsay")] = "lindsay triangle"
    series[series.str.contains("saratoga square park")] = "saratoga park"
    series[series.str.contains("wayanda park")] = "wayanda park"
    series[series.str.contains("barrett")] = "barretto park"
    series[series.str.contains("baisley pond park")] = "baisley pond park"
    


    series[series.str.contains('pool -')] = series[series.str.contains('pool -')].str.replace('pool -','').str.strip()
 
    
    return series


In [668]:
fuzzyNames['fuzz'] = cleanNames(fuzzyNames['Park Facility Name'])

## Checking ontology

In [688]:
x = fuzzyNames.merge(onto2, how='left', on='fuzz')

print len(x[(pd.isnull(x.ParkID))&(x['Park Facility Name'].str.contains('park'))])
z = x[(pd.isnull(x.ParkID))&(x['Park Facility Name'].str.contains('park'))].sort_values('fuzz')[['Park Facility Name','fuzz']]
x[(pd.isnull(x.ParkID))].sort_values('fuzz')

60


350

## Trying to Fuzzy_Match names

In [671]:
z['try'] = z.fuzz.apply(lambda x: process.extractOne(x, onto2.fuzz.tolist()))

In [672]:
z['ratio'] = z['try'].str.get(1)
z['name'] = z['try'].str.get(0)
z.sort_values('ratio',ascending=0).head(10)

Unnamed: 0,Park Facility Name,fuzz,try,ratio,name
1951,wolfe's pond park,wolfe's pond park,"(wolfes pond park, 97)",97,wolfes pond park
2008,j. hood wright park,j. hood wright park,"(j hood wright park, 97)",97,j hood wright park
493,ferry point park,ferry point park,"(ferry point park zone 1, 95)",95,ferry point park zone 1
794,richman echo park,richman echo park,"(richman (echo) park upper, 95)",95,richman (echo) park upper
1931,paerdegat basin park,paerdegat basin park,"(paerdegat park, 95)",95,paerdegat park
1324,ocean parkway malls,ocean parkway malls,"(ocean parkway malls zone 1, 95)",95,ocean parkway malls zone 1
1989,mosholu parkway,mosholu parkway,"(mosholu parkway zone 1, 95)",95,mosholu parkway zone 1
403,little bay park,little bay park,"(little bay park zone 1, 95)",95,little bay park zone 1
1347,laurelton parkway,laurelton parkway,"(laurelton parkway west, 95)",95,laurelton parkway west
786,pool - lasker in central park,lasker in central park,"(pool - lasker in central park, 95)",95,pool - lasker in central park


## creating Ontology File

In [674]:
ontology =  onto2.merge(fuzzyNames, how='left', on='fuzz')

In [678]:
ontology.head()

Unnamed: 0,Name,ParkID,type,fuzz
0,prospect park,"[B073-02D, B073-02, B073-10, B073-20, B073-09,...",abstr,prospect park - east parade grounds
1,prospect park,"[B073-02D, B073-02, B073-10, B073-20, B073-09,...",abstr,prospect park - grand army plaza
2,prospect park,"[B073-02D, B073-02, B073-10, B073-20, B073-09,...",abstr,mt prospect park
3,prospect park,"[B073-02D, B073-02, B073-10, B073-20, B073-09,...",abstr,prospect park
4,prospect park,"[B073-02D, B073-02, B073-10, B073-20, B073-09,...",abstr,prospect park - harmony playground


In [677]:
ontology[['Name','ParkID','type','fuzz']].to_csv(PARQA + 'parqa/311/ONTOLOGY/ontology_districts.csv', encoding='utf8')

In [681]:
calls2['Park Facility Name']  = cleanNames(calls2['Park Facility Name'])
calls3 = calls2.merge(onto2.rename(columns={'fuzz':'Park Facility Name'}), on='Park Facility Name',how='left')

In [683]:
print len(calls3)
print len(calls3[pd.isnull(calls3.type)])

75212
6841


In [684]:
calls3[calls3.Descriptor == 'Garbage or Litter'].to_csv(PARQA + 'data/311/311_rPID_litter.csv')
calls3.to_csv(PARQA + 'data/311/311_rPID_all.csv')