## How it works:

1. we read DPR calls
2. we split them into geolocated and name-located
3. for geolocated, we spatially join them with park property, and give them property name.
4. after that, we join two groups back and create a list of unique names - **fuzzyNames**
5. Now, we read database of properties, and for each record convert it's Proprty ID into list with one Property ID
6. We also create empirically defined dataframe of large parks, containing more than one property. For each park, we pass list of properties instead of one. Now we join two dataframes together in **prop2**: each record have *type* attribute, showing if this was a record from original database of homebrewed one
7. At this point, we use ontology we did before (ontology for Districts). We load this ontology,  

In [108]:
__author__ = "me"
__date__ = "2015_10_13"

%pylab inline
import pandas as pd
import geopandas as gp
import numpy as np
import random

import pylab as plt
import os

from geopandas.tools import sjoin
from shapely.geometry import Point

from fuzzywuzzy import process

import requests
try:
    s = requests.get("https://raw.githubusercontent.com/Casyfill/CUSP_templates/master/Py/fbMatplotlibrc.json").json()
    plt.rcParams.update(s)
except:
    pass


numpy.random.seed(2015)

PARQA = os.getenv('PARQA')

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


## Split calls to named and geolocated

In [109]:
calls = pd.read_csv(PARQA + 'data/311/311DPR.csv', encoding='utf8', na_values='Unspecified')

In [110]:
myCalls = calls[['Park Facility Name','Descriptor','Created Date','Closed Date','Longitude','Latitude','Location Type', 'Complaint Type']]
myCalls['Park Facility Name'] = myCalls['Park Facility Name'].str.lower()
myCalls['Park Facility Name'].head()


namedCalls = myCalls[pd.notnull(myCalls['Park Facility Name'])]
geoCalls = myCalls[(pd.isnull(myCalls['Park Facility Name'])) & (pd.notnull(myCalls.Latitude))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [111]:
geoCalls.head()

Unnamed: 0,Park Facility Name,Descriptor,Created Date,Closed Date,Longitude,Latitude,Location Type,Complaint Type
0,,Snow or Ice,12/31/2010 09:04:48 PM,01/03/2011 12:03:59 PM,-73.93112,40.668798,Park,Maintenance or Facility
3,,Snow or Ice,12/31/2010 03:36:37 PM,01/03/2011 09:41:24 AM,-73.962835,40.688556,Park,Maintenance or Facility
4,,Snow or Ice,12/31/2010 03:03:16 PM,01/03/2011 12:15:38 PM,-73.999809,40.636935,Park,Maintenance or Facility
6,,Snow or Ice,12/31/2010 12:59:59 PM,01/03/2011 12:23:04 PM,-73.999456,40.609951,Park,Maintenance or Facility
7,,Snow or Ice,12/31/2010 12:12:02 PM,01/03/2011 12:19:51 PM,-73.977616,40.633153,Park,Maintenance or Facility


## GeoCalls: spatial join with parks to get parkName

In [112]:
parks = gp.read_file(PARQA + 'data/SHP/DPR_ParksProperties_001/DPR_ParksProperties_001.shp')[['geometry','SIGNNAME']]

In [6]:
parks.columns

Index([u'geometry', u'SIGNNAME'], dtype='object')

In [7]:
def toGeoDataFrame(df, lat='Latitude',lon='Longitude'):
    '''dataframe to geodataframe'''
    df['geometry'] = df.apply(lambda z: Point(z[lon], z[lat]), axis=1)
    df = gp.GeoDataFrame(df)
    df.crs = {'init': 'epsg:4326', 'no_defs': True}
    return df 

In [8]:
geoCalls = toGeoDataFrame(geoCalls).to_crs(parks.crs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [9]:
geoCalls = sjoin(geoCalls, parks, how="left").to_crs(epsg=4326)

In [10]:
geoCalls = geoCalls[pd.notnull(geoCalls.SIGNNAME)]
geoCalls['Park Facility Name'] = geoCalls['SIGNNAME']
geoCalls = geoCalls[['Park Facility Name','Descriptor','Created Date','Closed Date','Longitude','Latitude','Location Type', 'Complaint Type']]

In [113]:
calls2 = pd.concat([namedCalls, geoCalls])

In [114]:
calls2.head()

Unnamed: 0,Park Facility Name,Descriptor,Created Date,Closed Date,Longitude,Latitude,Location Type,Complaint Type
1,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility
2,brookville park,Snow or Ice,12/31/2010 04:17:22 PM,01/06/2011 08:58:30 AM,,,Park,Maintenance or Facility
5,highland park,Snow or Ice,12/31/2010 02:57:34 PM,01/03/2011 11:31:26 AM,,,Park,Maintenance or Facility
10,prospect park - east parade grounds,Dead Animal,12/31/2010 11:26:34 AM,01/03/2011 11:12:37 AM,,,Park,Animal in a Park
11,central park - east 96th street playground,Snow or Ice,12/31/2010 11:18:31 AM,01/04/2011 12:11:02 PM,,,Park,Maintenance or Facility


In [115]:
calls2.shape

(79058, 8)

## Now match those names

In [116]:
calls2['Park Facility Name'] = calls2['Park Facility Name'].str.lower()

In [149]:
fuzzyNames = pd.DataFrame(calls2['Park Facility Name'].unique())

fuzzyNames['try'] = 'try'
fuzzyNames.rename(columns={0:'Park Facility Name'}, inplace=1)

## ParkID

In [118]:
prop = pd.read_excel(PARQA + 'data/Input/Parks_Data/CUSP_Adjusted_Spatial_Data.xlsx')[['ParkID','Name']]
prop = prop.dropna()
prop.Name = prop.Name.str.lower()
prop['type']='pid'

In [119]:
def trySplit(x, spl='-'):
    '''get rid of addons'''
    if spl in x:
        return x.split(spl)[0].strip()
    else:
        return x

In [171]:
prop[prop.Name.str.contains('fort tryon park')]

Unnamed: 0,ParkID,Name,type
457,[M029-ZN01],fort tryon park zone 1,pid
458,[M029-ZN02],fort tryon park zone 2,pid
459,[M029-ZN03],fort tryon park zone 3,pid
460,[M029-ZN04],fort tryon park zone 4,pid
461,[M029-ZN05],fort tryon park zone 5,pid
462,[M029-ZN06],fort tryon park zone 6,pid


In [120]:
### empirical dictionary

d = [
    {'ParkID':'B073','type':'abstr','Name': 'prospect park'},
    {'ParkID':'M010','type':'abstr','Name': 'central park'},
    {'ParkID':'Q004','type':'abstr','Name': 'astoria park'},
    {'ParkID':'X010','type':'abstr','Name': 'crotona park'},
    {'ParkID':'Q162','type':'abstr', 'Name': 'rockaway beach boardwalk'},
    {'ParkID':'M014','type':'abstr', 'Name': 'jackie robinson park'},
    {'ParkID':'M028','type':'abstr', 'Name': 'fort washington park'},
    {'ParkID':'M037','type':'abstr', 'Name': 'highbridge park'},
    {'ParkID':'M098','type':'abstr', 'Name': 'washington square park'},
    {'ParkID':'M105','type':'abstr', 'Name': 'sara d. roosevelt park'},
    {'ParkID':'M107','type':'abstr', 'Name': "hell's kitchen park"},
    {'ParkID':'M283','type':'abstr', 'Name': 'battery park city'},
    {'ParkID':'Q001','type':'abstr', 'Name': 'alley pond park'},
    {'ParkID':'Q005','type':'abstr', 'Name': 'baisley pond park'},
    {'ParkID':'Q009','type':'abstr', 'Name': 'macneil park'},
    {'ParkID':'Q012','type':'abstr', 'Name': 'crocheron park'},
    {'ParkID':'Q020','type':'abstr', 'Name': 'highland park'},
    {'ParkID':'Q021','type':'abstr', 'Name': 'cunningham park'},
    {'ParkID':'Q024','type':'abstr', 'Name': 'kissena park'},
    {'ParkID':'Q102','type':'abstr', 'Name': 'juniper valley park'},
    {'ParkID':'R129','type':'abstr', 'Name': 'greenbelt native plant center'},
    {'ParkID':'B058','type':'abstr', 'Name': 'mccarren park'},
    {'ParkID':'M071','type':'abstr', 'Name': 'riverside park'},
    {'ParkID':'M360','type':'abstr', 'Name': 'the high line'},
    {'ParkID':'X001','type':'abstr', 'Name': 'aqueduct walk'},
    {'ParkID':'X092','type':'abstr', 'Name': 'van cortlandt park'},
    {'ParkID':'Q099','type':'abstr', 'Name': 'flushing meadows corona park'},
    {'ParkID':'X039','type':'abstr', 'Name': 'pelham bay park'},
    {'ParkID':'Q015','type':'abstr', 'Name': 'forest park'},
    {'ParkID':'M042','type':'abstr', 'Name': 'inwood hill park'},
    {'ParkID':'B057','type':'abstr', 'Name': 'marine park'},
    {'ParkID':'B126','type':'abstr', 'Name': 'red hook park'},
    {'ParkID':'Q300','type':'abstr', 'Name': 'kissena corridor park'},
    {'ParkID':'X045','type':'abstr', 'Name': "st mary's playground"},
    {'ParkID':'X002','type':'abstr', 'Name': "bronx park"},
    {'ParkID':'M058','type':'abstr', 'Name': "marcus garvey park"},
    {'ParkID':'B371','type':'abstr', 'Name': "spring creek park"},
    {'ParkID':'B371','type':'abstr', 'Name': "spring creek park"},
    {'ParkID':'X039','type':'abstr', 'Name': "orchard beach and promenade"},
    {'ParkID':'M029','type':'abstr', 'Name': "fort tryon park"}
    
] 

abstr = pd.DataFrame(d)

In [121]:
def getIDList(ID):
    '''get list of pID for this park'''
    return prop.ParkID[prop.ParkID.str.startswith(ID)].tolist()

In [122]:
abstr.ParkID = abstr.ParkID.apply(getIDList)

In [123]:
prop.ParkID = prop.ParkID.apply(lambda x: [x])

In [124]:
prop.head()

Unnamed: 0,ParkID,Name,type
0,[M058-07],marcus garvey memorial park,pid
1,[M058-06],marcus garvey memorial park,pid
2,[M058-01],mt. morris east,pid
3,[M047-03],thomas jefferson park,pid
4,[M273-01],othmar ammann playground,pid


In [125]:
prop2 = pd.concat((prop, abstr))

## Check how ontology works

In [155]:
ontoMask = pd.read_csv(PARQA + 'parqa/311/ONTOLOGY/Ontology_districts.csv')[['cleanName','NAME']]
ontoMask.rename(columns={'NAME':'Name','cleanName':'fuzname'}, inplace=1)
ontoMask.head(2)

Unnamed: 0,fuzname,Name
0,geo soilan park - battery park city,battery park city
1,brookville park,brookville park


In [161]:
onto2 = prop2.merge(ontoMask, on='Name', how='left')
onto2.fuzname[pd.isnull(onto2.fuzname)] = onto2.Name[pd.isnull(onto2.fuzname)]

In [162]:
### edit Onto
onto2.fuzname[onto2.Name.str.contains('mccarren park')] = 'mccarren park'
onto2.fuzname[onto2.fuzname.str.contains('hunt')] = 'hunts point riverside park'
onto2.fuzname[onto2.fuzname.str.contains('waring')] = 'waring plgd'
onto2.fuzname[onto2.fuzname.str.contains('red hook')] = 'red hook park'
onto2.fuzname[onto2.fuzname.str.contains('rockaway beach and boardwalk')] = 'rockaway beach boardwalk'


In [163]:
onto2.head()

Unnamed: 0,Name,ParkID,type,fuzname
0,marcus garvey memorial park,[M058-07],pid,marcus garvey playground
1,marcus garvey memorial park,[M058-06],pid,marcus garvey playground
2,mt. morris east,[M058-01],pid,mt. morris east
3,thomas jefferson park,[M047-03],pid,thomas jefferson park
4,thomas jefferson park,[M047-03],pid,recreation center - thomas jefferson


In [164]:
onto2.rename(inplace=1, columns={'fuzname':'Park Facility Name','newName':'Name'})

## Checking ontology

In [169]:
x = fuzzyNames.merge(onto2, how='left', on='Park Facility Name')

x[(pd.isnull(x.ParkID))&(x['Park Facility Name'].str.contains('park'))].sort_values('Park Facility Name')

Unnamed: 0,Park Facility Name,try,Name,ParkID,type
1453,9th st community garden park,try,,,
900,baisley pond park - 157th st playground,try,,,
973,barrett park,try,,,
994,bayview terrace park,try,,,
302,bloomingdale park,try,,,
965,bradhurst urban renewal park,try,,,
416,bradys pond park,try,,,
1310,breukelen park,try,,,
241,bronx river park,try,,,
1011,brooklyn bridge park,try,,,


## get Pid for Calls

In [138]:
onto2.head()

Unnamed: 0,Park Facility Name,ParkID,type,Name
0,marcus garvey memorial park,[M058-07],pid,marcus garvey memorial park
1,marcus garvey memorial park,[M058-06],pid,marcus garvey memorial park
2,mt. morris east,[M058-01],pid,mt. morris east
3,thomas jefferson park,[M047-03],pid,thomas jefferson park
4,othmar ammann playground,[M273-01],pid,othmar ammann playground


In [141]:
calls3 = calls2.merge(onto2[['Park Facility Name','ParkID']], on='Park Facility Name', how='left')
# calls3 = calls3[pd.notnull(calls3.ParkID)]

In [144]:
len(calls3[pd.notnull(calls3.ParkID)])

43588

In [142]:
### randomly chosen ParkID in the list

calls3['rParkID'] = calls3['ParkID'].apply(lambda x: random.choice(x))

TypeError: object of type 'float' has no len()

In [517]:
calls3.head()

Unnamed: 0,Park Facility Name,Descriptor,Created Date,Closed Date,Longitude,Latitude,Location Type,Complaint Type,ParkID,rParkID
0,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility,[M283A],M283A
1,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility,"[M283-03, M283-02, M283-01, M283-ZN01, M283A]",M283-ZN01
2,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility,[M283A],M283A
3,geo soilan park - battery park city,Graffiti or Vandalism,12/31/2010 04:31:52 PM,12/31/2010 05:36:58 PM,,,Park,Maintenance or Facility,"[M283-03, M283-02, M283-01, M283-ZN01, M283A]",M283-02
4,brookville park,Snow or Ice,12/31/2010 04:17:22 PM,01/06/2011 08:58:30 AM,,,Park,Maintenance or Facility,[Q008-02],Q008-02


In [1]:
# calls3.Descriptor.value_counts()
calls3[['Created Date','rParkID']][calls3.Descriptor == 'Garbage or Litter'].to_csv(PARQA + 'data/311/311_rPID_litter.csv')
calls3[['Created Date','rParkID']].to_csv(PARQA + 'data/311/311_rPID_all.csv')

NameError: name 'calls3' is not defined