# Investigating 311 Calls

## Requirements

[required python libraries](libraries_list.md)

## Data

- [NYC_311_open_data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)
- parks shape file
- multiple park property shapefiles
- Park district boundaries
- DPR PIP Inspections aggregated dataset


[list of open park-related data sources](PARKS_OPEN_DATA.md)

## I. collecting parks-related 311 Calls
[Notebook](1_Raw_311_to_dataset.ipynb)

All data available on [NYC_311_open_data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) was collected as raw csv files, one per year. Data was collected manually via webpage with filtering. 

Data then was saved in: 
    - two formats (csv and pkl), 
    - two attributes sets: full and "light" (only essentual columns, read below)
    - three record sets: full (all calls), DPR-related, and park-related (without streets and curbs complains
    
None of this data is presented in repository (due to it's size): it is included into the offline package

## II. exploring 311 data

[Notebook](2_311_exploration.ipynb)

### actual numbers

Dataset consists of 10'030'671 calls, but only 459'764 of them (4.58 %) are related to DPR. Of those, only 81'685 calls (17.76 %) complains on parks (not on streets and street curbs). During the rsearch, we are working with those 81'685 calls (files **311_parks_full** and **311_parks_light**)



### plotted timeline of 311 calls total, calls for DPR 
![tline1](img/311_exploration_1.png)
### plotted timeline of DPR_311 data, by park borough
![tline3](img/exploration3.png)

### plotted timeline of DPR_311 data for Queens
![tline4](img/exploration4.png)

Here above, we can see that this abnormal pike in october for 2010, 2011 and 2012 is caused by "Unspecified" facilities, yet almost all calls in this category have geolocation. Most of them are on streets, and again, most of them are caused either by fallen tree or fallen leaf. However, it still unclear why we can see this periodicity and those specific areas (Queens and Brooklin) affected.

## III. Park naming Ontology

all calls in 311_DPR dataset are geocoded either by longitude and latitude, or by the park property name (few calls were not geocoded or geocoded in an other way, so we didn't use them). While coordinates are easy to spatially join,
**Park Facility Name** keeps all different names: from 1678 unique park names in 311 dataset, only 923 were recognized directly (1-1 match with DPR park signnames). All the others: 

- had slight differencies (or typos) in naming, abbreviations, 
- used old (expired) name of the park
- property was converted
- (mostly) property wasn't named by park name, but either by specific location (Grand Army Plaza) or visa versa, larger park unit (Central park)



- [Notebook: park Districts ontology](ONTOLOGY/3_Calls_to_pDistricts.ipynb)
- [Notebook: DPR property ontology](ONTOLOGY/4_Calls_to_DPR_property.ipynb)
- [README](ONTOLOGY/README.md)
- [Sources](ONTOLOGY/SOURCES.md)


- [Ontology_District_file](ONTOLOGY/Ontology_districts.csv)
- [Ontology_Park_file](ONTOLOGY/ontology_districts.csv)

To match entities we developed an ontology, presented in key-value (fuzzy_name - clean_name or fuzzy_name - district ) dataframe. To do so, we used **fuzzywuzzy** library, which provides fuzzy match functional, and processed data in several "cascades", trying to match data to multiple DPR properties datasets. Still, a lot of the work was manual and empirical. 

### A. Matching to districts:
1. First, singificant part of calls has no facility name  but have geolocation. Those calls were matched to districts using **districts boundaries** shapefile and **spatial join** in geopandas.
2. Others were matched directly with parknames from open data parks list, then with playgrounds, school playgrounds, pools, golf courses, recreation centers, etc
3. Every matching is checked then manually, starting with those with the worst matching rating.
4. In the end, we failed to find matchFew (6) *park name values* out of ~1600. At the final stage, every enitity was checked manually.


### B. Matching to parks:

1. we read DPR calls
2. we split them into geolocated and name-located
3. for geolocated, we spatially join them with park property, and give them property name.
4. after that, we join two groups back and create a list of unique names - **fuzzyNames**
5. Now, we read database of properties, and for each record convert it's Proprty ID into list with one Property ID
6. We also create empirically defined dataframe of large parks, containing more than one property. For each park, we pass list of properties instead of one. Now we join two dataframes together in **prop2**: each record have *type* attribute, showing if this was a record from original database of homebrewed one.
7. At this point, we use ontology we did before (ontology for Districts). We load this ontology, and manually perform some aggregation and attach **fuzz** name to each property.
8. Then we create a custom "fuzzname cleaning" function to pefrorm on calls.
9. And we try to match unique call locations with our properties. To improve matching, we perform **fuzzywuzzy process.extractOne** on unmatched ones - this helps us to improve our cleaning function. 
10. However, here we fail to recognize as much as 350 fuzz names: most of them just are not in our Proprties database, or their name changed/differs officially.
11. **As multiple calls named after large park, not specific zone, all calls choose random element in *property_id* list: most of the calls have only one in the list, therefore, they select it all the tyme.**
12. Both ontology pairs and matched calls are saved as csv files

It seems that the final ontology can be used not only on this particular dataset, but for the newly collected data as well.


## V. Calls to Park Districts
[Notebook](4_Matching_Calls_Districts.ipynb)

using ontology we created in chapter 3, and park District borders (shape files), we now match all the calls to specific park districts - either by **spatial join** (if call has coordinates) or using **ontology**

## V. Park_District Top Complains

Using created Ontology, we were able to categorise most (all but 6) complains to the park District. By Grouping calls by park District, we were able to create top-complain type tables for each park, for the whole time. 

[Tables](district_complain_tables/)

## VI. 311 TimeSeries

On this step we aggregate 311 calls per District and look at park_district timeseries, yearly, from 2010 to 2015.

[Notebook](6_Calls_TImeseries.ipynb)

## VII. Timeseries of Park Inspections 

Timeseries of Park Inspections  scores per District were computed using modification of previously created script (script was also updated to compute scores per park district)

- [PIP_Timeseries_script](scripts/3_PIP_timeseries.py)
- [PIP_analysis_updated](scripts/2_PIP_Analysis_1_01.py)

Both scripts should be placed in the folder next to original capstone scripts

## VIII. Correlation
[Notebook](6_Timeseries_corellation.ipynb)

In this notebook we attempted to find any significant correlation between different subsets of inspections and 311 calls; However, we did not find any significant correlation - nor negative nor positive.

![ts](img/calls_scores.png)
![corr](img/corr311_naive.png)

## IIX. Mapping complains
[Notebook](10_Mapping_park_calls.ipynb)
![map](img/map1.png)
![map](img/map2.png)

## IX. Results

- 311 calls is extremely fuzzy and hard to use
- On the other side, we find a serious lack of structurad hierarchical database of DPR properties, including unofficial names, old naming, etc.
- There is an interesting phenomen of complain pikes in 2010, 2011, 2012 - most of those related to fallen treas or leafs
- Despite our effort, We failed to find any significant correlation between 311 calls and park inspections rate.
- Yet, we didn't investigate "result" column of 311 calls: many of calls were readressed to other agencies or rejected, some were not found by DPR officer, and some issues were solved: In the future, this columns might be used to filter calls, in order to provide "cleaner" calls dataset and, maybe, improve our correlation coefficients.