# Investigating 311 Calls

## Requirements

[required python libraries](libraries_list.md)

## Data

- [NYC_311_open_data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)
- parks shape file
- multiple park property shapefiles
- Park district boundaries
- DPR PIP Inspections aggregated dataset


[list of open park-related data sources](PARKS_OPEN_DATA.md)

## I. collecting parks-related 311 Calls
[Notebook](1_Raw_311_to_dataset.ipynb)

- First, all data available on [NYC_311_open_data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) was collected as raw csv files, one per year.
- All data was merged into single file, and saved as a csv file
- Then, we subsetted the data by **Agency**, keeping only DPR-related calls. this dataset was saved separately and is used as main 311 dataset.

## II. exploring 311 data

[Notebook](2_311_exploration.ipynb)

### plotted timeline of general 311 data
![tline1](img/exploration1.png)
### plotted timeline of DPR_311 data
![tline2](img/exploration2.png)
### plotted timeline of DPR_311 data, by park borough
![tline3](img/exploration3.png)


## III. Park naming Ontology

all calls in 311_DPR dataset are geocoded either by longitude and latitude, or by the park property name (few calls were not geocoded or geocoded in an other way, so we didn't use them). While coordinates are easy to spatially join,
**Park Facility Name** keeps all different names: from 1678 unique park names in 311 dataset, only 923 were recognized directly (1-1 match with DPR park signnames). All the others: 

- had slight differencies (or typos) in naming, abbreviations, 
- used old (expired) name of the park
- property was converted
- (mostly) property wasn't named by park name, but either by specific location (Grand Army Plaza) or visa versa, larger park unit (Central park)


To match entities we developed an ontology, presented in key-value (wrong_name - clean_name or wrong_name - district ) dataframe. To do so, we used **fuzzywuzzy** library, which provides fuzzy match functional, and processed data in several "cascades", trying to match data to multiple DPR properties datasets. Still, a lot of the work was manual and empirical. We failed to find matchFew (6) calls. 

However, It seems that the final ontology can be used not only on this particular dataset, but for the newly collected data as well.

- [Notebook](ONTOLOGY/Ontology2.0.ipynb)
- [Onto_District_file](ONTOLOGY/onto_data/Ontology_verified2.csv)

## IV. Computing park area for park districts

For normalisation purposes, we aggregated park area per district by:

- reading two shp files (park boundaries and park districts boundaries)

- use spatial join on park centroids, getting **park district id** for each park
- collect park area and number of parks for each district
- collect park id's as array for each district
- calculate percent of park area vs total area for each district

- save districts with new data  as csv and geojson (**data/park_Districts_computed.csv**, **data/park_Districts_computed.geojson**)
- save parks with area and **park district id** attribute (**data/parks_computed.geojson**)

## V. Park_District Top Complains

Using created Ontology, we were able to categorise most (all but 6) complains to the park District. By Grouping calls by park District, we were able to create top-complain type tables for each park, for the whole time. 

[Tables](district_complain_tables/)

## VI. 311 TimeSeries

Using created Ontology, we were able to categorise most (all but 6) complains to the park District. With that, we were able to aggregate 311 calls per District and look at park_district timeseries, year based, from 2010 to 2015.

[Notebook](5_Calls_TImeseries.ipynb)

## VII. Timeseries of Park Inspections 

Timeseries of Park Inspections  scores per District were computed using modification of previously created script (script was also updated to compute scores per park district)

- [PIP_Timeseries_script](PIP_ts.py)
- [PIP_analysis](PIP_analysis.py)

## VIII. Correlation

Unfortunately, we did not discover only weak negative correlation between two timeseries
[Notebook](6_Timeseries_corellation.ipynb)
![ts](img/calls_scores.png)
![corr](img/corr311_naive.png)

## IIX. Mapping complains

![map](img/map1.png)
![map](img/map2.png)