# SafeChicago Data Cleaning Functions

For time series aggregation of crimes into hour-blocks.

In [11]:
from load_data import *

Raw data files needed beforehand if starting from scratch _(assumes that script is running in the same directory as these files)_:

* `crimes_raw.csv`
* `ChicagoMLB_raw.xlsx`
* `ChicagoNBA_raw.xlsx`
* `ChicagoNFL_raw.xlsx`
* `rain_snow_raw.csv`
* `historical_weather_raw.csv`

To chain directly to go from raw data to fully formatted data for a <b>single beat</b>, steps would be as follows:

### 1) Run `clean_crimes()` on the `crimes_raw.csv` file. 

This function will clean and reformat the raw crimes file and save the cleaned version to `crimes_clean.csv`. It will also return the cleaned dataframe, but this isn't needed for now.

It only takes the name of the raw data as an argument. It appends to this raw data new observations pulled from the Chicago crimes API and appends them to the baseline data from 2001.

In [None]:
clean_crimes('crimes_raw.csv')

### 2) Run `aggregate_crimes()` on the cleaned `crimes_clean.csv` file. 

This function takes as an argument the filename for the cleaned crimes file and `hour_delta` which defines the number of hours that should be included in each aggreagation of observations. For example, an `hour_delta` of 2 would group together 12:00 PM to 2:00 PM, 2:00PM to 4:00PM, etc.

It will save to the directory a file called `crimes_agg_{hour_delta}h.csv` that contains the aggregated observations. It will also save to the directory a file called `crimes_times_{hour_delta}h.csv` that contains all possible time values as defined by the `hour_delta` value for later use.

In [None]:
aggregate_crimes('crimes_clean.csv', 4)

### 3) Run `get_beat_data()` on the aggregated crimes file generated in the previous step

`get_beat_data()` takes as arguments the filename/path to the aggregated crime file and the number of the beat the data should be generated for.

It calls to helper functions `clean_nba_schedule()`, `get_mlb_schedule()`, and `clean_nfl_schedule()` to append columns for sports game occurence. It also (if there isn't already a cleaned version in the directory) cleans the `historical_weather_raw.csv` file via a call to `format_weather.py` and merges weather data to the crimes data for that beat.

This function returns the finalized aggregated and extended (width-wise) file and saves it to the directory as `crimes_agg_ext_beat{beat_num}.csv`. 

In [8]:
get_beat_data('crimes_agg_4h.csv',111)

Merging sports...
Merging weather...


Unnamed: 0,Year,Month,Day,HourGroup,Beat,Count,BullsGame,CubsGame,SoxGame,BearsGame,C,m/s,RelHum
0,2001,1,1,0,111,1.0,0,0,0,0,-4.8,3.5,
1,2001,1,1,8,111,1.0,0,0,0,0,-5.5,4.2,
2,2001,1,1,12,111,1.0,0,0,0,0,-6.2,4.1,
3,2001,1,1,20,111,2.0,0,0,0,0,-4.2,8.5,
4,2001,1,1,16,111,0.0,0,0,0,0,-6.0,5.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52631,2022,12,31,8,111,0.0,0,0,0,0,,,
52632,2022,12,31,12,111,0.0,0,0,0,0,,,
52633,2022,12,31,20,111,0.0,0,0,0,0,,,
52634,2022,12,31,16,111,0.0,0,0,0,0,,,


Note: this is still being added to and needs the following that I can tell for now:

* Cutoff table after the last observation available in the `crimes_agg` file 
* Tweak `weather_format.py` functions to also include the rain and snow features for each day