<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load data</a></span><ul class="toc-item"><li><span><a href="#Load-crimes" data-toc-modified-id="Load-crimes-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load crimes</a></span><ul class="toc-item"><li><span><a href="#Drop-the-ones-with-missing-geospatial-data" data-toc-modified-id="Drop-the-ones-with-missing-geospatial-data-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Drop the ones with missing geospatial data</a></span></li></ul></li><li><span><a href="#Load-blocks" data-toc-modified-id="Load-blocks-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load blocks</a></span></li></ul></li><li><span><a href="#Spatial-join-of-crimes-and-blocks" data-toc-modified-id="Spatial-join-of-crimes-and-blocks-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Spatial join of crimes and blocks</a></span></li><li><span><a href="#Add-school-year-identifier-to-crimes" data-toc-modified-id="Add-school-year-identifier-to-crimes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Add school year identifier to crimes</a></span></li><li><span><a href="#Save" data-toc-modified-id="Save-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Save</a></span></li></ul></div>

**Description**: Performs spatial join of crimes and blocks. Adds this information
as a new SQL table to the crime database.

---

In [1]:
import pickle
import sys
from pathlib import Path

import geopandas as gpd
from shapely.geometry import Point

sys.path.append('../..')
from src.prepare_data.crime_database import load_relevant_crimes, get_engine

In [2]:
data_path = Path('../../data')

# Load data
## Load crimes

In [3]:
crimes = load_relevant_crimes(
    '2006-01-01', '2016-06-30', sqldb_path=str(data_path / 'processed/crimes.db'))

### Drop the ones with missing geospatial data

In [4]:
crimes = crimes.dropna(subset=['Longitude', 'Latitude'])

## Load blocks

In [5]:
with (data_path / 'processed/blocks.pkl').open('rb') as f:
    blocks = pickle.load(f)

# Spatial join of crimes and blocks

Convert crimes to a geopandas dataframe

In [6]:
locations = [
    Point(lon, lat)
    for lon, lat in zip(crimes['Longitude'], crimes['Latitude'])
]

Same crs as blocks dataset has

In [7]:
crimes = gpd.GeoDataFrame(
    data=crimes, geometry=locations, crs={
        'init': 'epsg:4326'
    }).reset_index(drop=True)
del locations

Currently each block has for each school year one entry.
Only need one of them for geometries.

In [8]:
blocks = blocks.drop_duplicates(subset=['tract_bloc'])[[
    'tract_bloc', 'geometry'
]]
assert isinstance(blocks, gpd.GeoDataFrame)

Perform spatial join

In [9]:
crimes_blocks = gpd.sjoin(
    # Deep copies should not be needed anymore in
    # future version of geopandas (current 0.3.0)
    crimes.copy(),
    blocks.copy(),
    how='left',
    op='intersects').reset_index(drop=True).drop(
        'index_right', axis='columns')

assert crimes.shape[0] == crimes_blocks.shape[0]

Some crimes don't get a matched tract_bloc as
they are in the outer ares of Chicago which are not considered
in this analysis. Drop these crimes.

In [10]:
crimes_blocks.dropna(subset=['tract_bloc'], inplace=True)
crimes_blocks.reset_index(drop=True, inplace=True)

# Add school year identifier to crimes
Only for crimes which happened during a school year.
Will still later save the block information about crimes which
happened during summer months etc., i.e. school year = NA is allowed.

In [11]:
sy_range = {
    sy: (f'20{sy[2:4]}-09-01', f'20{sy[4:]}-06-30')
    for sy in [
        'SY0506', 'SY0607', 'SY0708', 'SY0809', 'SY0910', 'SY1011',
        'SY1112', 'SY1213', 'SY1314', 'SY1415', 'SY1516'
    ]
}

for sy in sy_range.keys():
    crimes_blocks.loc[(crimes_blocks['Date'] >= sy_range[sy][0]) & (
        crimes_blocks['Date'] <= sy_range[sy][1]), 'school_year'] = sy

# Save
as a new table to SQL database

In [13]:
crimes_blocks = crimes_blocks[['ID', 'tract_bloc', 'school_year']]
disk_engine = get_engine(sqldb_path=str(data_path / 'processed/crimes.db'))
crimes_blocks.to_sql(
    'crimes_blocks',
    disk_engine,
    if_exists='replace',
    index=False,
    chunksize=900)