<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Extend-block-dataset-by-school-years-SY1314---SY1516" data-toc-modified-id="Extend-block-dataset-by-school-years-SY1314---SY1516-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Extend block dataset by school years SY1314 - SY1516</a></span></li><li><span><a href="#Add-earlier-school-years-for-which-no-route-information-exists" data-toc-modified-id="Add-earlier-school-years-for-which-no-route-information-exists-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Add earlier school years for which no route information exists</a></span></li><li><span><a href="#Aggregate-multiple-routes-per-block" data-toc-modified-id="Aggregate-multiple-routes-per-block-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Aggregate multiple routes per block</a></span></li><li><span><a href="#Create-treatment-dummy-for-census-blocks" data-toc-modified-id="Create-treatment-dummy-for-census-blocks-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Create treatment dummy for census blocks</a></span></li><li><span><a href="#Save-all-census-blocks" data-toc-modified-id="Save-all-census-blocks-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Save all census blocks</a></span></li></ul></div>

**Descriptions**: Extends block dataset to all relevant school years,
performs a spatial join of the mentioned block data with the routes data,
creates preliminary treatment indicator,
and saves the data.

---

In [1]:
import pickle
from pathlib import Path

import geopandas as gpd
import pandas as pd
import numpy as np
from tqdm import tqdm, tqdm_notebook

tqdm.pandas(tqdm_notebook)

In [2]:
def agg_block_schoolyear(group):
    """Aggregates all observations of a block per school year to one.

    Makes list of merged routes and corresponding schools and saves them
    as new columns.

    Parameters
    ----------
    group : pandas groupby object

    Returns
    -------
    pandas groupby object
        Modified group (1 Observation)
    """
    # Returns one row of the following columns
    block_cols = group[[
        'statefp10', 'countyfp10', 'tractce10', 'geoid10', 'blockce10',
        'name10', 'geometry'
    ]].head(1)
    assert block_cols.shape == (1, 7)
    # Convert to a pandas series
    block_cols = block_cols.squeeze()
    # Make list of merged routes and corresponding schools
    # (the .all() just takes the truth value out of the pandas series)
    if group.shape[0] == 1 and pd.isnull(group['route_number']).all():
        route_numbers = np.nan
    else:
        route_numbers = list(group['route_number'])

    # Add everything together
    # The first parts of the names are shortened to make it better visible
    # that these are not the usual 'school_name', etc. columns
    group_new = block_cols.append(pd.Series({'r_numbers': route_numbers}))
    return group_new

# Load data

In [3]:
data_path = Path('../../data')

In [4]:
with (data_path / 'interim/blocks.pkl').open('rb') as f:
    blocks = pickle.load(f)

with (data_path / 'processed/routes.pkl').open('rb') as f:
    routes = pickle.load(f)

# Extend block dataset by school years SY1314 - SY1516
These are the school years on which we have information on
location of routes.

In [5]:
school_years = ['SY1314', 'SY1415', 'SY1516']
n_blocks_original = blocks.shape[0]
blocks = pd.concat([blocks] * len(school_years), ignore_index=True)
blocks = blocks.sort_values('tract_bloc').reset_index(drop=True)
blocks['school_year'] = school_years * n_blocks_original
assert not blocks[['tract_bloc', 'school_year']].duplicated().any()
assert (blocks.groupby('tract_bloc').size() == blocks.shape[0] /
        blocks['tract_bloc'].nunique()).all()

Spatial join of extended block dataset and routes
(for each school year separately)

In [6]:
bl_ro = []
for sy in blocks['school_year'].unique():
    blocks_temp = blocks[blocks['school_year'] == sy]
    routes_temp = routes[routes['school_year'] == sy].drop(
        'school_year', axis='columns')
    bl_ro_temp = gpd.sjoin(
        # Deep copies should not be needed anymore in
        # future version of geopandas (current 0.3.0)
        blocks_temp.copy(),
        routes_temp.copy(),
        how='left',
        op='intersects').reset_index(drop=True).drop(
            'index_right', axis='columns')
    bl_ro.append(bl_ro_temp)
bl_ro = pd.concat(bl_ro, ignore_index=True)
del blocks
del routes

# Add earlier school years for which no route information exists
From SY0910 - SY1213 the FOIA request provided information on
implementation of Safe Passage program
Before no Safe Passage program existed in all blocks

In [7]:
sy_to_add = [
    'SY0506', 'SY0607', 'SY0708', 'SY0809', 'SY0910', 'SY1011', 'SY1112',
    'SY1213'
]

For each block take the observation from SY1314

In [8]:
early_blocks = bl_ro.query('school_year == "SY1314"').copy()

Dupliate these observations once for each school year prior to SY1314
(i.e. 4 observations per block).

In [9]:
all_early_block_years = []
for sy in sy_to_add:
    early_blocks_temp = early_blocks.copy()
    early_blocks_temp['school_year'] = sy
    all_early_block_years.append(early_blocks_temp)
all_early_block_years = pd.concat(all_early_block_years, ignore_index=True)
assert all_early_block_years.shape[0] / len(
    sy_to_add) == early_blocks.shape[0]
all_early_block_years['school_year'].unique()

array(['SY0506', 'SY0607', 'SY0708', 'SY0809', 'SY0910', 'SY1011',
       'SY1112', 'SY1213'], dtype=object)

Add to existing block data

In [10]:
bl_ro = pd.concat([bl_ro, all_early_block_years], ignore_index=True)
bl_ro = bl_ro.sort_values(['tract_bloc',
                           'school_year']).reset_index(drop=True)

# Aggregate multiple routes per block
Multiple routes can intersect with a block in a given school year.
The following will aggregate these entries such that
each block has again one entry per school year.

In [11]:
assert (bl_ro.groupby('tract_bloc').size() >=
        len(school_years + sy_to_add)).all()

bl_ro = (bl_ro.groupby(['tract_bloc', 'school_year'])
         .progress_apply(agg_block_schoolyear).reset_index())

 36%|███▌      | 181583/509421 [05:07<08:50, 617.60it/s]

Make sure that there is actually for each block for each
of the school years one observation

In [12]:
assert not bl_ro.duplicated(['tract_bloc', 'school_year']).any()
assert bl_ro.shape[0] == bl_ro['tract_bloc'].nunique() * bl_ro[
    'school_year'].nunique()

# Create treatment dummy for census blocks
Treatment is defined as 1 if a Safe Passage route intersects
with the census block.

In [13]:
bl_ro['treated'] = ~bl_ro['r_numbers'].isnull() * 1

# Save all census blocks

Convert to geopandas dataframe

In [14]:
bl_ro = gpd.GeoDataFrame(
    data=bl_ro.drop('geometry', axis='columns'),
    crs={'init': 'epsg:4326'},
    geometry=bl_ro['geometry'])

Sort observations and save data

In [15]:
bl_ro = bl_ro.sort_values(['tract_bloc',
                           'school_year']).reset_index(drop=True)



Save

In [16]:
with (data_path / 'processed/blocks.pkl').open('wb') as f:
    pickle.dump(bl_ro, f)