# Get Aggregated Statistics by Neighbourhood

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import os
import urllib
from datetime import datetime
from glob import glob
from io import BytesIO
from typing import Dict, List
from urllib.request import urlopen
from zipfile import ZipFile

import geopandas as gpd
import numpy as np
import pandas as pd
import requests



In [3]:
%aimport src.utils
from src.utils import summarize_df

## Load Inspections Data that was Filtered, Aggregated and Geocoded

We'll load the filtered and aggregated data with the missing latitudes and longitudes filled using geocoding

In [4]:
%%time
df = pd.read_csv(
    glob(f"data/processed/filtered_transformed_filledmissing_data__*.csv")[-1],
    parse_dates=["inspection_date"],
)
summarize_df(df)

Unnamed: 0,dtype,num_missing,num,nunique,single_non_nan_value
establishment_id,int64,0,83702,18028,9421327
establishmenttype,object,0,83702,13,Restaurant
establishment_address,object,0,83702,9838,361 WILSON AVE
inspection_id,int64,0,83702,83515,103290970
inspection_date,datetime64[ns],0,83702,2318,2014-07-28 00:00:00
infractions_summary,object,0,83702,37683,Operator fail to properly wash surfaces in roo...
num_significant,int64,0,83702,34,0
num_crucial,int64,0,83702,15,0
num_minor,int64,0,83702,25,6
num_infractions,int64,0,83702,55,6


CPU times: user 365 ms, sys: 37.2 ms, total: 403 ms
Wall time: 403 ms


## Get Supplementary Datasets

We will download various datasets (eg. crime, population, etc.) from the city of Toronto Open Data Portal. Crime will be aggregated by neighbourhood.

### Neighbourhood Boundary and Land Area GeoData

First we will use the `geopandas` Python library to download the geodata for the boundaries of neighbourhoods in the city of Toronto. A helper function is used to do this and it takes a dataset ID and the main URL for the Open Data Platform

In [5]:
def get_neighbourhood_boundary_land_area_data(url: str, params: Dict) -> pd.DataFrame:
    """Download neighbourhoods geodata from Toronto Open Data Portal."""
    # Get data package from Toronto Open Data Portal and Convert it to JSON
    package = requests.get(url, params=params).json()
    # Retrieve dataset URL from nested JSON object
    n_url = (
        package["result"]["resources"][0]["url"].replace(
            "datastore/dump", "download_resource"
        )
        + "?format=geojson&projection=4326"
    )
    # Load geodata from dataset URL into GeoDataFrame
    gdf = gpd.read_file(n_url)

    # # Set Co-ordinate Reference System for Toronto (2019) and access the centroid
    # gdf["centroid"] = gdf["geometry"].to_crs(epsg=2019).centroid.to_crs(epsg=4326)
    # gdf["AREA_LATITUDE"] = gdf["centroid"].y
    # gdf["AREA_LONGITUDE"] = gdf["centroid"].x

    # Check that we have 140 neighbourhoods
    assert len(gdf) == 140
    return gdf

We'll now define the main URL for the open data platform. This will be used to download geodata and also other publically available datasets on the platform

In [6]:
# Toronto Open Data Portal
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"

We'll use the above helper function to download the Toronto neighbourhoods geodata and store the result in a `GeoDataFrame`

In [7]:
%%time
neigh_params = {"id": "4def3f65-2a65-4a4f-83c4-b2a4aed72d46"}
gdf = get_neighbourhood_boundary_land_area_data(url, neigh_params)

CPU times: user 197 ms, sys: 18.6 ms, total: 215 ms
Wall time: 1.98 s


Print the Co-Ordinate Reference System of this `GeoDataFrame`

In [8]:
print(gdf.crs)

epsg:4326


**Notes**
1. An EPSG is a registry of co-ordinate reference systems. Since the inspection establishment locations are in co-ordinates of latitude and longitude, we would need the neighbourhood boundaries to be in the same units if we want to get the neighbourhood containing a particular establishment. An EPSG of 4326 since this corresponds to coordinates in latitude and longitude. Since this is already set for the geodata we loaded with `geopandas`, we don't have to set this manually.

Fix typographic errors in the name of the neighbourhood in this dataset
- [North St. James Town](https://www.toronto.ca/ext/sdfa/Neighbourhood%20Profiles/pdf/2016/pdf1/cpa74.pdf) and [Cabbagetown-South St. James Town](https://www.toronto.com/community-static/4550668-cabbagetown-south-st-james-town/)
  - missing space between ...St. and Ja...
- Weston-Pelham Park
  - incorrectly listed as its old name (from 2011) of Weston-Pellam Park ([link](https://www.toronto.ca/wp-content/uploads/2017/11/900b-91-Weston-Pellam-Park.pdf))
  - replace with [new name from 2016](https://www.toronto.ca/ext/sdfa/Neighbourhood%20Profiles/pdf/2016/pdf1/cpa91.pdf)

In [9]:
d_renaming = {
    "St.James": "St. James",
    "Weston-Pellam": "Weston-Pelham",
}
for k, v in d_renaming.items():
    gdf["AREA_NAME"] = gdf["AREA_NAME"].str.replace(k, v, regex=False)

The incorrect names have been successfully replaced as shown below

In [10]:
# Neighbourhood GeoData columns to use
geo_cols = ["AREA_NAME", "geometry", "Shape__Area"]

In [11]:
gdf.query("AREA_NAME.str.contains('James Town|Weston-|Cabbage')")[geo_cols]

Unnamed: 0,AREA_NAME,geometry,Shape__Area
18,North St. James Town (74),"POLYGON ((-79.38057 43.67161, -79.37947 43.671...",811303.9
40,Weston-Pelham Park (91),"POLYGON ((-79.46005 43.66723, -79.46092 43.668...",2794057.0
114,Cabbagetown-South St. James Town (71),"POLYGON ((-79.37672 43.66242, -79.37721 43.663...",2711742.0


**Notes**

1. `AREA_NAME` is the name of each neighbourhood
2. `geometry` is the geo data boundary of each neighbourhood
3. `Shape__Area` is the land area of each neighbourhood

### Neighbourhood Profile Data - Population

The helper function below (`get_toronto_open_data()`) takes a dictionary with a dataset ID and the main URL for the Open Data platform and loads a non-geodata dataset corresponding to that ID into a `DataFrame`

In [12]:
def get_toronto_open_data(
    url: str, params: Dict, col_rename_dict: Dict = {}
) -> pd.DataFrame:
    """Download data from Toronto Open Data Portal."""
    # Get data package from Toronto Open Data Portal and Convert it to JSON
    package = requests.get(url, params=params).json()
    # Retrieve dataset ID from nested JSON object and Get corresponding dataset
    for _, resource in enumerate(package["result"]["resources"]):
        # If datastore_active key is available, then get first dataset
        # id
        if resource["datastore_active"]:
            datastore_url = (
                "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/"
                "action/datastore_search"
            )
            p = {"id": resource["id"]}
            # Use dataset ID (in GET request paarameters) to download
            # data and convert it to JSON
            data = requests.get(datastore_url, params=p).json()
            # Get list of dictionaries from result > records inside
            # nested JSON object and Convert to DataFrame
            df = pd.DataFrame(data["result"]["records"])
            # if datset_active key is available, then break out of
            # conditional statement with DataFrame
            break
    # (Optional) Rename columns of DataFrame
    if col_rename_dict:
        df = df.rename(columns=col_rename_dict)
    return df

Another helper function `get_neighbourhood_profile_data()` is now defined. It calls the above `get_toronto_open_data()` function to download the data for neighbourhood population in 2011 and 2016 (Canada's census years) and then process it. Processing details are explained in the comments in the function below

In [13]:
def get_neighbourhood_profile_data(url: str, params: Dict) -> pd.DataFrame:
    df_neigh_demog = get_toronto_open_data(url, params)
    # Select only the neighbourhood and population rows and
    # take the transpose so that population appears as a column (separate
    # columns for 2011 population and 2016 population)
    df_neigh_demog = (
        df_neigh_demog[
            df_neigh_demog["Characteristic"].isin(
                [
                    "Neighbourhood Number",
                    "Population, 2011",
                    "Population, 2016",
                ]
            )
        ]
        .iloc[:, slice(4, None)]
        .set_index("Characteristic")
        .T.reset_index()
        .iloc[1:]
        .reset_index(drop=True)
        .rename(columns={"index": "name"})
    )
    # Verify we have 140 neighbourhoods of data
    assert len(df_neigh_demog) == 140
    # Combine the neighbourhood name and number columns
    # - need to do this since we are joining this with data that has neighbourhood
    #   name and number combined
    df_neigh_demog["AREA_NAME"] = (
        df_neigh_demog["name"] + " (" + df_neigh_demog["Neighbourhood Number"] + ")"
    )
    # Remove thousands tick commas from population columns
    for c, c_new in zip(
        ["Population, 2016", "Population, 2011"], ["pop_2016", "pop_2011"]
    ):
        df_neigh_demog[c] = df_neigh_demog[c].str.replace(",", "").astype(int)
    # Shorten population column names
    df_neigh_demog = df_neigh_demog.rename(
        columns={"Population, 2016": "pop_2016", "Population, 2011": "pop_2011"},
    )
    return df_neigh_demog

We'll use the two helper functions above to download neighbourhood population data

In [14]:
%%time
neigh_profile_params = {"id": "6e19a90f-971c-46b3-852c-0c48c436d1fc"}
df_neigh_demog = get_neighbourhood_profile_data(url, neigh_profile_params)
df_neigh_demog.head(6)

CPU times: user 61.1 ms, sys: 2.18 ms, total: 63.3 ms
Wall time: 580 ms


Characteristic,name,Neighbourhood Number,pop_2016,pop_2011,AREA_NAME
0,Agincourt North,129,29113,30279,Agincourt North (129)
1,Agincourt South-Malvern West,128,23757,21988,Agincourt South-Malvern West (128)
2,Alderwood,20,12054,11904,Alderwood (20)
3,Annex,95,30526,29177,Annex (95)
4,Banbury-Don Mills,42,27695,26918,Banbury-Don Mills (42)
5,Bathurst Manor,34,15873,15434,Bathurst Manor (34)


### Crime Data

Next, we'll use another two helper functions `get_mci_data()` and `transform_mci_data()` to retrieve (download) Toronto crimes data from the Open Data Portal and process it. Processing details are provided as comments in the two functions below and covers the following core steps
- filtering out types of crimes and locations where the crimes are committed
  - we are not interested here in crimes related to vehicle theft
- aggregating crimes by neighbourhood, date and type of crime
- reshape (pivot) data to have neighbourhood and date as rows and the count of each type of crime as columns

In [15]:
def transform_mci_data(mci_dir: str, date_col_name: str = "date") -> pd.DataFrame:
    """Process Toronto Crimes dataset."""
    # Load data into GeoDataFrame
    # - data comes in .shp file, so we need to initially load it using geopandas
    #   and then convert it to a pandas DataFrame
    cols_to_ignore = [
        # non-unique ID column (we will keep the Index_ column, which is unique)
        "event_uniq",
        # police division (not relevant, we only need neighbourhood)
        "Division",
        # police-related crime codes (not relevant)
        "ucr_code",
        "ucr_ext",
        # Neighbourhood ID (we already have neighbourhood name from the
        # neighbourh column, so this is not necessary)
        "Hood_ID",
        # geodata shape ID (we are not visualizing crimes, so not needed)
        "ObjectId",
        # we want the date when crime occurred, not when it was reported
        # - we want number of crimes in the same neighbourhood as a restaurant
        #   that is inspected; crime occurrence date gives us this info, so we
        #   don't need crime reported date
        "reportedda",
        "reportedye",
        "reportedmo",
        "reported_1",
        "reported_2",
        "reported_3",
        "reportedho",
        # datetime attributes (we can get these from the occurrence column)
        "occurren_1",
        "occurren_2",
        "occurren_3",
        "occurren_4",
        "occurren_5",
        "occurren_6",
        # Location of crime is not necessary; we just need the Neighbourhood
        # which comes from the neighbourh column
        "Lat",
        "Long",
    ]
    gdf_mci = gpd.read_file(
        f"data/raw/{mci_dir}/Major_Crime_Indicators.shp",
        ignore_fields=cols_to_ignore,
    )
    # Convert GeoDataFrame to DataFrame
    # - we are not interested in plotting individual crimes, so drop the geometry
    #   column
    df_mci = pd.DataFrame(gdf_mci).drop(columns=["geometry"])
    # Filter 1
    # - we only want crimes that could have occurred inside / outside a retail
    #   food establishment (exclude crimes committed at apartment, house, school
    #   or on public transit)
    premises_wanted = ["Outside", "Commercial", "Other"]
    # Filter 2
    # - we do not want crimes related to a motor vehicle, public transport (taxi)
    #   or at a financial institution
    exclude_offences = [
        "Robbery - Vehicle Jacking",
        "Theft From Motor Vehicle Over",
        "Robbery - Taxi",
        "Robbery - Home Invasion",
        "Robbery - Financial Institute",
        "B&E - M/Veh To Steal Firearm",
    ]
    # Apply filters
    df_mci = df_mci.query(
        "premises_t.isin(@premises_wanted) & ~offence.isin(@exclude_offences)"
    )
    # Formatting - Rename columns
    df_mci = df_mci.rename(
        columns={"Neighbourh": "AREA_NAME", "occurrence": date_col_name}
    )
    # Formatting - Convert to Datetime
    df_mci[date_col_name] = pd.to_datetime(df_mci[date_col_name])
    # Aggregate crimes (counts) by neighbourhood, date and type of crime
    # - Index_ is a unique identifier, so aggregate (count) it by neighbourhood
    df_mci_agg = (
        df_mci.groupby(
            ["AREA_NAME", date_col_name, "MCI"],
            as_index=False,
        )["Index_"]
        .count()
        .rename(columns={"Index_": "crimes"})
        .sort_values(by=["AREA_NAME", date_col_name, "MCI"])
    )
    # Pivot data - move crime counts from rows to columns (one column for
    # each type of crime committed)
    df_mci_agg_pivot = (
        df_mci_agg.pivot_table(
            index=["AREA_NAME", date_col_name],
            columns="MCI",
            values="crimes",
            aggfunc="count",
        )
        .fillna(0)
        .astype({k: int for k in df_mci_agg["MCI"].unique()})
        .add_prefix("neigh_")
    )
    # Reset index after pivotting
    df_mci_agg_pivot = df_mci_agg_pivot.reset_index()
    return df_mci_agg_pivot


def get_mci_data(
    url: str, mci_params: Dict, date_col_name: str = "date"
) -> pd.DataFrame:
    """Download Toronto Crimes dataset locally."""
    # Get URL
    package = requests.get(url, params=mci_params).json()
    mci_file_url = package["result"]["resources"][0]["url"]
    # Download .zip folder
    mci_dir = os.path.splitext(os.path.basename(mci_file_url))[0]
    if not os.path.exists(f"data/raw/{mci_dir}"):
        with urlopen(mci_file_url) as zipresp:
            with ZipFile(BytesIO(zipresp.read())) as zfile:
                zfile.extractall(f"data/raw/{mci_dir}")
    # Read .shp file from .zip folder and Aggregate crimes by
    # neighbourhood, date and type of crime
    df_mci = transform_mci_data(mci_dir, date_col_name)
    return df_mci

We'll load and process the Toronto Crimes dataset into a `DataFrame`

In [16]:
%%time
mci_params = {"id": "247788f6-ca20-42e8-b00f-894ac43053e5"}
df_mci = get_mci_data(url, mci_params, "inspection_date")
display(df_mci.head())
summarize_df(df_mci)

MCI,AREA_NAME,inspection_date,neigh_Assault,neigh_Auto Theft,neigh_Break and Enter,neigh_Robbery,neigh_Theft Over
0,Agincourt North (129),2014-01-01,1,0,0,0,0
1,Agincourt North (129),2014-01-15,1,0,0,0,0
2,Agincourt North (129),2014-01-20,0,0,0,1,0
3,Agincourt North (129),2014-01-24,0,0,0,1,0
4,Agincourt North (129),2014-01-27,0,0,0,1,0


Unnamed: 0_level_0,dtype,num_missing,num,nunique,single_non_nan_value
MCI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AREA_NAME,object,0,78185,141,Moss Park (73)
inspection_date,datetime64[ns],0,78185,2733,2015-04-04 00:00:00
neigh_Assault,int64,0,78185,2,1
neigh_Auto Theft,int64,0,78185,2,0
neigh_Break and Enter,int64,0,78185,2,0
neigh_Robbery,int64,0,78185,2,0
neigh_Theft Over,int64,0,78185,2,0


CPU times: user 13.9 s, sys: 181 ms, total: 14.1 s
Wall time: 14.2 s


**Observations**
1. This has given us the number of each type of crime committed by date by neighbourhood.

### Other Neighbourhood Datasets

The city accepts 311 service requests which could involve complaints related to an establishment (eg. sanitation problems, etc.). These could be a useful predictor of a crucial infraction.

The city's open data portal has two datasets related to 311 service requests
- [customer-driven requests](https://open.toronto.ca/dataset/311-service-requests-customer-initiated/)
  - this dataset contains only a small fraction of requests made (see the *Limitations* section for details)
- [service requests made about potholes and grafitti](https://open.toronto.ca/dataset/311-open311-api-calls-for-service-requests/)
  - this dataset is missing all other types of service requests (such as sanitation or garbage accumulation complaints), which are likely to be stronger and more relevant predictors of food infractions

Due to these problems, the number of 311 service requests by neighbourhood and date cannot be used to create features for an ML model.

## Get Name of Neighbourhood containing Establishment

We are now ready to get the name of the neighbourhood containing each inspection.

We will first get all unique inspected establishments and their `latitude` and `longitude`

In [17]:
%%time
unique_locations = df.groupby(
    ["establishment_id", "establishmenttype", "establishment_address"],
    as_index=False,
)[["latitude", "longitude"]].max()
unique_locations = unique_locations.assign(row_num=range(1, len(unique_locations)+1))
unique_locations

CPU times: user 23.4 ms, sys: 0 ns, total: 23.4 ms
Wall time: 22.9 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num
0,1222579,Food Take Out,870 MARKHAM RD,43.7680,-79.2290,1
1,1222807,Restaurant,1635 LAWRENCE AVE W,43.7046,-79.4922,2
2,9000002,Food Take Out,361 OAKWOOD AVE,43.6872,-79.4385,3
3,9000004,Food Take Out,1788 JANE ST,43.7063,-79.5050,4
4,9000026,Food Take Out,2372 EGLINTON AVE E,43.7320,-79.2710,5
...,...,...,...,...,...,...
18148,10690581,Restaurant,3560 VICTORIA PARK AVE,43.8060,-79.3375,18149
18149,10690642,Bake Shop,20 ST PATRICK ST,43.6509,-79.3890,18150
18150,10690660,Restaurant,549 BLOOR ST W,43.6652,-79.4102,18151
18151,10690679,Food Take Out,1175 ST CLAIR AVE W,43.6777,-79.4434,18152


**Notes**
1. The `row_num` column was added as a dummy / placeholder column with a unique value for each row of unique establishment locations. This column will be used for merging a little later in this notebook.

The `row_num` column is unique for each row of the `unique_locations` found above

In [18]:
print(unique_locations["row_num"].nunique(), len(unique_locations))
unique_locations.head(2)

18153 18153


Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num
0,1222579,Food Take Out,870 MARKHAM RD,43.768,-79.229,1
1,1222807,Restaurant,1635 LAWRENCE AVE W,43.7046,-79.4922,2


Get the name of the neighbourhood containing the establishment that was inspected. To do this, a helper function `get_data_with_neighbourhood()` is used. A temporary `GeoDataFrame` named `df_check` is created in `get_data_with_neighbourhood()` and it gets the name of each neighbourhood per `row_num`, and we will merge this back with the `unique_locations` `DataFrame` using the `row_num` column. Merging is required so we can
- count the number of number of inspections per neighbourhood
- (later) merge the `unique_locations` back with the original data `df`

This is done below

In [19]:
def get_neighbourhood_containing_point(
    gdf: gpd.GeoDataFrame,
    df: pd.DataFrame,
    lat: str = "Latitude",
    lon: str = "Longitude",
    crs: int = 4326,
) -> gpd.GeoDataFrame:
    """Get name associated with geodata shape object containing lat-lon point."""
    # Get all columns from DataFrame of inspection locations and GeoDataFrame
    # of neighbourhoods
    cols_order = list(df) + list(gdf)
    # Use spatial join to get the name associated with the polygon object (neighbourhood)
    # containing a point (inspection location), by checking if the lat-lon co-ordinate
    # is contained within the polygon
    polygons_contains = (
        gpd.sjoin(
            gdf,
            gpd.GeoDataFrame(
                df, geometry=gpd.points_from_xy(df[lon], df[lat]), crs=crs
            ),
            predicate="contains",
        )
        .reset_index(drop=True)
        .drop(columns=["index_right"])[cols_order]
    )
    return polygons_contains


def get_data_with_neighbourhood(
    gdf: gpd.GeoDataFrame,
    df: pd.DataFrame,
    lat: int,
    lon: int,
    col_to_join: str,
    crs: int = 4326,
) -> gpd.GeoDataFrame:
    """Get name of neighbourhood in which inspection was conducted."""
    # columns wanted
    cols_to_keep = [col_to_join, "AREA_NAME", "geometry", "Shape__Area"]
    # Create temporary DataFrame with the name of the neighbourhood (as a column)
    # for each inspection
    df_check = get_neighbourhood_containing_point(gdf, df, lat, lon, crs)[cols_to_keep]
    display(df_check.head(2))

    # Merge the inspections data with the temporary DataFrame so that we get the neighbourhood
    # name for each row alongside other columns from the inspections data
    df = df.merge(df_check.drop(columns=["geometry"]), on=col_to_join, how="left").drop(
        columns=["geometry"]
    )
    # Drop rows without a neighbourhood name - these lie outside the neighbourhood boundaries
    # (meaning they lie outside the city of Toronto. eg. Toronto Pearson Airport in Mississauga)
    print(
        f"Dropped {len(df[['AREA_NAME']].isna().sum())} rows with a missing AREA_NAME"
    )
    df = df.dropna(subset=["AREA_NAME"])
    return df

**Notes**
1. When a `GeoDataFrame` is being manually created from latitudes and longitudes (like in `gpd.GeoDataFrame()`), we have to manually set the CRS to an [EPSG of 4326](https://epsg.io/4326) since this corresponds to coordinates in latitude and longitude. This is done using the `crs` keyword which is set to 4326 as discussed earlier to specify co-ordinates in latitude and longitude.

We will use these helper functions to get the name of the neighbourhood containing the inspections in the inspections data

In [20]:
%%time
unique_locations_new = get_data_with_neighbourhood(
    gdf[geo_cols],
    unique_locations,
    "latitude",
    "longitude",
    "row_num",
)
display(unique_locations_new.head(2))

Unnamed: 0,row_num,AREA_NAME,geometry,Shape__Area
0,5390,Casa Loma (96),"POLYGON ((-79.41469 43.67391, -79.41485 43.674...",3678385.0
1,6962,Casa Loma (96),"POLYGON ((-79.41469 43.67391, -79.41485 43.674...",3678385.0


Dropped 1 rows with a missing AREA_NAME


Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num,AREA_NAME,Shape__Area
0,1222579,Food Take Out,870 MARKHAM RD,43.768,-79.229,1,Woburn (137),23664990.0
1,1222807,Restaurant,1635 LAWRENCE AVE W,43.7046,-79.4922,2,Brookhaven-Amesbury (30),6715561.0


CPU times: user 60.5 ms, sys: 18 µs, total: 60.5 ms
Wall time: 58.8 ms


**Notes**
1. The first output shown above is a subset of the columns from the `GeoDataFrame`. The second output is the same as the `unique_locations` (that were inspected) with two extra columns added
   - `AREA_NAME`
     - neighbourhood in which the inspected establishment is located
   - `Shape__Area`
     - land area of the neighbourhood
2. Any locations outside the extreme neighbourhood boundaries will not fall within a City of Toronto neighbourhood and so will not have a neighbourhood name associated with them. Such rows will be dropped from the resulting data returned by this helper function and a message (indicating how many such rows were dropped) is printed to the screen between the two outputs.

### Spot Checking Accuracy of Extracted Neighbourhood Names

Random checks were done to verify the neighbourhood assigned to the `AREA_NAME` column, as shown below and the latitude and longitude for the a sample of the inspections were checked on Google Maps with neighbourhood locations were compared to Neighbourhood Profiles [here](https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/neighbourhood-profiles/) (see the *Alphabetical listing of neighbourhoods* tab)

In [21]:
unique_locations_new.sample(25)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num,AREA_NAME,Shape__Area
463,9005388,Restaurant,769 ST CLAIR AVE W,43.6809,-79.4286,464,Wychwood (94),3217960.0
9337,10468770,Restaurant,33 BALDWIN ST,43.6559,-79.3934,9338,Kensington-Chinatown (78),2933586.0
12989,10548204,Restaurant,545 KING ST W,43.6447,-79.3988,12990,Waterfront Communities-The Island (77),25629770.0
6096,10374544,Food Store (Convenience / Variety),2300 YONGE ST,43.7072,-79.3991,6097,Yonge-Eglinton (100),3160334.0
12426,10534177,Butcher Shop,2500 YONGE ST,43.712,-79.3998,12427,Yonge-Eglinton (100),3160334.0
13630,10561396,Restaurant,1265 DUNDAS ST W,43.6492,-79.4244,13631,Trinity-Bellwoods (81),3306038.0
12416,10534023,Restaurant,4771 STEELES AVE E,43.8253,-79.2988,12417,Milliken (130),18217910.0
13031,10548761,Restaurant,410 ADELAIDE ST W,43.6466,-79.3967,13032,Waterfront Communities-The Island (77),25629770.0
14757,10585086,Restaurant,2183 WESTON RD,43.7031,-79.5257,14758,Weston (113),4912768.0
16536,10633440,Restaurant,709 QUEEN ST E,43.6587,-79.3496,16537,South Riverdale (70),20956720.0


**Notes**
1. The following 25 indexes were checked from the above randomly sampled inspections, and all 25 were found to have the correct neighbourhood name associated with their latitude and longitude
   ```python
   [4805, 13273, 9075, 8435, 9831, 1340, 10349, 1590, 5089, 8445, 9100, 5230, 9506, 9864, 7588, 13403, 1630, 7450, 1481, 7088, 2507, 11690, 12173, 11102, 1587]
   ```

   An automated approach to preforming these checks would be to call an API that accepts a latitude and longitude and returns the name of the neighbourhood.

## Merge Neighbourhood Aggregations with GeoData

We'll now merge all the datasets we have loaded above in this notebook
- geodata with neighbourhood metadata
  - `Shape__Area` and `Shape__Length`
    - we could use one of these to normalize neighbourhood population to area to account for differences in neighbourhood sizes
  - `CLASSIFICATION` and `CLASSIFICATION_CODE`
    - these could be useful attributes about a neighbourhood (we'll keep these now and will drop them during ML modeling if necessary)
- number of inspections by neighbourhood
  - simple count of number of inspections by neighbourhood
- population by neighbourhood

In [22]:
df_neigh_stats = (
    # geodata per neighbourhood
    gdf.set_index("AREA_NAME")[
        [
            "Shape__Area",
            "Shape__Length",
            # "geometry",
            "CLASSIFICATION",
            "CLASSIFICATION_CODE",
            # "AREA_LATITUDE",
            # "AREA_LONGITUDE",
        ]
    ]
    # number of inspections per neighbourhood
    .merge(
        unique_locations_new.groupby("AREA_NAME")["row_num"]
        .count()
        .rename("establishments_inspected")
        .to_frame(),
        left_index=True,
        right_index=True,
        how="left",
    )
    # population per neighbourhood
    .merge(
        df_neigh_demog.set_index("AREA_NAME")[["pop_2011", "pop_2016"]],
        left_index=True,
        right_index=True,
        how="left",
    ).add_prefix("neigh_")
)
# Clean column names
df_neigh_stats.columns = df_neigh_stats.columns.str.lower().str.replace("__", "_")
df_neigh_stats = df_neigh_stats.reset_index()
df_neigh_stats

Unnamed: 0,AREA_NAME,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop_2011,neigh_pop_2016
0,Casa Loma (96),3.678385e+06,8214.176485,,,29,10487,10968
1,Annex (95),5.337192e+06,10513.883143,,,432,29177,30526
2,Caledonia-Fairbank (109),2.955857e+06,6849.911724,,,28,9851,9955
3,Woodbine Corridor (64),3.052518e+06,7512.966773,,,48,11703,12541
4,Lawrence Park South (103),6.211341e+06,13530.370002,,,37,15070,15179
...,...,...,...,...,...,...,...,...
135,Dorset Park (126),1.153256e+07,14645.384509,Emerging Neighbourhood,EN,136,24363,25003
136,Centennial Scarborough (133),1.049677e+07,16683.674975,,,20,13093,13362
137,Humbermede (22),8.478390e+06,17227.580237,Neighbourhood Improvement Area,NIA,74,15853,15545
138,Willowdale West (37),5.533653e+06,10354.990437,,,122,15004,16936


Verify that no neighbourhoods are missing population data after merging

In [23]:
assert df_neigh_stats.query("neigh_pop_2011.isna() | neigh_pop_2016.isna()").empty

## Merge Modified Neighbourhood Aggregations with Inspections Data

We'll now merge the neighbourhood aggregations-geodata from above with the original inspections data (`df`) from above.

### Neighbourhood Population and GeoData

First we'll merge the modified unique locations (which includes the name of the neighbourhood for each location that was inspected) with the aggregated neighbourhood stats (population, land area, land length, classification code)

In [24]:
%%time
unique_locations_full = unique_locations_new.merge(df_neigh_stats, on="AREA_NAME", how="left")
unique_locations_full

CPU times: user 5.2 ms, sys: 91 µs, total: 5.29 ms
Wall time: 4.91 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num,AREA_NAME,Shape__Area,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop_2011,neigh_pop_2016
0,1222579,Food Take Out,870 MARKHAM RD,43.7680,-79.2290,1,Woburn (137),2.366499e+07,2.366499e+07,25089.815423,Neighbourhood Improvement Area,NIA,232,53350,53485
1,1222807,Restaurant,1635 LAWRENCE AVE W,43.7046,-79.4922,2,Brookhaven-Amesbury (30),6.715561e+06,6.715561e+06,12417.055559,,,65,17787,17757
2,9000002,Food Take Out,361 OAKWOOD AVE,43.6872,-79.4385,3,Oakwood Village (107),4.247608e+06,4.247608e+06,8766.961761,,,95,21073,21210
3,9000004,Food Take Out,1788 JANE ST,43.7063,-79.5050,4,Weston (113),4.912768e+06,4.912768e+06,14113.249585,Neighbourhood Improvement Area,NIA,147,18170,17992
4,9000026,Food Take Out,2372 EGLINTON AVE E,43.7320,-79.2710,5,Ionview (125),3.743076e+06,3.743076e+06,8806.027493,Neighbourhood Improvement Area,NIA,19,13091,13641
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18147,10690581,Restaurant,3560 VICTORIA PARK AVE,43.8060,-79.3375,18149,Hillcrest Village (48),1.036532e+07,1.036532e+07,13263.782407,,,82,17656,16934
18148,10690642,Bake Shop,20 ST PATRICK ST,43.6509,-79.3890,18150,Kensington-Chinatown (78),2.933586e+06,2.933586e+06,6945.056557,,,771,18495,17945
18149,10690660,Restaurant,549 BLOOR ST W,43.6652,-79.4102,18151,University (79),2.687050e+06,2.687050e+06,6872.849906,,,247,7782,7607
18150,10690679,Food Take Out,1175 ST CLAIR AVE W,43.6777,-79.4434,18152,Corso Italia-Davenport (92),3.605719e+06,3.605719e+06,8404.231261,,,127,13743,14133


**Notes**
1. This gives us the neighbourhood name containing each unique inspected location and the aggregated neighbourhood statistics for each neighbourhood.

We'll now connect the year in which the inspection was performed to the appropriate neighbourhood population column (`pop_2011` or `pop_2016`) from the above modified neighbourhood aggregation data (`unique_locations_full`). If the ML model to be trained for predicting critical infractions ahead of time is to be used by an inspector, and one of the features it takes is population of the neighbourhood containing the inspected location, then we need to have this population available when the ML model is to be trained, when it makes its prediction and when it must be evaluated. If the inspection is performed in 2016, we can't use the population from the 2016 census since that data is available in 2017 (at the earliest; discussed below) so we would have to use the population from the 2011 census.

The following timeline explains which census' results can be used and when
- 2013, 2014, 2015, 2016, 2017
  - 2011 census population data was posted on *statcan.gc.ca* in [February 2012](https://en.wikipedia.org/wiki/2011_Canadian_census#Data_releases)
  - Statistics Canada posts population data from the most recent census at the level of [Forward Sortation Area](https://www.ic.gc.ca/eic/site/bsf-osb.nsf/eng/br03396.html), which the city then aggregates by neighbourhood (see the *PLEASE NOTE:* section from [the *Neighbourhood Profiles* dataset](https://open.toronto.ca/dataset/neighbourhood-profiles/), which was used to get the `pop_2011` and `pop_2016` columns)
  - we'll assume that the city took until the end of 2012 to post this in its *Neighbourhood Profiles* dataset
  - so, for all inspections performed in these five years (Jan 1, 2013 to Dec 31, 2017), the population to be used is that from the most recent (2011) census (`pop_2011`)
- 2018 onwards
  - 2016 census population data was posted on *statcan.gc.ca* in [February 2017](https://www.statcan.gc.ca/en/about/smr09/smr09_061)
  - we'll assume that the city took until the end of 2017 to post this in its *Neighbourhood Profiles* dataset
  - so, for all inspections performed from Jan 1, 2018 onwards, the population to be used is that from the 2016 census (`pop_2016`)
- 2011, 2012
  - 2006 census population data was posted on *statcan.gc.ca* in [March 2007](https://en.wikipedia.org/wiki/2006_Canadian_census#Population_and_dwelling_counts)
  - so, for all inspections performed in these years (all dates in 2011 and 2012), the population to be used is that from the 2006 census (`pop_2006`)
  - we can assume that the city took until the end of 2007 to post this in its *Neighbourhood Profiles* dataset, but this population is not listed in the *Neighbourhoods Profiles* dataset (only the population in the 2011 and 2016 census is listed), so the population in all neighbourhoods for these two years will be taken as a missing value (`np.nan`)

We will need to take this into account when merging the modified neighbourhood aggregation data to the inspections data (`df`).

To do this, we'll first define three year ranges (2011-2012, 2013-2017 and 2018-)

In [25]:
census_2006_years = range(2011, 2012 + 1)
census_2011_years = range(2013, 2017 + 1)
census_2016_years = range(2018, df["inspection_date"].dt.year.max() + 1)
print(census_2006_years, census_2011_years, census_2016_years)

range(2011, 2013) range(2013, 2018) range(2018, 2020)


Next, we'll filter inspections in each of these three year ranges and merge those inspections with the associated population column (`pop_2011` or `pop_2016`) from the modified neighbourhood aggregations

In [26]:
unique_locations_full.head(2)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude,row_num,AREA_NAME,Shape__Area,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop_2011,neigh_pop_2016
0,1222579,Food Take Out,870 MARKHAM RD,43.768,-79.229,1,Woburn (137),23664990.0,23664990.0,25089.815423,Neighbourhood Improvement Area,NIA,232,53350,53485
1,1222807,Restaurant,1635 LAWRENCE AVE W,43.7046,-79.4922,2,Brookhaven-Amesbury (30),6715561.0,6715561.0,12417.055559,,,65,17787,17757


In [27]:
%%time
df_full = pd.concat(
    [
        # filter to get inspections from 2011-2012
        # - merge with modified neighbourhood aggregations data
        df.query("inspection_date.dt.year.isin(@census_2006_years)").merge(
            # modified neighbourhood aggregations with population from 2006 census
            # - this is missing, so assign a missing value (np.nan) to all neighbourhoods
            unique_locations_full.assign(neigh_pop_2006=np.nan)
            .drop(columns=["row_num", "neigh_pop_2011", "neigh_pop_2016"])
            .rename(columns={"neigh_pop_2006": "neigh_pop"})
            .assign(pop_census_year=2006),
            on=[
                "establishment_id",
                "establishmenttype",
                "establishment_address",
                "latitude",
                "longitude",
            ],
            how="left",
        ),
        # filter to get inspections from 2013-2017
        # - merge with modified neighbourhood aggregations data
        df.query("inspection_date.dt.year.isin(@census_2011_years)").merge(
            # modified neighbourhood aggregations with population from 2011 census
            unique_locations_full.drop(
                columns=["row_num", "neigh_pop_2016"]
            ).rename(columns={"neigh_pop_2011": "neigh_pop"}).assign(pop_census_year=2011),
            on=["establishment_id","establishmenttype","establishment_address","latitude","longitude"],
            how="left",
        ),
        # filter to get inspections from 2018-
        # - merge with modified neighbourhood aggregations data
        df.query("inspection_date.dt.year.isin(@census_2016_years)").merge(
            # modified neighbourhood aggregations with population from 2016 census
            unique_locations_full.drop(
                columns=["row_num", "neigh_pop_2011"]
            ).rename(columns={"neigh_pop_2016": "neigh_pop"}).assign(pop_census_year=2016),
            on=["establishment_id","establishmenttype","establishment_address","latitude","longitude"],
            how="left",
        ),
    ], ignore_index=True
)
df_full

CPU times: user 110 ms, sys: 0 ns, total: 110 ms
Wall time: 110 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,...,longitude,AREA_NAME,Shape__Area,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop,pop_census_year
0,9000002,Food Take Out,361 OAKWOOD AVE,102611725,2011-10-05,Operator fail to properly wash equipment. Oper...,3,0,2,5,...,-79.4385,Oakwood Village (107),4.247608e+06,4.247608e+06,8766.961761,,,95.0,,2006.0
1,9000029,Food Take Out,2548 EGLINTON AVE W,102594872,2011-09-07,Display hazardous foods at internal temperatur...,1,1,0,2,...,-79.4715,Beechborough-Greenbrook (112),3.509860e+06,3.509860e+06,8866.793417,Neighbourhood Improvement Area,NIA,14.0,,2006.0
2,9000046,Food Take Out,1468 QUEEN ST W,102613594,2011-10-11,Operator fail to properly wash surfaces in roo...,3,0,3,6,...,-79.4366,Roncesvalles (86),2.875399e+06,2.875399e+06,8521.476353,,,224.0,,2006.0
3,9000109,Restaurant,493 DANFORTH AVE,102671200,2012-01-05,Operator fail to ensure food is not contaminat...,3,1,2,6,...,-79.3492,North Riverdale (68),3.416312e+06,3.416312e+06,7571.599524,,,149.0,,2006.0
4,9000239,Restaurant,1220 ST CLAIR AVE W,102571924,2011-08-08,Operate food premise--three-sink equipment not...,1,0,0,1,...,-79.4448,Corso Italia-Davenport (92),3.605719e+06,3.605719e+06,8404.231261,,,127.0,,2006.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83697,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3,...,-79.3375,Hillcrest Village (48),1.036532e+07,1.036532e+07,13263.782407,,,82.0,16934.0,2016.0
83698,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1,...,-79.3890,Kensington-Chinatown (78),2.933586e+06,2.933586e+06,6945.056557,,,771.0,17945.0,2016.0
83699,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2,...,-79.4102,University (79),2.687050e+06,2.687050e+06,6872.849906,,,247.0,7607.0,2016.0
83700,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1,...,-79.4434,Corso Italia-Davenport (92),3.605719e+06,3.605719e+06,8404.231261,,,127.0,14133.0,2016.0


Verify that the `LEFT JOIN` has produced the same number of rows in the original data

In [28]:
assert len(df_full) == len(df)

### Neighbourhood Crimes

Finally, we will now merge the above output with to the aggregated crimes by date and neighbourhood.

Show the first three rows of the aggregated crimes data (daily MCI by neighbourhood)

In [29]:
df_mci.head(3)

MCI,AREA_NAME,inspection_date,neigh_Assault,neigh_Auto Theft,neigh_Break and Enter,neigh_Robbery,neigh_Theft Over
0,Agincourt North (129),2014-01-01,1,0,0,0,0
1,Agincourt North (129),2014-01-15,1,0,0,0,0
2,Agincourt North (129),2014-01-20,0,0,0,1,0


Show the first three rows of the modified neighbourhood aggregation and inspections data (merged in the previous sub-section)

In [30]:
df_full.head(3)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,...,longitude,AREA_NAME,Shape__Area,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop,pop_census_year
0,9000002,Food Take Out,361 OAKWOOD AVE,102611725,2011-10-05,Operator fail to properly wash equipment. Oper...,3,0,2,5,...,-79.4385,Oakwood Village (107),4247608.0,4247608.0,8766.961761,,,95.0,,2006.0
1,9000029,Food Take Out,2548 EGLINTON AVE W,102594872,2011-09-07,Display hazardous foods at internal temperatur...,1,1,0,2,...,-79.4715,Beechborough-Greenbrook (112),3509860.0,3509860.0,8866.793417,Neighbourhood Improvement Area,NIA,14.0,,2006.0
2,9000046,Food Take Out,1468 QUEEN ST W,102613594,2011-10-11,Operator fail to properly wash surfaces in roo...,3,0,3,6,...,-79.4366,Roncesvalles (86),2875399.0,2875399.0,8521.476353,,,224.0,,2006.0


`LEFT JOIN` the above `DataFrame` with the aggregated crimes data using the neighbourhood name and date columns

In [31]:
%%time
df_full_with_mci = df_full.merge(df_mci, on=["AREA_NAME", "inspection_date"], how="left")
display(df_full_with_mci.head())
summarize_df(df_full_with_mci)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,...,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop,pop_census_year,neigh_Assault,neigh_Auto Theft,neigh_Break and Enter,neigh_Robbery,neigh_Theft Over
0,9000002,Food Take Out,361 OAKWOOD AVE,102611725,2011-10-05,Operator fail to properly wash equipment. Oper...,3,0,2,5,...,,,95.0,,2006.0,,,,,
1,9000029,Food Take Out,2548 EGLINTON AVE W,102594872,2011-09-07,Display hazardous foods at internal temperatur...,1,1,0,2,...,Neighbourhood Improvement Area,NIA,14.0,,2006.0,,,,,
2,9000046,Food Take Out,1468 QUEEN ST W,102613594,2011-10-11,Operator fail to properly wash surfaces in roo...,3,0,3,6,...,,,224.0,,2006.0,,,,,
3,9000109,Restaurant,493 DANFORTH AVE,102671200,2012-01-05,Operator fail to ensure food is not contaminat...,3,1,2,6,...,,,149.0,,2006.0,,,,,
4,9000239,Restaurant,1220 ST CLAIR AVE W,102571924,2011-08-08,Operate food premise--three-sink equipment not...,1,0,0,1,...,,,127.0,,2006.0,,,,,


Unnamed: 0,dtype,num_missing,num,nunique,single_non_nan_value
establishment_id,int64,0,83702,18028,10493249
establishmenttype,object,0,83702,13,Food Court Vendor
establishment_address,object,0,83702,9838,3401 DUFFERIN ST
inspection_id,int64,0,83702,83515,104403559
inspection_date,datetime64[ns],0,83702,2318,2019-02-08 00:00:00
infractions_summary,object,0,83702,37683,FOOD PREMISE NOT MAINTAINED WITH FOOD HANDLING...
num_significant,int64,0,83702,34,0
num_crucial,int64,0,83702,15,0
num_minor,int64,0,83702,25,2
num_infractions,int64,0,83702,55,2


CPU times: user 177 ms, sys: 4.02 ms, total: 181 ms
Wall time: 180 ms


**Observations**
1. `neigh_Assault`, `neigh_Auto Theft`, `neigh_Break and Enter`, `neigh_Robbery` and `neigh_Theft Over` have missing values
   - pre-2014 since there is no MCI crime data before Jan 1, 2014
     - these rows will be kept as missing values
   - post-2014 since some types of crimes were not commited on specific dates in certain neighbourhoods
     - these missing values can be filled in with zeros

Get the fully merged inspections before and after 2014

In [32]:
# Get pre-2014 merged data
df_full_with_mci_pre_2014 = df_full_with_mci.query(
    "inspection_date.dt.year < 2014"
).copy()

# Get merged data for 2014 onwards
df_full_with_mci_incl_post_2014 = df_full_with_mci.query(
    "inspection_date.dt.year >= 2014"
).copy()

Fill misisng values in the `neigh_Assault`, `neigh_Auto Theft`, `neigh_Break and Enter`, `neigh_Robbery` and `neigh_Theft Over` crime columns **after** Jan 1, 2014 with zeros

In [33]:
for c in [
    "neigh_Assault",
    "neigh_Auto Theft",
    "neigh_Break and Enter",
    "neigh_Robbery",
    "neigh_Theft Over",
]:
    df_full_with_mci_incl_post_2014[c] = df_full_with_mci_incl_post_2014[c].fillna(0)

Combine pre- and post-2014 merged datasets

In [34]:
df_full_with_mci = pd.concat(
    [df_full_with_mci_pre_2014, df_full_with_mci_incl_post_2014], ignore_index=True
)
display(df_full_with_mci.head().append(df_full_with_mci.tail()))
summarize_df(df_full_with_mci)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,...,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop,pop_census_year,neigh_Assault,neigh_Auto Theft,neigh_Break and Enter,neigh_Robbery,neigh_Theft Over
0,9000002,Food Take Out,361 OAKWOOD AVE,102611725,2011-10-05,Operator fail to properly wash equipment. Oper...,3,0,2,5,...,,,95.0,,2006.0,,,,,
1,9000029,Food Take Out,2548 EGLINTON AVE W,102594872,2011-09-07,Display hazardous foods at internal temperatur...,1,1,0,2,...,Neighbourhood Improvement Area,NIA,14.0,,2006.0,,,,,
2,9000046,Food Take Out,1468 QUEEN ST W,102613594,2011-10-11,Operator fail to properly wash surfaces in roo...,3,0,3,6,...,,,224.0,,2006.0,,,,,
3,9000109,Restaurant,493 DANFORTH AVE,102671200,2012-01-05,Operator fail to ensure food is not contaminat...,3,1,2,6,...,,,149.0,,2006.0,,,,,
4,9000239,Restaurant,1220 ST CLAIR AVE W,102571924,2011-08-08,Operate food premise--three-sink equipment not...,1,0,0,1,...,,,127.0,,2006.0,,,,,
83697,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3,...,,,82.0,16934.0,2016.0,0.0,0.0,0.0,0.0,0.0
83698,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1,...,,,771.0,17945.0,2016.0,1.0,0.0,0.0,0.0,0.0
83699,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2,...,,,247.0,7607.0,2016.0,0.0,0.0,1.0,0.0,0.0
83700,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1,...,,,127.0,14133.0,2016.0,0.0,0.0,0.0,0.0,0.0
83701,10690681,Restaurant,200 QUEENS PLATE DR,104594980,2019-10-23,Fail to Hold a Valid Food Handler's Certificat...,0,0,0,1,...,,,336.0,33312.0,2016.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,dtype,num_missing,num,nunique,single_non_nan_value
establishment_id,int64,0,83702,18028,10681321
establishmenttype,object,0,83702,13,Restaurant
establishment_address,object,0,83702,9838,1316 QUEEN ST W
inspection_id,int64,0,83702,83515,104534221
inspection_date,datetime64[ns],0,83702,2318,2019-08-01 00:00:00
infractions_summary,object,0,83702,37683,Fail to Produce Valid Food Handlers Certificat...
num_significant,int64,0,83702,34,1
num_crucial,int64,0,83702,15,0
num_minor,int64,0,83702,25,0
num_infractions,int64,0,83702,55,2


**Notes**
1. 2006 census data for population is missing, so missing values in the `neigh_pop` column correspond to inspections in 2011 and 2012 that do not have a value in the population column.
2. `neigh_Assault`, `neigh_Auto Theft`, `neigh_Break and Enter`, `neigh_Robbery` and `neigh_Theft Over` are missing since there is no crime data for the years 2011, 2012 and 2013, as mentioned above. Since the first inspections data starts in 2011, and the inspections data was `LEFT JOIN`ed to the crimes data, there will be missing values in these columns.
3. The neighbourhood classification and its code are missing for many neighbourhoods in the neighbourhood boundaries dataset, so we expect missing values to be present here.

Verify that the `LEFT JOIN` has produced the same number of rows in the original data

In [35]:
assert len(df_full_with_mci) == len(df)
assert len(df_full_with_mci) == len(df_full)

## Export to Disk

Data acquisition and processing is now complete. This version of the data will be exported to a CSV file for use in exploratory data analysis and ML experiments

In [36]:
%%time
time_now  = datetime.now().strftime('%Y%m%d_%H%M%S')
df_full_with_mci.to_csv(f"data/processed/processed__{time_now}.csv", index=False)

CPU times: user 1.26 s, sys: 32 ms, total: 1.29 s
Wall time: 1.29 s
