# Transforming Infractions into Inspections

In [1]:
%load_ext lab_black

In [2]:
import configparser
from datetime import datetime

import numpy as np
import pandas as pd
from sqlalchemy import create_engine

In [3]:
# Access `../sql.ini` (database connection details) as environment variables
config = configparser.ConfigParser()
config.read("../sql.ini")
default_cfg = config["default"]

In [4]:
DB_TYPE = default_cfg["DB_TYPE"]
DB_DRIVER = default_cfg["DB_DRIVER"]
DB_USER = default_cfg["DB_USER"]
DB_PASS = default_cfg["DB_PASS"]
DB_HOST = default_cfg["DB_HOST"]
DB_PORT = default_cfg["DB_PORT"]
DB_NAME = default_cfg["DB_NAME"]

In [5]:
# Connect to all databases (required to perform CRUD operations and submit queries)
URI = f"{DB_TYPE}+{DB_DRIVER}://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

## Background

The following are facts about the data based on the city of Toronto's [Open Data Portal page for DineSafe data](https://open.toronto.ca/dataset/dinesafe/) and [general info page for the DineSafe program](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/about-dinesafe/)
1. A single inspection takes place on a specific date at a single establishment. An `inspection_id` should be unique for each inspection. An `establishment_id` should be unique for each establishment.
2. A group of establishments, in a chain (such as the [SUBWAY](https://en.wikipedia.org/wiki/Subway_(restaurant)) brand), can have multiple locations.
3. Each location can be inspected one or more times (usually more than once). So, a single `establishment_id` and `inspection_id` should be associated with a single `inspection_date`.
4. One or more infractions can be recorded per inspection. In the data, each infraction is listed on a single row. There can be multiple rows (of infractions) per inspection.
5. If a Significant infraction is detected, an [inspector returns within two days](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/dinesafe-infractions/) to re-inspect (follow-up inspection) the establishment.

### Implications for Current Use-Case

For the current ML use-case, we require each *observation* to be an independent inspection with an infraction (crucial, significant or minor). We will then create a binary variable indicating whether the inspection resulted in a crucial infraction (1) or not (0) since that is that label that the ML algorithm needs to predict. The ML model will not be predicting the outcome of follow-up inspections, but will only be trained to predict the outcome (if there was a crucial infraction or not) of the initial inspection. Also, since there is no use in predicting the outcome of inspections that do not result in an infraction (crucial or not), the ML algorithm will not need access to data about such inpsections.

When exploring the data, we will need to take these considerations into account as well as the facts about the data mentioned above.

## Connect to the MySQL Database

Create a SQLAlchemy engine object and get a connection to the `dinesafe` database on the MySQL server

In [6]:
engine = create_engine(URI)
conn = engine.connect()

## Preliminary Exploration of Data

In [7]:
%%time
df_query = pd.read_sql(
    """
    SELECT *
    FROM inspections
    LIMIT 6
    """,
    con=conn,
)
df_query

CPU times: user 4.4 ms, sys: 404 µs, total: 4.8 ms
Wall time: 4.21 ms


Unnamed: 0,row_id,establishment_id,inspection_id,establishment_name,establishmenttype,establishment_address,latitude,longitude,establishment_status,minimum_inspections_peryear,infraction_details,inspection_date,severity,action,court_outcome,amount_fined
0,1,1222579,102810896,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Pass,2,,2012-08-21,,,,
1,2,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,2013-06-26,S - Significant,Corrected During Inspection,,
2,3,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Food handler fail to wear headgear,2013-06-26,M - Minor,Notice to Comply,,
3,4,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Operator fail to ensure food is not contaminat...,2013-06-26,C - Crucial,Notice to Comply,,
4,5,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Operator fail to maintain hazardous food(s) at...,2013-06-26,C - Crucial,Notice to Comply,,
5,6,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Operator fail to properly maintain rooms,2013-06-26,M - Minor,Notice to Comply,,


### Count number of establishments per Inspection

Count the number of establishments per inspection (we should have, at most, one restaurant for each inspection ID) and sort the results in descending order of the number of establishments

In [8]:
%%time
df_query = pd.read_sql(
    """
    SELECT inspection_id,
           COUNT(DISTINCT(establishment_id)) AS num_establishments
    FROM inspections
    GROUP BY inspection_id
    ORDER BY COUNT(DISTINCT(inspection_id)) DESC
    """,
    con=conn,
)
df_query

CPU times: user 1.34 s, sys: 57.8 ms, total: 1.4 s
Wall time: 2.19 s


Unnamed: 0,inspection_id,num_establishments
0,104571930,1
1,104571931,1
2,104571932,1
3,104571939,1
4,104571970,1
...,...,...
246629,103934110,1
246630,103934112,1
246631,103934121,1
246632,103934148,1


As expected, we only have one establishment recorded per inspection ID.

### Count number of Inspections per Establishment

Count the number of inspections per establishment (we should have one or more inspections per establishment)

In [9]:
%%time
df_query = pd.read_sql(
    """
    SELECT establishment_id,
           establishmenttype,
           establishment_address,
           COUNT(DISTINCT(inspection_id)) AS num_inspections
    FROM inspections
    GROUP BY establishment_id, establishmenttype, establishment_address
    ORDER BY COUNT(DISTINCT(inspection_id)) DESC
    """,
    con=conn,
)
df_query

CPU times: user 185 ms, sys: 11.8 ms, total: 196 ms
Wall time: 1.64 s


Unnamed: 0,establishment_id,establishmenttype,establishment_address,num_inspections
0,10336522,Supermarket,4466 SHEPPARD AVE E,60
1,10282501,Bakery,2300 LAWRENCE AVE E,51
2,10399527,Food Take Out,4810 SHEPPARD AVE E,48
3,9011824,Restaurant,4386 SHEPPARD AVE E,48
4,10420908,Restaurant,3601 VICTORIA PARK AVE,47
...,...,...,...,...
30285,10690642,Bake Shop,20 ST PATRICK ST,1
30286,10690660,Restaurant,549 BLOOR ST W,1
30287,10690679,Food Take Out,1175 ST CLAIR AVE W,1
30288,10690680,Food Store (Convenience / Variety),155 WELLINGTON ST W,1


We do have one or more inspections per establishment inspected. Since an establishment can be inspected multiple times, this is expected.

### Count number of infractions per inspection

To do this, it only makes sense to count the number of infractions per inspection per establishment

In [10]:
%%time
df_query = pd.read_sql(
    """
    SELECT establishment_id,
           establishmenttype,
           establishment_address,
           inspection_id,
           COUNT(*) AS number_of_infractions
    FROM inspections
    GROUP BY establishment_id, establishmenttype, establishment_address, inspection_id
    ORDER BY COUNT(*) DESC
    """,
    con=conn,
)
df_query

CPU times: user 1.97 s, sys: 70 ms, total: 2.04 s
Wall time: 3.76 s


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,number_of_infractions
0,10356286,Restaurant,4016 FINCH AVE E,103465643,90
1,10528444,Food Processing Plant,19 WATERMAN AVE,103473708,84
2,9031081,Food Take Out,200 WELLINGTON ST W,103580745,84
3,10191833,Restaurant,5594 YONGE ST,103598378,80
4,10522734,Restaurant,1686 ELLESMERE RD,103428956,80
...,...,...,...,...,...
247323,9408154,Food Court Vendor,6312 YONGE ST,102908755,1
247324,9408154,Food Court Vendor,6312 YONGE ST,102982172,1
247325,9408426,Food Caterer,195 BENTWORTH AVE,102709799,1
247326,9408426,Food Caterer,195 BENTWORTH AVE,102764842,1


As we can see, there can be one or more infractions per inspection performed at a given establishment.

The first inspection found above did indeed detect 90 infractions as shown below

In [11]:
%%time
df_query = pd.read_sql(
    """
    SELECT *
    FROM inspections
    WHERE inspection_id = 103465643
    """,
    con=conn,
)
df_query

CPU times: user 20.2 ms, sys: 652 µs, total: 20.8 ms
Wall time: 625 ms


Unnamed: 0,row_id,establishment_id,inspection_id,establishment_name,establishmenttype,establishment_address,latitude,longitude,establishment_status,minimum_inspections_peryear,infraction_details,inspection_date,severity,action,court_outcome,amount_fined
0,51235,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to clean washroom fixtures,2015-04-15,S - Significant,Notice to Comply,,
1,51236,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to ensure food is not contaminat...,2015-04-15,C - Crucial,Notice to Comply,,
2,51237,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to maintain hazardous foods at 6...,2015-04-15,C - Crucial,Notice to Comply,,
3,51238,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to properly maintain equipment,2015-04-15,S - Significant,Notice to Comply,,
4,51239,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to properly maintain equipment(N...,2015-04-15,M - Minor,Notice to Comply,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,35175,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to provide hand washing supplies,2015-04-15,S - Significant,Notice to Comply,,
86,35176,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to provide proper equipment,2015-04-15,M - Minor,Notice to Comply,,
87,35177,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to provide proper garbage contai...,2015-04-15,M - Minor,Notice to Comply,,
88,35178,10356286,103465643,MILLIKEN BAR RESTAURANT,Restaurant,4016 FINCH AVE E,,,Conditional Pass,3,Operator fail to use proper procedure(s) to en...,2015-04-15,S - Significant,Notice to Comply,,


### Get Data with a Missing Address

In [12]:
%%time
df_miss_address = pd.read_sql(
    """
    SELECT COUNT(*) AS num_missing_addresses
    FROM inspections
    WHERE establishment_address IS NULL
    """,
    con=conn,
)
df_miss_address.head()

CPU times: user 77 µs, sys: 2.59 ms, total: 2.67 ms
Wall time: 392 ms


Unnamed: 0,num_missing_addresses
0,0


There are no inspections where the establishment address is missing.

### Get Data with a Missing Latitude or Longitude

Get all inspections missing either a latitude or longitude
- in the next notebook, we will geocode these locations so that we can (later) determine the neighbourhood for each establishment and get supplementary datasets that provide metadata for each neighbourhood

In [13]:
%%time
df_miss_lat_lon = pd.read_sql(
    """
    SELECT *
    FROM inspections
    WHERE latitude IS NULL
    OR longitude IS NULL
    """,
    con=conn,
)
df_miss_lat_lon.head()

CPU times: user 11.3 s, sys: 370 ms, total: 11.6 s
Wall time: 11.6 s


Unnamed: 0,row_id,establishment_id,inspection_id,establishment_name,establishmenttype,establishment_address,latitude,longitude,establishment_status,minimum_inspections_peryear,infraction_details,inspection_date,severity,action,court_outcome,amount_fined
0,1,1222579,102810896,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Pass,2,,2012-08-21,,,,
1,2,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,2013-06-26,S - Significant,Corrected During Inspection,,
2,3,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Food handler fail to wear headgear,2013-06-26,M - Minor,Notice to Comply,,
3,4,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Operator fail to ensure food is not contaminat...,2013-06-26,C - Crucial,Notice to Comply,,
4,5,1222579,103015258,SAI-LILA KHAMAN DHOKLA HOUSE,Food Take Out,870 MARKHAM RD,,,Conditional Pass,2,Operator fail to maintain hazardous food(s) at...,2013-06-26,C - Crucial,Notice to Comply,,


There are many rows with missing values in the `latitude` or `longitude` columns for the same address. As seen earlier, there could be many infractions recorded in this dataset for a single establishment (address) on a given date.

For geocoding purposes, we only need to get each (unique) address (with missing latitude or longitude) once. We'll now write a SQL query to give this output

In [14]:
%%time
df_addr_lat_lon = pd.read_sql(
    """
    SELECT establishment_address,
           MAX(latitude) AS latitude,
           MAX(longitude) AS longitude
    FROM inspections
    WHERE latitude IS NULL
    OR longitude IS NULL
    GROUP BY establishment_address
    """,
    con=conn,
)
df_addr_lat_lon

CPU times: user 135 ms, sys: 7.93 ms, total: 143 ms
Wall time: 1.04 s


Unnamed: 0,establishment_address,latitude,longitude
0,870 MARKHAM RD,,
1,1550 JANE ST,,
2,1635 LAWRENCE AVE W,,
3,606 BROWNS LINE,,
4,500 REXDALE BLVD,,
...,...,...,...
13280,8 SEASONS DR,,
13281,453 PARLIAMENT ST,,
13282,121 HUMBER BLVD,,
13283,5298 YONGE ST,,


Check that every row in the above query has missing values in **both** the `latitude` and `longitude` columns. To do this, we'll count the number of
- establishments
- missing values in the `latitude` column
- missing values in the `longitude` column

If each establishment with a missing value in the `latitude` or `longitude` column is **also** missing a value in the `longitude` or `latitude` column, then we will know that every establishment listed in the above query is missing values in **both** the `latitude` and `longitude` columns

In [15]:
%%time
df_query = pd.read_sql(
    """
    SELECT SUM(CASE WHEN latitude IS NULL THEN 1 ELSE 0 END) AS num_miss_lat,
           SUM(CASE WHEN longitude IS NULL THEN 1 ELSE 0 END) AS num_miss_lon,
           COUNT(DISTINCT(establishment_address)) AS num_establishments
    FROM (
        SELECT establishment_address,
               MAX(latitude) AS latitude,
               MAX(longitude) AS longitude
        FROM inspections
        WHERE latitude IS NULL
        OR longitude IS NULL
        GROUP BY establishment_address
    ) AS combo
    """,
    con=conn,
)
df_query

CPU times: user 2.83 ms, sys: 266 µs, total: 3.1 ms
Wall time: 929 ms


Unnamed: 0,num_miss_lat,num_miss_lon,num_establishments
0,13285.0,13285.0,13285


Since these row counts agree with eachother, we can say that every establishment with a missing value in the `latitude` column is also missing a value in the `longitude` column, and vice-versa.

### Types of Infraction Severities

In [16]:
%%time
df_query = pd.read_sql(
    """
    SELECT severity,
           COUNT(*) AS num_infractions
    FROM inspections
    GROUP BY severity
    ORDER BY COUNT(*) DESC
    """,
    con=conn,
)
df_query

CPU times: user 3.34 ms, sys: 0 ns, total: 3.34 ms
Wall time: 831 ms


Unnamed: 0,severity,num_infractions
0,,359119
1,M - Minor,330971
2,S - Significant,223491
3,NA - Not Applicable,41503
4,C - Crucial,26715


**Observatoins**
1. Our model needs to be trained on infractions only. It will need to know whether an infraction is crucial or not (significant, minor), since this is what we are trying to predict (whether a crucial infraction is detected during an inspection or not). We are not trying to predict (ahead of time) if an inspection will produce no infractions. So, we don't need inspections with no infraction (where the severity is `NULL`).
2. We may also not need to keep inspections where the infraction severity is `NA - Not Applicable`, but we'll need to explore this first.

Show the infractions with a severity of `NA - ...`

In [17]:
%%time
df_query = pd.read_sql(
    """
    SELECT infraction_details,
           severity,
           establishment_status
    FROM inspections
    WHERE severity LIKE '%%NA -%%'
    """,
    con=conn,
)
with pd.option_context('display.max_colwidth',1000):
    display(df_query)

Unnamed: 0,infraction_details,severity,establishment_status
0,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Conditional Pass
1,Fail to hold a valid food handler's certificate - Municipal Code Chapter 545 Sec. 5G(17)(b),NA - Not Applicable,Conditional Pass
2,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Conditional Pass
3,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Conditional Pass
4,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Conditional Pass
...,...,...,...
41498,Fail to Ensure the Presence of the Holder of a Valid Food Handlers Certificate - Sec. 545- 157E(1 7)(a),NA - Not Applicable,Conditional Pass
41499,Fail to Ensure the Presence of the Holder of a Valid Food Handlers Certificate - Sec. 545- 157E(1 7)(a),NA - Not Applicable,Pass
41500,Fail to Ensure the Presence of the Holder of a Valid Food Handlers Certificate - Sec. 545- 157E(1 7)(a),NA - Not Applicable,Conditional Pass
41501,Fail to Post Licence Adjacent to Food Safety Inspection Notice - Sec. 545-157(E)(4),NA - Not Applicable,Pass


CPU times: user 246 ms, sys: 466 µs, total: 246 ms
Wall time: 565 ms


Show the assigned establishment status for infractions assigned a severity of `NA - ...`

In [18]:
%%time
df_query = pd.read_sql(
    """
    SELECT establishment_status,
           COUNT(*) AS num_rows
    FROM inspections
    WHERE severity LIKE '%%NA -%%'
    GROUP BY establishment_status
    """,
    con=conn,
)
with pd.option_context('display.max_colwidth',1000):
    display(df_query)

Unnamed: 0,establishment_status,num_rows
0,Conditional Pass,9596
1,Closed,124
2,Pass,31783


CPU times: user 8.9 ms, sys: 125 µs, total: 9.03 ms
Wall time: 541 ms


**Observations**
1. Nearly all these infractions result in a `Pass` being assigned to th establishment. However, the infraction details column does suggest that some infraction was detected by the inspector. Unfortunately, there is no valid entry (Crucial, Significant or Minor) in the `severity` column.
2. Could we map the `establishment_status` that is `Pass` or `Conditional Pass` to non-critical infractions and `Closed` to critical? If we can do this, then we would be justified in keeping these infractions (where severity is `NA - ...`); if not then we will have to drop them.

Show the infraction details for infractions with a severity of `NA - ...` that resulted in the establishment being `Closed`

In [19]:
%%time
df_query = pd.read_sql(
    """
    SELECT infraction_details,
           severity,
           establishment_status
    FROM inspections
    WHERE severity LIKE '%%NA -%%'
    AND establishment_status = 'Closed'
    """,
    con=conn,
)
with pd.option_context('display.max_colwidth',1000):
    display(df_query)

Unnamed: 0,infraction_details,severity,establishment_status
0,"Fail to , upon request by any person, produce the food safety inspection report or reports relating to the currently posted food inspection notice for such establishment - Municipal Code Chapter 545 Sec. 5G(5)",NA - Not Applicable,Closed
1,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Closed
2,Fail to post the eating and drinking establishment license adjacent to the food safety inspection notice - Municipal Code Chapter 545 Sec. 5G(4),NA - Not Applicable,Closed
3,Fail to hold a valid food handler's certificate - Municipal Code Chapter 545 Sec. 5G(17)(b),NA - Not Applicable,Closed
4,Fail to ensure the presence of the holder of a valid food handler's certificate - Municipal Code Chapter 545 Sec. G(17)(a),NA - Not Applicable,Closed
...,...,...,...
119,Fail to Ensure the Presence of the Holder of a Valid Food Handler's Certificate. Muncipal Code Chapter 545-157(17)(a),NA - Not Applicable,Closed
120,Fail to Ensure the Presence of the Holder of a Valid Food Handlers Certificate - Sec. 545- 157E(1 7)(a),NA - Not Applicable,Closed
121,Fail to Post Licence Adjacent to Food Safety Inspection Notice - Sec. 545-157(E)(4),NA - Not Applicable,Closed
122,Fail to Ensure the Presence of the Holder of a Valid Food Handlers Certificate - Sec. 545- 157E(1 7)(a),NA - Not Applicable,Closed


CPU times: user 4.15 ms, sys: 3.28 ms, total: 7.43 ms
Wall time: 477 ms


**Observations**
1. These seem like valid infractions that led to the establishment being closed. Unfortunately, we don't have a valid entry in the severity column. If the establishment was strong enough to lead to closing the establishment, then why did the inspector not assign a `Crucial` severity to the infraction? There seems to be some reasoning / judgement that was used in arriving at this conclusion but that reasoning is not present in the Dinesafe dataset. Our ML model will not be able to learn from such infractions. So, we'll exclude infractions with such a severity (`NA - ...`) from the data.

Below, we show the number of infractions by severity and the assigned establishment status

In [20]:
%%time
df_query = pd.read_sql(
    """
    SELECT severity,
           establishment_status,
           COUNT(*) AS num_infractions
    FROM inspections
    GROUP BY severity, establishment_status
    ORDER BY COUNT(*) DESC
    """,
    con=conn,
)
df_query

CPU times: user 3.01 ms, sys: 0 ns, total: 3.01 ms
Wall time: 1.13 s


Unnamed: 0,severity,establishment_status,num_infractions
0,,Pass,359118
1,M - Minor,Pass,270153
2,S - Significant,Pass,125361
3,S - Significant,Conditional Pass,96556
4,M - Minor,Conditional Pass,59629
5,NA - Not Applicable,Pass,31783
6,C - Crucial,Conditional Pass,25475
7,NA - Not Applicable,Conditional Pass,9596
8,S - Significant,Closed,1574
9,C - Crucial,Closed,1197


**Observations**
1. An establishment status of `Pass` or `Conditional Pass` could be associated with a crucial severity, not just minor, significant or `N/A- ...`. Similarly, `Closed` is associated with a minor and crucial establishment status. So, we cannot map `Pass` and `Conditional Pass` to non-crucical severity and `Closed` to crucial. This means we must drop infractions where the severity contains `NA - ...`.

### Select from the Different Types of Establishments that were Inspected

The number of inspections and infractions (including non-infractions) is shown for each type of establishment below

In [21]:
%%time
df_query = pd.read_sql(
    """
    SELECT establishmenttype,
           COUNT(DISTINCT(inspection_id)) AS num_inspections,
           COUNT(infraction_details) AS num_infractions
    FROM inspections
    WHERE severity NOT LIKE '%%NA -' AND severity IS NOT NULL
    GROUP BY establishmenttype
    ORDER BY COUNT(DISTINCT(inspection_id)) DESC
    """,
    con=conn,
)
df_query

CPU times: user 0 ns, sys: 2.96 ms, total: 2.96 ms
Wall time: 1.19 s


Unnamed: 0,establishmenttype,num_inspections,num_infractions
0,Restaurant,55410,379246
1,Food Take Out,13460,82247
2,Food Store (Convenience / Variety),4009,22724
3,Food Court Vendor,3622,22896
4,Supermarket,3416,27161
5,Bakery,2621,20501
6,Butcher Shop,971,6749
7,Child Care - Food Preparation,928,3875
8,Child Care - Catered,919,3868
9,Food Caterer,864,5435


We will keep the following establishment types since they are equivalent to a restaurant or grocery store

In [22]:
establishment_types_wanted = [
    "Restaurant",
    "Food Take Out",
    "Food Store (Convenience / Variety)",  # equivalent to grocery store
    "Food Court Vendor",
    "Supermarket",  # equivalent to grocery store
    "Bakery",  # equivalent to grocery store
    # "Food Caterer",
    "Butcher Shop",  # equivalent to grocery store
    "Cafeteria - Public Access",
    # "Boarding / Lodging Home - Kitchen",
    "Cocktail Bar / Beverage Room",
    # "Food Depot",
    # "Private Club",
    "Fish Shop",  # equivalent to grocery store
    "Bake Shop",  # equivalent to grocery store
    # "Food Bank",
    "Flea Market",  # equivalent to grocery store
    "Farmer\\'s Market",  # equivalent to grocery store
    # "Bed & Breakfast",
]

The following is a string of all these wanted establishment types joined together, so that it can be used as a SQL filter (in the `WHERE` clause)

In [23]:
establishment_types_wanted_str = "('" + "', '".join(establishment_types_wanted) + "')"
print(establishment_types_wanted_str)

('Restaurant', 'Food Take Out', 'Food Store (Convenience / Variety)', 'Food Court Vendor', 'Supermarket', 'Bakery', 'Butcher Shop', 'Cafeteria - Public Access', 'Cocktail Bar / Beverage Room', 'Fish Shop', 'Bake Shop', 'Flea Market', 'Farmer\'s Market')


**Observations**
1. The following are a subset of the non-retail food establishments that were inspected
   - private / non-public
     - boarding / lodging home - kitchen
     - Private Club
     - Bed & Breakfast
   - niche (similar to school cafeteria)
     - food depot
     - food bank

   and these are distinct from restaurants and grocery stores, so they are excluded from analysis here.
2. Schools, private establishment (private club, etc.) and Hospitals do not present a risk that can be generalized across the population of the city like restaurants and grocery stores. Also, these two types of establishments follow a different inspection and planning protocol.

## Removing Invalid Data

### Aggregate Data to get one inspection per row

Count the number of each type of infraction recorded during a single inspection. Also, combine all the (text) details of each infraction (`infraction_details`) into an `infractions_summary` column.

In [24]:
%%time
df_query = pd.read_sql(
    f"""
    SELECT establishment_id,
           establishmenttype,
           establishment_address,
           inspection_date,
           inspection_id,
           GROUP_CONCAT(infraction_details SEPARATOR '. ') AS infractions_summary,
           CAST(SUM(CASE WHEN severity LIKE "%%S - Significant" THEN 1 ELSE 0 END) AS SIGNED) AS num_significant,
           CAST(SUM(CASE WHEN severity LIKE "%%C - Crucial" THEN 1 ELSE 0 END) AS SIGNED) AS num_crucial,
           CAST(SUM(CASE WHEN severity LIKE "%%M - Minor" THEN 1 ELSE 0 END) AS SIGNED) AS num_minor,
           COUNT(infraction_details) AS num_infractions
    FROM inspections
    WHERE establishmenttype IN {establishment_types_wanted_str}
    AND severity NOT LIKE '%%NA -' AND severity IS NOT NULL
    GROUP BY establishment_id, establishmenttype, establishment_address, inspection_date, inspection_id
    ORDER BY establishment_id, establishmenttype, establishment_address, inspection_date, inspection_id
    """,
    con=conn,
)
df_query

CPU times: user 1.18 s, sys: 19.8 ms, total: 1.2 s
Wall time: 2.61 s


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_date,inspection_id,infractions_summary,num_significant,num_crucial,num_minor,num_infractions
0,1222579,Food Take Out,870 MARKHAM RD,2013-06-26,103015258,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,4,4,8,16
1,1222579,Food Take Out,870 MARKHAM RD,2013-12-20,103133558,Food handler fail to wear headgear. Operator f...,0,0,6,6
2,1222579,Food Take Out,870 MARKHAM RD,2014-09-09,103329697,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,15
3,1222579,Food Take Out,870 MARKHAM RD,2015-01-08,103420091,Operator fail to properly wash equipment. Oper...,3,0,6,9
4,1222579,Food Take Out,870 MARKHAM RD,2016-12-21,103868579,Operator fail to properly wash equipment,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
85045,10690581,Restaurant,3560 VICTORIA PARK AVE,2019-10-22,104594294,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3
85046,10690642,Bake Shop,20 ST PATRICK ST,2019-10-23,104594681,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1
85047,10690660,Restaurant,549 BLOOR ST W,2019-10-23,104594800,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2
85048,10690679,Food Take Out,1175 ST CLAIR AVE W,2019-10-23,104594954,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1


**Notes**
1. We are excluding unwanted establishments and infractions where the severity is not valid. We discussed both of these choices earlier in the **Select from the Different Types of Establishments that were Inspected** and **Types of Infraction Severities** sub-sections respectively.

### Remove Inspections that took more than one day to complete

A single inspection should be completed on one day. It should not be spread out over more than one day. From the above aggregated result, get inspections (`inspection_id`s) that took more than one day to complete
- group by establishment and `inspection_id` and count the number of unique dates

In [25]:
df_query.groupby(
    ["establishment_id", "establishmenttype", "establishment_address", "inspection_id"],
    as_index=False,
)["inspection_date"].nunique().query("inspection_date > 1").sort_values(
    by=["inspection_date"], ascending=False
)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date
6763,9011956,Restaurant,57 OSSINGTON AVE,103455536,2
12026,9043539,Restaurant,125 OSSINGTON AVE,103353246,2
19909,10223866,Bakery,812 COLLEGE ST,103430882,2
24622,10287338,Restaurant,165 EAST LIBERTY ST,103553046,2
49525,10451955,Food Take Out,3863 LAWRENCE AVE E,103029566,2
51048,10457935,Restaurant,885 PROGRESS AVE,103452615,2
56600,10482456,Restaurant,796 COLLEGE ST,103436353,2
66899,10526821,Restaurant,453 QUEEN ST W,103522898,2
73659,10564048,Restaurant,120 CUMBERLAND ST,103736038,2
74485,10568777,Restaurant,4 COLLIER ST,103769871,2


These inspections wlill need to be removed from the data. The reason for this occurrence is one of the followig
- the inspector needed to go back for a re-inspection
- the date was incorrectly entered
- unknown

IT is reassuring that there are a small number of such inspections. Since we don't know the exact reason for this occurrence, we will remove all such inspections (that occurred on more than one day) from the data.

So, next, we will query this result to only get inspections that were completed on one day. We will do this using the establishment and inspection columns only, ignoring the counts and text column from earlier

In [26]:
%%time
df_query_no_multi_day_inspections = df_query.groupby(
    ["establishment_id", "establishmenttype", "establishment_address", "inspection_id"],
    as_index=False,
)["inspection_date"].nunique().query("inspection_date == 1").sort_values(
    by=["inspection_date"], ascending=False
)
df_query_no_multi_day_inspections

CPU times: user 97.9 ms, sys: 145 µs, total: 98 ms
Wall time: 97.2 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date
0,1222579,Food Take Out,870 MARKHAM RD,103015258,1
56699,10482577,Food Take Out,900 DUFFERIN ST,103218505,1
56697,10482559,Cafeteria - Public Access,640 LAWRENCE AVE W,103474423,1
56696,10482559,Cafeteria - Public Access,640 LAWRENCE AVE W,103215927,1
56695,10482559,Cafeteria - Public Access,640 LAWRENCE AVE W,103118069,1
...,...,...,...,...,...
28345,10327062,Restaurant,6180 YONGE ST,103608234,1
28344,10327062,Restaurant,6180 YONGE ST,103474003,1
28343,10327062,Restaurant,6180 YONGE ST,103347778,1
28342,10327062,Restaurant,6180 YONGE ST,103175034,1


We'll now merge this result with the aggregated data from the SQL query in order to get all columns (including text and counts) for inspections that were completed on a single date

In [27]:
%%time
df = df_query_no_multi_day_inspections.drop(columns=["inspection_date"]).merge(df_query, on=["establishment_id", "establishmenttype", "establishment_address", "inspection_id"])
df["inspection_date"] = pd.to_datetime(df["inspection_date"])
df = df.sort_values(by=["establishment_id", "establishmenttype", "establishment_address", "inspection_id", "inspection_date"])
df

CPU times: user 114 ms, sys: 8.1 ms, total: 122 ms
Wall time: 121 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions
0,1222579,Food Take Out,870 MARKHAM RD,103015258,2013-06-26,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,4,4,8,16
10629,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Food handler fail to wear headgear. Operator f...,0,0,6,6
56692,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,15
56691,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Operator fail to properly wash equipment. Oper...,3,0,6,9
56690,1222579,Food Take Out,870 MARKHAM RD,103868579,2016-12-21,Operator fail to properly wash equipment,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
28343,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3
28342,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1
28341,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2
28340,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1


As a sanity check, we will now count how many `inspection_id`s and `inspection_date`s occurred for each establishment. This will give us the number of inspections per establishment. We should get the same number of inspections if we count `inspection_id` or `inspection_date`. The result of this aggregation is shown below

In [28]:
recomp = (
    df.groupby(
        ["establishment_id", "establishmenttype", "establishment_address"],
        as_index=False,
    )
    .agg({"inspection_id": "count", "inspection_date": "count"})
    .rename(
        columns={
            "inspection_id": "num_inspection_ids",
            "inspection_date": "num_inspection_dates",
        }
    )
)
display(recomp.head())

Unnamed: 0,establishment_id,establishmenttype,establishment_address,num_inspection_ids,num_inspection_dates
0,1222579,Food Take Out,870 MARKHAM RD,9,9
1,1222807,Restaurant,1635 LAWRENCE AVE W,5,5
2,9000002,Food Take Out,361 OAKWOOD AVE,1,1
3,9000004,Food Take Out,1788 JANE ST,7,7
4,9000026,Food Take Out,2372 EGLINTON AVE E,8,8


As we can see below, the number of inspections per establishment is equal when calculated using `inspection_id` or `inspection_date`

In [29]:
assert recomp[recomp["num_inspection_ids"] != recomp["num_inspection_dates"]].empty

We can also verify that the sum (total) of the inspections adds up to the number of rows in the aggregated data (after removing the inspections that are spread across multiple dates), and this is shown below

In [31]:
assert recomp["num_inspection_ids"].sum() == len(df)
assert recomp["num_inspection_dates"].sum() == len(df)

### Remove Re-Inspections

Next, we need to eliminate inspectoins that occurred within two days since these correspond to re-inspections (as mentioned earlier). Now, a single `inspection_id` corresponds to a single `inspection_date`. Starting with the above result, we will group by establishment and calculate difference in days between successive `inspection_date`s. This will give us the time gap between successive inspections of a single establishment. This time gap needs to be more than two days to avoid including re-inspections.

Get the time gap between successive inspections of a single establishment

In [33]:
%%time
df["days_to_next"] = (
    df.groupby(
        [
            "establishment_id",
            "establishmenttype",
            "establishment_address",
        ],
    )["inspection_date"]
    .diff(-1)
    .dt.days.abs()
)
df

CPU times: user 3.15 s, sys: 13.5 ms, total: 3.16 s
Wall time: 3.12 s


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,days_to_next
0,1222579,Food Take Out,870 MARKHAM RD,103015258,2013-06-26,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,4,4,8,16,177.0
10629,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Food handler fail to wear headgear. Operator f...,0,0,6,6,263.0
56692,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,15,121.0
56691,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Operator fail to properly wash equipment. Oper...,3,0,6,9,713.0
56690,1222579,Food Take Out,870 MARKHAM RD,103868579,2016-12-21,Operator fail to properly wash equipment,0,0,1,1,546.0
...,...,...,...,...,...,...,...,...,...,...,...
28343,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3,
28342,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1,
28341,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2,
28340,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1,


Show how many establishments were re-inspected (for this, the time gap will be two days or less)

In [34]:
df.query("days_to_next <= 2")

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,days_to_next
56769,9000200,Restaurant,2300 YONGE ST,103788406,2016-08-03,Operator fail to properly maintain rooms. Oper...,4,0,8,12,2.0
56514,9000361,Food Take Out,2847 LAWRENCE AVE E,103846326,2016-11-08,Operator fail to properly wash equipment. Oper...,18,0,9,30,2.0
56592,9000596,Bakery,3772 BATHURST ST,103675610,2016-02-22,Operator fail to clean washroom fixtures. Oper...,3,0,9,12,2.0
57015,9000652,Bakery,2394 BLOOR ST W,103977217,2017-05-29,Operator fail to properly maintain rooms. Stor...,0,1,1,2,1.0
56864,9000963,Restaurant,33 SAMOR RD,104055201,2017-11-13,Operator fail to properly wash equipment. STOR...,3,0,3,6,0.0
...,...,...,...,...,...,...,...,...,...,...,...
28115,10667988,Bakery,248 EDDYSTONE AVE,104452960,2019-04-16,Fail to Ensure the Presence of the Holder of a...,4,0,2,8,2.0
27994,10676175,Restaurant,688 COLLEGE ST,104502534,2019-06-20,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,2,2,0,4,1.0
28476,10677946,Restaurant,729 BLOOR ST W,104517863,2019-07-15,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,1,1,2,2.0
28472,10677946,Restaurant,729 BLOOR ST W,104571577,2019-09-23,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,1,3,5,1.0


We need to remove these re-inspections from the aggregated data from the end of the previous section.

There are some establishments that might have
- closed permanently after one inspection
- just openend, so have only been inspected once

and these appear with a missing value in the `days_to_next` column. These establishments' inspections can be kept in the data, so below we will remove re-inspections (`days_to_next` < 2) and keep establishments with a single inspection (`days_to_next` has a missing value)

In [35]:
df = df.query("days_to_next > 2 | days_to_next.isna()").reset_index(drop=True).copy()
df

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,days_to_next
0,1222579,Food Take Out,870 MARKHAM RD,103015258,2013-06-26,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,4,4,8,16,177.0
1,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Food handler fail to wear headgear. Operator f...,0,0,6,6,263.0
2,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,15,121.0
3,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Operator fail to properly wash equipment. Oper...,3,0,6,9,713.0
4,1222579,Food Take Out,870 MARKHAM RD,103868579,2016-12-21,Operator fail to properly wash equipment,0,0,1,1,546.0
...,...,...,...,...,...,...,...,...,...,...,...
83697,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3,
83698,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1,
83699,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2,
83700,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1,


With this, our aggregation (transformation) of the raw infractions data is completed. We now have a single row per inspection (at a single establishment on a single date).

## Create the Class Labels column

We'll create a binary column to detect if a crucial infraction was detected or not

In [36]:
df["is_crucial"] = (df["num_crucial"] > 0).astype(int)
df

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,infractions_summary,num_significant,num_crucial,num_minor,num_infractions,days_to_next,is_crucial
0,1222579,Food Take Out,870 MARKHAM RD,103015258,2013-06-26,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,4,4,8,16,177.0,1
1,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Food handler fail to wear headgear. Operator f...,0,0,6,6,263.0,0
2,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,15,121.0,0
3,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Operator fail to properly wash equipment. Oper...,3,0,6,9,713.0,0
4,1222579,Food Take Out,870 MARKHAM RD,103868579,2016-12-21,Operator fail to properly wash equipment,0,0,1,1,546.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
83697,10690581,Restaurant,3560 VICTORIA PARK AVE,104594294,2019-10-22,FAIL TO ENSURE EQUIPMENT SURFACE SANITIZED AS ...,0,0,3,3,,0
83698,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,1,,0
83699,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1,0,1,2,,0
83700,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,1,,0


The class-imbalance is shown below (although we will not be interpreting this until we have split the data for ML experiments)

In [37]:
display(
    df["is_crucial"]
    .value_counts(normalize=True)
    .rename("fraction")
    .to_frame()
    .merge(
        df["is_crucial"].value_counts().rename("num_inspections").to_frame(),
        left_index=True,
        right_index=True,
        how="inner",
    )
)

Unnamed: 0,fraction,num_inspections
0,0.910767,76233
1,0.089233,7469


## Export Transformed Data to CSV

We'll now export this transformed data (aggregated by inspection and filtered to remove (a) inspections that took more than one day to complete, (b) re-inspections, (c) unwanted establishment types and (d) inspections with an invalid severity) to a CSV file which can be loaded into Python for further processing

In [38]:
%%time
time_now  = datetime.now().strftime('%Y%m%d_%H%M%S')
df.drop(columns=["days_to_next"]).to_csv(f"data/processed/filtered_transformed_data__{time_now}.csv", index=False)

CPU times: user 801 ms, sys: 7.8 ms, total: 809 ms
Wall time: 807 ms


## Disconnect from the MySQL Database

Close database connection and dispose the SQLAlchemy engine

In [39]:
conn.close()
engine.dispose()