# Feature Engineering

In [2]:
import os
from datetime import datetime
from glob import glob

import numpy as np
import pandas as pd

In [3]:
%aimport src.utils
from src.utils import summarize_df

## About

In this notebook, ML features will be engineered from existing columns in the transformed data created using the previous notebook.

## Load Transformed Data

We'll load the processed data with weather attributes

In [4]:
%%time
df = pd.read_csv(
    glob(f"data/processed/processed__*.csv")[-1],
    parse_dates=["inspection_date"],
).sort_values(
    by=[
        "establishment_id",
        "establishmenttype",
        "establishment_address",
        "inspection_date",
    ],
    ignore_index=True,
)
df = df.rename(columns={"num_null": "action_null", "num_null.1": "court_outcome_null"})
with pd.option_context("display.max_columns", 1000):
    display(df.head(2))
summarize_df(df)

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,establishment_status,infractions_summary,num_significant,num_crucial,num_minor,num_na,num_infractions,action_null,num_corrected_during_inspection,num_notice_to_comply,num_ticket,num_summons,num_summons_and_health_hazard_order,num_closure_order,num_not_in_compliance,num_order,num_education_provided,num_warning_letter,num_recommendations,num_prohibition_order_requested,court_outcome_null,num_conviction_fined,num_pending,num_charges_withdrawn,num_conviction_suspended_sentence,num_conviction_ordered_to_close_by_court,num_charges_dismissed,num_charges_quashed,num_conviction_probationary_order,num_cancelled,num_conviction_fined_order_to_close_by_court,is_infraction,latitude,longitude,AREA_NAME,Shape__Area,neigh_shape_area,neigh_shape_length,neigh_classification,neigh_classification_code,neigh_establishments_inspected,neigh_pop,pop_census_year,neigh_Assault,neigh_Auto Theft,neigh_Break and Enter,neigh_Robbery,neigh_Theft Over
0,1222579,Food Take Out,870 MARKHAM RD,102810896,2012-08-21,Pass,,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,43.768,-79.229,Woburn (137),23664990.0,23664990.0,25089.815423,Neighbourhood Improvement Area,NIA,358.0,,2006.0,,,,,
1,1222579,Food Take Out,870 MARKHAM RD,103015259,2013-06-27,Pass,,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,43.768,-79.229,Woburn (137),23664990.0,23664990.0,25089.815423,Neighbourhood Improvement Area,NIA,358.0,53350.0,2011.0,,,,,


Unnamed: 0,dtype,num_missing,num,nunique,single_non_nan_value
establishment_id,int64,0,205753,27860,10512302
establishmenttype,object,0,205753,56,Restaurant
establishment_address,object,0,205753,13314,879 WILSON AVE
inspection_id,int64,0,205753,205270,103465395
inspection_date,datetime64[ns],0,205753,2481,2015-04-14 00:00:00
establishment_status,object,0,205753,3,Conditional Pass
infractions_summary,object,121539,205753,30717,Operator fail to provide required sinks. Opera...
num_significant,float64,0,205753,35,6.0
num_crucial,float64,0,205753,15,0.0
num_minor,float64,0,205753,25,0.0


CPU times: user 967 ms, sys: 87.2 ms, total: 1.05 s
Wall time: 1.05 s


## Feature Engineering

### Get the Number of Days since the last Inspection Per Establishment

Group by each establishment (`establishment_id`, `establishmenttype` and `establishment_address`) and call `.diff().dt.days` on the `inspection_date` column to get the number of days between successive rows (successive inspections)

In [5]:
%%time
df["time_since_last_infrac"] = df.groupby(
    ["establishment_id", "establishmenttype", "establishment_address"]
)["inspection_date"].diff(1).dt.days

CPU times: user 4.72 s, sys: 79 ms, total: 4.8 s
Wall time: 4.67 s


### Did the Establishment Pass the Last Inspection?

Show the unique `establishment_status` values

In [6]:
df["establishment_status"].unique().tolist()

['Pass', 'Conditional Pass', 'Closed']

Group by establishment and call `.shift()` on the `establishment_status` column to align the previous and current values of the establishment status. Then check if the previous inspection resulted in `Pass`. Repeat for `Conditional Pass`.

In [7]:
%%time
df["last_status"] = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)["establishment_status"].shift()
# Check if last inspection was assigned a Pass
df["last_pass"] = df["last_status"] == "Pass"
# Check if last inspection was assigned a Conditional Pass
df["last_cond_pass"] = df["last_status"] == "Conditional Pass"
# Drop unwanted previous establishment_status column
df = df.drop(columns=["last_status"])

CPU times: user 87.4 ms, sys: 7.75 ms, total: 95.2 ms
Wall time: 94.8 ms


### Count the number of each type of infraction in the Last Inspection

Group by each establishment and `.shift()` on each of the infraction columns to align the previous and current inspections
- since the number of infractions are logged for each inspection, `shift`ing will align the number of infractions in the previous and current inspection date

In [8]:
df_num_infrac_last = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)[["num_minor", "num_significant", "num_crucial"]].shift()
df_num_infrac_last.columns = [
    f"num_{infrac_type}_prev" for infrac_type in ["minor", "significant", "crucial"]
]
df = pd.concat([df, df_num_infrac_last], axis=1)

### Count the number of each type of action in the Last Inspection

Group by each establishment and `.shift()` on each of the `action` columns to align the previous and current inspections
- since the number of actions are logged for each inspection, `shift`ing will align the number of actions in the previous and current inspection date

In [9]:
action_types = [
    "action_null",
    "num_corrected_during_inspection",
    "num_notice_to_comply",
    "num_ticket",
    "num_summons",
    "num_summons_and_health_hazard_order",
    "num_closure_order",
    "num_not_in_compliance",
    "num_order",
    "num_education_provided",
    "num_warning_letter",
    "num_recommendations",
    "num_prohibition_order_requested",
]

df_num_action_last = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)[action_types].shift()
df_num_action_last.columns = [
    f"num_{action_type.replace('num_', 'action_')}_prev" for action_type in action_types
]
df = pd.concat([df, df_num_action_last], axis=1)

### Count the number of each type of court outcome in the Last Inspection

Group by each establishment and `.shift()` on each of the `court_outcome` columns to align the previous and current inspections
- since the number of court outcomes are logged for each inspection, `shift`ing will align the number of court outcomes in the previous and current inspection date

In [10]:
court_outcome_types = [
    "court_outcome_null",
    "num_conviction_fined",
    "num_pending",
    "num_charges_withdrawn",
    "num_conviction_suspended_sentence",
    "num_conviction_ordered_to_close_by_court",
    "num_charges_dismissed",
    "num_charges_quashed",
    "num_conviction_probationary_order",
    "num_cancelled",
    "num_conviction_fined_order_to_close_by_court",
]

df_num_court_outcome_last = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)[court_outcome_types].shift()
df_num_court_outcome_last.columns = [
    f"num_{court_outcome_type.replace('num_', 'court_outcome_')}_prev"
    for court_outcome_type in court_outcome_types
]
df = pd.concat([df, df_num_court_outcome_last], axis=1)

### Get number of cumulative failures per establishment

A failure is when the establishment status is *Closed*.

Create an `is_fail` column indicating whether an inspection resulted in a status of *Closed*. then, group by each establishment and call `.cumsum()` on the `is_fail` column to get the running total of the number of failures

In [11]:
df["is_fail"] = df["establishment_status"] == "Closed"
df["cumulative_failures"] = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)["is_fail"].cumsum()
# Drop unwanted previous is_fail column
df = df.drop(columns=["is_fail"])

### Get number of cumulative infractions Per Establishment

This is a running total, that increases over time.

Similar to the above, groupby by each establishment and call `.cumsum()` on the column with number of counts for each type of infraction in every inspection to get the running total of each type of infraction

In [12]:
%%time
for infrac_type in ["minor", "significant", "crucial"]:
    df[f"cumulative_{infrac_type}"] = df.groupby(
        [
            "establishment_id",
            "establishmenttype",
            "establishment_address",
        ]
    )[f"num_{infrac_type}"].cumsum()

CPU times: user 103 ms, sys: 0 ns, total: 103 ms
Wall time: 102 ms


### Get number of cumulative actions and cumulative court outcomes Per Establishment

Similar to the above, groupby by each establishment and call `.cumsum()` on the column with number of counts for each type of action in every inspection to get the running total of each type of action. Repeat for court outcomes

In [13]:
%%time
for action_type in df_num_action_last.columns:
    df[f"cumulative_{action_type}"] = df.groupby(
        [
            "establishment_id",
            "establishmenttype",
            "establishment_address",
        ]
    )[action_type].cumsum()

CPU times: user 447 ms, sys: 0 ns, total: 447 ms
Wall time: 447 ms


In [14]:
%%time
for court_outcome_type in df_num_court_outcome_last.columns:
    df[f"cumulative_{court_outcome_type}"] = df.groupby(
        [
            "establishment_id",
            "establishmenttype",
            "establishment_address",
        ]
    )[court_outcome_type].cumsum()

CPU times: user 380 ms, sys: 0 ns, total: 380 ms
Wall time: 379 ms


### Check if establishment has ever failed a Previous Inspection

Since `.cumsum()` (from two sub-sections above) includes all previous inspections, check if it (total number of failures to-date) is greater than zero

In [15]:
df["ever_failed"] = df["cumulative_failures"] != 0

### Check if establishment has ever had a Previous Infraction

Similar to the above, check if the cumulative sum of the number of each type of infraction (in all previous inspections) is greater than zero

In [16]:
for infrac_type in ["minor", "significant", "crucial"]:
    df[f"ever_{infrac_type}"] = df[f"cumulative_{infrac_type}"] > 0

### Get cumulative inspections Per Establishment

Group by each establishment and call `.cumcount()` on the number of `inspection_date`s to get a running total of the number of inspections to-date

In [17]:
%%time
df["cumulative_inspections"] = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)["inspection_date"].cumcount()

CPU times: user 37 ms, sys: 51 µs, total: 37.1 ms
Wall time: 36.8 ms


### Get Ratio of Past Number of Failures to Inspections Per Establishment

Take the ratio of the cumulative failures to the cumulative number of inspections

In [18]:
df["proportion_past_failures"] = (
    df["cumulative_failures"] / df["cumulative_inspections"]
)

### Get Ratio of Past Number of Infractions to Inspections Per Establishment

Take the cumulative number of each type of infraction to the cumulative number of inspections

In [19]:
for infrac_type in ["minor", "significant", "crucial"]:
    df[f"proportion_past_{infrac_type}"] = (
        df[f"cumulative_{infrac_type}"] / df["cumulative_inspections"]
    )

### Get Number of days since last inspection Per Establishment

Group by each establishment and shift the `inspection_date` down by one row to align the current and previous inspections. Then calculate the number of days between the current and previous inspection dates

In [20]:
df["last_inspection_date"] = df.groupby(
    [
        "establishment_id",
        "establishmenttype",
        "establishment_address",
    ]
)["inspection_date"].shift()
df["days_since_last_inspection"] = (
    df["last_inspection_date"] - df["inspection_date"]
).dt.days

### Get Datetime Attributes of Last Inspection

For the previous inspection (from the above sub-section) get `datetime` attributes

In [21]:
df["last_inspection_month"] = df["last_inspection_date"].dt.month
df["last_inspection_weekday"] = df["last_inspection_date"].dt.weekday
df["last_inspection_weekofyear"] = df["last_inspection_date"].dt.isocalendar().week
df["last_inspection_quarter"] = df["last_inspection_date"].dt.quarter
df["last_inspection_year"] = df["last_inspection_date"].dt.year

### Get Datetime Attributes of Current Inspection

Get `datetime` attributes

In [22]:
df["inspection_month"] = df["inspection_date"].dt.month
df["inspection_weekday"] = df["inspection_date"].dt.weekday
df["inspection_weekofyear"] = df["inspection_date"].dt.isocalendar().week
df["inspection_quarter"] = df["inspection_date"].dt.quarter
df["inspection_year"] = df["inspection_date"].dt.year

Drop the unwanted column with the date of the last inspection

In [23]:
df = df.drop(columns=["last_inspection_date"])

In [24]:
df

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,establishment_status,infractions_summary,num_significant,num_crucial,num_minor,...,last_inspection_month,last_inspection_weekday,last_inspection_weekofyear,last_inspection_quarter,last_inspection_year,inspection_month,inspection_weekday,inspection_weekofyear,inspection_quarter,inspection_year
0,1222579,Food Take Out,870 MARKHAM RD,102810896,2012-08-21,Pass,,0.0,0.0,0.0,...,,,,,,8,1,34,3,2012
1,1222579,Food Take Out,870 MARKHAM RD,103015259,2013-06-27,Pass,,0.0,0.0,0.0,...,8.0,1.0,34,3.0,2012.0,6,3,26,2,2013
2,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Pass,Operator fail to properly wash surfaces in roo...,0.0,0.0,6.0,...,6.0,3.0,26,2.0,2013.0,12,4,51,4,2013
3,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,Pass,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3.0,0.0,12.0,...,12.0,4.0,51,4.0,2013.0,9,1,37,3,2014
4,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Pass,STORE UTENSILS IN MANNER NOT PREVENTING CONTAM...,3.0,0.0,6.0,...,9.0,1.0,37,3.0,2014.0,1,3,2,1,2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205748,10690616,Food Take Out,4698 YONGE ST,104594530,2019-10-23,Pass,,0.0,0.0,0.0,...,,,,,,10,2,43,4,2019
205749,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,Pass,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1.0,0.0,0.0,...,,,,,,10,2,43,4,2019
205750,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,Pass,FAIL TO MAINTAIN HANDWASHING STATIONS (LIQUID ...,1.0,0.0,1.0,...,,,,,,10,2,43,4,2019
205751,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,Pass,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1.0,0.0,0.0,...,,,,,,10,2,43,4,2019


## Notes

1. When making a prediction for new inspections (using a deployed ML model), the inspection schedule is assumed to be known ahead of time (this was the assumption in the **Background** section from `1_get_data.ipynb`). So, the `inspection_date` and establishment (name, type and address) are known ahead of time.
2. The features engineered in this notebook are based on looking back in time for each establishment. When the data is split into training and testing data, the values of these features in the testing data can be obtained by considering all available inspections (including those in the training data). For example, for the number of inspections to-date at an establishment (in the testing data), we would be counting the number of all previous inspections of that establishment (which includes those from the training data, that have occurred earlier in time). For this reason, the training and testing data would be combined before engineering these features and then re-splitting the data with the engineered features. Or, these features can be engineered before splitting the data - this was the approach used here and so the data was not split before performing the above feature engineering.

## Export Transformed Data to CSV

We'll now export this transformed data with engineered features to a CSV file which can be used for ML experiments

In [25]:
%%time
time_now  = datetime.now().strftime('%Y%m%d_%H%M%S')
df.to_csv(
    f"data/processed/processed_with_features__{time_now}.csv",
    index=False,
)

CPU times: user 6.68 s, sys: 103 ms, total: 6.79 s
Wall time: 6.79 s
