# Get Data

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
import configparser
import os
import shutil
from glob import glob
from io import BytesIO
from zipfile import ZipFile

import pandas as pd
import requests
from sqlalchemy import create_engine

In [None]:
config = configparser.ConfigParser()
config.read("../sql.ini")
default_cfg = config["default"]

In [None]:
DB_TYPE = default_cfg["DB_TYPE"]
DB_DRIVER = default_cfg["DB_DRIVER"]
DB_USER = default_cfg["DB_USER"]
DB_PASS = default_cfg["DB_PASS"]
DB_HOST = default_cfg["DB_HOST"]
DB_PORT = default_cfg["DB_PORT"]
DB_NAME = default_cfg["DB_NAME"]

In [None]:
# Connect to single database (required to create database)
URI_NO_DB = f"{DB_TYPE}+{DB_DRIVER}://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}"

# Connect to all databases (required to perform CRUD operations and submit queries)
URI = f"{DB_TYPE}+{DB_DRIVER}://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

## Background

### Objective
To use Machine Learning (ML) to predict the liklihood of a significant or crucial infraction in establishment (restaurant, grocery store or similar) inspections conducted by the City of Toronto's DineSafe Inspection system. Data collected by the DineSafe program are obtained from the city's open data portal.

### Facts about DineSafe Program in Toronto
The following are facts about the data based on the city of Toronto's [Open Data Portal page for DineSafe data](https://open.toronto.ca/dataset/dinesafe/) and [general info page for the DineSafe program](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/about-dinesafe/)
1. A single inspection takes place on a specific date at a single establishment. An `inspection_id` should be unique for each inspection. An `establishment_id` should be unique for each establishment.
2. An establishment chain (such as the [SUBWAY](https://en.wikipedia.org/wiki/Subway_(restaurant)) brand), can have multiple establishment locations.
3. Each location can be inspected one or more times (usually more than once). So, a single `establishment_id` and `inspection_id` should be associated with a single `inspection_date`.
4. One or more infractions can be recorded per inspection. In the inspections data, each infraction is listed on a single row. There can be multiple rows (infractions) per inspection.
5. If a Significant infraction is detected, an [inspector returns within two days](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/dinesafe-infractions/) to re-inspect (follow-up inspection) the establishment. The current ML use-case will not use such re-inspections so these inspections will be removed from the data (later in this notebook).

### Implications for Current Use-Case

For the current ML use-case, we require each *observation* to be an independent inspection with or without an infraction (crucial, significant or minor). We will then create a binary variable indicating whether the inspection resulted in a significant or crucial infraction (1) or not (0) since that is that label that the ML algorithm needs to predict. Significant or crucial infractions present a health hazard, while minor infractions only present minimal health risk. The ML model will not be predicting the outcome of follow-up inspections, but will only be trained to predict the outcome (if there was a significant or crucial infraction, or not) of the initial inspection.

When exploring the data, we will need to take these considerations into account as well as the facts about the data mentioned above.

### Assumptions

#### Inspection Schedule During the Out-of-Sample Period
When an ML model is trained, before it is called to make predictions, it is assumed that the future inspection schedule (establishments and planned inspection dates) are known ahead of time. These do not need to be made available ot the inspector. However, they must be provided to a prediction service that calls the trained ML model to predict if these scheduled inspections will result in an infraction. The ML model will predict the likelohood of detecting a significant or crucial infraction will be detected during these scheduled inspections (at the scheduled establishments) *ahead of the date on which the scheduled inspection will occur*. So, (if sufficiently accurate) the ML model can predict the likelihood of a crucial or significant infraction, during a scheduled inspection, before an inspector conducts a scheduled inspection.

#### Other Examples of Handling an Out-of-Sample Period
In one of the previous ML-based studies, a out-of-sample period of time is pre-chosen. During this time, the inspections proceed as planned. At the end of this period, when the inspection data becomes available (establishment locations and inspection dates), the ML model is used to predict the likelihood of an infraction during those inspections. So, the ML model does not have to predict into the future since it is being evaluated against true inspection data in the past. The reason for this approach is that the the applicatoin in question was part of a pilot study to estimate the efficacy of such an approach.

#### Implications for Current Work
By comparison, in the current project and with an eye towards deploying such an ML model, we need to predict this likelihood ahead of time so that establishments with infractions are known before the day of an in-person inspection by an inspector. For this reason, we are assuming here that the inspection schedule (establishments and planned inspection dates) are known ahead of time. This also means that any predictors (ML features) we use must be known ahead of the out-of-sample period. For example, if weather is to be used then a forecast of the weather conditions (eg. temperature) during the out-of-sample period is required for dates and locations corresponding to the planned inspection schedule since we will need this weather data on the date when the predictions of all out-of-sample inspections are to be made.

## About

In this notebook, we will download historical [Dinesafe inspections data](https://open.toronto.ca/dataset/dinesafe/) from WayBackMachine (internet web archive, [link](https://archive.org/)). These datasets are snapshots captured at various timestamps. We need these [snapshots](https://web.archive.org/web/*/http://opendata.toronto.ca/public.health/dinesafe/dinesafe.zip) since the version of this data on the Toronto Open Data portal covers a short period of time (approx. 18 months starting in Jan 2020). We will want to have access to as much data as possible to train an ML model to predict a critical infraction during future inspections.

All historical datasets will be processed (dropping any inspections that might be duplicated across multiple snapshots), concatenated and then appended to a local MySQL database.

## Database Administration

The inspections data will be stored locally in a MySQL database. We'll first create the `dinesafe` database

In [None]:
engine = create_engine(URI_NO_DB)
conn = engine.connect()

In [None]:
_ = conn.execute(f"DROP DATABASE IF EXISTS {DB_NAME};")
_ = conn.execute(f"CREATE DATABASE IF NOT EXISTS {DB_NAME};")

In [None]:
conn.close()
engine.dispose()

## Create Database Table

In [None]:
engine = create_engine(URI)
conn = engine.connect()

Create the `inspections` table in the `dinesafe` database

In [None]:
# Name of database table
table_name = "inspections"

In [None]:
_ = conn.execute(f"DROP TABLE IF EXISTS {table_name}")

In [None]:
create_table_query = f"""
                     CREATE TABLE IF NOT EXISTS {table_name} (
                         row_id INT,
                         establishment_id INT,
                         inspection_id INT,
                         establishment_name TEXT,
                         establishmenttype TEXT,
                         establishment_address TEXT,
                         latitude FLOAT,
                         longitude FLOAT,
                         establishment_status TEXT,
                         minimum_inspections_peryear INT,
                         infraction_details TEXT,
                         inspection_date DATE,
                         severity TEXT,
                         action TEXT,
                         court_outcome TEXT,
                         amount_fined FLOAT
                     )
                     """
_ = conn.execute(create_table_query)

In [None]:
conn.close()
engine.dispose()

## Get Data and Populate Database

Retrieve DineSafe program data snapshots from WayBackMachine and append to the `dinesafe` database (drop duplicate inspections and change data types before appending to table)
- a helper function `process_data()` is used to
  - download historical inspections data from WayBackMachine (`extract()`) and unzip contents into `data/raw`
  - process the raw inspections data (`process_data()`)
    - change `INSPECTION_DATE` to a `datetime`
    - change `MINIMUM_INSPECTIONS_PERYEAR` to an integer datatype
    - clean `AMOUNT_FINED` (remove commas) and convert to numerical datatype
    - append `LATITUDE` and `LONGITUDE` columns (if not present)
      - some inspections data files have these but others don't
      - since we'll be appending all files to the same database table, they will all need to have these columns even if the columns contain missing values
    - append `LATITUDE` and `LONGITUDE` columns to lowercase
  - append the processed data to the `inspections` table in the `dinesafe` database (`load()`)

In [None]:
def extract(zip_filenames):
    """Retrieve dinesafe data snapshot XML files from WayBackMachine."""
    available_files = []
    for zip_fname in zip_filenames:
        # Assemble source URL
        url = (
            f"https://web.archive.org/web/{zip_fname}/"
            "http://opendata.toronto.ca/public.health/dinesafe/dinesafe.zip"
        )
        # Create path to target dir, where extracted .XML file will be found
        target_dir = f"data/raw/{zip_fname}"
        if not os.path.exists(f"data/raw/{zip_fname}/dinesafe.xml"):
            # Get zipped file containing .XML file
            r = requests.get(url)
            # Extract to target dir
            with ZipFile(BytesIO(r.content)) as zfile:
                zfile.extractall(target_dir)
        available_files.append(target_dir)
    return available_files


def read_data(filepath):
    """Load an XML file into a DataFrame."""
    return pd.read_xml(filepath)


def process_data(df, cols_order_wanted):
    """Process inspections data."""
    # Datetime formatting
    df["INSPECTION_DATE"] = pd.to_datetime(df["INSPECTION_DATE"])
    # Change datatype 1/2
    df = df.astype({"MINIMUM_INSPECTIONS_PERYEAR": int})
    # Remove commas from column and convert string to float
    if df["AMOUNT_FINED"].dtype == "object":
        df["AMOUNT_FINED"] = pd.to_numeric(
            df["AMOUNT_FINED"].astype(str).str.replace(",", ""), errors="coerce"
        )
    # Append latitude and longitude columns, if not found in the data
    for loc_col in ["LATITUDE", "LONGITUDE"]:
        if loc_col not in list(df):
            df[loc_col] = None
    # Change column names to lowercase, Re-order columns and Change datatype of the
    # latitude and longitude columns
    df = df.rename(columns=str.lower)[cols_order_wanted].astype(
        {"latitude": float, "longitude": float}
    )
    return df


def transform(available_files, cols_order_wanted):
    """Transform data in downloaded XML files."""
    dfs = []
    for f in available_files:
        df = read_data(f"{f}/dinesafe.xml")
        df = process_data(df, cols_order_wanted)
        dfs.append(df)
    return dfs


def load(dfs, uri, table_name="inspections"):
    """Vertically concatenate list of DataFrames and Append to database."""
    dfs_all = pd.concat(dfs, ignore_index=True).drop_duplicates(
        keep="first", subset=None
    )
    engine = create_engine(uri)
    conn = engine.connect()
    dfs_all.to_sql(name=table_name, con=conn, index=False, if_exists="append")
    conn.close()
    engine.dispose()


def retrieve_data(zip_filenames, uri, cols_order_wanted, table_name="inspections"):
    """Retrieve data, process and append to database table."""
    # Extract
    available_files_list = extract(zip_filenames)

    # Transform
    dfs = transform(available_files_list, cols_order_wanted)

    # Load
    load(dfs, uri, table_name)
    return dfs

Run the ETL workflow to retrieve historical inspections data files, process each file and append processed data to the `inspections` table of the `dinesafe` database

In [None]:
# Data file names to download (these are timestamps at which data snapshot was
# captured by WayBackMachine)
zip_filenames = [
    "20130723222156",
    "20150603085055",
    "20151012004454",
    "20160129205023",
    "20160317045436",
    "20160915001010",
    "20170303162206",
    "20170330001043",
    "20170726115444",
    "20190116215713",
    "20190126084933",
    "20190614092848",
    "20210626163552",
]

# Order of DataFrame columns (to re-order raw data) in order to match column order in database table
cols_order_wanted = [
    "row_id",
    "establishment_id",
    "inspection_id",
    "establishment_name",
    "establishmenttype",
    "establishment_address",
    "latitude",
    "longitude",
    "establishment_status",
    "minimum_inspections_peryear",
    "infraction_details",
    "inspection_date",
    "severity",
    "action",
    "court_outcome",
    "amount_fined",
]

In [None]:
%%time
dfs = retrieve_data(zip_filenames, URI, cols_order_wanted, table_name)

These datasets list each infraction for a single inspection on a separate row. We will now need to filter these infractions to only select relevant ones and then aggregate them by inspection, since each row (inspection) will be used as an independent observation by the ML model we train later.

In the next notebook (`2_sql_filter_transform.ipynb`), we will filter these infractions and aggregate them by inspection.