# Get Data

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import configparser
import os
from glob import glob
from io import BytesIO
from typing import Dict, List, Union
from zipfile import ZipFile

import pandas as pd
import requests
import snowflake.connector

## Background

### Objective
To use Machine Learning (ML) to predict the liklihood of a significant or crucial infraction in establishment (restaurant, grocery store or similar) inspections conducted by the City of Toronto's DineSafe Inspection system. Data collected by the DineSafe program are obtained from the city's open data portal.

### Facts about DineSafe Program in Toronto
The following are facts about the data based on the city of Toronto's [Open Data Portal page for DineSafe data](https://open.toronto.ca/dataset/dinesafe/) and [general info page for the DineSafe program](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/about-dinesafe/)
1. A single inspection takes place on a specific date at a single establishment. An `inspection_id` should be unique for each inspection. An `establishment_id` should be unique for each establishment.
2. An establishment chain (such as the [SUBWAY](https://en.wikipedia.org/wiki/Subway_(restaurant)) brand), can have multiple establishment locations.
3. Each location can be inspected one or more times (usually more than once). So, a single `establishment_id` and `inspection_id` should be associated with a single `inspection_date`.
4. One or more infractions can be recorded per inspection. In the inspections data, each infraction is listed on a single row. There can be multiple rows (infractions) per inspection.
5. If a Significant infraction is detected, an [inspector returns within two days](https://www.toronto.ca/community-people/health-wellness-care/health-programs-advice/food-safety/dinesafe/dinesafe-infractions/) to re-inspect (follow-up inspection) the establishment. The current ML use-case will not use such re-inspections so these inspections will be removed from the data (later in this notebook).

### Implications for Current Use-Case

For the current ML use-case, we require each *observation* to be an independent inspection with or without an infraction (crucial, significant or minor). We will then create a binary variable indicating whether the inspection resulted in a significant or crucial infraction (1) or not (0) since that is that label that the ML algorithm needs to predict. Significant or crucial infractions present a health hazard, while minor infractions only present minimal health risk. The ML model will not be predicting the outcome of follow-up inspections, but will only be trained to predict the outcome (if there was a significant or crucial infraction, or not) of the initial inspection.

When exploring the data, we will need to take these considerations into account as well as the facts about the data mentioned above.

### Assumptions

#### Inspection Schedule During the Out-of-Sample Period
When an ML model is trained, before it is called to make predictions, it is assumed that the future inspection schedule (establishments and planned inspection dates) are known ahead of time. These do not need to be made available ot the inspector. However, they must be provided to a prediction service that calls the trained ML model to predict if these scheduled inspections will result in an infraction. The ML model will predict the likelohood of detecting a significant or crucial infraction will be detected during these scheduled inspections (at the scheduled establishments) *ahead of the date on which the scheduled inspection will occur*. So, (if sufficiently accurate) the ML model can predict the likelihood of a crucial or significant infraction, during a scheduled inspection, before an inspector conducts a scheduled inspection.

#### Other Examples of Handling an Out-of-Sample Period
In one of the previous ML-based studies, a out-of-sample period of time is pre-chosen. During this time, the inspections proceed as planned. At the end of this period, when the inspection data becomes available (establishment locations and inspection dates), the ML model is used to predict the likelihood of an infraction during those inspections. So, the ML model does not have to predict into the future since it is being evaluated against true inspection data in the past. The reason for this approach is that the the applicatoin in question was part of a pilot study to estimate the efficacy of such an approach.

#### Implications for Current Work
By comparison, in the current project and with an eye towards deploying such an ML model, we need to predict this likelihood ahead of time so that establishments with infractions are known before the day of an in-person inspection by an inspector. For this reason, we are assuming here that the inspection schedule (establishments and planned inspection dates) are known ahead of time. This also means that any predictors (ML features) we use must be known ahead of the out-of-sample period. For example, if weather is to be used then a forecast of the weather conditions (eg. temperature) during the out-of-sample period is required for dates and locations corresponding to the planned inspection schedule since we will need this weather data on the date when the predictions of all out-of-sample inspections are to be made.

## About

In this notebook, we will download historical [Dinesafe inspections data](https://open.toronto.ca/dataset/dinesafe/) from WayBackMachine (internet web archive, [link](https://archive.org/)). These datasets are snapshots captured at various timestamps. We need these [snapshots](https://web.archive.org/web/*/http://opendata.toronto.ca/public.health/dinesafe/dinesafe.zip) since the version of this data on the Toronto Open Data portal covers a short period of time (approx. 18 months starting in Jan 2020). We will want to have access to as much data as possible to train an ML model to predict a critical infraction during future inspections.

All historical datasets will be processed (dropping any inspections that might be duplicated across multiple snapshots), concatenated and then appended to a local MySQL database.

## User Inputs

In [3]:
# Name of database table
table_name = "inspections"

# Data file names to download (these are timestamps at which data snapshot was
# captured by WayBackMachine)
zip_filenames = [
    "20130723222156",
    "20150603085055",
    "20151012004454",
    "20160129205023",
    "20160317045436",
    "20160915001010",
    "20170303162206",
    "20170330001043",
    "20170726115444",
    "20190116215713",
    "20190126084933",
    "20190614092848",
    "20210626163552",
]

stage_name = "processed_dinesafe_data"
file_format_name = "COMMACOLSEP_ONEHEADROW"

num_proc_data_files = 10

ci_run = "no"

In [4]:
if ci_run == "yes":
    ACCOUNT = os.getenv("SNOWFLAKE_ACCOUNT")
    USER = os.getenv("SNOWFLAKE_USER")
    PASS = os.getenv("SNOWFLAKE_PASS")
    WAREHOUSE = os.getenv("SNOWFLAKE_WAREHOUSE")
    DB_SCHEMA = os.getenv("SNOWFLAKE_DB_SCHEMA")
    DB_NAME = "dinesafe"
else:
    config = configparser.ConfigParser()
    config.read("../sql.ini")
    default_cfg = config["default"]
    ACCOUNT = default_cfg["SNOWFLAKE_ACCOUNT"]
    USER = default_cfg["SNOWFLAKE_USER"]
    PASS = default_cfg["SNOWFLAKE_PASS"]
    WAREHOUSE = default_cfg["SNOWFLAKE_WAREHOUSE"]
    DB_SCHEMA = default_cfg["SNOWFLAKE_DB_SCHEMA"]
    DB_NAME = "dinesafe"

In [5]:
connector_dict = dict(
    account=ACCOUNT,
    user=USER,
    password=PASS,
    database=DB_NAME,
    schema="public",
    warehouse=WAREHOUSE,
    role="sysadmin",
)
connector_dict_no_db = dict(
    account=ACCOUNT,
    user=USER,
    password=PASS,
    warehouse=WAREHOUSE,
    role="sysadmin",
)

In [6]:
def show_sql_df(
    query: str,
    cursor,
    cnx=None,
    table_output: bool = False,
    use_manual_approach: bool = False,
) -> Union[None, pd.DataFrame]:
    cursor.execute(query)
    if cnx:
        cnx.commit()
    if table_output:
        if use_manual_approach:
            colnames = [cdesc[0].lower() for cdesc in cursor.description]
            cur_fetched = cursor.fetchall()
            if cur_fetched:
                df_query_output = pd.DataFrame.from_records(
                    cur_fetched, columns=colnames
                )
                with pd.option_context(
                    "display.max_columns", 200, "display.max_colwidth", 200
                ):
                    display(df_query_output)
                return df_query_output
        else:
            df_query_output = cursor.fetch_pandas_all()
            with pd.option_context(
                "display.max_columns", 200, "display.max_colwidth", 200
            ):
                display(df_query_output)
            return df_query_output
    return pd.DataFrame()

In [7]:
def create_database(connector_dict_no_db: Dict, database_name: str) -> None:
    conn = snowflake.connector.connect(**connector_dict_no_db)
    cur = conn.cursor()
    for query in [
        f"DROP DATABASE IF EXISTS {database_name}",
        f"CREATE DATABASE IF NOT EXISTS {database_name}",
        f"COMMENT ON DATABASE {database_name} IS 'Toronto dinesafe inspections'",
    ]:
        _ = cur.execute(query)
    query = f"""
            SHOW DATABASES LIKE '%{database_name}%'
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert database_name in df["name"].str.lower().tolist()
    print(f"Created database {database_name}")
    cur.close()
    conn.close()


def create_table(connector_dict: Dict, table_name: str) -> None:
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()
    for query in [
        f"DROP TABLE IF EXISTS {table_name}",
        f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            ROW_ID INT,
            ESTABLISHMENT_ID INT,
            INSPECTION_ID INT,
            ESTABLISHMENT_NAME TEXT,
            ESTABLISHMENTTYPE TEXT,
            ESTABLISHMENT_ADDRESS TEXT,
            ESTABLISHMENT_STATUS TEXT,
            MINIMUM_INSPECTIONS_PERYEAR INT,
            INFRACTION_DETAILS TEXT,
            INSPECTION_DATE DATE,
            SEVERITY TEXT,
            ACTION TEXT,
            COURT_OUTCOME TEXT,
            AMOUNT_FINED FLOAT,
            LATITUDE FLOAT,
            LONGITUDE FLOAT
        )
        """,
        f"COMMENT ON TABLE {table_name} IS 'Toronto Dinesafe inspections data'",
    ]:
        _ = cur.execute(query)
    query = f"""
            SHOW TABLES LIKE '%{table_name}%'
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert table_name in df["name"].str.lower().tolist()
    print(f"Created table {table_name}")
    cur.close()
    conn.close()


def create_file_format(connector_dict: Dict, file_format_name: str) -> None:
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()
    for query in [
        f"DROP FILE FORMAT IF EXISTS {file_format_name}",
        rf"""
        CREATE OR REPLACE FILE FORMAT {file_format_name}
        TYPE = 'CSV'
        COMPRESSION = 'AUTO'
        FIELD_DELIMITER = ','
        RECORD_DELIMITER = '\\n'
        SKIP_HEADER = 1
        FIELD_OPTIONALLY_ENCLOSED_BY='"'
        ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
        ESCAPE = 'NONE'
        ESCAPE_UNENCLOSED_FIELD = '\134'
        DATE_FORMAT = 'AUTO'
        NULL_IF = ('\\N')
        """,
        (
            f"COMMENT ON FILE FORMAT {file_format_name} IS "
            "'file format for Toronto dinesafe processed data files'"
        ),
    ]:
        _ = cur.execute(query)
    query = f"""
            SHOW FILE FORMATS LIKE '%{file_format_name}%'
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert file_format_name in df["name"].tolist()
    print(f"Created file format {file_format_name}")
    cur.close()
    conn.close()


def create_stage(connector_dict: Dict, stage_name: str, file_format_name: str) -> None:
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()
    for query in [
        f"DROP STAGE IF EXISTS {stage_name}",
        f"""
        CREATE OR REPLACE STAGE {stage_name}
        FILE_FORMAT = {file_format_name}
        """,
        (
            f"COMMENT ON STAGE {stage_name} IS "
            "'local dir for Toronto dinesafe processed data'"
        ),
    ]:
        _ = cur.execute(query)
    query = f"""
            SHOW STAGES LIKE '%{stage_name}%'
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert stage_name in df["name"].str.lower().tolist()
    print(f"Created stage {stage_name}")
    cur.close()
    conn.close()


def add_processed_data_to_stage(
    connector_dict: Dict,
    stage_name: str,
) -> None:
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()

    processed_filepaths = glob("data/processed/*.csv")
    for processed_filepath in processed_filepaths:
        query = f"""
                PUT file://{processed_filepath} @{stage_name}
                """
        print(f"{query.strip()}...", end="")
        _ = cur.execute(query)
        print("Done.")

    query = f"""
            LIST @{stage_name}/
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert df.shape[0] == len(processed_filepaths)
    print(
        f"Added {len(processed_filepaths):,} processed data files to stage {stage_name}"
    )
    cur.close()
    conn.close()


def copy_staged_data_to_table(
    connector_dict: Dict,
    table_name: str,
    stage_name: str,
    expected_num_rows: int,
) -> None:
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()
    query = f"""
            COPY INTO {table_name} from @{stage_name}
            """
    _ = cur.execute(query)

    query = f"""
            SELECT COUNT(*) AS num_rows
            FROM {table_name}
            """
    df = show_sql_df(query, cur, conn, True, True)
    assert df.loc[0, "num_rows"] == expected_num_rows
    print(
        f"Copied {len(df):,} rows of processed data from "
        f"stage {stage_name} to table {table_name}"
    )
    cur.close()
    conn.close()

In [8]:
def extract(zip_fname: str) -> List[str]:
    """Retrieve dinesafe data snapshot XML file from WayBackMachine."""
    # Assemble source URL
    url = (
        f"https://web.archive.org/web/{zip_fname}/"
        "http://opendata.toronto.ca/public.health/dinesafe/dinesafe.zip"
    )
    # Create path to target dir, where extracted .XML file will be found
    target_local_dir = f"data/raw/{zip_fname}"
    target_local_filepath = f"{target_local_dir}/dinesafe.xml"
    if not os.path.exists(target_local_filepath):
        print(f"Downloading data from {zip_fname}...", end="")
        # Get zipped file containing .XML file
        r = requests.get(url)
        # Extract to target dir
        with ZipFile(BytesIO(r.content)) as zfile:
            zfile.extractall(target_local_dir)
    else:
        print(f"Found local data for {zip_fname}...", end="")
    return target_local_dir


def read_data(filepath: str) -> pd.DataFrame:
    """Load an XML file into a DataFrame."""
    return pd.read_xml(filepath)


def process_data(df: pd.DataFrame) -> pd.DataFrame:
    """Process inspections data."""
    # Datetime formatting
    df["INSPECTION_DATE"] = pd.to_datetime(df["INSPECTION_DATE"])

    # Change datatype 1/2
    df = df.astype({"MINIMUM_INSPECTIONS_PERYEAR": int})

    # Remove commas from column and convert string to float
    if df["AMOUNT_FINED"].dtype == "object":
        df["AMOUNT_FINED"] = pd.to_numeric(
            df["AMOUNT_FINED"].astype(str).str.replace(",", ""), errors="coerce"
        )

    # Append latitude and longitude columns with missing values (if not found in data)
    for loc_col in ["LATITUDE", "LONGITUDE"]:
        if loc_col not in list(df):
            df[loc_col] = None

    # Change datatype 2/2
    df = df.astype({"LATITUDE": float, "LONGITUDE": float})
    return df


def transform(raw_data_dir: str) -> pd.DataFrame:
    """Transform local data in downloaded XML file."""
    df = read_data(f"{raw_data_dir}/dinesafe.xml")
    print(f"Processing data from {raw_data_dir}...", end="")
    df = process_data(df)
    print("Done.")
    return df


def load(
    dfs: List[pd.DataFrame],
    connector_dict: Dict[str, str],
    num_proc_data_files: int,
    stage_name: str,
    table_name: str,
) -> pd.DataFrame:
    """Vertically concatenate list of DataFrames and Append to database."""
    df = pd.concat(dfs, ignore_index=True)

    # Drop full rows that are exact duplicates of other rows
    df = df.drop_duplicates(keep="first", subset=None).reset_index(drop=True)

    # Add file counter column
    num_rows_per_csv_file = len(df) / num_proc_data_files
    print(f"Will split processed data among {num_proc_data_files:,} CSV files.")
    df["file_number"] = (df.index // num_rows_per_csv_file).astype(int)

    # Export concatenated DataFrame to CSV files
    for csv_idx in range(df["file_number"].max() + 1):
        proc_fpath = f"data/processed/dinesafe_{csv_idx}.csv"
        if not os.path.exists(proc_fpath):
            df.query(f"file_number == {csv_idx}").drop(columns=["file_number"]).to_csv(
                proc_fpath,
                index=False,
            )
            print(f"Exported processed data to CSV file index {csv_idx}.")
        else:
            print(f"Found CSV file index {csv_idx} with processed data. Did nothing.")

    # Append to table in database
    add_processed_data_to_stage(connector_dict, stage_name)
    copy_staged_data_to_table(connector_dict, table_name, stage_name, len(df))
    return df


def retrieve_data(
    zip_filenames: List[str],
    connector_dict: Dict[str, str],
    num_proc_data_files: int,
    stage_name: str,
    table_name: str,
) -> pd.DataFrame:
    """Retrieve data, process and append to database table."""
    dfs = []
    for zip_filename in zip_filenames:
        # Extract
        local_file_dir = extract(zip_filename)

        # Transform
        df = transform(local_file_dir)
        dfs.append(df)

    # Load
    df = load(
        dfs,
        connector_dict,
        num_proc_data_files,
        stage_name,
        table_name,
    )
    return df

In [11]:
conn = snowflake.connector.connect(**connector_dict_no_db)
cur = conn.cursor()

In [12]:
%%time
query = """
        SHOW DATABASES
        """
_ = show_sql_df(query, cur, None, True, True)

Unnamed: 0,created_on,name,is_default,is_current,origin,owner,comment,options,retention_time
0,2022-01-27 16:12:39.701000-08:00,DEMO_DB,N,N,,SYSADMIN,,,1
1,2022-01-27 10:58:19.534000-08:00,SNOWFLAKE_SAMPLE_DATA,N,N,SFC_SAMPLES.SAMPLE_DATA,ACCOUNTADMIN,Provided by Snowflake during account provisioning,,1
2,2022-01-27 16:12:52.421000-08:00,UTIL_DB,N,N,,SYSADMIN,,,1


CPU times: user 13.7 ms, sys: 404 µs, total: 14.1 ms
Wall time: 123 ms


In [13]:
cur.close()
conn.close()

## Create Database and Supporting Resources

The inspections data will be stored in a SQL database running on [Snowflake](https://www.snowflake.com/). This database will be created here along with any required supporting resources

In [15]:
%%time
create_database(connector_dict_no_db, DB_NAME)
create_table(connector_dict, table_name)
create_file_format(connector_dict, file_format_name)
create_stage(connector_dict, stage_name, file_format_name)

Unnamed: 0,created_on,name,is_default,is_current,origin,owner,comment,options,retention_time
0,2022-03-03 19:31:12.980000-08:00,DINESAFE,N,Y,,SYSADMIN,Toronto dinesafe inspections,,1


Created database dinesafe


Unnamed: 0,created_on,name,database_name,schema_name,kind,comment,cluster_by,rows,bytes,owner,retention_time,automatic_clustering,change_tracking,search_optimization,search_optimization_progress,search_optimization_bytes,is_external
0,2022-03-03 19:31:14.277000-08:00,INSPECTIONS,DINESAFE,PUBLIC,TABLE,Toronto Dinesafe inspections data,,0,0,SYSADMIN,1,OFF,OFF,OFF,,,N


Created table inspections


Unnamed: 0,created_on,name,database_name,schema_name,type,owner,comment,format_options
0,2022-03-03 19:31:15.501000-08:00,COMMACOLSEP_ONEHEADROW,DINESAFE,PUBLIC,CSV,SYSADMIN,file format for Toronto dinesafe processed data files,"{""TYPE"":""CSV"",""RECORD_DELIMITER"":""\n"",""FIELD_DELIMITER"":"","",""FILE_EXTENSION"":null,""SKIP_HEADER"":1,""DATE_FORMAT"":""AUTO"",""TIME_FORMAT"":""AUTO"",""TIMESTAMP_FORMAT"":""AUTO"",""BINARY_FORMAT"":""HEX"",""ESCAPE""..."


Created file format COMMACOLSEP_ONEHEADROW


Unnamed: 0,created_on,name,database_name,schema_name,url,has_credentials,has_encryption_key,owner,comment,region,type,cloud,notification_channel,storage_integration
0,2022-03-03 19:31:16.442000-08:00,PROCESSED_DINESAFE_DATA,DINESAFE,PUBLIC,,N,N,SYSADMIN,local dir for Toronto dinesafe processed data,,INTERNAL,,,


Created stage processed_dinesafe_data
CPU times: user 591 ms, sys: 6.81 ms, total: 598 ms
Wall time: 4.68 s


## Get Data and Populate Database

Retrieve DineSafe program data snapshots from WayBackMachine and append to the `dinesafe` database (drop duplicate inspections and change data types before appending to table)
A helper function `retrieve_data()` is used to
- download historical inspections data from WayBackMachine (`extract()`) and unzip contents into `data/raw`
- process the raw inspections data (`process_data()`)
  - change `INSPECTION_DATE` to a `datetime`
  - change `MINIMUM_INSPECTIONS_PERYEAR` to an integer datatype
  - clean `AMOUNT_FINED` (remove commas) and convert to numerical datatype
  - append `LATITUDE` and `LONGITUDE` columns (if not present)
    - some inspections data files have these but others don't
    - since we'll be appending all files to the same database table, they will all need to have these columns even if the columns contain missing values
  - append `LATITUDE` and `LONGITUDE` columns to lowercase
- append the processed data to the `inspections` table in the `dinesafe` SQL database (`load()`)

Run the ETL workflow to retrieve historical inspections data files, process each file and append processed data to the `inspections` table of the `dinesafe` database

In [16]:
%%time
df = retrieve_data(zip_filenames, connector_dict, num_proc_data_files, stage_name, table_name)

Downloading data from 20130723222156...Processing data from data/raw/20130723222156...Done.
Downloading data from 20150603085055...Processing data from data/raw/20150603085055...Done.
Downloading data from 20151012004454...Processing data from data/raw/20151012004454...Done.
Downloading data from 20160129205023...Processing data from data/raw/20160129205023...Done.
Downloading data from 20160317045436...Processing data from data/raw/20160317045436...Done.
Downloading data from 20160915001010...Processing data from data/raw/20160915001010...Done.
Downloading data from 20170303162206...Processing data from data/raw/20170303162206...Done.
Downloading data from 20170330001043...Processing data from data/raw/20170330001043...Done.
Downloading data from 20170726115444...Processing data from data/raw/20170726115444...Done.
Downloading data from 20190116215713...Processing data from data/raw/20190116215713...Done.
Downloading data from 20190126084933...Processing data from data/raw/20190126084

Unnamed: 0,name,size,md5,last_modified
0,processed_dinesafe_data/dinesafe_0.csv.gz,1931184,e8c5a1e2f26c38175ea13adf898249bd,"Fri, 4 Mar 2022 03:34:41 GMT"
1,processed_dinesafe_data/dinesafe_1.csv.gz,1936448,abea34927c3ec983697426116aa2ee52,"Fri, 4 Mar 2022 03:34:33 GMT"
2,processed_dinesafe_data/dinesafe_2.csv.gz,1955840,ac4a430d01f7d05095760fcb853fecb5,"Fri, 4 Mar 2022 03:34:36 GMT"
3,processed_dinesafe_data/dinesafe_3.csv.gz,1983520,1b88643766c530410e01ea88e8415fd2,"Fri, 4 Mar 2022 03:34:44 GMT"
4,processed_dinesafe_data/dinesafe_4.csv.gz,2009360,b9c277b6e5fb6abab0a81ac9a328289e,"Fri, 4 Mar 2022 03:34:45 GMT"
5,processed_dinesafe_data/dinesafe_5.csv.gz,2024224,e33409eb61fb75a7230a0b097d1aac3b,"Fri, 4 Mar 2022 03:34:38 GMT"
6,processed_dinesafe_data/dinesafe_6.csv.gz,2031328,e9dbb529001e56e0e46a1826dbbaf143,"Fri, 4 Mar 2022 03:34:43 GMT"
7,processed_dinesafe_data/dinesafe_7.csv.gz,2324064,2c9546bd365f45f1a8ed9c85415c294d,"Fri, 4 Mar 2022 03:34:47 GMT"
8,processed_dinesafe_data/dinesafe_8.csv.gz,2474688,e18b67df04951942ca5d21cb1e66d59c,"Fri, 4 Mar 2022 03:34:35 GMT"
9,processed_dinesafe_data/dinesafe_9.csv.gz,2507424,1d4ef3b336bed2e304b50756b1fd5e29,"Fri, 4 Mar 2022 03:34:40 GMT"


Added 10 processed data files to stage processed_dinesafe_data


Unnamed: 0,num_rows
0,981799


Copied 1 rows of processed data from stage processed_dinesafe_data to table inspections
CPU times: user 1min 16s, sys: 3.3 s, total: 1min 19s
Wall time: 3min 19s


These datasets list each infraction for a single inspection on a separate row. We will now need to filter these infractions to only select relevant ones and then aggregate them by inspection, since each row (inspection) will be used as an independent observation by the ML model we train later.

In the next notebook (`2_sql_filter_transform_v2.ipynb`), we will filter these infractions and aggregate them by inspection.