# SFScores - Predicting health inspection scores of restaurants

In this notebook, we will benchmark several of getML's feature learning algorithms against featuretools using a dataset of eateries in San Francisco.

Summary:

- Prediction type: __Regression model__
- Domain: __Health__
- Prediction target: __Sales__ 
- Population size: __12887__

## Background

This notebook is based on the San Francisco Dept. of Public Health's database of eateries in San Francisco. These eateries are regularly inspected. The inspections often result in a score.

The challenge is to predict the score resulting from an inspection.

The dataset has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/SFScores) (Motl and Schulte, 2015)(Now residing at [relational-data.org](https://relational-data.org/dataset/SFScores).).

We will benchmark [getML](https://www.getml.com)'s feature learning algorithms against [featuretools](https://www.featuretools.com), an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

## Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path

from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline  

import featuretools
import woodwork as ww
import getml

getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')
getml.engine.set_project('sfscores')

getML engine is already running.

Connected to project 'sfscores'


### 1. Loading data

#### 1.1 Download from source

We begin by downloading the data:

In [2]:
conn = getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="SFScores",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(dbname='SFScores',
           dialect='mysql',
           host='db.relational-data.org',
           port=3306)

In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [4]:
businesses = load_if_needed("businesses")
inspections = load_if_needed("inspections")
violations = load_if_needed("violations")

In [5]:
businesses

name,business_id,latitude,longitude,phone_number,business_certificate,name,address,city,postal_code,tax_code,application_date,owner_name,owner_address,owner_city,owner_state,owner_zip
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,10,37.7911,-122.404,,779059,Tiramisu Kitchen,033 Belden Pl,San Francisco,94104,H24,,Tiramisu LLC,33 Belden St,San Francisco,CA,94104
1.0,24,37.7929,-122.403,,352312,OMNI S.F. Hotel - 2nd Floor Pant...,"500 California St, 2nd Floor",San Francisco,94104,H24,,OMNI San Francisco Hotel Corp,"500 California St, 2nd Floor",San Francisco,CA,94104
2.0,31,37.8072,-122.419,,346882,Norman's Ice Cream and Freezes,2801 Leavenworth St,San Francisco,94133,H24,,Norman Antiforda,2801 Leavenworth St,San Francisco,CA,94133
3.0,45,37.7471,-122.414,,340024,CHARLIE'S DELI CAFE,3202 FOLSOM St,S.F.,94110,H24,2001-10-10,"HARB, CHARLES AND KRISTIN",1150 SANCHEZ,S.F.,CA,94114
4.0,48,37.764,-122.466,,318022,ART'S CAFE,747 IRVING St,SAN FRANCISCO,94122,H24,,YOON HAE RYONG,1567 FUNSTON AVE,SAN FRANCISCO,CA,94122
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6353.0,89335,,,,1057025,Breaking Bad Sandwiches,154 McAllister St,,94102,H25,2016-09-23,"JPMD, LLC",662 Bellhurst Lane,Castro Valley,CA,94102
6354.0,89336,,,,1057746,Miller's Rest,1085 Sutter St,,94109,H26,2016-09-23,"Miller's Rest, LLC",2906 Bush Street,San Francisco,CA,94109
6355.0,89393,,,,1042408,Panuchos,620 Broadway St,,94133,H24,2016-09-28,"Los Aluxes, LLC","1032 Irving Street, #421",San Francisco,CA,94122
6356.0,89416,,,,1051081,Nobhill Pizza & Shawerma,1534 California St,,94109,H24,2016-09-29,"BBA Foods, Inc.","840 Post Street, #218",San Francisco,CA,94109


In [6]:
inspections

name,business_id,score,date,type
role,unused_float,unused_float,unused_string,unused_string
0.0,10,92,2014-01-14,Routine - Unscheduled
1.0,10,,2014-01-24,Reinspection/Followup
2.0,10,94,2014-07-29,Routine - Unscheduled
3.0,10,,2014-08-07,Reinspection/Followup
4.0,10,82,2016-05-03,Routine - Unscheduled
,...,...,...,...
23759.0,89199,100,2016-09-12,Routine - Unscheduled
23760.0,89200,100,2016-09-12,Routine - Unscheduled
23761.0,89201,,2016-09-12,New Ownership
23762.0,89204,100,2016-09-12,Routine - Unscheduled


In [7]:
violations

name,business_id,date,violation_type_id,risk_category,description
role,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,10,2014-07-29,103129,Moderate Risk,Insufficient hot water or runnin...
1.0,10,2014-07-29,103144,Low Risk,Unapproved or unmaintained equip...
2.0,10,2014-01-14,103119,Moderate Risk,Inadequate and inaccessible hand...
3.0,10,2014-01-14,103145,Low Risk,Improper storage of equipment ut...
4.0,10,2014-01-14,103154,Low Risk,Unclean or degraded floors walls...
,...,...,...,...,...
36045.0,88878,2016-08-19,103144,Low Risk,Unapproved or unmaintained equip...
36046.0,88878,2016-08-19,103124,Moderate Risk,Inadequately cleaned or sanitize...
36047.0,89072,2016-09-22,103120,Moderate Risk,Moderate risk food holding tempe...
36048.0,89072,2016-09-22,103131,Moderate Risk,Moderate risk vermin infestation


#### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [8]:
businesses.set_role("business_id", getml.data.roles.join_key)
businesses.set_role("name", getml.data.roles.text)
businesses.set_role(["postal_code", "tax_code", "owner_zip"], getml.data.roles.categorical)

businesses

name,business_id,postal_code,tax_code,owner_zip,name,latitude,longitude,phone_number,business_certificate,address,city,application_date,owner_name,owner_address,owner_city,owner_state
role,join_key,categorical,categorical,categorical,text,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,10,94104,H24,94104,Tiramisu Kitchen,37.7911,-122.404,,779059,033 Belden Pl,San Francisco,,Tiramisu LLC,33 Belden St,San Francisco,CA
1.0,24,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pant...,37.7929,-122.403,,352312,"500 California St, 2nd Floor",San Francisco,,OMNI San Francisco Hotel Corp,"500 California St, 2nd Floor",San Francisco,CA
2.0,31,94133,H24,94133,Norman's Ice Cream and Freezes,37.8072,-122.419,,346882,2801 Leavenworth St,San Francisco,,Norman Antiforda,2801 Leavenworth St,San Francisco,CA
3.0,45,94110,H24,94114,CHARLIE'S DELI CAFE,37.7471,-122.414,,340024,3202 FOLSOM St,S.F.,2001-10-10,"HARB, CHARLES AND KRISTIN",1150 SANCHEZ,S.F.,CA
4.0,48,94122,H24,94122,ART'S CAFE,37.764,-122.466,,318022,747 IRVING St,SAN FRANCISCO,,YOON HAE RYONG,1567 FUNSTON AVE,SAN FRANCISCO,CA
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6353.0,89335,94102,H25,94102,Breaking Bad Sandwiches,,,,1057025,154 McAllister St,,2016-09-23,"JPMD, LLC",662 Bellhurst Lane,Castro Valley,CA
6354.0,89336,94109,H26,94109,Miller's Rest,,,,1057746,1085 Sutter St,,2016-09-23,"Miller's Rest, LLC",2906 Bush Street,San Francisco,CA
6355.0,89393,94133,H24,94122,Panuchos,,,,1042408,620 Broadway St,,2016-09-28,"Los Aluxes, LLC","1032 Irving Street, #421",San Francisco,CA
6356.0,89416,94109,H24,94109,Nobhill Pizza & Shawerma,,,,1051081,1534 California St,,2016-09-29,"BBA Foods, Inc.","840 Post Street, #218",San Francisco,CA


In [9]:
inspections = inspections[~inspections.score.is_nan()].to_df("inspections")

inspections.set_role("business_id", getml.data.roles.join_key)
inspections.set_role("score", getml.data.roles.target)
inspections.set_role("date", getml.data.roles.time_stamp)

inspections

name,date,business_id,score,type
role,time_stamp,join_key,target,unused_string
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0.0,2014-01-14,10,92,Routine - Unscheduled
1.0,2014-07-29,10,94,Routine - Unscheduled
2.0,2016-05-03,10,82,Routine - Unscheduled
3.0,2013-11-18,24,100,Routine - Unscheduled
4.0,2014-06-12,24,96,Routine - Unscheduled
,...,...,...,...
12882.0,2016-09-22,89072,90,Routine - Unscheduled
12883.0,2016-09-12,89198,100,Routine - Unscheduled
12884.0,2016-09-12,89199,100,Routine - Unscheduled
12885.0,2016-09-12,89200,100,Routine - Unscheduled


In [10]:
violations.set_role("business_id", getml.data.roles.join_key)
violations.set_role("date", getml.data.roles.time_stamp)
violations.set_role(["violation_type_id", "risk_category"], getml.data.roles.categorical)
violations.set_role("description", getml.data.roles.text)

violations

name,date,business_id,violation_type_id,risk_category,description
role,time_stamp,join_key,categorical,categorical,text
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0.0,2014-07-29,10,103129,Moderate Risk,Insufficient hot water or runnin...
1.0,2014-07-29,10,103144,Low Risk,Unapproved or unmaintained equip...
2.0,2014-01-14,10,103119,Moderate Risk,Inadequate and inaccessible hand...
3.0,2014-01-14,10,103145,Low Risk,Improper storage of equipment ut...
4.0,2014-01-14,10,103154,Low Risk,Unclean or degraded floors walls...
,...,...,...,...,...
36045.0,2016-08-19,88878,103144,Low Risk,Unapproved or unmaintained equip...
36046.0,2016-08-19,88878,103124,Moderate Risk,Inadequately cleaned or sanitize...
36047.0,2016-09-22,89072,103120,Moderate Risk,Moderate risk food holding tempe...
36048.0,2016-09-22,89072,103131,Moderate Risk,Moderate risk vermin infestation


### 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

In [11]:
split = getml.data.split.random(train=0.8, test=0.2)

#### 2.1 Define relational model

In [12]:
star_schema = getml.data.StarSchema(population=inspections, alias="population", split=split)

star_schema.join(
    businesses,
    on="business_id",
    relationship=getml.data.relationship.many_to_one,
)

star_schema.join(
    violations,
    on="business_id",
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    inspections,
    on="business_id",
    time_stamps="date",
    lagged_targets=True,
    horizon=getml.data.time.days(1),
)

star_schema

Unnamed: 0,data frames,staging table
0,"population, businesses",POPULATION__STAGING_TABLE_1
1,inspections,INSPECTIONS__STAGING_TABLE_2
2,violations,VIOLATIONS__STAGING_TABLE_3

Unnamed: 0,subset,name,rows,type
0,test,inspections,2492,View
1,train,inspections,10395,View

Unnamed: 0,name,rows,type
0,businesses,6358,DataFrame
1,violations,36050,DataFrame
2,inspections,12887,DataFrame


#### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (`min_num_samples`).

In [13]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

__Build the pipeline__

In [14]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1

#### 2.3 Model training

In [15]:
pipe1.check(star_schema.train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          



Unnamed: 0,type,label,message
0,INFO,FOREIGN KEYS NOT FOUND,"When joining POPULATION__STAGING_TABLE_1 and VIOLATIONS__STAGING_TABLE_3 over 'business_id' and 'business_id', there are no corresponding entries for 5.685426% of entries in 'business_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys."


In [16]:
pipe1.fit(star_schema.train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          

To see the issues in full, run .check() on the pipeline.

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Indexing text fields... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Trying 104 features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:02, remaining: 00:00]          

Trained pipeline.
Time taken: 0h:0m:2.358353



#### 2.4 Model evaluation

In [17]:
fastprop_score = pipe1.score(star_schema.test)
fastprop_score

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2024-02-21 15:09:45,train,score,4.8865,6.5247,0.3608
1,2024-02-21 15:09:45,test,score,5.3218,7.0532,0.2889


#### 2.5 featuretools

In [18]:
population_train_pd = star_schema.train.population.to_pandas()
population_test_pd = star_schema.test.population.to_pandas()

In [19]:
inspections_pd = inspections.drop(inspections.roles.unused).to_pandas()
violations_pd = violations.drop(violations.roles.unused).to_pandas()
businesses_pd = businesses.drop(businesses.roles.unused).to_pandas()

In [20]:
population_train_pd["id"] = population_train_pd.index

population_train_pd = population_train_pd.merge(
    businesses_pd,
    on="business_id"
)

population_train_pd

Unnamed: 0,business_id,score,date,id,postal_code,tax_code,owner_zip,name
0,10,92.0,2014-01-14,0,94104,H24,94104,Tiramisu Kitchen
1,10,94.0,2014-07-29,1,94104,H24,94104,Tiramisu Kitchen
2,10,82.0,2016-05-03,2,94104,H24,94104,Tiramisu Kitchen
3,24,96.0,2014-06-12,3,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
4,24,96.0,2014-11-24,4,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
...,...,...,...,...,...,...,...,...
10390,88878,94.0,2016-08-19,10390,94102,H24,94566,Jamba Juice
10391,89072,90.0,2016-09-22,10391,94109,H91,94109,Epicurean at Sacred Heart Catholic Prep School
10392,89198,100.0,2016-09-12,10392,94107,H36,29615,"AT&T Park - Beer Cart/View Level, Sec. 333"
10393,89199,100.0,2016-09-12,10393,94107,H36,29615,"AT&T Park - Beer Cart/Lower CF, Sec. 140"


In [21]:
population_test_pd["id"] = population_test_pd.index

population_test_pd = population_test_pd.merge(
    businesses_pd,
    on="business_id"
)

population_test_pd

Unnamed: 0,business_id,score,date,id,postal_code,tax_code,owner_zip,name
0,24,100.0,2013-11-18,0,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
1,24,96.0,2016-03-11,1,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
2,45,94.0,2013-12-09,2,94110,H24,94114,CHARLIE'S DELI CAFE
3,58,78.0,2014-07-25,3,94111,H24,94111,Oasis Grill
4,66,91.0,2014-05-19,4,94122,H24,94122,STARBUCKS
...,...,...,...,...,...,...,...,...
2487,87802,91.0,2016-06-07,2487,94110,H25,94110,Bernal Heights Pizzeria
2488,88082,84.0,2016-08-30,2488,94133,H24,94133,Chongqing Xiaomian
2489,88447,96.0,2016-08-17,2489,,H91,94107,Fare Resources
2490,88702,96.0,2016-08-15,2490,94118,H25,94118,Dancing Bull


In [22]:
def prepare_peripheral(violations_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    violations_new = violations_pd.merge(
        train_or_test[["id", "business_id", "date"]],
        on="business_id"
    )

    violations_new = violations_new[
        violations_new["date_x"] < violations_new["date_y"]
    ]

    del violations_new["date_y"]
    del violations_new["business_id"]

    return violations_new.rename(columns={"date_x": "date"})

In [23]:
violations_train_pd = prepare_peripheral(violations_pd, population_train_pd)
violations_test_pd = prepare_peripheral(violations_pd, population_test_pd)
violations_train_pd

Unnamed: 0,violation_type_id,risk_category,description,date,id
2,103129,Moderate Risk,Insufficient hot water or running water,2014-07-29,2
5,103144,Low Risk,Unapproved or unmaintained equipment or utensils,2014-07-29,2
7,103119,Moderate Risk,Inadequate and inaccessible handwashing facili...,2014-01-14,1
8,103119,Moderate Risk,Inadequate and inaccessible handwashing facili...,2014-01-14,2
10,103145,Low Risk,Improper storage of equipment utensils or linens,2014-01-14,1
...,...,...,...,...,...
89220,103119,Moderate Risk,Inadequate and inaccessible handwashing facili...,2016-02-16,10290
89256,103131,Moderate Risk,Moderate risk vermin infestation,2016-04-04,10308
89336,103154,Low Risk,Unclean or degraded floors walls or ceilings,2016-04-11,10331
89338,103148,Low Risk,No thermometers or uncalibrated thermometers,2016-04-11,10331


In [24]:
inspections_train_pd = prepare_peripheral(inspections_pd, population_train_pd)
inspections_test_pd = prepare_peripheral(inspections_pd, population_test_pd)
inspections_train_pd

Unnamed: 0,score,date,id
1,92.0,2014-01-14,1
2,92.0,2014-01-14,2
5,94.0,2014-07-29,2
9,100.0,2013-11-18,3
10,100.0,2013-11-18,4
...,...,...,...
32628,92.0,2016-02-16,10290
32648,96.0,2016-04-04,10308
32673,94.0,2016-04-11,10331
32707,100.0,2016-05-23,10360


In [25]:
del population_train_pd["business_id"]
del population_test_pd["business_id"]

In [26]:
population_train_pd

Unnamed: 0,score,date,id,postal_code,tax_code,owner_zip,name
0,92.0,2014-01-14,0,94104,H24,94104,Tiramisu Kitchen
1,94.0,2014-07-29,1,94104,H24,94104,Tiramisu Kitchen
2,82.0,2016-05-03,2,94104,H24,94104,Tiramisu Kitchen
3,96.0,2014-06-12,3,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
4,96.0,2014-11-24,4,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pantry
...,...,...,...,...,...,...,...
10390,94.0,2016-08-19,10390,94102,H24,94566,Jamba Juice
10391,90.0,2016-09-22,10391,94109,H91,94109,Epicurean at Sacred Heart Catholic Prep School
10392,100.0,2016-09-12,10392,94107,H36,29615,"AT&T Park - Beer Cart/View Level, Sec. 333"
10393,100.0,2016-09-12,10393,94107,H36,29615,"AT&T Park - Beer Cart/Lower CF, Sec. 140"


In [27]:
def add_index(df):
    df.insert(0, "index", range(len(df)))

population_pd_logical_types = {
    "id": ww.logical_types.Integer,
    "score": ww.logical_types.Integer,
    "date": ww.logical_types.Datetime,
    "postal_code": ww.logical_types.Categorical,
    "tax_code": ww.logical_types.Categorical,
    "owner_zip": ww.logical_types.Categorical,
    "name": ww.logical_types.Categorical
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")

add_index(inspections_train_pd)
add_index(inspections_test_pd)
inspections_pd_logical_types = {
    "index": ww.logical_types.Integer,
    "score": ww.logical_types.Integer,
    "date": ww.logical_types.Datetime,
    "id": ww.logical_types.Integer
}
inspections_train_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")
inspections_test_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")

add_index(violations_train_pd)
add_index(violations_test_pd)
violations_pd_logical_types = {
    "index": ww.logical_types.Integer,
    "violation_type_id": ww.logical_types.Categorical,
    "risk_category": ww.logical_types.Categorical,
    "description": ww.logical_types.Categorical,
    "date": ww.logical_types.Datetime,
    "id": ww.logical_types.Integer
}
violations_train_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")
violations_test_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")

In [28]:
dataframes_train = {
    "population" : (population_train_pd, ),
    "inspections" : (inspections_train_pd, ),
    "violations" : (violations_train_pd, )
}

In [29]:
dataframes_test = {
    "population" : (population_test_pd, ),
    "inspections" : (inspections_test_pd, ),
    "violations" : (violations_test_pd, )
}

In [30]:
relationships = [
    ("population", "id", "inspections", "id"),
    ("population", "id", "violations", "id")
]

In [31]:
featuretools_train_pd = featuretools.dfs(
    dataframes=dataframes_train,
    relationships=relationships,
    target_dataframe_name="population")[0]

In [32]:
featuretools_test_pd = featuretools.dfs(
    dataframes=dataframes_test,
    relationships=relationships,
    target_dataframe_name="population")[0]

In [33]:
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")

In [34]:
featuretools_train.set_role("score", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)

featuretools_train

name,score,postal_code,tax_code,owner_zip,name,COUNT(inspections),COUNT(violations),MODE(violations.description),MODE(violations.risk_category),MODE(violations.violation_type_id),NUM_UNIQUE(violations.description),NUM_UNIQUE(violations.risk_category),NUM_UNIQUE(violations.violation_type_id),DAY(date),MONTH(date),WEEKDAY(date),YEAR(date),MODE(inspections.DAY(date)),MODE(inspections.MONTH(date)),MODE(inspections.WEEKDAY(date)),MODE(inspections.YEAR(date)),NUM_UNIQUE(inspections.DAY(date)),NUM_UNIQUE(inspections.MONTH(date)),NUM_UNIQUE(inspections.WEEKDAY(date)),NUM_UNIQUE(inspections.YEAR(date)),MODE(violations.DAY(date)),MODE(violations.MONTH(date)),MODE(violations.WEEKDAY(date)),MODE(violations.YEAR(date)),NUM_UNIQUE(violations.DAY(date)),NUM_UNIQUE(violations.MONTH(date)),NUM_UNIQUE(violations.WEEKDAY(date)),NUM_UNIQUE(violations.YEAR(date)),MAX(inspections.score),MEAN(inspections.score),MIN(inspections.score),SKEW(inspections.score),STD(inspections.score),SUM(inspections.score)
role,target,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,numerical,numerical,numerical,numerical,numerical,numerical
0.0,92,94104,H24,94104,Tiramisu Kitchen,0,0,,,,,,,14,1,1,2014,,,,,,,,,,,,,,,,,,,,,,0
1.0,94,94104,H24,94104,Tiramisu Kitchen,1,3,Improper storage of equipment ut...,Low Risk,103119,3,2,3,29,7,1,2014,14,1,1,2014,1,1,1,1,14,1,1,2014,1,1,1,1,92,92,92,,,92
2.0,82,94104,H24,94104,Tiramisu Kitchen,2,5,Improper storage of equipment ut...,Low Risk,103119,5,2,5,3,5,1,2016,14,1,1,2014,2,2,1,1,14,1,1,2014,2,2,1,1,94,93,92,,1.4142,186
3.0,96,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pant...,1,0,,,,,,,12,6,3,2014,18,11,0,2013,1,1,1,1,,,,,,,,,100,100,100,,,100
4.0,96,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pant...,2,2,Improper storage of equipment ut...,Low Risk,103145,2,1,2,24,11,0,2014,12,6,0,2013,2,2,2,2,12,6,3,2014,1,1,1,1,100,98,96,,2.8284,196
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10390.0,94,94102,H24,94566,Jamba Juice,0,0,,,,,,,19,8,4,2016,,,,,,,,,,,,,,,,,,,,,,0
10391.0,90,94109,H91,94109,Epicurean at Sacred Heart Cathol...,0,0,,,,,,,22,9,3,2016,,,,,,,,,,,,,,,,,,,,,,0
10392.0,100,94107,H36,29615,AT&T Park - Beer Cart/View Level...,0,0,,,,,,,12,9,0,2016,,,,,,,,,,,,,,,,,,,,,,0
10393.0,100,94107,H36,29615,"AT&T Park - Beer Cart/Lower CF, ...",0,0,,,,,,,12,9,0,2016,,,,,,,,,,,,,,,,,,,,,,0


In [35]:
featuretools_test.set_role("score", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)

featuretools_test

name,score,postal_code,tax_code,owner_zip,name,COUNT(inspections),COUNT(violations),MODE(violations.description),MODE(violations.risk_category),MODE(violations.violation_type_id),NUM_UNIQUE(violations.description),NUM_UNIQUE(violations.risk_category),NUM_UNIQUE(violations.violation_type_id),DAY(date),MONTH(date),WEEKDAY(date),YEAR(date),MODE(inspections.DAY(date)),MODE(inspections.MONTH(date)),MODE(inspections.WEEKDAY(date)),MODE(inspections.YEAR(date)),NUM_UNIQUE(inspections.DAY(date)),NUM_UNIQUE(inspections.MONTH(date)),NUM_UNIQUE(inspections.WEEKDAY(date)),NUM_UNIQUE(inspections.YEAR(date)),MODE(violations.DAY(date)),MODE(violations.MONTH(date)),MODE(violations.WEEKDAY(date)),MODE(violations.YEAR(date)),NUM_UNIQUE(violations.DAY(date)),NUM_UNIQUE(violations.MONTH(date)),NUM_UNIQUE(violations.WEEKDAY(date)),NUM_UNIQUE(violations.YEAR(date)),MAX(inspections.score),MEAN(inspections.score),MIN(inspections.score),SKEW(inspections.score),STD(inspections.score),SUM(inspections.score)
role,target,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,categorical,numerical,numerical,numerical,numerical,numerical,numerical
0.0,100,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pant...,0,0,,,,,,,18,11,0,2013,,,,,,,,,,,,,,,,,,,,,,0
1.0,96,94104,H24,94104,OMNI S.F. Hotel - 2nd Floor Pant...,3,3,Improper storage of equipment ut...,Low Risk,103119,3,2,3,11,3,4,2016,12,11,0,2014,3,2,2,2,12,6,3,2014,2,2,2,1,100,97.3333,96,1.7321,2.3094,292
2.0,94,94110,H24,94114,CHARLIE'S DELI CAFE,0,0,,,,,,,9,12,0,2013,,,,,,,,,,,,,,,,,,,,,,0
3.0,78,94111,H24,94111,Oasis Grill,0,0,,,,,,,25,7,4,2014,,,,,,,,,,,,,,,,,,,,,,0
4.0,91,94122,H24,94122,STARBUCKS,1,1,Wiping cloths not clean or prope...,Low Risk,103149,1,1,1,19,5,0,2014,10,2,0,2014,1,1,1,1,10,2,0,2014,1,1,1,1,98,98,98,,,98
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2487.0,91,94110,H25,94110,Bernal Heights Pizzeria,0,0,,,,,,,7,6,1,2016,,,,,,,,,,,,,,,,,,,,,,0
2488.0,84,94133,H24,94133,Chongqing Xiaomian,0,0,,,,,,,30,8,1,2016,,,,,,,,,,,,,,,,,,,,,,0
2489.0,96,,H91,94107,Fare Resources,0,0,,,,,,,17,8,2,2016,,,,,,,,,,,,,,,,,,,,,,0
2490.0,96,94118,H25,94118,Dancing Bull,0,0,,,,,,,15,8,0,2016,,,,,,,,,,,,,,,,,,,,,,0


We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.

Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.

In [36]:
imputation = getml.preprocessors.Imputation()

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe2 = getml.pipeline.Pipeline(
    tags=['featuretools'],
    preprocessors=[imputation],
    predictors=[predictor],
    include_categorical=True,
)

pipe2

In [37]:
pipe2.fit(featuretools_train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          

To see the issues in full, run .check() on the pipeline.

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:02, remaining: 00:00]          

Trained pipeline.
Time taken: 0h:0m:2.007516



In [38]:
featuretools_score = pipe2.score(featuretools_test)
featuretools_score

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2024-02-21 15:09:56,featuretools_train,score,5.1474,6.7615,0.3184
1,2024-02-21 15:09:56,featuretools_test,score,5.46,7.2008,0.2612


#### 2.6 Features

The most important feature looks as follows:

In [39]:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

```sql
DROP TABLE IF EXISTS "FEATURE_1_43";

CREATE TABLE "FEATURE_1_43" AS
SELECT COUNT( t1."date" - t2."date"  ) - COUNT( DISTINCT t1."date" - t2."date" ) AS "feature_1_43",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "VIOLATIONS__STAGING_TABLE_3" t2
ON t1."business_id" = t2."business_id"
WHERE t2."date__1_000000_days" <= t1."date"
GROUP BY t1.rowid;
```

#### 2.7 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's `sqlite3` and `spark` modules.

In [40]:
# Creates a folder named sfscores_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("sfscores_pipeline")

In [41]:
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("sfscores_spark")

#### 2.8 Discussion

For a more convenient overview, we summarize our results into a table.

In [42]:
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={
    'Name': ['getML: FastProp', 'featuretools'],
    'R-squared': [f'{score.rsquared:.1%}' for score in scores],
    'RMSE': [f'{score.rmse:,.2f}' for score in scores],
    'MAE': [f'{score.mae:,.2f}' for score in scores]
})

Unnamed: 0,Name,R-squared,RMSE,MAE
0,getML: FastProp,28.9%,7.05,5.32
1,featuretools,26.1%,7.2,5.46


As we can see, getML's FastProp outperforms featuretools according to all three measures.

### 3. Conclusion

We have benchmarked getML against featuretools on dataset related to health inspections of eateries in San Francisco. We have found that getML outperforms featuretools.

## References

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).