# Formula 1 ...

...

Summary:

- Prediction type: __Classification model__
- Domain: __Sports__
- Prediction target: __Wins__ 
- Population size: __...__

_Author: Dr. Patrick Urbanke_

# Background

...

The dataset has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/ErgastF1) (Motl and Schulte, 2015).

We will benchmark [getML](https://www.getml.com) 's feature learning algorithms against [featuretools](https://www.featuretools.com), an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import featuretools
import getml

getml.engine.set_project('ErgastF1')



Connected to project 'ErgastF1'


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="ErgastF1",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(conn_id='default',
           dbname='ErgastF1',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)

In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [4]:
driverStandings = load_if_needed("driverStandings")
drivers = load_if_needed("drivers")
lapTimes = load_if_needed("lapTimes")
pitStops = load_if_needed("pitStops")
races = load_if_needed("races")
qualifying = load_if_needed("qualifying")

In [5]:
driverStandings

name,driverStandingsId,raceId,driverId,points,position,wins,positionText
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,1,18,1,10,1,1,1
1.0,2,18,2,8,2,0,2
2.0,3,18,3,6,3,0,3
3.0,4,18,4,5,4,0,4
4.0,5,18,5,4,5,0,5
,...,...,...,...,...,...,...
31573.0,68456,982,835,8,16,0,16
31574.0,68457,982,154,26,13,0,13
31575.0,68458,982,836,5,18,0,18
31576.0,68459,982,18,0,22,0,22


In [6]:
drivers

name,driverId,number,driverRef,code,forename,surname,dob,nationality,url
role,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,44,hamilton,HAM,Lewis,Hamilton,1985-01-07,British,http://en.wikipedia.org/wiki/Lewis_Hamilton
1.0,2,,heidfeld,HEI,Nick,Heidfeld,1977-05-10,German,http://en.wikipedia.org/wiki/Nick_Heidfeld
2.0,3,6,rosberg,ROS,Nico,Rosberg,1985-06-27,German,http://en.wikipedia.org/wiki/Nico_Rosberg
3.0,4,14,alonso,ALO,Fernando,Alonso,1981-07-29,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso
4.0,5,,kovalainen,KOV,Heikki,Kovalainen,1981-10-19,Finnish,http://en.wikipedia.org/wiki/Heikki_Kovalainen
,...,...,...,...,...,...,...,...,...
835.0,837,88,haryanto,HAR,Rio,Haryanto,1993-01-22,Indonesian,http://en.wikipedia.org/wiki/Rio_Haryanto
836.0,838,2,vandoorne,VAN,Stoffel,Vandoorne,1992-03-26,Belgian,http://en.wikipedia.org/wiki/Stoffel_Vandoorne
837.0,839,31,ocon,OCO,Esteban,Ocon,1996-09-17,French,http://en.wikipedia.org/wiki/Esteban_Ocon
838.0,840,18,stroll,STR,Lance,Stroll,1998-10-29,Canadian,http://en.wikipedia.org/wiki/Lance_Stroll


In [7]:
lapTimes

name,raceId,driverId,lap,position,milliseconds,time
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,1,1,1,13,109088,1:49.088
1.0,1,1,2,12,93740,1:33.740
2.0,1,1,3,11,91600,1:31.600
3.0,1,1,4,10,91067,1:31.067
4.0,1,1,5,10,92129,1:32.129
,...,...,...,...,...,...
420364.0,982,840,54,8,107528,1:47.528
420365.0,982,840,55,8,107512,1:47.512
420366.0,982,840,56,8,108143,1:48.143
420367.0,982,840,57,8,107848,1:47.848


In [8]:
pitStops

name,raceId,driverId,stop,lap,milliseconds,time,duration
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string
0.0,841,1,1,16,23227,17:28:24,23.227
1.0,841,1,2,36,23199,17:59:29,23.199
2.0,841,2,1,15,22994,17:27:41,22.994
3.0,841,2,2,30,25098,17:51:32,25.098
4.0,841,3,1,16,23716,17:29:00,23.716
,...,...,...,...,...,...,...
6065.0,982,839,6,38,29134,21:29:07,29.134
6066.0,982,840,1,1,37403,20:06:43,37.403
6067.0,982,840,2,2,29294,20:10:07,29.294
6068.0,982,840,3,3,25584,20:13:16,25.584


In [9]:
races

name,raceId,year,round,circuitId,name,date,time,url
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix
1.0,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix
2.0,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix
3.0,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Grand_Prix
4.0,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Grand_Prix
,...,...,...,...,...,...,...,...
971.0,984,2017,16,22,Japanese Grand Prix,2017-10-08,05:00:00,https://en.wikipedia.org/wiki/2017_Japanese_Grand_Prix
972.0,985,2017,17,69,United States Grand Prix,2017-10-22,19:00:00,https://en.wikipedia.org/wiki/2017_United_States_Grand_Prix
973.0,986,2017,18,32,Mexican Grand Prix,2017-10-29,19:00:00,https://en.wikipedia.org/wiki/2017_Mexican_Grand_Prix
974.0,987,2017,19,18,Brazilian Grand Prix,2017-11-12,16:00:00,https://en.wikipedia.org/wiki/2017_Brazilian_Grand_Prix


In [10]:
qualifying

name,qualifyId,raceId,driverId,constructorId,number,position,q1,q2,q3
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,1,18,1,1,22,1,1:26.572,1:25.187,1:26.714
1.0,2,18,9,2,4,2,1:26.103,1:25.315,1:26.869
2.0,3,18,5,1,23,3,1:25.664,1:25.452,1:27.079
3.0,4,18,13,6,2,4,1:25.994,1:25.691,1:27.178
4.0,5,18,2,2,3,5,1:25.960,1:25.518,1:27.236
,...,...,...,...,...,...,...,...,...
7392.0,7415,982,825,210,20,16,1:43.756,,
7393.0,7416,982,13,3,19,17,1:44.014,,
7394.0,7417,982,840,3,18,18,1:44.728,,
7395.0,7418,982,836,15,94,19,1:45.059,,


### 1.2 Prepare data for getML

In [11]:
racesPd = races.to_pandas()
racesPd

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url
0,1.0,2009.0,1.0,1.0,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...
1,2.0,2009.0,2.0,2.0,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...
2,3.0,2009.0,3.0,17.0,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...
3,4.0,2009.0,4.0,3.0,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...
4,5.0,2009.0,5.0,4.0,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...
...,...,...,...,...,...,...,...,...
971,984.0,2017.0,16.0,22.0,Japanese Grand Prix,2017-10-08,05:00:00,https://en.wikipedia.org/wiki/2017_Japanese_Gr...
972,985.0,2017.0,17.0,69.0,United States Grand Prix,2017-10-22,19:00:00,https://en.wikipedia.org/wiki/2017_United_Stat...
973,986.0,2017.0,18.0,32.0,Mexican Grand Prix,2017-10-29,19:00:00,https://en.wikipedia.org/wiki/2017_Mexican_Gra...
974,987.0,2017.0,19.0,18.0,Brazilian Grand Prix,2017-11-12,16:00:00,https://en.wikipedia.org/wiki/2017_Brazilian_G...


In [12]:
driverStandingsPd = driverStandings.to_pandas()

driverStandingsPd = driverStandingsPd.merge(
    racesPd[["raceId", "year", "date", "round"]],
    on="raceId"
)

previousStanding = driverStandingsPd.merge(
    driverStandingsPd[["driverId", "year", "wins", "round"]],
    on=["driverId", "year"],
)

isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0)

previousStanding = previousStanding[isPreviousRound]

previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"]

driverStandingsPd = driverStandingsPd.merge(
    previousStanding[["raceId", "driverId", "win"]],
    on=["raceId", "driverId"],
    how="left",
)

driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])]

driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings")

driver_standings

name,driverStandingsId,raceId,driverId,points,position,wins,year,round,win,positionText,date
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string
0.0,1,18,1,10,1,1,2008,1,1,1,2008-03-16
1.0,2,18,2,8,2,0,2008,1,0,2,2008-03-16
2.0,3,18,3,6,3,0,2008,1,0,3,2008-03-16
3.0,4,18,4,5,4,0,2008,1,0,4,2008-03-16
4.0,5,18,5,4,5,0,2008,1,0,5,2008-03-16
,...,...,...,...,...,...,...,...,...,...,...
31573.0,68456,982,835,8,16,0,2017,14,0,16,2017-09-17
31574.0,68457,982,154,26,13,0,2017,14,0,13,2017-09-17
31575.0,68458,982,836,5,18,0,2017,14,0,18,2017-09-17
31576.0,68459,982,18,0,22,0,2017,14,0,22,2017-09-17


In [13]:
lapTimesPd = lapTimes.to_pandas()

lapTimesPd = lapTimesPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times")

lap_times

name,raceId,driverId,lap,position,milliseconds,year,time,date
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string
0.0,1,1,1,13,109088,2009,1:49.088,2009-03-29
1.0,1,1,2,12,93740,2009,1:33.740,2009-03-29
2.0,1,1,3,11,91600,2009,1:31.600,2009-03-29
3.0,1,1,4,10,91067,2009,1:31.067,2009-03-29
4.0,1,1,5,10,92129,2009,1:32.129,2009-03-29
,...,...,...,...,...,...,...,...
420364.0,982,840,54,8,107528,2017,1:47.528,2017-09-17
420365.0,982,840,55,8,107512,2017,1:47.512,2017-09-17
420366.0,982,840,56,8,108143,2017,1:48.143,2017-09-17
420367.0,982,840,57,8,107848,2017,1:47.848,2017-09-17


In [14]:
pitStopsPd = pitStops.to_pandas()

pitStopsPd = pitStopsPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops")

pit_stops

name,raceId,driverId,stop,lap,milliseconds,year,time,duration,date
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,841,1,1,16,23227,2011,17:28:24,23.227,2011-03-27
1.0,841,1,2,36,23199,2011,17:59:29,23.199,2011-03-27
2.0,841,2,1,15,22994,2011,17:27:41,22.994,2011-03-27
3.0,841,2,2,30,25098,2011,17:51:32,25.098,2011-03-27
4.0,841,3,1,16,23716,2011,17:29:00,23.716,2011-03-27
,...,...,...,...,...,...,...,...,...
6065.0,982,839,6,38,29134,2017,21:29:07,29.134,2017-09-17
6066.0,982,840,1,1,37403,2017,20:06:43,37.403,2017-09-17
6067.0,982,840,2,2,29294,2017,20:10:07,29.294,2017-09-17
6068.0,982,840,3,3,25584,2017,20:13:16,25.584,2017-09-17


getML requires that we define *roles* for each of the columns.

In [15]:
driver_standings.set_role("win", getml.data.roles.target)
driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
driver_standings.set_role("position", getml.data.roles.numerical)
driver_standings.set_role("date", getml.data.roles.time_stamp)

driver_standings

name,date,raceId,driverId,year,win,position,driverStandingsId,points,wins,round,positionText
role,time_stamp,join_key,join_key,join_key,target,numerical,unused_float,unused_float,unused_float,unused_float,unused_string
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
0.0,2008-03-16,18,1,2008,1,1,1,10,1,1,1
1.0,2008-03-16,18,2,2008,0,2,2,8,0,1,2
2.0,2008-03-16,18,3,2008,0,3,3,6,0,1,3
3.0,2008-03-16,18,4,2008,0,4,4,5,0,1,4
4.0,2008-03-16,18,5,2008,0,5,5,4,0,1,5
,...,...,...,...,...,...,...,...,...,...,...
31573.0,2017-09-17,982,835,2017,0,16,68456,8,0,14,16
31574.0,2017-09-17,982,154,2017,0,13,68457,26,0,14,13
31575.0,2017-09-17,982,836,2017,0,18,68458,5,0,14,18
31576.0,2017-09-17,982,18,2017,0,22,68459,0,0,14,22


In [16]:
drivers.set_role("driverId", getml.data.roles.join_key)
drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical)

drivers

name,driverId,nationality,driverRef,number,code,forename,surname,dob,url
role,join_key,categorical,categorical,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,British,hamilton,44,HAM,Lewis,Hamilton,1985-01-07,http://en.wikipedia.org/wiki/Lewis_Hamilton
1.0,2,German,heidfeld,,HEI,Nick,Heidfeld,1977-05-10,http://en.wikipedia.org/wiki/Nick_Heidfeld
2.0,3,German,rosberg,6,ROS,Nico,Rosberg,1985-06-27,http://en.wikipedia.org/wiki/Nico_Rosberg
3.0,4,Spanish,alonso,14,ALO,Fernando,Alonso,1981-07-29,http://en.wikipedia.org/wiki/Fernando_Alonso
4.0,5,Finnish,kovalainen,,KOV,Heikki,Kovalainen,1981-10-19,http://en.wikipedia.org/wiki/Heikki_Kovalainen
,...,...,...,...,...,...,...,...,...
835.0,837,Indonesian,haryanto,88,HAR,Rio,Haryanto,1993-01-22,http://en.wikipedia.org/wiki/Rio_Haryanto
836.0,838,Belgian,vandoorne,2,VAN,Stoffel,Vandoorne,1992-03-26,http://en.wikipedia.org/wiki/Stoffel_Vandoorne
837.0,839,French,ocon,31,OCO,Esteban,Ocon,1996-09-17,http://en.wikipedia.org/wiki/Esteban_Ocon
838.0,840,Canadian,stroll,18,STR,Lance,Stroll,1998-10-29,http://en.wikipedia.org/wiki/Lance_Stroll


In [17]:
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical)
lap_times.set_role("date", getml.data.roles.time_stamp)

lap_times

name,date,raceId,driverId,year,lap,milliseconds,position,time
role,time_stamp,join_key,join_key,join_key,numerical,numerical,numerical,unused_string
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0.0,2009-03-29,1,1,2009,1,109088,13,1:49.088
1.0,2009-03-29,1,1,2009,2,93740,12,1:33.740
2.0,2009-03-29,1,1,2009,3,91600,11,1:31.600
3.0,2009-03-29,1,1,2009,4,91067,10,1:31.067
4.0,2009-03-29,1,1,2009,5,92129,10,1:32.129
,...,...,...,...,...,...,...,...
420364.0,2017-09-17,982,840,2017,54,107528,8,1:47.528
420365.0,2017-09-17,982,840,2017,55,107512,8,1:47.512
420366.0,2017-09-17,982,840,2017,56,108143,8,1:48.143
420367.0,2017-09-17,982,840,2017,57,107848,8,1:47.848


In [18]:
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical)
pit_stops.set_role("date", getml.data.roles.time_stamp)

pit_stops

name,date,raceId,driverId,year,lap,milliseconds,stop,time,duration
role,time_stamp,join_key,join_key,join_key,numerical,numerical,numerical,unused_string,unused_string
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
0.0,2011-03-27,841,1,2011,16,23227,1,17:28:24,23.227
1.0,2011-03-27,841,1,2011,36,23199,2,17:59:29,23.199
2.0,2011-03-27,841,2,2011,15,22994,1,17:27:41,22.994
3.0,2011-03-27,841,2,2011,30,25098,2,17:51:32,25.098
4.0,2011-03-27,841,3,2011,16,23716,1,17:29:00,23.716
,...,...,...,...,...,...,...,...,...
6065.0,2017-09-17,982,839,2017,38,29134,6,21:29:07,29.134
6066.0,2017-09-17,982,840,2017,1,37403,1,20:06:43,37.403
6067.0,2017-09-17,982,840,2017,2,29294,2,20:10:07,29.294
6068.0,2017-09-17,982,840,2017,3,25584,3,20:13:16,25.584


In [19]:
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key)
qualifying.set_role(["position", "number"], getml.data.roles.numerical)

qualifying

name,raceId,driverId,qualifyId,position,number,constructorId,q1,q2,q3
role,join_key,join_key,join_key,numerical,numerical,unused_float,unused_string,unused_string,unused_string
0.0,18,1,1,1,22,1,1:26.572,1:25.187,1:26.714
1.0,18,9,2,2,4,2,1:26.103,1:25.315,1:26.869
2.0,18,5,3,3,23,1,1:25.664,1:25.452,1:27.079
3.0,18,13,4,4,2,6,1:25.994,1:25.691,1:27.178
4.0,18,2,5,5,3,2,1:25.960,1:25.518,1:27.236
,...,...,...,...,...,...,...,...,...
7392.0,982,825,7415,16,20,210,1:43.756,,
7393.0,982,13,7416,17,19,3,1:44.014,,
7394.0,982,840,7417,18,18,3,1:44.728,,
7395.0,982,836,7418,19,94,15,1:45.059,,


## 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

In [20]:
split = getml.data.split.random(train=0.8, test=0.2)
split

Unnamed: 0,Unnamed: 1
0.0,train
1.0,train
2.0,train
3.0,test
4.0,train
,...


### 2.1 Define relational model

In [21]:
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split)

star_schema.join(
    driver_standings,
    on=["driverId", "year"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
    lagged_targets=True,
)

star_schema.join(
    lap_times,
    on=["driverId", "year"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    pit_stops,
    on=["driverId", "year"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    qualifying,
    on=["driverId", "raceId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema.join(
    drivers,
    on=["driverId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema

Unnamed: 0,subset,name,rows,type
0,test,driver_standings,6229,View
1,train,driver_standings,25349,View

Unnamed: 0,name,rows,type
0,driver_standings,31578,DataFrame
1,lap_times,420369,DataFrame
2,pit_stops,6070,DataFrame
3,qualifying,7397,DataFrame
4,drivers,840,DataFrame


### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (`min_num_samples`).

In [22]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    num_threads=1,
)

relboost = getml.feature_learning.Relboost(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

relmt = getml.feature_learning.RelMT(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

predictor = getml.predictors.XGBoostClassifier(n_jobs=1)

__Build the pipeline__

In [23]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor],
    include_categorical=True,
)

pipe1

### 2.3 Model training

In [24]:
pipe1.check(star_schema.train)

Checking data model...


Preprocessing...

INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId', 'year' and 'driverId', 'year', there are no corresponding entries for 72.677423% of entries in 'driverId', 'year' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId', 'year' and 'driverId', 'year', there are no corresponding entries for 90.595290% of entries in 'driverId', 'year' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


In [25]:
pipe1.fit(star_schema.train)

Checking data model...


INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId', 'year' and 'driverId', 'year', there are no corresponding entries for 72.677423% of entries in 'driverId', 'year' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId', 'year' and 'driverId', 'year', there are no corresponding entries for 90.595290% of entries in 'driverId', 'year' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


Preprocessing...

FastProp: Trying 867 features...

FastProp: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:1m:52.942229



### 2.4 Model evaluation

In [26]:
pipe1.score(star_schema.test)



Preprocessing...

FastProp: Building features...



Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-07-15 22:15:58,train,win,0.9736,0.9566,0.07478
1,2021-07-15 22:16:06,test,win,0.9724,0.922,0.08662


### 2.5 featuretools

In [None]:
population_train_pd = star_schema.train.population.to_pandas()
population_test_pd = star_schema.test.population.to_pandas()

In [None]:
inspections_pd = inspections.drop(inspections.unused_names).to_pandas()
violations_pd = violations.drop(violations.unused_names).to_pandas()
businesses_pd = businesses.drop(businesses.unused_names).to_pandas()

In [None]:
population_train_pd["id"] = population_train_pd.index

population_train_pd = population_train_pd.merge(
    businesses_pd,
    on="business_id"
)

population_train_pd

In [None]:
population_test_pd["id"] = population_test_pd.index

population_test_pd = population_test_pd.merge(
    businesses_pd,
    on="business_id"
)

population_test_pd

In [None]:
def prepare_violations(violations_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    violations_new = violations_pd.merge(
        train_or_test[["id", "business_id", "date"]],
        on="business_id"
    )

    violations_new = violations_new[
        violations_new["date_x"] <= violations_new["date_y"]
    ]

    del violations_new["date_y"]
    del violations_new["business_id"]

    return violations_new.rename(columns={"date_x": "date"})

In [None]:
violations_train_pd = prepare_violations(violations_pd, population_train_pd)
violations_test_pd = prepare_violations(violations_pd, population_test_pd)
violations_train_pd

In [None]:
def prepare_inspections(inspections_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    inspections_new = inspections_pd.merge(
        train_or_test[["id", "business_id", "date"]],
        on="business_id"
    )

    inspections_new = inspections_new[
        inspections_new["date_x"] < inspections_new["date_y"]
    ]
    
    del inspections_new["date_y"]
    del inspections_new["business_id"]

    return inspections_new.rename(columns={"date_x": "date"})

In [None]:
inspections_train_pd = prepare_inspections(inspections_pd, population_train_pd)
inspections_test_pd = prepare_inspections(inspections_pd, population_test_pd)
inspections_train_pd

In [None]:
del population_train_pd["business_id"]
del population_test_pd["business_id"]

In [None]:
population_train_pd

In [None]:
entities_train = {
    "population" : (population_train_pd, "id"),
    "inspections" : (inspections_train_pd, "index"),
    "violations" : (violations_train_pd, "index")
}

In [None]:
entities_test = {
    "population" : (population_test_pd, "id"),
    "inspections" : (inspections_test_pd, "index"),
    "violations" : (violations_test_pd, "index")
}

In [None]:
relationships = [
    ("population", "id", "inspections", "id"),
    ("population", "id", "violations", "id")
]

In [None]:
featuretools_train_pd = featuretools.dfs(
    entities=entities_train,
    relationships=relationships,
    target_entity="population")[0]

In [None]:
featuretools_test_pd = featuretools.dfs(
    entities=entities_test,
    relationships=relationships,
    target_entity="population")[0]

In [None]:
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")

In [None]:
featuretools_train.set_role("name", getml.data.roles.text)
featuretools_train.set_role("score", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.unused_float_names, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.unused_string_names, getml.data.roles.categorical)

featuretools_train

In [None]:
featuretools_test.set_role("name", getml.data.roles.text)
featuretools_test.set_role("score", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.unused_float_names, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.unused_string_names, getml.data.roles.categorical)

featuretools_test

We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.

Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.

In [None]:
data_model = getml.data.DataModel("population")

In [None]:
imputation = getml.preprocessors.Imputation()

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe4 = getml.pipeline.Pipeline(
    tags=['featuretools'],
    data_model=data_model,
    preprocessors=[imputation],
    predictors=[predictor],
    include_categorical=True,
)

pipe4

In [None]:
pipe4.fit(featuretools_train)

In [None]:
pipe4.score(featuretools_test)

### 2.6 Studying features

We would like to understand why getML outperforms featuretools. In particular, getML's FastProp is based on an approach that is very similar to featuretools. However, getML's FastProp outperforms featuretools by over 10 percentage points (in terms of R-squared).

To investigate this matter, we first take a look at the importance of the features FastProp has learned:

In [None]:
names, importances = pipe1.features.importances(target_num=0)

plt.subplots(figsize=(20, 10))

plt.bar(names[:30], importances[:30], color='#6829c2')

plt.title("feature importances")
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')

plt.show()

As we can see, a small number of features accounts for well over 90% of the predictive power. Therefore, if we take a look at the most important features, we will get a very good idea where the predictive power comes from:

In [None]:
pipe1.features.to_sql()[names[0]]

In [None]:
pipe1.features.to_sql()[names[1]]

In [None]:
pipe1.features.to_sql()[names[2]]

What we can learn from these features is the following:

1) The health score depends on the number of violations at the LAST inspection (FEATURE_1_135).

2) The health score also depends on the severity of these violations (FEATURE_1_57, FEATURE_1_54). Note that EWMA is short for exponentially weighted moving average and is therefore an aggregation that gives greater emphasis to more recent data.

Of course, this is very much how we would expect a public health department to assign these scores. But what are the features that featuretools has come up with?

In [None]:
names, importances = pipe4.features.importances(target_num=0)

plt.subplots(figsize=(20, 10))

plt.bar(names[:30], importances[:30], color='#6829c2')

plt.title("feature importances")
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')

plt.show()

As we can see, featuretools cannot reproduce such a logic. It can calculate the number of violations and the number of unique violations, but it cannot assess their severity and it cannot differentiate between more recent violations and violations that happened a long time ago.

### 2.7 Discussion

For a more convenient overview, we summarize our results into a table.

Name                 | R-squared | RMSE | MAE
-------------------- | --------- | ---- | ----
getML: FastProp      |     97.0% | 1.45 | 0.53
getML: Relboost      |     97.2% | 1.41 | 0.45
getML: RelMT         |     97.4% | 1.35 | 0.3
featuretools         |     84.0% | 3.39 | 2.13

As we can see, these figures paint a very clear picture. All scores indicate that RelMT outperforms Relboost, which outperforms FastProp. All three algorithms outperform featuretools by a wide margin.

As we have seen in the previous section, the health score largely depends on the number of violations and their severity at the most recent inspection. However, featuretools cannot quite build features that reproduce this kind of logic.

## 3. Conclusion

We have benchmarked getML against featuretools on dataset related to health inspections of eateries in San Francisco. We have found that getML outperforms featuretools by a wide margin.

## References

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

# Next Steps

This tutorial benchmarked getML against academic state-of-the-art algorithms from relational learning literature and getML's qualities with respect to categorical data.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.