# Quick Start Tutorial: Model Training

## Learning Objectives

In this tutorial you will learn:
1. How to design an observation set for your use case
2. How to materialize training data
3. How your ML training environment can consume training data

## Set up the prerequisites

Learning Objectives

In this section you will:
* start your local featurebyte server
* import libraries
* learn the about catalogs
* activate a pre-built catalog

In [None]:
!pip install featurebyte
!pip install scikit-learn
!wget https://raw.githubusercontent.com/featurebyte/featurebyte-hosted-tutorials/main/tutorials/notebooks/prebuilt_catalogs.py

In [1]:
# library imports
import pandas as pd
import numpy as np
import random

# load the featurebyte SDK
import featurebyte as fb

print("FeatureByte version " + fb.version)

# inject your API token after registering for the tutorial
fb.register_tutorial_api_token("<api_token>")

2023-03-27 18:57:15.788 | INFO     | featurebyte.docker.manager:start_playground:305 | Starting featurebyte service | {}


FeatureByte version 0.1.4


2023-03-27 18:57:23.678 | INFO     | featurebyte.docker.manager:start_playground:307 | Starting local spark service | {}
2023-03-27 18:57:30.561 | INFO     | featurebyte.docker.manager:start_playground:310 | Starting documentation service | {}
2023-03-27 18:57:37.374 | INFO     | featurebyte.docker.manager:start_playground:314 | Creating local spark feature store | {}
2023-03-27 18:57:37.899 | INFO     | featurebyte.docker.manager:start_playground:336 | Dataset grocery already exists, skipping import | {}
2023-03-27 18:57:37.899 | INFO     | featurebyte.docker.manager:start_playground:336 | Dataset healthcare already exists, skipping import | {}
2023-03-27 18:57:37.899 | INFO     | featurebyte.docker.manager:start_playground:336 | Dataset creditcard already exists, skipping import | {}


### Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog_name = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)

Cleaning up any existing tutorial catalogs
Building a quick start catalog for model training named [quick start model training 20230327:1857]
Creating new catalog
Catalog created
Registering the source tables
Registering the entities
Tagging the entities to columns in the data tables
Populating the feature store with example features
Setting feature readiness
Saving Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.0s (23.41/s)                               
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s (4.09/s)                               
Saving Feature(s) |████████████████████████████████████████| 8/8 [100%] in 0.0s (201.94/s)                              
Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 1.5s (5.45/s)                               
Catalog created and pre-populated with data and features


### Example: Activate an existing catalog

In [3]:
# you can activate an existing catalog
catalog = fb.Catalog.activate(catalog_name)

### Example: Create views from tables in the Catalog

In [4]:
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")

## Create an observation set for your use case

Learning Objectives

In this section you will learn:
* the purpose of observation sets
* the relationship between entities, point in time, and observation sets
* how to design an observation set suitable for training data

### Case Study: Predicting Customer Spend

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

### Concept: Materialization

A feature definition is a set of instructions for computing the feature on past or newly available data. The act of computing features is known as Feature Materialization.

### Concept: Observation set

An observation set is a table of entity keys and points in time, for which you wish to materialize feature values. The entities keys define which entities a feature will materialize, and the points in time define at which timestamps.

### Concept: Point in time

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

An observation set is created as a pandas data frame containing the keys for the primary entity, and points in time. The column name for the primary entity must be its serving name, and the column name for the point in time must be "POINT_IN_TIME".

### Case Study: Predicting Customer Spend

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

In [5]:
# get the feature list for the target feature
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s (4.46/s)                               


Unnamed: 0,name,version,dtype,readiness,online_enabled,tables,primary_tables,entities,primary_entities,created_at
0,Target,V230327,FLOAT,PRODUCTION_READY,False,[GROCERYINVOICE],[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-03-27 10:58:16.549


In [6]:
# filter to get the second half of 2022
filter = (grocery_invoice_view["Timestamp"].dt.year == 2022) & (grocery_invoice_view["Timestamp"].dt.month >= 7)

# create a pandas data frame a sample of the customer IDs and timestamps
observation_set_features = observation_set = (
    grocery_invoice_view[filter].sample(1000)[["GroceryCustomerGuid", "Timestamp"]]
    .rename({
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    }, axis=1)
)
display(observation_set_features)

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME
0,306f4ba8-63a7-4995-8c47-adaea26e3e65,2022-08-24 13:58:21
1,6a24aaf2-65e4-48a5-8027-1088bf53102a,2022-09-20 16:55:00
2,c6d5809b-d835-4b4d-b442-f753be68fa85,2022-07-03 16:34:14
3,19e98e0f-bb53-41d3-bc31-415975fed467,2022-09-04 19:50:45
4,82a104d2-ad63-4079-8ccc-767c5b88afcb,2022-08-27 15:48:17
...,...,...
995,7ac933ed-db9f-4169-a52d-0b86fab44379,2022-08-27 14:12:28
996,b3b9a70e-4ec3-4fe2-b563-873899b357b1,2022-07-03 09:37:59
997,22d37e8d-0e7b-41c0-94eb-95c9282ab041,2022-08-03 18:50:41
998,d836b370-9b8c-4cf5-8612-32e39224a9d3,2022-08-19 20:53:12


## Materialize Training Data

Learning Objectives

In this section you will learn:
* how to create historical training data
* how to merge target and features

### Example: Get historical values

In [7]:
# list the feature lists
display(catalog.list_feature_lists())

Unnamed: 0,name,num_features,status,deployed,readiness_frac,online_frac,tables,entities,created_at
0,Features,8,DRAFT,False,1.0,0.0,"[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...","[grocerycustomer, frenchstate]",2023-03-27 10:58:18.742
1,TargetFeature,1,DRAFT,False,1.0,0.0,[GROCERYINVOICE],[grocerycustomer],2023-03-27 10:58:17.407


In [8]:
# get the feature list
feature_list = catalog.get_feature_list("Features")

# use the get historical features function to get the feature values for the observation set
training_data_features = feature_list.get_historical_features(observation_set_features)
display(training_data_features)

Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 1.4s (5.57/s)                               
Retrieving Historical Feature(s) |████████████████████████████████████████| 1/1 [100%] in 43.6s (0.02/s)                


Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME,CustomerAvgInvoiceAmount_28d,CustomerSpend_28d,CustomerStateSimilarity_28d,CustomerInventoryStability_14d28d,StateMeanLongitude,StateMeanLatitude,StateAvgInvoiceAmount_28d,StatePopulation
0,306f4ba8-63a7-4995-8c47-adaea26e3e65,2022-08-24 13:58:21,5.911429,124.14,0.630861,0.903672,2.330867,48.841036,21.959640,180
1,6a24aaf2-65e4-48a5-8027-1088bf53102a,2022-09-20 16:55:00,19.830000,178.47,0.714224,0.944014,7.531211,48.049312,26.899000,10
2,c6d5809b-d835-4b4d-b442-f753be68fa85,2022-07-03 16:34:14,7.526667,90.32,0.758315,0.843853,2.331794,48.840356,19.233153,179
3,19e98e0f-bb53-41d3-bc31-415975fed467,2022-09-04 19:50:45,29.801429,208.61,0.781039,0.886844,2.331067,48.840595,20.661001,179
4,82a104d2-ad63-4079-8ccc-767c5b88afcb,2022-08-27 15:48:17,24.577273,270.35,0.759823,0.798057,2.331067,48.840595,21.901817,179
...,...,...,...,...,...,...,...,...,...,...
995,7ac933ed-db9f-4169-a52d-0b86fab44379,2022-08-27 14:12:28,29.960000,29.96,0.405494,1.000000,-2.639970,48.015606,17.372927,13
996,b3b9a70e-4ec3-4fe2-b563-873899b357b1,2022-07-03 09:37:59,10.000000,30.00,0.297688,1.000000,2.331794,48.840356,19.268804,179
997,22d37e8d-0e7b-41c0-94eb-95c9282ab041,2022-08-03 18:50:41,22.661765,385.25,0.833442,0.873186,2.330867,48.841036,20.238734,180
998,d836b370-9b8c-4cf5-8612-32e39224a9d3,2022-08-19 20:53:12,9.781154,254.31,0.859999,0.990923,3.270621,45.921705,16.543846,6


### Example: Get target values

When target values use aggregates or time offsets, you first need to offset the point in time by the time window.

In [9]:
# add 14 days to the timestamps in the observation set
observation_set_target = observation_set_features.copy()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)
display(observation_set_target)

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME
0,306f4ba8-63a7-4995-8c47-adaea26e3e65,2022-09-07 13:58:21
1,6a24aaf2-65e4-48a5-8027-1088bf53102a,2022-10-04 16:55:00
2,c6d5809b-d835-4b4d-b442-f753be68fa85,2022-07-17 16:34:14
3,19e98e0f-bb53-41d3-bc31-415975fed467,2022-09-18 19:50:45
4,82a104d2-ad63-4079-8ccc-767c5b88afcb,2022-09-10 15:48:17
...,...,...
995,7ac933ed-db9f-4169-a52d-0b86fab44379,2022-09-10 14:12:28
996,b3b9a70e-4ec3-4fe2-b563-873899b357b1,2022-07-17 09:37:59
997,22d37e8d-0e7b-41c0-94eb-95c9282ab041,2022-08-17 18:50:41
998,d836b370-9b8c-4cf5-8612-32e39224a9d3,2022-09-02 20:53:12


In [10]:
# Materialize the target feature using get historical features
training_data_target = customer_target_list.get_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)

display(training_data_target)

Retrieving Historical Feature(s) |████████████████████████████████████████| 1/1 [100%] in 10.4s (0.10/s)                


Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME,Target
0,306f4ba8-63a7-4995-8c47-adaea26e3e65,2022-08-24 13:58:21,16.55
1,6a24aaf2-65e4-48a5-8027-1088bf53102a,2022-09-20 16:55:00,44.55
2,c6d5809b-d835-4b4d-b442-f753be68fa85,2022-07-03 16:34:14,90.24
3,19e98e0f-bb53-41d3-bc31-415975fed467,2022-09-04 19:50:45,307.17
4,82a104d2-ad63-4079-8ccc-767c5b88afcb,2022-08-27 15:48:17,89.79
...,...,...,...
995,7ac933ed-db9f-4169-a52d-0b86fab44379,2022-08-27 14:12:28,68.52
996,b3b9a70e-4ec3-4fe2-b563-873899b357b1,2022-07-03 09:37:59,5.97
997,22d37e8d-0e7b-41c0-94eb-95c9282ab041,2022-08-03 18:50:41,367.73
998,d836b370-9b8c-4cf5-8612-32e39224a9d3,2022-08-19 20:53:12,118.59


### Example: Merging materialized values for features and target

In [11]:
# merge training data features and training data target
training_data = training_data_features.merge(training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"])
display(training_data)

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME,CustomerAvgInvoiceAmount_28d,CustomerSpend_28d,CustomerStateSimilarity_28d,CustomerInventoryStability_14d28d,StateMeanLongitude,StateMeanLatitude,StateAvgInvoiceAmount_28d,StatePopulation,Target
0,306f4ba8-63a7-4995-8c47-adaea26e3e65,2022-08-24 13:58:21,5.911429,124.14,0.630861,0.903672,2.330867,48.841036,21.959640,180,16.55
1,6a24aaf2-65e4-48a5-8027-1088bf53102a,2022-09-20 16:55:00,19.830000,178.47,0.714224,0.944014,7.531211,48.049312,26.899000,10,44.55
2,c6d5809b-d835-4b4d-b442-f753be68fa85,2022-07-03 16:34:14,7.526667,90.32,0.758315,0.843853,2.331794,48.840356,19.233153,179,90.24
3,19e98e0f-bb53-41d3-bc31-415975fed467,2022-09-04 19:50:45,29.801429,208.61,0.781039,0.886844,2.331067,48.840595,20.661001,179,307.17
4,82a104d2-ad63-4079-8ccc-767c5b88afcb,2022-08-27 15:48:17,24.577273,270.35,0.759823,0.798057,2.331067,48.840595,21.901817,179,89.79
...,...,...,...,...,...,...,...,...,...,...,...
995,7ac933ed-db9f-4169-a52d-0b86fab44379,2022-08-27 14:12:28,29.960000,29.96,0.405494,1.000000,-2.639970,48.015606,17.372927,13,68.52
996,b3b9a70e-4ec3-4fe2-b563-873899b357b1,2022-07-03 09:37:59,10.000000,30.00,0.297688,1.000000,2.331794,48.840356,19.268804,179,5.97
997,22d37e8d-0e7b-41c0-94eb-95c9282ab041,2022-08-03 18:50:41,22.661765,385.25,0.833442,0.873186,2.330867,48.841036,20.238734,180,367.73
998,d836b370-9b8c-4cf5-8612-32e39224a9d3,2022-08-19 20:53:12,9.781154,254.31,0.859999,0.990923,3.270621,45.921705,16.543846,6,118.59


## Consuming training data

Learning Objectives

In this section you will learn:
* how to save a training file
* how to use a pandas data frame

### Example: Save the training data to a file

In [12]:
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)

In [13]:
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")

### Example: Training a scikit learn model

Note that you will need to install scikit learn https://scikit-learn.org/stable/install.html

In [14]:
# EDA on the training data
training_data.describe()

Unnamed: 0,CustomerAvgInvoiceAmount_28d,CustomerSpend_28d,CustomerStateSimilarity_28d,CustomerInventoryStability_14d28d,StateMeanLongitude,StateMeanLatitude,StateAvgInvoiceAmount_28d,StatePopulation,Target
count,963.0,1000.0,962.0,886.0,999.0,999.0,999.0,1000.0,1000.0
mean,21.122718,135.80006,0.609925,0.852671,2.692321,47.356411,20.231044,82.139,86.37872
std,18.539464,116.830562,0.184723,0.160516,3.944948,2.834125,4.397186,78.785791,72.271099
min,0.75,0.0,0.03509,0.059868,-34.636037,28.372194,8.394603,0.0,0.5
25%,9.06947,47.8725,0.502678,0.80507,2.330867,45.669184,18.519533,10.0,32.945
50%,15.628333,106.795,0.638681,0.899904,2.331234,48.840356,20.441071,32.0,67.38
75%,27.454423,192.545,0.747632,0.96087,4.5863,48.841199,21.78241,179.0,120.34
max,131.49,659.85,0.986629,1.0,8.775254,50.675502,63.587778,181.0,495.26


In [15]:
# do any columns in the training data contain missing values?
training_data.isna().any()

GROCERYCUSTOMERGUID                  False
POINT_IN_TIME                        False
CustomerAvgInvoiceAmount_28d          True
CustomerSpend_28d                    False
CustomerStateSimilarity_28d           True
CustomerInventoryStability_14d28d     True
StateMeanLongitude                    True
StateMeanLatitude                     True
StateAvgInvoiceAmount_28d             True
StatePopulation                      False
Target                               False
dtype: bool

In [16]:
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]), training_data["Target"], test_size=0.2, random_state=42)

# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)

# get predictions
y_pred = model.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

# save the model
import joblib
joblib.dump(model, "model.pkl")

Mean squared error:  52.60576072514335


['model.pkl']

## Next Steps

Now that you've completed the quick-start feature engineering tutorial, you can put your knowledge into practice or learn more:<br>
1. Learn more about materializing features via the "Deep Dive Materializing Features" tutorial
2. Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" workspaces
3. Learn more about feature governance via the "Quick Start Feature Governance" tutorial