# Deep Dive Tutorial: Materializing Features

## Learning Objectives

In this tutorial you will learn:
1. How to construct an observation set
2. How features, entities, and observation sets are used together
3. How to preview features
4. How to get historical values
5. How and why to deploy features
6. How to serve and consume deployed features

## Set up the prerequisites

Learning Objectives

In this section you will:
* connect to the remote featurebyte server
* import libraries
* learn the about catalogs
* activate a pre-built catalogs

### Load the featurebyte library and connect to the remote featurebyte server

In [None]:
import urllib.request

# install featurebyte package and download supporting library
!pip install --no-warn-conflicts featurebyte
urllib.request.urlretrieve("https://raw.githubusercontent.com/featurebyte/featurebyte-hosted-tutorials/main/tutorials/notebooks/prebuilt_catalogs.py", "prebuilt_catalogs.py")

In [1]:
# library imports
import pandas as pd
import numpy as np

# load the featurebyte SDK
import featurebyte as fb

# replace <api_token> with your API token you receieved after registering
fb.register_tutorial_api_token("<api_token>)

# define the database name for this tutorial
TUTORIAL_DATABASE = "TUTORIAL_DATASETS"

[32;20m11:01:20[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mUsing configuration file at: /Users/smillet/.featurebyte/config.yaml[0m[0m
[32;20m11:01:20[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mUsing profile: tutorial[0m[0m
[32;20m11:01:20[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mUsing configuration file at: /Users/smillet/.featurebyte/config.yaml[0m[0m
[32;20m11:01:20[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mActive profile: tutorial (https://tutorials.featurebyte.com/api/v1)[0m[0m
[32;20m11:01:21[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mNo catalog activated.[0m[0m
[32;20m11:01:21[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20m2 feature lists, 9 features deployed[0m[0m


### Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveMaterializingFeatures)

Cleaning up existing tutorial catalogs
Cleaning catalog: deep dive feature engineering 20230726:1053
  1 historical feature tables
  3 observation tables


[32;20m11:01:38[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mCatalog activated: deep dive feature engineering 20230726:1053[0m[0m


Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         
Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         
Building a deep dive catalog for materializing features named [deep dive materializing features 20230726:1102]
Creating new catalog
Catalog created


[32;20m11:02:14[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mCatalog activated: deep dive materializing features 20230726:1102[0m[0m


Registering the source tables
Registering the entities
Tagging the entities to columns in the data tables
Populating the feature store with example features
Done! |████████████████████████████████████████| 100% in 13.7s (0.07%/s)        
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.5s
Done! |████████████████████████████████████████| 100% in 6.6s (0.15%/s)         
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.5s
Catalog created and pre-populated with data and features


### Load the tables for this catalog

In [3]:
# get the tables for this catalog
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")

### Create views for the tables in this catalog

In [4]:
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()

## How to construct an observation set

Learning Objectives

In this section you will learn:
* the purpose of observation sets
* the relationship between entities, point in time, and observation sets
* how to construct an observation set

### Concept: Materialization

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

### Concept: Observation set

An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.

The observation set can be a Pandas DataFrame or an ObservationTable object representing an observation set in the feature store.

### Concept: Point in time

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

An observation set is created as a Pandas DataFrame containing the keys for the primary entity, and points in time. The column name for the primary entity must be its serving name, and the column name for the point in time must be "POINT_IN_TIME".

### Example: Create an observation set based upon events

Some use cases are about events, and require predictions to be triggered when a specified event occurs.

A use case requiring predictions about a grocery customer whenever an invoice event occurs, your observation set may be sampled from historical invoices.

In [5]:
# show the serving name for grocery customer
entity_list = catalog.list_entities()
display(entity_list[entity_list.name == "grocerycustomer"])

Unnamed: 0,id,name,serving_names,created_at
3,64c13582e6bdf4dd02f040d5,grocerycustomer,[GROCERYCUSTOMERGUID],2023-07-26 15:02:28.123


In [6]:
# get a sample of 200 customer IDs and invoice event timestamps from 01-Apr-2022 to 31-Mar-2023
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] <= pd.to_datetime("2023-03-31")
)
observation_set = (
    grocery_invoice_view[filter]
    .sample(200)[["GroceryCustomerGuid", "Timestamp"]]
    .rename(
        {
            "Timestamp": "POINT_IN_TIME",
            "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
        },
        axis=1,
    )
)
display(observation_set)

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME
0,ea512344-adc5-45ac-a419-9613c61a8e98,2023-01-03 14:52:39
1,ab27bf30-5ecb-4c88-a723-695b5a8a7a4f,2022-12-03 16:54:48
2,dba29407-bc25-44ab-853c-3f7c1b78f296,2022-10-17 11:40:47
3,eaae23d5-2d5f-416c-8292-d79282d63779,2023-02-19 16:48:30
4,888aa655-927f-41c8-a0ba-7dab2872fca8,2022-10-11 14:52:36
...,...,...
195,f6a783f7-5091-46fa-8ebf-aa13ec868234,2022-04-27 18:10:12
196,bcd8cedb-9f49-461c-86bd-920fa9316239,2022-05-27 19:05:46
197,f6a783f7-5091-46fa-8ebf-aa13ec868234,2022-07-09 19:54:49
198,59d264dd-494b-4c79-9794-d6fa103b0f7e,2023-01-15 15:53:56


### Concept: Observation table

An ObservationTable object is a representation of an observation set in the feature store. Unlike a local Pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused.

ObservationTable objects can be created from a source table or from a view after subsampling.

### Example: Create an observation table based upon events

In [7]:
# create a large observation table from a view
# observation tables are the recommended workflow for training data

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name="10000 customers who were active between 01-Apr-2022 and 31-Mar-2023",
    sample_rows=10000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 6.5s (0.15%/s)         
Downloading table |████████████████████████████████████████| 10000/10000 [100%] 


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID
0,2022-04-05 06:51:50,5c96089d-95f7-4a12-ab13-e082836253f1
1,2022-04-05 18:55:03,5c96089d-95f7-4a12-ab13-e082836253f1
2,2022-04-08 13:10:00,5c96089d-95f7-4a12-ab13-e082836253f1
3,2022-04-11 11:49:05,5c96089d-95f7-4a12-ab13-e082836253f1
4,2022-04-15 09:50:57,5c96089d-95f7-4a12-ab13-e082836253f1
...,...,...
9995,2022-04-25 12:28:01,e2034c31-e304-4a0a-983f-4f08ca79c528
9996,2022-05-24 12:06:21,e2034c31-e304-4a0a-983f-4f08ca79c528
9997,2022-05-29 10:43:52,e2034c31-e304-4a0a-983f-4f08ca79c528
9998,2022-05-31 11:01:38,e2034c31-e304-4a0a-983f-4f08ca79c528


### Example: Create an observation set based upon regularly scheduled batch predictions

Some use cases require predictions to be triggered at regular time periods. Some use cases have conditions for which only a subset of entities require predictions.

A use case requiring monthly predictions for recently active customers may use an observation set containing sample customer IDs combined with predefined timestamps.

In [8]:
# define a function to list a sample of the customers who were active in a given month
def get_recently_active_customers(month_number):
    # filter the invoices by month
    filter = (grocery_invoice_view["Timestamp"].dt.month == month_number) & (
        grocery_invoice_view["Timestamp"].dt.year == 2022
    )
    # get a list of customers who made an invoice in the month
    recently_active_customers = (
        grocery_invoice_view[filter].sample(200)["GroceryCustomerGuid"].unique()
    )
    # get the start of the month
    point_in_time = pd.Timestamp(f"2022-{month_number}-01")
    # get the end of the month
    end_of_month = point_in_time + pd.DateOffset(months=1)
    # get the point in time by subtracting 0.001 second from the end of the month
    point_in_time = end_of_month - pd.Timedelta(seconds=0.001)
    # combine the point in time with the customer IDs
    recently_active_customers = pd.DataFrame(
        {
            "GROCERYCUSTOMERGUID": recently_active_customers,
            "POINT_IN_TIME": point_in_time,
        }
    )
    return recently_active_customers


# create an observation set comprised of up to 200 customers per month who were active in that month in the second half of 2022
observation_set = pd.concat(
    [get_recently_active_customers(month_number) for month_number in range(7, 13)],
    ignore_index=True,
)
display(observation_set)

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME
0,e0830d95-acfe-446e-b430-8689a447eacc,2022-07-31 23:59:59.999
1,7a024068-3f99-4114-9d90-3a61f679be51,2022-07-31 23:59:59.999
2,b21ae11c-83cf-4146-832e-1163413a3295,2022-07-31 23:59:59.999
3,9359ef7b-7fd8-4587-bc40-e89f6acc1218,2022-07-31 23:59:59.999
4,eaae23d5-2d5f-416c-8292-d79282d63779,2022-07-31 23:59:59.999
...,...,...
863,a9a0388d-9e35-4717-a61a-b9eb0a9ce92c,2022-12-31 23:59:59.999
864,bfb599c9-404c-42c1-addf-84b7b1b42ca8,2022-12-31 23:59:59.999
865,ca1fd4ac-accd-444a-bbc3-a9b10c400f2e,2022-12-31 23:59:59.999
866,5c1e93ae-fa46-4a26-bb1d-6040603dad87,2022-12-31 23:59:59.999


## Previewing features

Learning Objectives

In this section you will learn:
* how to preview features
* the limitations of previews

### Example: Preview features

During feature prototyping, new features may not have been saved to the catalog. A data scientist will want to preview sample features to sensibility check their feature declaration.

In [9]:
# create a lookup feature that is the city in which the customer resides
french_state_lookup = grocery_customer_view.City.as_feature("CustomerCity")

# preview materialized values for the unsaved feature
display(french_state_lookup.preview(observation_set.sample(5)))

Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME,CustomerCity
523,967e4beb-c889-4ff9-9140-66655248bbde,2022-10-31 23:59:59.999,TOURS
375,2032c9d8-2793-4231-b6e6-6fe7b9ca82f4,2022-09-30 23:59:59.999,ABBEVILLE
27,1b82b9eb-cc54-4cc4-a7e3-9a7417faa8a5,2022-07-31 23:59:59.999,OLIVET
337,0ae905b7-c49b-4799-8bd6-ffadb683c778,2022-09-30 23:59:59.999,PARIS
45,1e6976c5-7622-4e45-8b28-1d75f7ea7793,2022-07-31 23:59:59.999,HYÈRES


Feature previews are not suited to creating training files or feature serving. Previews have a limitation of 50 rows and do not create an audit trail.

## Create training data

Learning Objectives

In this section you will learn:
* how to design an observation set suitable for training data
* how to get historical values for a feature list
* how to get historical values for the target
* how to join features and the target to create training data

### Design an Observation Set for Training

Observation Training Design: A training data observation set should typically meet the following criteria:
* be collected from a time period that does not start until after the earliest data availability timestamp plus longest time window in the features
* be collected from a time period that ends before the latest data timestamp less the time window of the target value
* uses points in time that align with the anticipated timing of the use case inference, whether it's based on a regular schedule, triggered by an event, or any other timing mechanism.
* does not have duplicate rows
* has a column containing the primary entity of the use case, using its serving name
* has a column, named "POINT_IN_TIME", containing the points in time
* has for the same entity key points in time that have time intervals greater than the horizon of the target to avoid leakage

### Case Study: Predicting Customer Spend

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

### Example: Create an observation table for training data

In [10]:
# describe the customer view
display(grocery_customer_view.describe())

Unnamed: 0,RowID,GroceryCustomerGuid,ValidFrom,Gender,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,State,PostalCode,BrowserUserAgent,DateOfBirth,Latitude,Longitude
dtype,VARCHAR,VARCHAR,TIMESTAMP,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,VARCHAR,TIMESTAMP,FLOAT,FLOAT
unique,530,500,530,2,4,347,26,352,512,300,27,353,82,495,530,530
%missing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
%empty,0,0,,0,0,0,0,0,0,0,0,0,0,,,
entropy,6.214608,6.191446,,0.692285,1.146938,5.726251,2.925542,5.749627,6.201803,5.435211,2.49532,5.763347,3.814598,,,
top,0069200d-adf5-490a-acca-14bdf78072a0,0b7196a2-2dab-4218-a234-e193f7bc4470,2019-01-01 07:23:45.000,male,Mr.,Joanna,A,Saindon,1 cours Jean Jaures,PARIS,Île-de-France,75004,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,1947-09-22 00:00:00.000,-12.704022,-0.102024
freq,1.0,3.0,1.0,276.0,264.0,5.0,66.0,6.0,2.0,25.0,189.0,5.0,51.0,3.0,1.0,1.0
mean,,,,,,,,,,,,,,,46.50512,2.383389
std,,,,,,,,,,,,,,,6.108698,7.822694
min,,,2019-01-01T07:23:45.000000000,,,,,,,,,,,1937-07-02T00:00:00.000000000,-12.71811,-61.12404


Note that there are 471 unique customers

In [11]:
# describe the invoice view
display(grocery_invoice_view.describe())

Unnamed: 0,GroceryInvoiceGuid,GroceryCustomerGuid,Timestamp,tz_offset,Amount
dtype,VARCHAR,VARCHAR,TIMESTAMP,VARCHAR,FLOAT
unique,39828,500,39803,4,6668
%missing,0.0,0.0,0.0,0.0,0.0
%empty,0,0,,0,
entropy,6.214608,5.824943,,0.817283,
top,0087544f-9f96-46f6-9211-e20f54577bcd,3019bdbf-667c-4081-acb5-26cd2d559c5e,2022-01-05 11:34:17.000,+02:00,1
freq,1.0,639.0,2.0,22375.0,834.0
mean,,,,,18.355359
std,,,,,22.735611
min,,,2022-01-01T04:17:46.000000000,,0.0


Note that the earliest data timestamp is at the beginning of 2022, and the timestamps end in the present.

In [12]:
# get the customer feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# display details about the features in the customer feature list
display(customer_feature_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.5s


Unnamed: 0,id,name,version,dtype,readiness,online_enabled,tables,primary_tables,entities,primary_entities,created_at,is_default
0,64c13590e6bdf4dd02f040dd,StateMeanLongitude,V230726,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:49.888,True
1,64c13590e6bdf4dd02f040dc,StateMeanLatitude,V230726,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:48.707,True
2,64c13590e6bdf4dd02f040db,CustomerInventoryMostFrequent_4w,V230726,VARCHAR,DRAFT,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:47.546,True
3,64c13590e6bdf4dd02f040da,CustomerInventoryEntropy_4w,V230726,FLOAT,DRAFT,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:44.664,True


Note that the longest time window in the features is 4 weeks.

In [13]:
# get the target
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.5s


Unnamed: 0,id,name,version,dtype,readiness,online_enabled,tables,primary_tables,entities,primary_entities,created_at,is_default
0,64c13591e6bdf4dd02f040e0,Target,V230726,FLOAT,DRAFT,False,[GROCERYINVOICE],[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.232,True


Note that the time window for the target is 14 days

We can conclude that it would be safe for the training data observation set's points in time to commence on 29-Jan-2022 and end 14 days before the present.<br>

We will create an observation set for invoice dates from Feb-22 to Dec-22.

In [14]:
# create a large observation table from a view

# filter to get Feb-22 to Jan-23
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-02-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table_large = observation_set_view.create_observation_table(
    name="1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
    sample_rows=1000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table_large.to_pandas())

Done! |████████████████████████████████████████| 100% in 6.6s (0.15%/s)         
Downloading table |████████████████████████████████████████| 1000/1000 [100%] in


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID
0,2022-03-01 10:23:14,5c96089d-95f7-4a12-ab13-e082836253f1
1,2022-03-06 19:22:16,5c96089d-95f7-4a12-ab13-e082836253f1
2,2022-05-25 13:01:50,5c96089d-95f7-4a12-ab13-e082836253f1
3,2022-08-31 08:35:20,5c96089d-95f7-4a12-ab13-e082836253f1
4,2022-12-13 13:26:58,5c96089d-95f7-4a12-ab13-e082836253f1
...,...,...
995,2022-05-16 16:51:07,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
996,2022-07-29 15:25:28,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
997,2022-08-23 14:28:10,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
998,2022-10-13 15:40:35,c0c4da4d-08a3-4a03-a1f6-9c015362caf9


### Example: Get historical values

In [15]:
# use the get historical features function to get the feature values for the observation set
training_data_features = customer_feature_list.compute_historical_features(observation_set)
display(training_data_features)

Done! |████████████████████████████████████████| 100% in 16.9s (0.06%/s)        
Downloading table |████████████████████████████████████████| 868/868 [100%] in 0
Done! |████████████████████████████████████████| 100% in 6.6s (0.15%/s)         


Unnamed: 0,GROCERYCUSTOMERGUID,POINT_IN_TIME,CustomerInventoryEntropy_4w,CustomerInventoryMostFrequent_4w,StateMeanLatitude,StateMeanLongitude
0,e0830d95-acfe-446e-b430-8689a447eacc,2022-07-31 23:59:59.999,1.039721,Fruits,48.737227,2.240549
1,7a024068-3f99-4114-9d90-3a61f679be51,2022-07-31 23:59:59.999,2.359466,Fromages,48.737227,2.240549
2,b21ae11c-83cf-4146-832e-1163413a3295,2022-07-31 23:59:59.999,2.531310,"Colas, Thés glacés et Sodas",49.185354,-0.608146
3,9359ef7b-7fd8-4587-bc40-e89f6acc1218,2022-07-31 23:59:59.999,3.011942,Cave à Vins,48.737227,2.240549
4,eaae23d5-2d5f-416c-8292-d79282d63779,2022-07-31 23:59:59.999,3.132692,Laits,44.718312,-0.478629
...,...,...,...,...,...,...
863,a9a0388d-9e35-4717-a61a-b9eb0a9ce92c,2022-12-31 23:59:59.999,2.982231,Laits,47.182230,4.394402
864,bfb599c9-404c-42c1-addf-84b7b1b42ca8,2022-12-31 23:59:59.999,,,48.815086,4.386779
865,ca1fd4ac-accd-444a-bbc3-a9b10c400f2e,2022-12-31 23:59:59.999,2.572869,"Colas, Thés glacés et Sodas",50.665263,2.908103
866,5c1e93ae-fa46-4a26-bb1d-6040603dad87,2022-12-31 23:59:59.999,1.834372,Fromages,48.739038,2.242254


### Concept: Historical feature table

A HistoricalFeatureTable object represents a table in the feature store containing historical feature values from a historical feature request. The historical feature values can also be obtained as a Pandas DataFrame, but using a HistoricalFeatureTable object has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage of the training and test data.

In [16]:
# the syntax is different when using an observation table to create a historical feature table

# Compute the historical feature table
training_table = customer_feature_list.compute_historical_feature_table(
    observation_table_large,
    historical_feature_table_name="customer training table on 1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
)

# display the training data
display(training_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 16.3s (0.06%/s)        
Downloading table |████████████████████████████████████████| 1000/1000 [100%] in


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID,CustomerInventoryEntropy_4w,CustomerInventoryMostFrequent_4w,StateMeanLatitude,StateMeanLongitude
0,2022-04-06 17:29:08,c6ef9073-3351-4f54-869a-4c926a479520,3.420063,Fromages,43.452577,5.848259
1,2022-10-02 09:53:01,7026ce5b-ba7f-4804-8a30-700ea501438e,2.745184,"Colas, Thés glacés et Sodas",45.500198,5.054081
2,2022-06-11 13:08:35,53b76d93-0577-4dca-bc7b-dc493120c3be,1.039721,Laits,48.739485,2.238596
3,2022-03-21 14:38:27,37467a5c-f833-494b-9e15-6126c173f825,2.798653,Yaourt et Compotes,48.739692,2.235733
4,2022-08-12 17:59:13,cfd39ed9-3140-4af5-9f72-77881aa6c2a8,3.318023,Viande Surgelée,48.737227,2.240549
...,...,...,...,...,...,...
995,2022-10-18 16:37:09,a34c0e2e-2def-49bd-9e62-39e80cd219f8,3.271166,Chips et Tortillas,48.738384,2.241215
996,2022-06-19 18:09:58,db726554-ea0d-422d-b4de-39efa949f60c,3.421687,Bonbons,48.739359,2.239731
997,2022-12-24 08:02:42,85807d39-10ab-445d-a034-9ab6e57e73c4,3.362808,Pains,48.739799,2.241806
998,2022-02-22 15:50:40,94127b9f-1366-4bbe-afea-7cd77225da52,3.395711,Plats Cuisinés Surgelés,48.799660,5.963028


### Example: Get target values

When target values use aggregates or time offsets, you first need to offset the point in time by the time window.

In [17]:
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table_large.to_pandas()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(
    days=14
)
display(observation_set_target)

Downloading table |████████████████████████████████████████| 1000/1000 [100%] in


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID
0,2022-03-15 10:23:14,5c96089d-95f7-4a12-ab13-e082836253f1
1,2022-03-20 19:22:16,5c96089d-95f7-4a12-ab13-e082836253f1
2,2022-06-08 13:01:50,5c96089d-95f7-4a12-ab13-e082836253f1
3,2022-09-14 08:35:20,5c96089d-95f7-4a12-ab13-e082836253f1
4,2022-12-27 13:26:58,5c96089d-95f7-4a12-ab13-e082836253f1
...,...,...
995,2022-05-30 16:51:07,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
996,2022-08-12 15:25:28,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
997,2022-09-06 14:28:10,c0c4da4d-08a3-4a03-a1f6-9c015362caf9
998,2022-10-27 15:40:35,c0c4da4d-08a3-4a03-a1f6-9c015362caf9


In [18]:
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(
    days=14
)

display(training_data_target)

Done! |████████████████████████████████████████| 100% in 13.0s (0.08%/s)        
Downloading table |████████████████████████████████████████| 1000/1000 [100%] in
Done! |████████████████████████████████████████| 100% in 6.6s (0.15%/s)         


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID,Target
0,2022-03-01 10:23:14,5c96089d-95f7-4a12-ab13-e082836253f1,169.63
1,2022-03-06 19:22:16,5c96089d-95f7-4a12-ab13-e082836253f1,135.05
2,2022-05-25 13:01:50,5c96089d-95f7-4a12-ab13-e082836253f1,134.36
3,2022-08-31 08:35:20,5c96089d-95f7-4a12-ab13-e082836253f1,43.69
4,2022-12-13 13:26:58,5c96089d-95f7-4a12-ab13-e082836253f1,130.95
...,...,...,...
995,2022-05-16 16:51:07,c0c4da4d-08a3-4a03-a1f6-9c015362caf9,131.88
996,2022-07-29 15:25:28,c0c4da4d-08a3-4a03-a1f6-9c015362caf9,136.42
997,2022-08-23 14:28:10,c0c4da4d-08a3-4a03-a1f6-9c015362caf9,21.73
998,2022-10-13 15:40:35,c0c4da4d-08a3-4a03-a1f6-9c015362caf9,98.22


### Example: Merging materialized values for features and target

In [19]:
# merge training data features and training data target
training_data = training_table.to_pandas().merge(
    training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]
)
display(training_data)

Downloading table |████████████████████████████████████████| 1000/1000 [100%] in


Unnamed: 0,POINT_IN_TIME,GROCERYCUSTOMERGUID,CustomerInventoryEntropy_4w,CustomerInventoryMostFrequent_4w,StateMeanLatitude,StateMeanLongitude,Target
0,2022-04-06 17:29:08,c6ef9073-3351-4f54-869a-4c926a479520,3.420063,Fromages,43.452577,5.848259,58.78
1,2022-10-02 09:53:01,7026ce5b-ba7f-4804-8a30-700ea501438e,2.745184,"Colas, Thés glacés et Sodas",45.500198,5.054081,122.45
2,2022-06-11 13:08:35,53b76d93-0577-4dca-bc7b-dc493120c3be,1.039721,Laits,48.739485,2.238596,43.05
3,2022-03-21 14:38:27,37467a5c-f833-494b-9e15-6126c173f825,2.798653,Yaourt et Compotes,48.739692,2.235733,107.43
4,2022-08-12 17:59:13,cfd39ed9-3140-4af5-9f72-77881aa6c2a8,3.318023,Viande Surgelée,48.737227,2.240549,82.11
...,...,...,...,...,...,...,...
995,2022-10-18 16:37:09,a34c0e2e-2def-49bd-9e62-39e80cd219f8,3.271166,Chips et Tortillas,48.738384,2.241215,50.17
996,2022-06-19 18:09:58,db726554-ea0d-422d-b4de-39efa949f60c,3.421687,Bonbons,48.739359,2.239731,115.30
997,2022-12-24 08:02:42,85807d39-10ab-445d-a034-9ab6e57e73c4,3.362808,Pains,48.739799,2.241806,161.75
998,2022-02-22 15:50:40,94127b9f-1366-4bbe-afea-7cd77225da52,3.395711,Plats Cuisinés Surgelés,48.799660,5.963028,139.20


## Deploying features

Learning Objectives

In this section you will learn:
* feature readiness
* feature list status
* how to deploy a feature list

### Feature readiness

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

PRODUCTION_READY: ready for deployment in production environments.<br>
PUBLIC_DRAFT: shared for feedback purposes.<br>
DRAFT: in the prototype stage.<br>
DEPRECATED`: not advised for use in either training or prediction.

In [20]:
# view the readiness of the features
catalog.list_features()

Unnamed: 0,id,name,dtype,readiness,online_enabled,tables,primary_tables,entities,primary_entities,created_at
0,64c13591e6bdf4dd02f040e0,Target,FLOAT,DRAFT,False,[GROCERYINVOICE],[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.247
1,64c13590e6bdf4dd02f040dd,StateMeanLongitude,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:49.907
2,64c13590e6bdf4dd02f040dc,StateMeanLatitude,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:48.723
3,64c13590e6bdf4dd02f040db,CustomerInventoryMostFrequent_4w,VARCHAR,DRAFT,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:47.564
4,64c13590e6bdf4dd02f040da,CustomerInventoryEntropy_4w,FLOAT,DRAFT,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:44.682


When a feature has been reviewed and is ready for production, its readiness can be upgraded.

In [21]:
# get CustomerInventoryEntropy_4w
customer_inventory_entropy_4w = catalog.get_feature("CustomerInventoryEntropy_4w")

In [22]:
# check feature definition file
customer_inventory_entropy_4w.definition

In [23]:
# change the readiness to public
customer_inventory_entropy_4w.update_readiness("PRODUCTION_READY")

# view the readiness of the features
catalog.list_features()

Unnamed: 0,id,name,dtype,readiness,online_enabled,tables,primary_tables,entities,primary_entities,created_at
0,64c13591e6bdf4dd02f040e0,Target,FLOAT,DRAFT,False,[GROCERYINVOICE],[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.247
1,64c13590e6bdf4dd02f040dd,StateMeanLongitude,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:49.907
2,64c13590e6bdf4dd02f040dc,StateMeanLatitude,FLOAT,DRAFT,False,[GROCERYCUSTOMER],[GROCERYCUSTOMER],[frenchstate],[frenchstate],2023-07-26 15:02:48.723
3,64c13590e6bdf4dd02f040db,CustomerInventoryMostFrequent_4w,VARCHAR,DRAFT,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:47.564
4,64c13590e6bdf4dd02f040da,CustomerInventoryEntropy_4w,FLOAT,PRODUCTION_READY,False,"[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]",[INVOICEITEMS],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:44.682


### Feature list status

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

- DEPLOYED: Assigned to feature list with at least one deployed version.
- TEMPLATE: For feature lists as reference templates or safe starting points.
- PUBLIC_DRAFT: For feature lists shared for feedback purposes.
- DRAFT: For feature lists in the prototype stage.
- DEPRECATED: For outdated or unnecessary feature lists.

In [24]:
# view the status of the feature lists
display(catalog.list_feature_lists())

Unnamed: 0,id,name,num_feature,status,deployed,readiness_frac,online_frac,tables,entities,primary_entities,created_at
0,64c13591e6bdf4dd02f040e2,TargetFeature,1,DRAFT,False,0.0,0.0,[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.438
1,64c13590e6bdf4dd02f040de,CustomerFeatures,4,DRAFT,False,0.25,0.0,"[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...","[grocerycustomer, frenchstate]",[grocerycustomer],2023-07-26 15:02:50.214


When a feature list is ready for review, its status can be updated.

In [25]:
# get the CustomerFeatures feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# update the status to PUBLIC_DRAFT
customer_feature_list.update_status("PUBLIC_DRAFT")

# view the status of the feature lists
display(catalog.list_feature_lists())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.5s


Unnamed: 0,id,name,num_feature,status,deployed,readiness_frac,online_frac,tables,entities,primary_entities,created_at
0,64c13591e6bdf4dd02f040e2,TargetFeature,1,DRAFT,False,0.0,0.0,[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.438
1,64c13590e6bdf4dd02f040de,CustomerFeatures,4,PUBLIC_DRAFT,False,0.25,0.0,"[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...","[grocerycustomer, frenchstate]",[grocerycustomer],2023-07-26 15:02:50.214


### Deploying a feature list

In [26]:
# deploy the customer feature list
deployment = customer_feature_list.deploy(make_production_ready=True)
deployment.enable()

# view the status of the feature lists
display(catalog.list_feature_lists())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.5s
Done! |████████████████████████████████████████| 100% in 3.4s (0.30%/s)         
Done! |████████████████████████████████████████| 100% in 9.7s (0.10%/s)         


Unnamed: 0,id,name,num_feature,status,deployed,readiness_frac,online_frac,tables,entities,primary_entities,created_at
0,64c13591e6bdf4dd02f040e2,TargetFeature,1,DRAFT,False,0.0,0.0,[GROCERYINVOICE],[grocerycustomer],[grocerycustomer],2023-07-26 15:02:58.438
1,64c13590e6bdf4dd02f040de,CustomerFeatures,4,DEPLOYED,True,1.0,1.0,"[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...","[grocerycustomer, frenchstate]",[grocerycustomer],2023-07-26 15:02:50.214


### Why deploy?

When you deploy a feature list, behind the scenes the Feature Store starts regularly pre-calculating and caching feature values. This can significantly reduce the latency of feature serving.

## Serving and consuming features

Learning Objectives

In this section you will learn:
* the point in time used for production serving
* how to create a Python function to consume a feature list
* how to consume a feature list

### Point in time for deployment

The production feature serving API uses the current time as its point in time. To consume the feature list, send only the primary entity via the serving name.

### Automatically create a Python function for consuming the API

You can either use a python template or a shell script where the generated code will use the curl command to send the request.

For the python template, set the language parameter value as 'python'.
For the shell script, set the language parameter value as 'sh'.

In [27]:
# get a python template for consuming the feature serving API
sample_code = deployment.get_online_serving_code(language="python")
print(sample_code)

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 0.7s
from typing import Any, Dict

import pandas as pd
import requests


def request_features(entity_serving_names: Dict[str, Any]) -> pd.DataFrame:
    """
    Send POST request to online serving endpoint

    Parameters
    ----------
    entity_serving_names: Dict[str, Any]
        Entity serving name values to used for serving request

    Returns
    -------
    pd.DataFrame
    """
    response = requests.post(
        url="https://tutorials.featurebyte.com/api/v1/deployment/64c13623e6bdf4dd02f040ec/online_features",
        headers={"Content-Type": "application/json", "active-catalog-id": "64c13575e6bdf4dd02f040d0", "Authorization": "Bearer nspcvgX-gauPK5qieXmBUiwXXK9Z-EMEc75Qqmwm_cU"},
        json={"entity_serving_names": entity_serving_names},
    )
    assert response.status_code == 200, response.json()
    return pd.DataFrame.from_dict(response.json()["features"])


request_features([{"GROCERYCUSTOM

Copy the online serving code that was generated above, paste it into the cell below, then run it

In [28]:
# replace the contents of this Python code cell with the output from to_be_deployed.get_online_serving_code(language="python")

### Concept: Batch request table

A BatchRequestTable object is a representation of a table in the feature store that specifies entity values for batch serving.

In [29]:
# this is a new use case, a daily batch run for customers who were active in the latest 24 hours

# filter the invoice view to get customers who had an invoice in the latest 24 hours
batch_request_timestamp = pd.Timestamp.now(tz="utc")
filter = grocery_invoice_view["Timestamp"] > batch_request_timestamp - pd.to_timedelta(
    24, unit="hour"
)
recently_active_view = grocery_invoice_view[filter].copy()

display(recently_active_view.preview())

Unnamed: 0,GroceryInvoiceGuid,GroceryCustomerGuid,Timestamp,tz_offset,Amount


In [30]:
# create a batch request table from the filtered view
# note that the table does not contain a prediction point in time
# batch requests use the batch run time as the point in time
batch_request_table = recently_active_view.create_batch_request_table(
    "customer batch request for customers active in the latest 24 hours as at "
    + str(batch_request_timestamp),
    columns=["GroceryCustomerGuid"],
    columns_rename_mapping={"GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         


### Concept: Batch feature table

A BatchFeatureTable object is a representation of a table in the feature store that contains feature values from batch serving. The object includes metadata on the Deployment and the BatchRequestTable used to create it.

In [31]:
# enable the deployment - this is a pre-requisite
if not deployment.enabled:
    deployment.enable()

In [32]:
# request batch features
batch_features = deployment.compute_batch_feature_table(
    batch_request_table=batch_request_table,
    batch_feature_table_name="customer batch feature data for customers active in the latest 24 hours as at "
    + str(batch_request_timestamp),
)

Done! |████████████████████████████████████████| 100% in 6.5s (0.15%/s)         


In [33]:
# display the contents of the batch feature table
display(batch_features.to_pandas())

Downloading table |████████████████████████████████████████| 0 in 0.1s (0.00/s) 


Unnamed: 0,GROCERYCUSTOMERGUID,CustomerInventoryEntropy_4w,CustomerInventoryMostFrequent_4w,StateMeanLatitude,StateMeanLongitude


In [34]:
# display the batch feature table metadata
batch_features.info()

0,1
name,customer batch feature data for customers active in the latest 24 hours as at 2023-07-26 15:05:29.790569+00:00
created_at,2023-07-26 15:05:40
updated_at,
batch_request_table_name,customer batch request for customers active in the latest 24 hours as at 2023-07-26 15:05:29.790569+00:00
deployment_name,Deployment with CustomerFeatures_V230726
table_details,database_name  TUTORIAL  schema_name  TUTORIAL_PROD  table_name  BATCH_FEATURE_TABLE_64c1364217718fc7a4fc26c8

0,1
database_name,TUTORIAL
schema_name,TUTORIAL_PROD
table_name,BATCH_FEATURE_TABLE_64c1364217718fc7a4fc26c8


### Disable a deployment

In [35]:
# disable the feature list deployment
deployment.disable()

Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         


## Next Steps

Now that you've completed the deep dive materializing features tutorial, you can put your knowledge into practice or learn more:<br>
1. Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" catalogs
2. Learn more about feature governance via the "Quick Start Feature Governance" tutorial
3. Learn about data modeling via the "Deep Dive Data Modeling" tutorial