# Deep Dive Tutorial: Materializing Features

## Learning Objectives

In this tutorial you will learn:
1. How to construct an observation set
2. How features, entities, and observation sets are used together
3. How to preview features
4. How to get historical values
5. How and why to deploy features
6. How to serve and consume deployed features

## Set up the prerequisites

Learning Objectives

In this section you will:
* start your local featurebyte server
* import libraries
* learn the about catalogs
* activate a pre-built catalogs

### Load the featurebyte library and connect to the local instance of featurebyte

In [1]:
# library imports
import pandas as pd
import numpy as np

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()

### Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveMaterializingFeatures)

### Load the tables for this catalog

In [3]:
# get the tables for this catalog
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")

### Create views for the tables in this catalog

In [4]:
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()

## How to construct an observation set

Learning Objectives

In this section you will learn:
* the purpose of observation sets
* the relationship between entities, point in time, and observation sets
* how to construct an observation set

### Concept: Materialization

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

### Concept: Observation set

An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.

The observation set can be a Pandas DataFrame or an ObservationTable object representing an observation set in the feature store.

### Concept: Point in time

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

An observation set is created as a Pandas DataFrame containing the keys for the primary entity, and points in time. The column name for the primary entity must be its serving name, and the column name for the point in time must be "POINT_IN_TIME".

### Example: Create an observation set based upon events

Some use cases are about events, and require predictions to be triggered when a specified event occurs.

A use case requiring predictions about a grocery customer whenever an invoice event occurs, your observation set may be sampled from historical invoices.

In [5]:
# show the serving name for grocery customer
entity_list = catalog.list_entities()
display(entity_list[entity_list.name == "grocerycustomer"])

In [6]:
# get a sample of 200 customer IDs and invoice event timestamps from 01-Apr-2022 to 31-Mar-2023
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] <= pd.to_datetime("2023-03-31")
)
observation_set = (
    grocery_invoice_view[filter]
    .sample(200)[["GroceryCustomerGuid", "Timestamp"]]
    .rename(
        {
            "Timestamp": "POINT_IN_TIME",
            "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
        },
        axis=1,
    )
)
display(observation_set)

### Concept: Observation table

An ObservationTable object is a representation of an observation set in the feature store. Unlike a local Pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused.

ObservationTable objects can be created from a source table or from a view after subsampling.

### Example: Create an observation table based upon events

In [7]:
# create a large observation table from a view
# observation tables are the recommended workflow for training data

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name="10000 customers who were active between 01-Apr-2022 and 31-Mar-2023",
    sample_rows=10000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())

### Example: Create an observation set based upon regularly scheduled batch predictions

Some use cases require predictions to be triggered at regular time periods. Some use cases have conditions for which only a subset of entities require predictions.

A use case requiring monthly predictions for recently active customers may use an observation set containing sample customer IDs combined with predefined timestamps.

In [8]:
# define a function to list a sample of the customers who were active in a given month
def get_recently_active_customers(month_number):
    # filter the invoices by month
    filter = (grocery_invoice_view["Timestamp"].dt.month == month_number) & (
        grocery_invoice_view["Timestamp"].dt.year == 2022
    )
    # get a list of customers who made an invoice in the month
    recently_active_customers = (
        grocery_invoice_view[filter].sample(200)["GroceryCustomerGuid"].unique()
    )
    # get the start of the month
    point_in_time = pd.Timestamp(f"2022-{month_number}-01")
    # get the end of the month
    end_of_month = point_in_time + pd.DateOffset(months=1)
    # get the point in time by subtracting 0.001 second from the end of the month
    point_in_time = end_of_month - pd.Timedelta(seconds=0.001)
    # combine the point in time with the customer IDs
    recently_active_customers = pd.DataFrame(
        {
            "GROCERYCUSTOMERGUID": recently_active_customers,
            "POINT_IN_TIME": point_in_time,
        }
    )
    return recently_active_customers


# create an observation set comprised of up to 200 customers per month who were active in that month in the second half of 2022
observation_set = pd.concat(
    [get_recently_active_customers(month_number) for month_number in range(7, 13)],
    ignore_index=True,
)
display(observation_set)

## Previewing features

Learning Objectives

In this section you will learn:
* how to preview features
* the limitations of previews

### Example: Preview features

During feature prototyping, new features may not have been saved to the catalog. A data scientist will want to preview sample features to sensibility check their feature declaration.

In [9]:
# create a lookup feature that is the city in which the customer resides
french_state_lookup = grocery_customer_view.City.as_feature("CustomerCity")

# preview materialized values for the unsaved feature
display(french_state_lookup.preview(observation_set.sample(5)))

Feature previews are not suited to creating training files or feature serving. Previews have a limitation of 50 rows and do not create an audit trail.

## Create training data

Learning Objectives

In this section you will learn:
* how to design an observation set suitable for training data
* how to get historical values for the target
* how to get historical values for a feature list, and create training data

### Design an Observation Set for Training

Observation Training Design: A training data observation set should typically meet the following criteria:
* be collected from a time period that does not start until after the earliest data availability timestamp plus longest time window in the features
* be collected from a time period that ends before the latest data timestamp less the time window of the target value
* uses points in time that align with the anticipated timing of the use case inference, whether it's based on a regular schedule, triggered by an event, or any other timing mechanism.
* does not have duplicate rows
* has a column containing the primary entity of the use case, using its serving name
* has a column, named "POINT_IN_TIME", containing the points in time
* has for the same entity key points in time that have time intervals greater than the horizon of the target to avoid leakage

### Case Study: Predicting Customer Spend

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

### Example: Create an observation table for training data

In [10]:
# describe the customer view
display(grocery_customer_view.describe())

Note that there are 471 unique customers

In [11]:
# describe the invoice view
display(grocery_invoice_view.describe())

Note that the earliest data timestamp is at the beginning of 2022, and the timestamps end in the present.

In [12]:
# get the customer feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# display details about the features in the customer feature list
display(customer_feature_list.list_features())

Note that the longest time window in the features is 4 weeks.

In [13]:
# get the feature list for the target
import json

next_customer_sales_14d_target = catalog.get_target("next_customer_sales_14d")

# display details about the target
info = next_customer_sales_14d_target.info()
display_info = {
    key: info[key] for key in ("id", "target_name", "entities", "window", "primary_table")
}
print(json.dumps(display_info, indent=4))

Note that the time window for the target is 14 days

We can conclude that it would be safe for the training data observation set's points in time to commence on 29-Jan-2022 and end 14 days before the present.<br>

We will create an observation set for invoice dates from Feb-22 to Dec-22.

In [14]:
# create a large observation table from a view

# filter to get Feb-22 to Jan-23
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-02-01")) & (
    grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01")
)
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table_large = observation_set_view.create_observation_table(
    name="1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
    sample_rows=1000,
    columns=["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    },
)

# if the observation table isn't too large, you can materialize it
display(observation_table_large.to_pandas())

### Example: Get historical values

In [15]:
# use the get historical features function to get the feature values for the observation set
training_data_features = customer_feature_list.compute_historical_features(observation_set)
display(training_data_features)

### Example: Get target values

We can materialize the Target values by calling `compute_target`, or `compute_target_table`.

In [16]:
# Materialize the target feature using get historical features
training_data_target_table = next_customer_sales_14d_target.compute_target_table(
    observation_table_large, observation_table_name="next_customer_sales_14d_target_table"
)
training_data_target = training_data_target_table.to_pandas()
display(training_data_target)

### Concept: Historical feature table

A HistoricalFeatureTable object represents a table in the feature store containing historical feature values from a historical feature request. The historical feature values can also be obtained as a Pandas DataFrame, but using a HistoricalFeatureTable object has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage of the training and test data.

In [17]:
# the syntax is different when using an observation table to create a historical feature table

# Compute the historical feature table
training_table = customer_feature_list.compute_historical_feature_table(
    training_data_target_table,
    historical_feature_table_name="customer training table on 1000 customers who were active between 01-Feb-2022 and 31-Jan-2023",
)
training_data = training_table.to_pandas()
# display the training data
display(training_data)

## Deploying features

Learning Objectives

In this section you will learn:
* feature readiness
* feature list status
* how to deploy a feature list

### Feature readiness

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

PRODUCTION_READY: ready for deployment in production environments.<br>
PUBLIC_DRAFT: shared for feedback purposes.<br>
DRAFT: in the prototype stage.<br>
DEPRECATED`: not advised for use in either training or prediction.

In [18]:
# view the readiness of the features
catalog.list_features()

When a feature has been reviewed and is ready for production, its readiness can be upgraded.

In [19]:
# get CustomerInventoryEntropy_4w
customer_inventory_entropy_4w = catalog.get_feature("CustomerInventoryEntropy_4w")

In [20]:
# check feature definition file
customer_inventory_entropy_4w.definition

In [21]:
# change the readiness to public
customer_inventory_entropy_4w.update_readiness("PRODUCTION_READY")

# view the readiness of the features
catalog.list_features()

### Feature list status

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

- DEPLOYED: Assigned to feature list with at least one deployed version.
- TEMPLATE: For feature lists as reference templates or safe starting points.
- PUBLIC_DRAFT: For feature lists shared for feedback purposes.
- DRAFT: For feature lists in the prototype stage.
- DEPRECATED: For outdated or unnecessary feature lists.

In [22]:
# view the status of the feature lists
display(catalog.list_feature_lists())

When a feature list is ready for review, its status can be updated.

In [23]:
# get the CustomerFeatures feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# update the status to PUBLIC_DRAFT
customer_feature_list.update_status("PUBLIC_DRAFT")

# view the status of the feature lists
display(catalog.list_feature_lists())

### Deploying a feature list

In [24]:
# deploy the customer feature list
deployment = customer_feature_list.deploy(make_production_ready=True)
deployment.enable()

# view the status of the feature lists
display(catalog.list_feature_lists())

### Why deploy?

When you deploy a feature list, behind the scenes the Feature Store starts regularly pre-calculating and caching feature values. This can significantly reduce the latency of feature serving.

## Serving and consuming features

Learning Objectives

In this section you will learn:
* the point in time used for production serving
* how to create a Python function to consume a feature list
* how to consume a feature list

### Point in time for deployment

The production feature serving API uses the current time as its point in time. To consume the feature list, send only the primary entity via the serving name.

### Automatically create a Python function for consuming the API

You can either use a python template or a shell script where the generated code will use the curl command to send the request.

For the python template, set the language parameter value as 'python'.
For the shell script, set the language parameter value as 'sh'.

In [25]:
# get a python template for consuming the feature serving API
deployment.get_online_serving_code(language="python")

Copy the online serving code that was generated above, paste it into the cell below, then run it

In [26]:
# replace the contents of this Python code cell with the output from to_be_deployed.get_online_serving_code(language="python")

### Concept: Batch request table

A BatchRequestTable object is a representation of a table in the feature store that specifies entity values for batch serving.

In [27]:
# this is a new use case, a daily batch run for customers who were active in the latest 24 hours

# filter the invoice view to get customers who had an invoice in the latest 24 hours
batch_request_timestamp = pd.Timestamp.now(tz="utc")
filter = grocery_invoice_view["Timestamp"] > batch_request_timestamp - pd.to_timedelta(
    24, unit="hour"
)
recently_active_view = grocery_invoice_view[filter].copy()

display(recently_active_view.preview())

In [28]:
# create a batch request table from the filtered view
# note that the table does not contain a prediction point in time
# batch requests use the batch run time as the point in time
batch_request_table = recently_active_view.create_batch_request_table(
    "customer batch request for customers active in the latest 24 hours as at "
    + str(batch_request_timestamp),
    columns=["GroceryCustomerGuid"],
    columns_rename_mapping={"GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

### Concept: Batch feature table

A BatchFeatureTable object is a representation of a table in the feature store that contains feature values from batch serving. The object includes metadata on the Deployment and the BatchRequestTable used to create it.

In [29]:
# enable the deployment - this is a pre-requisite
if not deployment.enabled:
    deployment.enable()

In [30]:
# request batch features
batch_features = deployment.compute_batch_feature_table(
    batch_request_table=batch_request_table,
    batch_feature_table_name="customer batch feature data for customers active in the latest 24 hours as at "
    + str(batch_request_timestamp),
)

In [31]:
# display the contents of the batch feature table
display(batch_features.to_pandas())

In [32]:
# display the batch feature table metadata
batch_features.info()

### Disable a deployment

In [None]:
# disable the feature list deployment
deployment.disable()

## Next Steps

Now that you've completed the deep dive materializing features tutorial, you can put your knowledge into practice or learn more:<br>
1. Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" catalogs
2. Learn more about feature governance via the "Quick Start Feature Governance" tutorial
3. Learn about data modeling via the "Deep Dive Data Modeling" tutorial