Skip to content

Latest commit

 

History

History
132 lines (90 loc) · 6.69 KB

feature-retrieval.md

File metadata and controls

132 lines (90 loc) · 6.69 KB

Feature retrieval

1. Overview

Feature retrieval (or serving) is the process of retrieving either historical features or online features from Feast, for the purposes of training or serving a model.

Feast attempts to unify the process of retrieving features in both the historical and online case. It does this through the creation of feature references. One of the major advantages of using Feast is that you have a single semantic reference to a feature. These feature references can then be stored alongside your model and loaded into a serving layer where it can be used for online feature retrieval.

2. Feature references

In Feast, each feature can be uniquely addressed through a feature reference. A feature reference is composed of the following components

  • Feature Set
  • Feature

These components can be used to create a string based feature reference as follows

<feature-set>:<feature>

Feast will attempt to infer both the feature-set name if it is not provided, but a feature reference must provide a feature name.

# Feature references
features = [
 'partner', 
 'daily_transactions',
 'customer_feature_set:dependents',
 'customer_feature_set:has_phone_service',
 ]

target = 'churn'

{% hint style="info" %} Where the Features from different Feature Sets use the same name, the Feature Set name feature-set is required to disambiguate which feature is specified. {% endhint %}

Feature references only apply to a single project. Features cannot be retrieved across projects in a single request.

3. Historical feature retrieval

Historical feature retrieval can be done through either the Feast SDK or directly through the Feast Serving gRPC API. Below is an example of historical retrieval from the Churn Prediction Notebook.

# Add the target variable to our feature list
features = self._features + [self._target]

# Retrieve training dataset from Feast. The "entity_df" is a dataframe that contains
# timestamps and entity keys. In this case, it is a dataframe with two columns.
# One timestamp column, and one customer id column
dataset = client.get_batch_features(
    feature_refs=features,
    entity_rows=entity_df
    )

# Materialize the dataset object to a Pandas DataFrame. 
# Alternatively it is possible to use a file reference if the data is too large
df = dataset.to_dataframe()

{% hint style="info" %} When no project is specified when retrieving features with get_batch_features(), Feast infers that the features specified belong to the default project. To retrieve from another project, specify the default parameter when retrieving features. {% endhint %}

In the above example, Feast does a point in time correct query from a single feature set. For each timestamp and entity key combination that is provided by entity_df, Feast determines the values of all the features in the features list at that respective point in time and then joins features values to that specific entity value and timestamp, and repeats this process for all timestamps.

This is called a point in time correct join.

Feast allows users to retrieve features from any feature sets and join them together in a single response dataset. The only requirement is that the user provides the correct entities in order to look up the features.

Point-in-time-correct Join

Below is another example of how a point-in-time-correct join works. We have two dataframes. The first is the entity dataframe that contains timestamps, entities, and labels. The user would like to have driver features joined onto this entity dataframe from the driver dataframe to produce an output dataframe that contains both labels and features. They would then like to train their model on this output

Input 1: Entity DataFrame

Input 2: Driver DataFrame

Typically the input 1 DataFrame would be provided by the user, and the input 2 DataFrame would already be ingested into Feast. To join these two, the user would call Feast as follows:

# Feature references
features = [
 'conv_rate',
 'acc_rate',
 'avg_daily_trips',
 'trip_completed'
 ]


dataset = client.get_batch_features(
        feature_refs=features, # this is a list of feature references
        entity_rows=entity_df # This is the entity dataframe above
    )

# This prints out the dataframe below 
print(dataset.to_dataframe())

Output: Joined DataFrame

Feast is able to intelligently join feature data with different timestamps to a single basis table in a point-in-time-correct way. This allows users to join daily batch data with high-frequency event data transparently. They simply need to know the feature names.

{% hint style="info" %} Feast can retrieve features from any amount of feature sets, as long as they occur on the same entities. {% endhint %}

Point-in-time-correct joins also prevents the occurrence of feature leakage by trying to accurate the state of the world at a single point in time, instead of just joining features based on the nearest timestamps.

Online feature retrieval

Online feature retrieval works in much the same way as batch retrieval, with one important distinction: Online stores only maintain the current state of features. No historical data is served.

features = [
 'conv_rate',
 'acc_rate',
 'avg_daily_trips',
 ]

data = client.get_online_features(
        feature_refs=features, # Contains only feature references
        entity_rows=entity_rows, # Contains only entities (driver ids)
    )

{% hint style="info" %} When no project is specified when retrieving features with get_online_feature(), Feast infers that the features specified belong to the default project. To retrieve from another project, specify the project parameter when retrieving features. {% endhint %}

Online serving with Feast is built to be very low latency. Feast Serving provides a gRPC API that is backed by Redis. We also provide support for Python, Go, and Java clients.