# Overview

In this tutorial, we'll use Feast to generate training data and power online model inference for a 
ride-sharing driver satisfaction prediction model. Feast solves several common issues in this flow:

1. **Training-serving skew and complex data joins:** Feature values often exist across multiple tables. Joining 
   these datasets can be complicated, slow, and error-prone.
   * Feast joins these tables with battle-tested logic that ensures _point-in-time_ correctness so future feature 
     values do not leak to models.
2. **Online feature availability:** At inference time, models often need access to features that aren't readily 
   available and need to be precomputed from other data sources.
   * Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and 
     ensures necessary features are consistently _available_ and _freshly computed_ at inference time.
3. **Feature and model versioning:** Different teams within an organization are often unable to reuse 
   features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need 
   to be versioned, for example when running A/B tests on model versions.
   * Feast enables discovery of and collaboration on previously used features and enables versioning of sets of 
     features (via _feature services_).
   * _(Experimental)_ Feast enables light-weight feature transformations so users can re-use transformation logic 
     across online / offline use cases and across models.

We will:
1. Deploy a local feature store with a **Parquet file offline store** and **Sqlite online store**.
2. Build a training dataset using our time series features from our **Parquet files**.
3. Materialize feature values from the offline store into the online store.
4. Read the latest features from the online store for inference.

## Step 1: Install Feast

Install Feast (and Pygments for pretty printing) using pip:


In [None]:
%%sh
pip install feast -U -q
pip install Pygments -q
echo "Please restart your runtime now (Runtime -> Restart runtime). This ensures that the correct dependencies are loaded."

**Reminder**: Please restart your runtime after installing Feast (Runtime -> Restart runtime). This ensures that the correct dependencies are loaded.


## Step 2: Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See [Feature Repository](https://docs.feast.dev/reference/feature-repository) for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use the `feast init` command. This creates a scaffolding with initial demo data.

### Demo data scenario 
- We have surveyed some drivers for how satisfied they are with their experience in a ride-sharing app. 
- We want to generate predictions for driver satisfaction for the rest of the users so we can reach out to potentially dissatisfied users.

In [None]:
!feast init feature_repo


Creating a new Feast repository in [1m[32m/content/feature_repo[0m.



### Step 2a: Inspecting the feature repository

Let's take a look at the demo repo itself. It breaks down into


* `data/` contains raw demo parquet data
* `example_repo.py` contains demo feature definitions
* `feature_store.yaml` contains a demo setup configuring where data sources are
* `test_workflow.py` showcases how to run all key Feast commands, including defining, retrieving, and pushing features.
   * You can run this with `python test_workflow.py`.



In [None]:
%cd feature_repo
!ls -R

/content/feature_repo
README.md          feature_store.yaml
__init__.py        example_repo.py    test_workflow.py

./data:
driver_stats.parquet


### Step 2b: Inspecting the project configuration
Let's inspect the setup of the project in `feature_store.yaml`. 

The key line defining the overall architecture of the feature store is the **provider**. 

The provider value sets default offline and online stores. 
* The offline store provides the compute layer to process historical data (for generating training data & feature 
  values for serving). 
* The online store is a low latency store of the latest feature values (for powering real-time inference).

Valid values for `provider` in `feature_store.yaml` are:

* local: use file source with SQLite/Redis
* gcp: use BigQuery/Snowflake with Google Cloud Datastore/Redis
* aws: use Redshift/Snowflake with DynamoDB/Redis

Note that there are many other offline / online stores Feast works with, including Azure, Hive, Trino, and PostgreSQL via community plugins. See https://docs.feast.dev/roadmap for all supported connectors.

A custom setup can also be made by following [Customizing Feast](https://docs.feast.dev/v/master/how-to-guides/customizing-feast)

In [None]:
!pygmentize feature_store.yaml

[94mproject[39;49;00m:[37m [39;49;00mfeature_repo[37m[39;49;00m
[37m# By default, the registry is a file (but can be turned into a more scalable SQL-backed registry)[39;49;00m[37m[39;49;00m
[94mregistry[39;49;00m:[37m [39;49;00mdata/registry.db[37m[39;49;00m
[37m# The provider primarily specifies default offline / online stores & storing the registry in a given cloud[39;49;00m[37m[39;49;00m
[94mprovider[39;49;00m:[37m [39;49;00mlocal[37m[39;49;00m
[94monline_store[39;49;00m:[37m[39;49;00m
[37m    [39;49;00m[94mpath[39;49;00m:[37m [39;49;00mdata/online_store.db[37m[39;49;00m
[94mentity_key_serialization_version[39;49;00m:[37m [39;49;00m2[37m[39;49;00m


### Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

In [None]:
import pandas as pd

pd.read_parquet("data/driver_stats.parquet")

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
0,2022-07-24 14:00:00+00:00,1005,0.423913,0.082831,201,2022-08-08 14:14:11.200
1,2022-07-24 15:00:00+00:00,1005,0.507126,0.427470,690,2022-08-08 14:14:11.200
2,2022-07-24 16:00:00+00:00,1005,0.139810,0.129743,845,2022-08-08 14:14:11.200
3,2022-07-24 17:00:00+00:00,1005,0.383574,0.071728,839,2022-08-08 14:14:11.200
4,2022-07-24 18:00:00+00:00,1005,0.959131,0.440051,2,2022-08-08 14:14:11.200
...,...,...,...,...,...,...
1802,2022-08-08 12:00:00+00:00,1001,0.994883,0.020145,650,2022-08-08 14:14:11.200
1803,2022-08-08 13:00:00+00:00,1001,0.663844,0.864639,359,2022-08-08 14:14:11.200
1804,2021-04-12 07:00:00+00:00,1001,0.068696,0.624977,624,2022-08-08 14:14:11.200
1805,2022-08-01 02:00:00+00:00,1003,0.980869,0.244420,790,2022-08-08 14:14:11.200


## Step 3: Register feature definitions and deploy your feature store

`feast apply` scans python files in the current directory for feature/entity definitions and deploys infrastructure according to `feature_store.yaml`.



### Step 3a: Inspecting feature definitions
Let's inspect what `example_repo.py` looks like:

```python
# This is an example feature definition file

from datetime import timedelta

import pandas as pd

from feast import Entity, FeatureService, FeatureView, Field, FileSource, RequestSource, PushSource
from feast.on_demand_feature_view import on_demand_feature_view
from feast.types import Float32, Int64, Float64

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
    name="driver_hourly_stats_source",
    path="/content/feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(days=1),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_hourly_stats,
    tags={},
)

# Defines a way to push data (to be available offline, online or both) into Feast.
driver_stats_push_source = PushSource(
    name="driver_stats_push_source",
    batch_source=driver_hourly_stats,
)

# Define a request data source which encodes features / information only
# available at request time (e.g. part of the user initiated HTTP request)
input_request = RequestSource(
    name="vals_to_add",
    schema=[
        Field(name="val_to_add", dtype=Int64),
        Field(name="val_to_add_2", dtype=Int64),
    ],
)


# Define an on demand feature view which can generate new features based on
# existing feature views and RequestSource features
@on_demand_feature_view(
    sources=[driver_hourly_stats_view, input_request],
    schema=[
        Field(name="conv_rate_plus_val1", dtype=Float64),
        Field(name="conv_rate_plus_val2", dtype=Float64),
    ],
)
def transformed_conv_rate(inputs: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame()
    df["conv_rate_plus_val1"] = inputs["conv_rate"] + inputs["val_to_add"]
    df["conv_rate_plus_val2"] = inputs["conv_rate"] + inputs["val_to_add_2"]
    return df


# This groups features into a model version
driver_stats_fs = FeatureService(
    name="driver_activity_v1", features=[driver_hourly_stats_view, transformed_conv_rate]
)
```

### Step 3b: Applying feature definitions
Now we run `feast apply` to register the feature views and entities defined in `example_repo.py`, and sets up SQLite online store tables. Note that we had previously specified SQLite as the online store in `feature_store.yaml` by specifying a `local` provider.

In [None]:
!feast apply

Created entity [1m[32mdriver[0m
Created feature view [1m[32mdriver_hourly_stats[0m
Created on demand feature view [1m[32mtransformed_conv_rate[0m
Created feature service [1m[32mdriver_activity_v1[0m

Created sqlite table [1m[32mfeature_repo_driver_hourly_stats[0m



## Step 4: Generating training data or powering batch scoring models

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). Feast can help generate the features that map to these labels.

Feast needs a list of **entities** (e.g. driver ids) and **timestamps**. Feast will intelligently join relevant 
tables to create the relevant feature vectors. There are two ways to generate this list:
1. The user can query that table of labels with timestamps and pass that into Feast as an _entity dataframe_ for 
training data generation. 
2. The user can also query that table with a *SQL query* which pulls entities. See the documentation on [feature retrieval](https://docs.feast.dev/getting-started/concepts/feature-retrieval) for details    

* Note that we include timestamps because we want the features for the same driver at various timestamps to be used in a model.

### Step 4a: Generating training data

In [None]:
from datetime import datetime
import pandas as pd

from feast import FeatureStore

# The entity dataframe is the dataframe we want to enrich with feature values
# Note: see https://docs.feast.dev/getting-started/concepts/feature-retrieval for more details on how to retrieve
# for all entities in the offline store instead
entity_df = pd.DataFrame.from_dict(
    {
        # entity's join key -> entity values
        "driver_id": [1001, 1002, 1003],
        # "event_timestamp" (reserved key) -> timestamps
        "event_timestamp": [
            datetime(2021, 4, 12, 10, 59, 42),
            datetime(2021, 4, 12, 8, 12, 10),
            datetime(2021, 4, 12, 16, 40, 26),
        ],
        # (optional) label name -> label values. Feast does not process these
        "label_driver_reported_satisfaction": [1, 5, 3],
        # values we're using for an on-demand transformation
        "val_to_add": [1, 2, 3],
        "val_to_add_2": [10, 20, 30],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 10 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   driver_id                           3 non-null      int64              
 1   event_timestamp                     3 non-null      datetime64[ns, UTC]
 2   label_driver_reported_satisfaction  3 non-null      int64              
 3   val_to_add                          3 non-null      int64              
 4   val_to_add_2                        3 non-null      int64              
 5   conv_rate                           3 non-null      float32            
 6   acc_rate                            3 non-null      float32            
 7   avg_daily_trips                     3 non-null      int32              
 8   conv_rate_plus_val1                 3 non-null      float64            
 9   conv_rate_plus_val2

### Step 4b: Run offline inference (batch scoring)
To power a batch model, we primarily need to generate features with the `get_historical_features` call, but using the current timestamp

In [None]:
entity_df["event_timestamp"] = pd.to_datetime("now", utc=True)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

print("\n----- Example features -----\n")
print(training_df.head())


----- Example features -----

   driver_id                  event_timestamp  \
0       1001 2022-08-08 18:22:06.555018+00:00   
1       1002 2022-08-08 18:22:06.555018+00:00   
2       1003 2022-08-08 18:22:06.555018+00:00   

   label_driver_reported_satisfaction  val_to_add  val_to_add_2  conv_rate  \
0                                   1           1            10   0.663844   
1                                   5           2            20   0.151189   
2                                   3           3            30   0.769165   

   acc_rate  avg_daily_trips  conv_rate_plus_val1  conv_rate_plus_val2  
0  0.864639              359             1.663844            10.663844  
1  0.695982              311             2.151189            20.151189  
2  0.949191              789             3.769165            30.769165  


## Step 5: Load features into your online store

### Step 5a: Using `materialize_incremental`

We now serialize the latest values of features since the beginning of time to prepare for serving (note: `materialize_incremental` serializes all new features since the last `materialize` call).

An alternative to using the CLI command is to use Python:

```bash
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
```

In [None]:
from datetime import datetime
store.materialize_incremental(datetime.now())

Materializing [1m[32m1[0m feature views to [1m[32m2022-08-08 14:19:04-04:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2022-08-07 18:19:04-04:00[0m to [1m[32m2022-08-08 14:19:04-04:00[0m:


100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 346.47it/s]


### Step 5b: Inspect materialized features

Note that now there are `online_store.db` and `registry.db`, which store the materialized features and schema information, respectively.

In [None]:
print("--- Data directory ---")
!ls data

import sqlite3
import pandas as pd
con = sqlite3.connect("data/online_store.db")
print("\n--- Schema of online store ---")
print(
    pd.read_sql_query(
        "SELECT * FROM feature_repo_driver_hourly_stats", con).columns.tolist())
con.close()

--- Data directory ---
driver_stats.parquet online_store.db      registry.db

--- Schema of online store ---
['entity_key', 'feature_name', 'value', 'event_ts', 'created_ts']


### Quick note on entity keys
Note from the above command that the online store indexes by `entity_key`. 

[Entity keys](https://docs.feast.dev/getting-started/concepts/entity#entity-key) include a list of all entities needed (e.g. all relevant primary keys) to generate the feature vector. In this case, this is a serialized version of the `driver_id`. We use this later to fetch all features for a given driver at inference time.

## Step 6: Fetching real-time feature vectors for online inference

At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using `get_online_features()`. These feature vectors can then be fed to the model.

In [None]:
from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
    entity_rows=[
        # {join_key: entity_value}
        {
            "driver_id": 1001,
            "val_to_add": 1000,
            "val_to_add_2": 2000,
        },
        {
            "driver_id": 1002,
            "val_to_add": 1001,
            "val_to_add_2": 2002,
        },
    ],
).to_dict()

pprint(feature_vector)

{'acc_rate': [0.86463862657547, 0.6959823369979858],
 'avg_daily_trips': [359, 311],
 'conv_rate_plus_val1': [1000.6638441681862, 1001.1511893719435],
 'conv_rate_plus_val2': [2000.6638441681862, 2002.1511893719435],
 'driver_id': [1001, 1002]}


### Fetching features using feature services
You can also use feature services to manage multiple features, and decouple feature view definitions and the features needed by end applications. The feature store can also be used to fetch either online or historical features using the same api below. More information can be found [here](https://docs.feast.dev/getting-started/concepts/feature-retrieval).

 The `driver_activity_v1` feature service pulls all features from the `driver_hourly_stats` feature view:

```python
driver_stats_fs = FeatureService(
    name="driver_activity_v1", features=[driver_hourly_stats_view]
)
```

In [None]:
from feast import FeatureStore
feature_store = FeatureStore('.')  # Initialize the feature store

feature_service = feature_store.get_feature_service("driver_activity_v1")
feature_vector = feature_store.get_online_features(
    features=feature_service,
    entity_rows=[
        # {join_key: entity_value}
        {
            "driver_id": 1001,
            "val_to_add": 1000,
            "val_to_add_2": 2000,
        },
        {
            "driver_id": 1002,
            "val_to_add": 1001,
            "val_to_add_2": 2002,
        },
    ],
).to_dict()
pprint(feature_vector)

{'acc_rate': [0.86463862657547, 0.6959823369979858],
 'avg_daily_trips': [359, 311],
 'conv_rate': [0.6638441681861877, 0.15118937194347382],
 'conv_rate_plus_val1': [1000.6638441681862, 1001.1511893719435],
 'conv_rate_plus_val2': [2000.6638441681862, 2002.1511893719435],
 'driver_id': [1001, 1002]}


## Step 7: Making streaming features available in Feast
Feast does not directly ingest from streaming sources. Instead, Feast relies on a push-based model to push features into Feast. You can write a streaming pipeline that generates features, which can then be pushed to the offline store, the online store, or both (depending on your needs).

This relies on the `PushSource` defined above. Pushing to this source will populate all dependent feature views with the pushed feature values.

In [None]:
from feast.data_source import PushMode

print("\n--- Simulate a stream event ingestion of the hourly stats df ---")
event_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001],
        "event_timestamp": [
            datetime(2021, 5, 13, 10, 59, 42),
        ],
        "created": [
            datetime(2021, 5, 13, 10, 59, 42),
        ],
        "conv_rate": [1.0],
        "acc_rate": [1.0],
        "avg_daily_trips": [1000],
    }
)
print(event_df)
store.push("driver_stats_push_source", event_df, to=PushMode.ONLINE_AND_OFFLINE)


--- Simulate a stream event ingestion of the hourly stats df ---
   driver_id     event_timestamp             created  conv_rate  acc_rate  \
0       1001 2021-05-13 10:59:42 2021-05-13 10:59:42        1.0       1.0   

   avg_daily_trips  
0             1000  


# Next steps

- Read the [Concepts](https://docs.feast.dev/getting-started/concepts/) page to understand the Feast data model and architecture.
- Check out our [Tutorials](https://docs.feast.dev/tutorials/tutorials-overview) section for more examples on how to use Feast.
- Follow our [Running Feast with Snowflake/GCP/AWS](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws) guide for a more in-depth tutorial on using Feast.
- Join other Feast users and contributors in [Slack](https://slack.feast.dev/) and become part of the community!