# Overview

In this tutorial, we use feature stores to generate training data and power online model inference for a ride-sharing driver satisfaction prediction model. Feast addresses several common issues in this flow:
1. **Training-serving skew and complex data joins:** Feature values often exist across multiple tables. Joining these datasets can be complicated, slow, and error-prone.
  - Feast joins these tables with battle-tested logic that ensures *point-in-time* correctness so future feature values do not leak to models.
  - **Upcoming*: Feast alerts users to offline / online skew with data quality monitoring. 
2. **Online feature availability:** At inference time, models often need access to features that aren't readily available and need to be precomputed from other datasources. 
  - Feast manages deployment to a variety of online stores (e.g. DynamoDB, Redis, Google Cloud Datastore) and ensures necessary features are consistently *available* and *freshly computed* at inference time.
3. **Feature reusability and model versioning:** Different teams within an organization are often unable to reuse features across projects, resulting in duplicate feature creation logic. Models have data dependencies that need to be versioned, for example when running A/B tests on model versions.
  - Feast enables discovery of and collaboration on previously used features and enables versioning of sets of features (via *feature services*). 
  - **Upcoming*: Feast enables feature transformation so users can re-use transformation logic across online / offline usecases and across models.

We will:
- Deploy a local feature store with a Parquet file offline store and Sqlite online store.
- Build a training dataset using our time series features from our Parquet files.
- Materialize feature values from the offline store into the online store in preparation for low latency serving.
- Read the latest features from the online store for inference.

## Step 1: Install Feast

Install Feast (and Pygments for pretty printing) using pip:


In [None]:
%%sh
pip install feast -U -q
pip install Pygments -q
echo "Please restart your runtime now (Runtime -> Restart runtime). This ensures that the correct dependencies are loaded."

Please restart your runtime now (Runtime -> Restart runtime). This ensures that the correct dependencies are loaded.


**Reminder**: Please restart your runtime after installing Feast (Runtime -> Restart runtime). This ensures that the correct dependencies are loaded.


## Step 2: Create a feature repository

A feature repository is a directory that contains the configuration of the feature store and individual features. This configuration is written as code (Python/YAML) and it's highly recommended that teams track it centrally using git. See [Feature Repository](https://docs.feast.dev/reference/feature-repository) for a detailed explanation of feature repositories.

The easiest way to create a new feature repository to use the `feast init` command. This creates a scaffolding with initial demo data.

### Demo data scenario 
- We have surveyed some drivers for how satisfied they are with their experience in a ride-sharing app. 
- We want to generate predictions for driver satisfaction for the rest of the users so we can reach out to potentially dissatisfied users.

In [None]:
!feast init feature_repo

Feast is an open source project that collects anonymized error reporting and usage statistics. To opt out or learn more see https://docs.feast.dev/reference/usage

Creating a new Feast repository in [1m[32m/content/feature_repo[0m.



### Step 2a: Inspecting the feature repository

Let's take a look at the demo repo itself. It breaks down into


*   `data/` contains raw demo parquet data
*   `example.py` contains demo feature definitions
*   `feature_store.yaml` contains a demo setup configuring where data sources are



In [None]:
%cd feature_repo
!ls -R

/content/feature_repo
.:
data  example.py  feature_store.yaml

./data:
driver_stats.parquet


### Step 2b: Inspecting the project configuration
Let's inspect the setup of the project in `feature_store.yaml`. The key line defining the overall architecture of the feature store is the **provider**. This defines where the raw data exists (for generating training data & feature values for serving), and where to materialize feature values to in the online store (for serving). 

Valid values for  `provider` in `feature_store.yaml` are:

*   local: use file source / SQLite
*   gcp: use BigQuery / Google Cloud Datastore
*   aws: use Redshift / DynamoDB

A custom setup (e.g. using the built-in support for Redis) can be made by following https://docs.feast.dev/v/master/how-to-guides/creating-a-custom-provider

In [None]:
!pygmentize feature_store.yaml

project: feature_repo
registry: data/registry.db
provider: local
online_store:
    path: data/online_store.db


### Inspecting the raw data

The raw feature data we have in this demo is stored in a local parquet file. The dataset captures hourly stats of a driver in a ride-sharing app.

In [None]:
import pandas as pd

pd.read_parquet("data/driver_stats.parquet")

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
0,2021-08-08 16:00:00+00:00,1005,0.293061,0.001904,40,2021-08-23 16:25:16.962
1,2021-08-08 17:00:00+00:00,1005,0.411542,0.893139,722,2021-08-23 16:25:16.962
2,2021-08-08 18:00:00+00:00,1005,0.495635,0.202365,280,2021-08-23 16:25:16.962
3,2021-08-08 19:00:00+00:00,1005,0.890092,0.771689,88,2021-08-23 16:25:16.962
4,2021-08-08 20:00:00+00:00,1005,0.308211,0.126267,552,2021-08-23 16:25:16.962
...,...,...,...,...,...,...
1802,2021-08-23 14:00:00+00:00,1001,0.251525,0.245729,98,2021-08-23 16:25:16.962
1803,2021-08-23 15:00:00+00:00,1001,0.469145,0.138416,606,2021-08-23 16:25:16.962
1804,2021-04-12 07:00:00+00:00,1001,0.897222,0.086379,314,2021-08-23 16:25:16.962
1805,2021-08-16 04:00:00+00:00,1003,0.298156,0.671153,162,2021-08-23 16:25:16.962


## Step 3: Register feature definitions and deploy your feature store

`feast apply` scans python files in the current directory for feature/entity definitions and deploys infrastructure according to `feature_store.yaml`.



### Step 3a: Inspecting feature definitions
Let's inspect what `example.py` looks like (the only python file in the repo):

In [None]:
!pygmentize -f terminal16m example.py

[38;2;64;128;128m# This is an example feature definition file[39m

[38;2;0;128;0;01mfrom[39;00m [38;2;0;0;255;01mgoogle.protobuf.duration_pb2[39;00m [38;2;0;128;0;01mimport[39;00m Duration

[38;2;0;128;0;01mfrom[39;00m [38;2;0;0;255;01mfeast[39;00m [38;2;0;128;0;01mimport[39;00m Entity, Feature, FeatureView, FileSource, ValueType

[38;2;64;128;128m# Read data from parquet files. Parquet is convenient for local development mode. For[39m
[38;2;64;128;128m# production, you can use your favorite DWH, such as BigQuery. See Feast documentation[39m
[38;2;64;128;128m# for more info.[39m
driver_hourly_stats [38;2;102;102;102m=[39m FileSource(
    path[38;2;102;102;102m=[39m[38;2;186;33;33m"[39m[38;2;186;33;33m/content/feature_repo/data/driver_stats.parquet[39m[38;2;186;33;33m"[39m,
    event_timestamp_column[38;2;102;102;102m=[39m[38;2;186;33;33m"[39m[38;2;186;33;33mevent_timestamp[39m[38;2;186;33;33m"[39m,
    created_timestamp_column[38;2;102;102;102m=[

### Step 3b: Applying feature definitions
Now we run `feast apply` to register the feature views and entities defined in `example.py`, and sets up SQLite online store tables. Note that we had previously specified SQLite as the online store in `feature_store.yaml` by specifying a `local` provider.

In [None]:
!feast apply

Registered entity [1m[32mdriver_id[0m
Registered feature view [1m[32mdriver_hourly_stats[0m
Deploying infrastructure for [1m[32mdriver_hourly_stats[0m


## Step 4: Generate training data

To train a model, we need features and labels. Often, this label data is stored separately (e.g. you have one table storing user survey results and another set of tables with feature values). 

The user can query that table of labels with timestamps and pass that into Feast as an *entity dataframe* for training data generation. In many cases, Feast will also intelligently join relevant tables to create the relevant feature vectors.
- Note that we include timestamps because want the features for the same driver at various timestamps to be used in a model.

In [None]:
from datetime import datetime, timedelta
import pandas as pd

from feast import FeatureStore

# The entity dataframe is the dataframe we want to enrich with feature values
entity_df = pd.DataFrame.from_dict(
    {
        "driver_id": [1001, 1002, 1003],
        "label_driver_reported_satisfaction": [1, 5, 3], 
        "event_timestamp": [
            datetime.now() - timedelta(minutes=11),
            datetime.now() - timedelta(minutes=36),
            datetime.now() - timedelta(minutes=73),
        ],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype              
---  ------                              --------------  -----              
 0   event_timestamp                     3 non-null      datetime64[ns, UTC]
 1   driver_id                           3 non-null      int64              
 2   label_driver_reported_satisfaction  3 non-null      int64              
 3   conv_rate                           3 non-null      float32            
 4   acc_rate                            3 non-null      float32            
 5   avg_daily_trips                     3 non-null      int32              
dtypes: datetime64[ns, UTC](1), float32(2), int32(1), int64(2)
memory usage: 132.0 bytes
None

----- Example features -----

                   event_timestamp  driver_id  ...  acc_rate  avg_daily_trips
0 2021-08-23 15:12:55.489091+00:00       1003  ...  0

## Step 5: Load features into your online store

### Step 5a: Using `feast materialize-incremental`

We now serialize the latest values of features since the beginning of time to prepare for serving (note: `materialize-incremental` serializes all new features since the last `materialize` call).

In [None]:
from datetime import datetime
!feast materialize-incremental {datetime.now().isoformat()}

Materializing [1m[32m1[0m feature views to [1m[32m2021-08-23 16:25:46+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2021-08-22 16:25:47+00:00[0m to [1m[32m2021-08-23 16:25:46+00:00[0m:
  0%|                                                                         | 0/5 [00:00<?, ?it/s]100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 592.05it/s]


### Step 5b: Inspect materialized features

Note that now there are `online_store.db` and `registry.db`, which store the materialized features and schema information, respectively.

In [None]:
print("--- Data directory ---")
!ls data

import sqlite3
import pandas as pd
con = sqlite3.connect("data/online_store.db")
print("\n--- Schema of online store ---")
print(
    pd.read_sql_query(
        "SELECT * FROM feature_repo_driver_hourly_stats", con).columns.tolist())
con.close()

--- Data directory ---
driver_stats.parquet  online_store.db  registry.db

--- Schema of online store ---
['entity_key', 'feature_name', 'value', 'event_ts', 'created_ts']


### Quick note on entity keys
Note from the above command that the online store indexes by `entity_key`. 

[Entity keys](https://docs.feast.dev/getting-started/concepts/entity#entity-key) include a list of all entities needed (e.g. all relevant primary keys) to generate the feature vector. In this case, this is a serialized version of the `driver_id`. We use this later to fetch all features for a given driver at inference time.

## Step 6: Fetching feature vectors for inference


At inference time, we need to quickly read the latest feature values for different drivers (which otherwise might have existed only in batch sources) from the online feature store using `get_online_features()`. These feature vectors can then be fed to the model.

In [None]:
from pprint import pprint
from feast import FeatureStore

store = FeatureStore(repo_path=".")

feature_vector = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        {"driver_id": 1004},
        {"driver_id": 1005},
    ],
).to_dict()

pprint(feature_vector)

{'acc_rate': [0.5732735991477966, 0.7828438878059387],
 'avg_daily_trips': [33, 984],
 'conv_rate': [0.15498852729797363, 0.6263588070869446],
 'driver_id': [1004, 1005]}


# Next steps

- Read the [Concepts](https://docs.feast.dev/getting-started/concepts/) page to understand the Feast data model and architecture.
- Check out our [Tutorials](https://docs.feast.dev/tutorials/tutorials-overview) section for more examples on how to use Feast.
- Follow our [Running Feast with GCP/AWS](https://docs.feast.dev/how-to-guides/feast-gcp-aws) guide for a more in-depth tutorial on using Feast.
- Join other Feast users and contributors in [Slack](https://slack.feast.dev/) and become part of the community!