# Feathr Feature Store on Azure Demo Notebook

This notebook illustrates the use of Feature Store to create a model that predicts NYC Taxi fares. It includes these steps:


This tutorial demonstrates the key capabilities of Feathr, including:

1. Install and set up Feathr with Azure
2. Create shareable features with Feathr feature definition configs.
3. Create a training dataset via point-in-time feature join.
4. Compute and write features.
5. Train a model using these features to predict fares.
6. Materialize feature value to online store.
7. Fetch feature value in real-time from online store for online scoring.

In this tutorial, we use Feathr Feature Store to create a model that predicts NYC Yellow Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The feature flow is as below:

![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/feature_flow.png?raw=true)

## Prerequisite: Provision cloud resources

First step is to provision required cloud resources if you want to use Feathr. Feathr provides a python based client to interact with cloud resources.

Please follow the steps [here]() to provision required cloud resources. Due to the complexity of the possible cloud environment, it is almost impossible to create a script that works for all the use cases. Because of this, [azure_resource_provision.sh](https://github.com/linkedin/feathr/blob/main/docs/how-to-guides/azure_resource_provision.sh) is a full end to end command line to create all the required resources, and you can tailor the script as needed, while [the companion documentation](https://github.com/linkedin/feathr/blob/main/docs/how-to-guides/azure-deployment.md) can be used as a complete guide for using that shell script.

At the end of the script, it should give you some output which you will need later. For example, the Service Principal IDs, Redis endpoint, etc.

Please also note that at the end of this step, you need to **manually** grant your service principal "Data Curator" permission of your Azure Purview account, due to a current limiation with Azure Purview.

And the architecture is as below:

![Architecture](https://github.com/linkedin/feathr/blob/main/docs/images/architecture.png?raw=true)

## Prerequisite: Install Feathr

Install Feathr using pip:

```bash
pip install -U feathr
pip install pandavro scikit-learn
```

Or if you want to use the latest Feathr code from GitHub:
```bash
pip install -I git+https://github.com/linkedin/feathr.git#subdirectory=feathr_project
pip install pandavro scikit-learn
```

## Prerequisite: Configure the required environment

In the first step, you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations.

In [None]:
yaml_config = """
# DO NOT MOVE OR DELETE THIS FILE
# This file contains the configurations that are used by Feathr
# All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of this config file.
# For example, `feathr_runtime_location` for databricks can be overwritten by setting this environment variable:
# SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION
# Another example would be overwriting Redis host with this config: `ONLINE_STORE__REDIS__HOST`
# For example if you want to override this setting in a shell environment:
# export ONLINE_STORE__REDIS__HOST=feathrazure.redis.cache.windows.net
# version of API settings
api_version: 1
project_config:
  project_name: 'feathr_sample'
  # Information that are required to be set via environment variables.
  required_environment_variables:
    # the environemnt variables are required to run Feathr
    # Redis password for your online store
    - 'REDIS_PASSWORD'
    # client IDs and client Secret for the service principal. Read the getting started docs on how to get those information.
    - 'AZURE_CLIENT_ID'
    - 'AZURE_TENANT_ID'
    - 'AZURE_CLIENT_SECRET'
  optional_environment_variables:
    # the environemnt variables are optional, however you will need them if you want to use some of the services:
    - ADLS_ACCOUNT
    - ADLS_KEY
    - WASB_ACCOUNT
    - WASB_KEY
    - S3_ACCESS_KEY
    - S3_SECRET_KEY
    - JDBC_TABLE
    - JDBC_USER
    - JDBC_PASSWORD
offline_store:
  # paths starts with abfss:// or abfs://
  # ADLS_ACCOUNT and ADLS_KEY should be set in environment variable if this is set to true
  adls:
    adls_enabled: true
  # paths starts with wasb:// or wasbs://
  # WASB_ACCOUNT and WASB_KEY should be set in environment variable
  wasb:
    wasb_enabled: true
  # paths starts with s3a://
  # S3_ACCESS_KEY and S3_SECRET_KEY should be set in environment variable
  s3:
    s3_enabled: false
    # S3 endpoint. If you use S3 endpoint, then you need to provide access key and secret key in the environment variable as well.
    s3_endpoint: 's3.amazonaws.com'
  # jdbc endpoint
  jdbc:
    jdbc_enabled: false
    jdbc_database: 'feathrtestdb'
    jdbc_table: 'feathrtesttable'
# reading from streaming source is coming soon
# streaming_source:
#   kafka_connection_string: ''
spark_config:
  # choice for spark runtime. Currently support: azure_synapse, databricks
  # The `databricks` configs will be ignored if `azure_synapse` is set and vice versa.
  spark_cluster: 'azure_synapse'
  # configure number of parts for the spark output for feature generation job
  spark_result_output_parts: '1'
  azure_synapse:
    dev_url: 'https://feathrazuretest3synapse.dev.azuresynapse.net'
    pool_name: 'spark3'
    # workspace dir for storing all the required configuration files and the jar resources
    workspace_dir: 'abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started'
    executor_size: 'Small'
    executor_num: 4
    # Feathr Job configuration. Support local paths, path start with http(s)://, and paths start with abfs(s)://
    # this is the default location so end users don't have to compile the runtime again.
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-0.1.0-SNAPSHOT.jar
  databricks:
    # workspace instance
    workspace_instance_url: 'https://adb-6885802458123232.12.azuredatabricks.net/'
    workspace_token_value: ''
    # config string including run time information, spark version, machine size, etc.
    # the config follows the format in the databricks documentation: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/2.0/jobs
    config_template: {'run_name':'','new_cluster':{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{}},'libraries':[{'jar':''}],'spark_jar_task':{'main_class_name':'','parameters':['']}}
    # Feathr Job location. Support local paths, path start with http(s)://, and paths start with dbfs:/
    work_dir: 'dbfs:/feathr_getting_started'
    # this is the default location so end users don't have to compile the runtime again.
    feathr_runtime_location: 'https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-0.1.0-SNAPSHOT.jar'
online_store:
  redis:
    # Redis configs to access Redis cluster
    host: 'feathrazuretest3redis.redis.cache.windows.net'
    port: 6380
    ssl_enabled: True
feature_registry:
  purview:
    # Registry configs
    # configure the name of the purview endpoint
    purview_name: 'feathrazuretest3-purview1'
    # delimiter indicates that how the project/workspace name, feature names etc. are delimited. By default it will be '__'
    # this is for global reference (mainly for feature sharing). For exmaple, when we setup a project called foo, and we have an anchor called 'taxi_driver' and the feature name is called 'f_daily_trips'
    # the feature will have a globally unique name called 'foo__taxi_driver__f_daily_trips'
    delimiter: '__'
"""
tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)
with open(tmp.name, "w") as text_file:
    text_file.write(yaml_config)


In [5]:
import pandas as pd
pd.read_csv(
    "https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2020-04.csv")


  pd.read_csv(


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2.0,2020-04-01 00:44:02,2020-04-01 00:52:23,N,1.0,42,41,1.0,1.68,8.00,0.5,0.5,0.0,0.00,,0.3,9.30,1.0,1.0,0.0
1,2.0,2020-04-01 00:24:39,2020-04-01 00:33:06,N,1.0,244,247,2.0,1.94,9.00,0.5,0.5,0.0,0.00,,0.3,10.30,2.0,1.0,0.0
2,2.0,2020-04-01 00:45:06,2020-04-01 00:51:13,N,1.0,244,243,3.0,1.00,6.50,0.5,0.5,0.0,0.00,,0.3,7.80,2.0,1.0,0.0
3,2.0,2020-04-01 00:45:06,2020-04-01 01:04:39,N,1.0,244,243,2.0,2.81,12.00,0.5,0.5,0.0,0.00,,0.3,13.30,2.0,1.0,0.0
4,2.0,2020-04-01 00:00:23,2020-04-01 00:16:13,N,1.0,75,169,1.0,6.79,21.00,0.5,0.5,0.0,0.00,,0.3,22.30,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35607,,2020-04-30 23:29:00,2020-04-30 23:57:00,,,37,147,,11.41,35.82,0.0,0.0,0.0,6.12,,0.3,42.24,,,
35608,,2020-04-30 23:11:00,2020-04-30 23:47:00,,,188,230,,11.17,35.45,0.0,0.0,0.0,0.00,,0.3,38.50,,,
35609,,2020-04-30 23:18:00,2020-04-30 23:46:00,,,205,37,,14.37,34.54,0.0,0.0,0.0,0.00,,0.3,34.84,,,
35610,,2020-04-30 23:55:00,2020-05-01 00:10:00,,,37,188,,4.25,16.72,0.0,0.0,0.0,0.00,,0.3,17.02,,,


And Let's put it in a configuration file.


In [1]:
import glob
import os
import tempfile
from datetime import datetime, timedelta
from math import sqrt

import pandas as pd
import pandavro as pdx
from feathr import FeathrClient
from feathr.anchor import FeatureAnchor
from feathr.client import FeathrClient
from feathr.dtype import BOOLEAN, FLOAT, INT32, ValueType
from feathr.feature import Feature
from feathr.feature_derivations import DerivedFeature
from feathr.materialization_settings import (BackfillTime,
                                             MaterializationSettings)
from feathr.query_feature_list import FeatureQuery
from feathr.settings import ObservationSettings
from feathr.sink import RedisSink
from feathr.source import INPUT_CONTEXT, HdfsSource
from feathr.transformation import WindowAggTransformation
from feathr.typed_key import TypedKey
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


Setup neccessary environment variables first.


In [2]:
os.environ['REDIS_PASSWORD'] = 'Li7Nn63iNB0x731VTnnz2Vr29WYJHx7JlAzCaH9lbHw='
os.environ['AZURE_CLIENT_ID'] = "b40e49c0-75c7-4959-ad25-896118cd79e8"
os.environ['AZURE_TENANT_ID'] = '72f988bf-86f1-41af-91ab-2d7cd011db47'
os.environ['AZURE_CLIENT_SECRET'] = 'kAB5ps6yvo_f08n-4Av~.IDwHFL_xl_63I'


Then we will initialize a feathr client:


In [4]:
client = FeathrClient(config_path=tmp.name)


## Feature Engineering with Feathr:

Basically we want to predict the fares for each driver. 

* Duration of trip
* Feature Engineering: Instead of using the raw datetime like `2021-01-01 00:15:56`, we want to feature engineering with customzied features, for exmaple we want to use the days of the week, the days of the months, etc. as the features.


Doing those transformations with Feathr is very straightforward. We only need to define a few configurations:

```python
f_trip_distance: "(float)trip_distance"
f_is_long_trip_distance: "trip_distance>30"
 ```


In [5]:
batch_source = HdfsSource(name="nycTaxiBatchSource",
                          path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv",
                          event_timestamp_column="lpep_dropoff_datetime",
                          timestamp_format="yyyy-MM-dd HH:mm:ss")
f_trip_distance = Feature(name="f_trip_distance",
                          feature_type=FLOAT, transform="trip_distance")
f_trip_time_duration = Feature(name="f_trip_time_duration",
                               feature_type=INT32,
                               transform="time_duration(lpep_pickup_datetime, lpep_dropoff_datetime, 'minutes')")
features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(name="f_is_long_trip_distance",
            feature_type=BOOLEAN,
            transform="cast_float(trip_distance)>30"),
    Feature(name="f_day_of_week",
            feature_type=INT32,
            transform="dayofweek(lpep_dropoff_datetime)"),
]
request_anchor = FeatureAnchor(name="request_features",
                               source=INPUT_CONTEXT,
                               features=features)
f_trip_time_distance = DerivedFeature(name="f_trip_time_distance",
                                      feature_type=FLOAT,
                                      input_features=[
                                          f_trip_distance, f_trip_time_duration],
                                      transform="f_trip_distance * f_trip_time_duration")
f_trip_time_rounded = DerivedFeature(name="f_trip_time_rounded",
                                     feature_type=INT32,
                                     input_features=[f_trip_time_duration],
                                     transform="f_trip_time_duration % 10")
location_id = TypedKey(key_column="DOLocationID",
                       key_column_type=ValueType.INT32,
                       description="location id in NYC",
                       full_name="nyc_taxi.location_id")
agg_features = [Feature(name="f_location_avg_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="AVG",
                                                          window="90d")),
                Feature(name="f_location_max_fare",
                        key=location_id,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(fare_amount)",
                                                          agg_func="MAX",
                                                          window="90d"))
                ]
agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)
client.build_features(anchor_list=[agg_anchor, request_anchor], derived_feature_list=[
                      f_trip_time_distance, f_trip_time_rounded])


In [6]:
feature_query = FeatureQuery(
    feature_list=["f_location_avg_fare", "f_trip_time_distance", "f_is_long_trip_distance"], key=location_id)
settings = ObservationSettings(
    observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv",
    event_timestamp_column="lpep_dropoff_datetime",
    timestamp_format="yyyy-MM-dd HH:mm:ss")
client.get_offline_features(observation_settings=settings,
                            feature_query=feature_query,
                            output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro")


2022-03-19 13:33:00.390 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:38 - Uploading /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpz7lxq3z7/feature_join_conf/feature_join.conf to cloud..
2022-03-19 13:33:00.391 | INFO     | feathr._synapse_submission:upload_file:317 - Uploading file feature_join.conf
2022-03-19 13:33:00.877 | INFO     | feathr._synapse_submission:upload_file:323 - /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpz7lxq3z7/feature_join_conf/feature_join.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started/feature_join.conf
2022-03-19 13:33:00.878 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:41 - /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpz7lxq3z7/feature_join_conf/feature_join.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started/feature_join.conf
2022-03-19 13:33:00.878 | I

KeyboardInterrupt: 

In [13]:
def get_result_df(client: FeathrClient) -> pd.DataFrame:
    """Download the job result dataset from cloud as a Pandas dataframe."""
    res_url = client.get_job_result_uri(block=True, timeout_sec=600)
    tmp_dir = tempfile.TemporaryDirectory()
    client.feathr_spark_laucher.download_result(
        result_path=res_url, local_folder=tmp_dir.name)
    dataframe_list = []
    # assuming the result are in avro format
    for file in glob.glob(os.path.join(tmp_dir.name, '*.avro')):
        dataframe_list.append(pdx.read_avro(file))
    vertical_concat_df = pd.concat(dataframe_list, axis=0)
    tmp_dir.cleanup()
    return vertical_concat_df


df_res = get_result_df(client)


2022-03-19 11:50:56.689 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: not_started
2022-03-19 11:51:26.870 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: not_started
2022-03-19 11:51:57.024 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: not_started
2022-03-19 11:52:27.194 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: not_started
2022-03-19 11:52:57.358 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: starting
2022-03-19 11:53:27.549 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: starting
2022-03-19 11:53:57.713 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status: running
2022-03-19 11:54:27.878 | INFO     | feathr._synapse_submission:wait_for_completion:109 - Current Spark job status

The result is also availble in the cloud


After getting all the features, let's train a model:


In [18]:
df_res


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,...,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,f_location_avg_fare
0,1,2020-04-01 23:08:20,2020-04-01 23:34:14,N,1,225,121,1,.00,26.20,...,0.5,0,0,0,0.3,27,1,1,0,26.200001
1,1,2020-04-02 06:31:30,2020-04-02 06:51:12,N,1,76,121,1,.00,30.13,...,0.5,0,0,0,0.3,30.93,1,1,0,26.200001
2,2,2020-04-02 14:47:48,2020-04-02 14:57:56,N,1,130,121,1,1.78,8.50,...,0.5,1.86,0,0,0.3,11.16,1,1,0,26.200001
3,2,2020-04-02 15:45:13,2020-04-02 16:13:57,N,1,173,121,1,5.45,25.00,...,0.5,0,0,0,0.3,25.8,1,1,0,26.200001
4,1,2020-04-03 06:34:33,2020-04-03 06:55:25,N,1,76,121,1,.00,30.13,...,0.5,0,0,0,0.3,30.93,1,1,0,26.200001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
328,0,2020-04-30 13:53:00,2020-04-30 13:57:00,0,0,17,49,0,.98,8.00,...,0,2.75,0,0,0.3,11.05,0,0,0,8.500000
329,0,2020-04-30 16:13:00,2020-04-30 16:46:00,0,0,197,49,0,10.05,32.47,...,0,2.75,0,0,0.3,35.52,0,0,0,8.500000
330,0,2020-04-30 17:01:00,2020-04-30 17:51:00,0,0,185,49,0,21.28,50.75,...,0,11.43,6.12,0,0.3,68.6,0,0,0,8.500000
331,0,2020-04-30 18:36:00,2020-04-30 18:47:00,0,0,65,49,0,2.06,8.00,...,0,2.75,0,0,0.3,11.05,0,0,0,8.500000


In [20]:
# remove columns
from sklearn.ensemble import GradientBoostingRegressor
final_df = df_res
final_df.drop(["lpep_pickup_datetime", "lpep_dropoff_datetime",
              "store_and_fwd_flag"], axis=1, inplace=True, errors='ignore')
final_df.fillna(0, inplace=True)
final_df['fare_amount'] = final_df['fare_amount'].astype("float64")


train_x, test_x, train_y, test_y = train_test_split(final_df.drop(["fare_amount"], axis=1),
                                                    final_df["fare_amount"],
                                                    test_size=0.2,
                                                    random_state=42)
model = GradientBoostingRegressor()
model.fit(train_x, train_y)

y_predict = model.predict(test_x)

y_actual = test_y.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)


Model MAPE:
0.026395762852816796

Model Accuracy:
0.9736042371471832


We now want to push the generated features to the online store, so we configure the destination in the feature_gen config:


In [21]:
backfill_time = BackfillTime(start=datetime(
    2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))
redisSink = RedisSink(table_name="nycTaxiDemoFeature")
settings = MaterializationSettings("nycTaxiTable",
                                   backfill_time=backfill_time,
                                   sinks=[redisSink],
                                   feature_names=["f_location_avg_fare", "f_location_max_fare"])
client.materialize_features(settings)
client.wait_job_to_finish(timeout_sec=500)


2022-03-19 12:02:02.718 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:38 - Uploading /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpf1nzrsse/feature_gen_conf/auto_gen_config_1589958000.0.conf to cloud..
2022-03-19 12:02:02.719 | INFO     | feathr._synapse_submission:upload_file:317 - Uploading file auto_gen_config_1589958000.0.conf
2022-03-19 12:02:03.782 | INFO     | feathr._synapse_submission:upload_file:323 - /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpf1nzrsse/feature_gen_conf/auto_gen_config_1589958000.0.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started/auto_gen_config_1589958000.0.conf
2022-03-19 12:02:03.783 | INFO     | feathr._synapse_submission:upload_or_get_cloud_path:41 - /var/folders/c0/h7cgkq4x56s__301z9203fw0001p3_/T/tmpf1nzrsse/feature_gen_conf/auto_gen_config_1589958000.0.conf is uploaded to location: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.win

We can then get the features from the online store (Redis):


In [22]:
res = client.get_online_features('nycTaxiDemoFeature', '265', [
                                 'f_location_avg_fare', 'f_location_max_fare'])


In [23]:
client.multi_get_online_features("nycTaxiDemoFeature", ["239", "265"], [
                                 'f_location_avg_fare', 'f_location_max_fare'])


{'239': [10.5, 10.5], '265': [42.5, 42.5]}

In [None]:
client.list_registered_features(project_name="frame_getting_started")
