# Feathr Quick Start Notebook

This notebook illustrates the use of Feathr Feature Store to create a model that predicts NYC Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

The major problems Feathr solves are:

1. Create, share and manage useful features from raw source data.
2. Provide Point-in-time feature join to create training dataset to ensure no data leakage.
3. Deploy the same feature data to online store to eliminate training and inference data skew.

## Prerequisite

Feathr has native cloud integration. First step is to provision required cloud resources if you want to use Feathr.

Follow the [Feathr ARM deployment guide](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html) to run Feathr on Azure. This allows you to quickly get started with automated deployment using Azure Resource Manager template. For more details, please refer [README.md](https://github.com/feathr-ai/feathr#%EF%B8%8F-running-feathr-on-cloud-with-a-few-simple-steps).

Additionally, to run this notebook, you'll need to install `feathr` pip package. For local spark, simply run `pip install feathr` on the machine that runs this notebook. To use Databricks or Azure Synapse Analytics, please see dependency management documents:
- [Azure Databricks dependency management](https://learn.microsoft.com/en-us/azure/databricks/libraries/)
- [Azure Synapse Analytics dependency management](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries)

## Notebook Steps

This tutorial demonstrates the key capabilities of Feathr, including:

1. Install Feathr and necessary dependencies
2. Create shareable features with Feathr feature definition configs
3. Create training data using point-in-time correct feature join
4. Train a prediction model and evaluate the model and features
5. Register the features to share across teams
6. Materialize feature values for online scoring

The overall data flow is as follows:

<img src="https://raw.githubusercontent.com/feathr-ai/feathr/main/docs/images/feature_flow.png" width="800">

## 1. Install Feathr and Necessary Dependancies

Install feathr and necessary packages by running one of following commends if you haven't installed them already:

In [1]:
# To install feathr from the latest codes in the repo:
#%pip install "git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project&egg=feathr[notebook]" 

# To install the latest release:
#%pip install "feathr[notebook]"

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from datetime import timedelta
import os
from pathlib import Path

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.sql import DataFrame, SparkSession
import pyspark.sql.functions as F

import feathr
from feathr import (
    FeathrClient,
    # Feature data types
    BOOLEAN, FLOAT, INT32, ValueType,
    # Feature data sources
    INPUT_CONTEXT, HdfsSource,
    # Feature aggregations
    TypedKey, WindowAggTransformation,
    # Feature types and anchor
    DerivedFeature, Feature, FeatureAnchor,
    # Materialization
    BackfillTime, MaterializationSettings, RedisSink,
    # Offline feature computation
    FeatureQuery, ObservationSettings,
)
from feathr.datasets import nyc_taxi
from feathr.spark_provider.feathr_configurations import SparkExecutionConfiguration
from feathr.utils.config import generate_config
from feathr.utils.job_utils import get_result_df
from feathr.utils.platform import is_databricks, is_jupyter

print(f"Feathr version: {feathr.__version__}")

Feathr version: 1.0.0


## 2. Create Shareable Features with Feathr Feature Definition Configs

First, we define all the necessary resource key values for authentication. These values are retrieved by using [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) cloud key value store. For authentication, we use Azure CLI credential in this notebook, but you may add secrets' list and get permission for the necessary service principal instead of running `az login --use-device-code`.

Please refer to [A note on using azure key vault to store credentials](https://github.com/feathr-ai/feathr/blob/41e7496b38c43af6d7f8f1de842f657b27840f6d/docs/how-to-guides/feathr-configuration-and-env.md#a-note-on-using-azure-key-vault-to-store-credentials) for more details.

In [4]:
RESOURCE_PREFIX = None  # TODO fill the value used to deploy the resources via ARM template
PROJECT_NAME = "nyc_taxi"

# Currently support: 'azure_synapse', 'databricks', and 'local' 
SPARK_CLUSTER = "local"

# TODO fill values to use databricks cluster:
DATABRICKS_CLUSTER_ID = None             # Set Databricks cluster id to use an existing cluster
if is_databricks():
    # If this notebook is running on Databricks, its context can be used to retrieve token and instance URL
    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    DATABRICKS_WORKSPACE_TOKEN_VALUE = ctx.apiToken().get()
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = f"https://{ctx.tags().get('browserHostName').get()}"
else:
    DATABRICKS_WORKSPACE_TOKEN_VALUE = None                  # Set Databricks workspace token to use databricks
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = None  # Set Databricks workspace url to use databricks

# TODO fill values to use Azure Synapse cluster:
AZURE_SYNAPSE_SPARK_POOL = None  # Set Azure Synapse Spark pool name
AZURE_SYNAPSE_URL = None         # Set Azure Synapse workspace url to use Azure Synapse
ADLS_KEY = None                  # Set Azure Data Lake Storage key to use Azure Synapse

# An existing Feathr config file path. If None, we'll generate a new config based on the constants in this cell.
FEATHR_CONFIG_PATH = None

# If set True, use an interactive browser authentication to get the redis password.
USE_CLI_AUTH = False

# If set True, register the features to Feathr registry.
REGISTER_FEATURES = False

# (For the notebook test pipeline) If true, use ScrapBook package to collect the results.
SCRAP_RESULTS = False

To use Databricks as the feathr client's target platform, you may need to set a databricks token to an environment variable like:

`export DATABRICKS_WORKSPACE_TOKEN_VALUE=your-token`

or in the notebook cell,

`os.environ["DATABRICKS_WORKSPACE_TOKEN_VALUE"] = your-token`

If you are running this notebook on Databricks, the token will be automatically retrieved by using the current Databricks notebook context.

On the other hand, to use Azure Synapse cluster, you have to specify the synapse workspace storage key:

`export ADLS_KEY=your-key`

or in the notebook cell,

`os.environ["ADLS_KEY"] = your-key`

In [5]:
if SPARK_CLUSTER == "azure_synapse" and not os.environ.get("ADLS_KEY"):
    os.environ["ADLS_KEY"] = ADLS_KEY
elif SPARK_CLUSTER == "databricks" and not os.environ.get("DATABRICKS_WORKSPACE_TOKEN_VALUE"):
    os.environ["DATABRICKS_WORKSPACE_TOKEN_VALUE"] = DATABRICKS_WORKSPACE_TOKEN_VALUE

In [6]:
# Get an authentication credential to access Azure resources and register features
if USE_CLI_AUTH:
    # Use AZ CLI interactive browser authentication
    !az login --use-device-code
    from azure.identity import AzureCliCredential
    credential = AzureCliCredential(additionally_allowed_tenants=['*'],)
elif "AZURE_TENANT_ID" in os.environ and "AZURE_CLIENT_ID" in os.environ and "AZURE_CLIENT_SECRET" in os.environ:
    # Use Environment variable secret
    from azure.identity import EnvironmentCredential
    credential = EnvironmentCredential()
else:
    # Try to use the default credential
    from azure.identity import DefaultAzureCredential
    credential = DefaultAzureCredential(
        exclude_interactive_browser_credential=False,
        additionally_allowed_tenants=['*'],
    )

### Configurations

Feathr uses a yaml file to define configurations. Please refer to [feathr_config.yaml]( https://github.com//feathr-ai/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) for the meaning of each field.

All the Feathr configurations can be set to the yaml file via keyword arguments of `generate_config` helper function. Each keyword argument should be the concatenation of different layers of the config name using `__` as a separator.
For example, if you want to specify a different value for the feature registry api endpoint, you can pass `        feature_registry__api_endpoint="YOUR-API-ENDPOINT-URL"`.

Note, a default value for the api endpoint will be set based on `RESOURCE_PREFIX`.

In [7]:
print(feathr.__version__)
os.environ['S3_ENDPOINT'] = "127.0.0.1:9000"
os.environ['S3_ACCESS_KEY'] = "3Uu5nPtNBjKLtHov89dD"
os.environ['S3_SECRET_KEY'] = "fd3YycgUgZkIQPTayGStql9MX4j8Z1sctc19iOh1"
os.environ['SPARK_LOCAL_IP'] = "127.0.0.1"
os.environ['REDIS_PASSWORD'] = "foobared"  # default password for Redis

1.0.0


All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of the config file, same as how you may pass the keyword arguments to `generate_config` utility function.

For example, `feathr_runtime_location` for databricks config can be overwritten by setting `spark_config__databricks__feathr_runtime_location` environment variable.

### Initialize Feathr client

In [8]:
from pathlib import Path
feathr_workspace_folder = Path(f"./{PROJECT_NAME}_feathr_config.yaml")
client = FeathrClient(str(feathr_workspace_folder))

2024-09-06 16:50:28.202 | INFO     | feathr.utils._env_config_reader:get:62 - Config secrets__azure_key_vault__name is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:28.203 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__s3__s3_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:28.205 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__adls__adls_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:28.205 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__wasb__wasb_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:28

### Prepare the NYC taxi fare dataset

In [9]:
# # If the notebook is runnong on Jupyter, start a spark session:
# if is_jupyter():
#     spark = (
#         SparkSession
#         .builder
#         .appName("feathr")
#         .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.0,io.delta:delta-core_2.12:2.1.1")
#         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
#         .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
#         .config("spark.ui.port", "8080")  # Set ui port other than the default one (4040) so that feathr spark job doesn't fail. 
#         .getOrCreate()
#     )

# # Else, you must already have a spark session object available in databricks or synapse notebooks.

In [10]:
# Use dbfs if the notebook is running on Databricks
if is_databricks():
    WORKING_DIR = f"/dbfs/{PROJECT_NAME}"
else:
    WORKING_DIR = PROJECT_NAME

In [11]:
# # Download the data file
# data_file_path = "../../feathr_project/test/test_user_workspace/green_tripdata_2020-04_with_index.csv"
# df_raw = nyc_taxi.get_spark_df(spark=spark, local_cache_path=data_file_path)
# df_raw.limit(5).show()

In [12]:
TIMESTAMP_COL = "lpep_dropoff_datetime"
TIMESTAMP_FORMAT = "yyyy-MM-dd HH:mm:ss"

In [13]:
# # Upload files to cloud if needed
# if client.spark_runtime == "local":
#     # In local mode, we can use the same data path as the source.
#     data_source_path = data_file_path
# elif client.spark_runtime == "databricks" and is_databricks():
#     # If the notebook is running on databricks, we can use the same data path as the source.
#     data_source_path = data_file_path.replace("/dbfs", "dbfs:", 1)
# else:
#     # Otherwise, upload the local file to the cloud storage (either dbfs or adls).
#     data_source_path = client.feathr_spark_launcher.upload_or_get_cloud_path(data_file_path)    

# Get features from register server

In [14]:
feature_dict = client.get_features_from_registry(project_name=PROJECT_NAME, return_keys=True, verbose=True)

2024-09-06 16:50:28.867 | INFO     | feathr.client:get_features_from_registry:1147 - Get anchor features from registry: 
2024-09-06 16:50:28.868 | INFO     | feathr.client:get_features_from_registry:1153 - {
  "name": "f_trip_distance",
  "featureType": {
    "type": "TENSOR",
    "tensorCategory": "DENSE",
    "dimensionType": [],
    "valType": "FLOAT"
  },
  "key": [
    {
      "keyColumn": "NOT_NEEDED",
      "keyColumnType": "UNSPECIFIED",
      "fullName": "feathr.dummy_typedkey",
      "description": "feathr.dummy_typedkey",
      "keyColumnAlias": "NOT_NEEDED"
    }
  ],
  "transformation": {
    "transformExpr": "trip_distance"
  }
}
2024-09-06 16:50:28.869 | INFO     | feathr.client:get_features_from_registry:1153 - {
  "name": "f_trip_time_duration",
  "featureType": {
    "type": "TENSOR",
    "tensorCategory": "DENSE",
    "dimensionType": [],
    "valType": "FLOAT"
  },
  "key": [
    {
      "keyColumn": "NOT_NEEDED",
      "keyColumnType": "UNSPECIFIED",
      "fullNam

In [15]:
[feat.name for feat in list(feature_dict[0].values())]

['f_trip_distance',
 'f_trip_time_duration',
 'f_is_long_trip_distance',
 'f_day_of_week',
 'f_day_of_month',
 'f_hour_of_day',
 'f_location_avg_fare',
 'f_location_max_fare',
 'f_trip_speed']

In [16]:
[type_key.key_column for type_keys in list(feature_dict[1].values()) for type_key in type_keys]

['NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'DOLocationID',
 'DOLocationID',
 'NOT_NEEDED']

# Materialize Feature Values for Offline Scoring - HDFS

In [17]:
from datetime import datetime
from feathr import HdfsSink

In [18]:
FEATURE_TABLE_NAME = "nycTaxiDemoFeature"

# Time range to materialize
backfill_time = BackfillTime(start=datetime(2020, 4, 19), end=datetime(2020, 4, 20), step=timedelta(days=1))

# Destinations:
# For offline store,
hdfs_sink = HdfsSink(output_path="s3a://feathrstore")
settings = MaterializationSettings(
    name=FEATURE_TABLE_NAME + ".job",  # job name
    backfill_time=backfill_time,
    sinks=[hdfs_sink],  # or adls_sink
    feature_names=['f_location_avg_fare',
                     'f_location_max_fare',],
)

client.materialize_features(
    settings=settings,
    execution_configurations={"spark.feathr.outputFormat": "parquet",
                              "spark.feathr.hdfs.local.enable": "true",
                              "spark.hadoop.fs.s3a.endpoint": "http://127.0.0.1:9000",
                              "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
                              "spark.hadoop.fs.s3a.path.style.access": "true",
                              'spark.hadoop.fs.s3a.aws.credentials.provider':'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider',
                               "spark.jars.plus.packages": 
                                                       "io.minio:spark-select_2.11:2.1,org.apache.hadoop:hadoop-aws:3.3.2"
                             
                             },
)

client.wait_job_to_finish(timeout_sec=5000)

2024-09-06 16:50:29.284 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__url is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:29.285 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__user is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:50:29.286 | INFO     | feathr.spark_provider._localspark_submission:_get_debug_file_name:296 - Spark log path is debug/nyc_taxi_feathr_feature_materialization_job20240906165029
2024-09-06 16:50:29.287 | INFO     | feathr.spark_provider._localspark_submission:_init_args:271 - Spark job: nyc_taxi_feathr_feature_materialization_job is running on local spark with master: local[*].
2024-09-06 16:50:29.292 | INFO     | feathr.spark_provider._localspark_submission:submit_feathr_job:151 - Detail job 

>

	found org.apache.spark#spark-avro_2.12;3.3.0 in central
	found org.apache.spark#spark-avro_2.12;3.3.0 in central
	found org.tukaani#xz;1.8 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.logging.log4j#log4j-core;2.17.2 in central
	found org.apache.logging.log4j#log4j-core;2.17.2 in central
	found com.typesafe#config;1.3.4 in central
	found com.typesafe#config;1.3.4 in central
	found org.apache.hadoop#hadoop-mapreduce-client-core;3.3.2 in central
	found org.apache.hadoop#hadoop-mapreduce-client-core;3.3.2 in central
	found org.apache.hadoop#hadoop-yarn-client;3.3.2 in central
	found org.apache.hadoop#hadoop-yarn-client;3.3.2 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1 in central
	found commons-cli#commons-cli;1.2 in central
	found commons-cli#commons-cli;1.2 i

x

	found com.fasterxml.jackson.core#jackson-core;2.13.0 in central	found com.fasterxml.jackson.core#jackson-databind;2.13.0 in central

	found com.fasterxml.jackson.core#jackson-core;2.13.0 in central
	found org.apache.hadoop#hadoop-auth;3.3.2 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.apache.hadoop#hadoop-auth;3.3.2 in central
	found commons-codec#commons-codec;1.11 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found commons-codec#commons-codec;1.11 in central
	found commons-logging#commons-logging;1.1.3 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found com.nimbusds#nimbus-jose-jwt;9.8.1 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found com.github.stephenc.jcip#jcip-annotations;1.0-1 in central
	found commons-logging#commons-logging;1.1.3 in central
	found net.minidev#json-sm

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>x

2024-09-06 16:52:06.496 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:198 - Pyspark job Completed


>

2024-09-06 16:52:07.501 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:234 - Spark job with pid 122423 finished in: 98 seconds.


>

In [19]:
# If the notebook is runnong on Jupyter, start a spark session:
if is_jupyter():
    spark = (
        SparkSession
        .builder
        .appName("feathr")
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-common:3.3.2,org.apache.hadoop:hadoop-mapreduce-client-core:3.3.2,org.apache.hadoop:hadoop-aws:3.3.2,io.minio:spark-select_2.11:2.1")
        .config("fs.s3a.endpoint", "http://127.0.0.1:9000")
        .config("fs.s3a.access.key", os.getenv('S3_ACCESS_KEY'))
        .config("fs.s3a.secret.key", os.getenv('S3_SECRET_KEY'))
        .config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        .config("fs.s3a.path.style.access", "true")
        .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
        .config("spark.ui.port", "8080")  # Set ui port other than the default one (4040) so that feathr spark job doesn't fail. 
        .getOrCreate()
    )

# Else, you must already have a spark session object available in databricks or synapse notebooks.

bash: /mnt/e/setup/miniconda3/lib/libtinfo.so.6: no version information available (required by bash)
bash: /mnt/e/setup/miniconda3/lib/libtinfo.so.6: no version information available (required by bash)


:: loading settings :: url = jar:file:/mnt/e/setup/miniconda3/envs/feathr3.9/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/cuong/.ivy2/cache
The jars for the packages stored in: /home/cuong/.ivy2/jars
org.apache.hadoop#hadoop-common added as a dependency
org.apache.hadoop#hadoop-mapreduce-client-core added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
io.minio#spark-select_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-752fd5bf-6faf-4c0e-b792-1e2c155f8cc9;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-common;3.3.2 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-protobuf_3_7;1.1.1 in central
	found org.apache.hadoop#hadoop-annotations;3.3.2 in central
	found org.apache.hadoop.thirdparty#hadoop-shaded-guava;1.1.1 in central
	found com.google.guava#guava;27.0-jre in central
	found com.google.guava#failureaccess;1.0 in central
	found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found org.checkerf

24/09/06 16:52:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [20]:
from feathr import get_result_df
path = "s3a://feathrstore/df*/daily/2020/04/20/"
df = get_result_df(spark=spark,client=client, format="parquet", res_url=path)
df.show()

24/09/06 16:52:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+-------------------+-------------------+----+
|f_location_avg_fare|f_location_max_fare|key0|
+-------------------+-------------------+----+
|          4270.7866|             7450.0| 227|
|          1941.2174|             5705.0| 131|
|           2707.024|             8000.0| 228|
|          2698.0476|             6620.0| 100|
|          1865.2667|             2988.0| 200|
|             8237.0|             8237.0| 221|
|          2041.9125|             7050.0| 241|
|           1280.625|             4853.0| 207|
|           1339.925|             4204.0| 135|
|             4567.5|             5120.0| 118|
|             5639.6|             6804.0| 201|
|          1686.2034|             6106.0| 242|
|          1714.4845|             5476.0| 177|
|             1988.1|             4850.0| 144|
|          2234.8262|             5563.0| 256|
|          

## Cleanup

In [21]:
# TODO: Unregister, delete cached files or do any other cleanups.

In [22]:
# Stop the spark session if it is a local session.
if is_jupyter():
    spark.stop()

Scrap Variables for Unit-Test