# Feathr Quick Start Notebook

This notebook illustrates the use of Feathr Feature Store to create a model that predicts NYC Taxi fares. The dataset comes from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

The major problems Feathr solves are:

1. Create, share and manage useful features from raw source data.
2. Provide Point-in-time feature join to create training dataset to ensure no data leakage.
3. Deploy the same feature data to online store to eliminate training and inference data skew.

## Prerequisite

Feathr has native cloud integration. First step is to provision required cloud resources if you want to use Feathr.

Follow the [Feathr ARM deployment guide](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html) to run Feathr on Azure. This allows you to quickly get started with automated deployment using Azure Resource Manager template. For more details, please refer [README.md](https://github.com/feathr-ai/feathr#%EF%B8%8F-running-feathr-on-cloud-with-a-few-simple-steps).

Additionally, to run this notebook, you'll need to install `feathr` pip package. For local spark, simply run `pip install feathr` on the machine that runs this notebook. To use Databricks or Azure Synapse Analytics, please see dependency management documents:
- [Azure Databricks dependency management](https://learn.microsoft.com/en-us/azure/databricks/libraries/)
- [Azure Synapse Analytics dependency management](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries)

## Notebook Steps

This tutorial demonstrates the key capabilities of Feathr, including:

1. Install Feathr and necessary dependencies
2. Create shareable features with Feathr feature definition configs
3. Create training data using point-in-time correct feature join
4. Train a prediction model and evaluate the model and features
5. Register the features to share across teams
6. Materialize feature values for online scoring

The overall data flow is as follows:

<img src="https://raw.githubusercontent.com/feathr-ai/feathr/main/docs/images/feature_flow.png" width="800">

## 1. Install Feathr and Necessary Dependancies

Install feathr and necessary packages by running one of following commends if you haven't installed them already:

In [1]:
# To install feathr from the latest codes in the repo:
#%pip install "git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project&egg=feathr[notebook]" 

# To install the latest release:
#%pip install "feathr[notebook]"

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from datetime import timedelta
import os
from pathlib import Path

from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.sql import DataFrame, SparkSession
import pyspark.sql.functions as F

import feathr
from feathr import (
    FeathrClient,
    # Feature data types
    BOOLEAN, FLOAT, INT32, ValueType,
    # Feature data sources
    INPUT_CONTEXT, HdfsSource,
    # Feature aggregations
    TypedKey, WindowAggTransformation,
    # Feature types and anchor
    DerivedFeature, Feature, FeatureAnchor,
    # Materialization
    BackfillTime, MaterializationSettings, RedisSink,
    # Offline feature computation
    FeatureQuery, ObservationSettings,
)
from feathr.datasets import nyc_taxi
from feathr.spark_provider.feathr_configurations import SparkExecutionConfiguration
from feathr.utils.config import generate_config
from feathr.utils.job_utils import get_result_df
from feathr.utils.platform import is_databricks, is_jupyter

print(f"Feathr version: {feathr.__version__}")

Feathr version: 1.0.0


## 2. Create Shareable Features with Feathr Feature Definition Configs

First, we define all the necessary resource key values for authentication. These values are retrieved by using [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) cloud key value store. For authentication, we use Azure CLI credential in this notebook, but you may add secrets' list and get permission for the necessary service principal instead of running `az login --use-device-code`.

Please refer to [A note on using azure key vault to store credentials](https://github.com/feathr-ai/feathr/blob/41e7496b38c43af6d7f8f1de842f657b27840f6d/docs/how-to-guides/feathr-configuration-and-env.md#a-note-on-using-azure-key-vault-to-store-credentials) for more details.

In [4]:
RESOURCE_PREFIX = None  # TODO fill the value used to deploy the resources via ARM template
PROJECT_NAME = "nyc_taxi"

# Currently support: 'azure_synapse', 'databricks', and 'local' 
SPARK_CLUSTER = "local"

# TODO fill values to use databricks cluster:
DATABRICKS_CLUSTER_ID = None             # Set Databricks cluster id to use an existing cluster
if is_databricks():
    # If this notebook is running on Databricks, its context can be used to retrieve token and instance URL
    ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
    DATABRICKS_WORKSPACE_TOKEN_VALUE = ctx.apiToken().get()
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = f"https://{ctx.tags().get('browserHostName').get()}"
else:
    DATABRICKS_WORKSPACE_TOKEN_VALUE = None                  # Set Databricks workspace token to use databricks
    SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL = None  # Set Databricks workspace url to use databricks

# TODO fill values to use Azure Synapse cluster:
AZURE_SYNAPSE_SPARK_POOL = None  # Set Azure Synapse Spark pool name
AZURE_SYNAPSE_URL = None         # Set Azure Synapse workspace url to use Azure Synapse
ADLS_KEY = None                  # Set Azure Data Lake Storage key to use Azure Synapse

# An existing Feathr config file path. If None, we'll generate a new config based on the constants in this cell.
FEATHR_CONFIG_PATH = None

# If set True, use an interactive browser authentication to get the redis password.
USE_CLI_AUTH = False

# If set True, register the features to Feathr registry.
REGISTER_FEATURES = False

# (For the notebook test pipeline) If true, use ScrapBook package to collect the results.
SCRAP_RESULTS = False

To use Databricks as the feathr client's target platform, you may need to set a databricks token to an environment variable like:

`export DATABRICKS_WORKSPACE_TOKEN_VALUE=your-token`

or in the notebook cell,

`os.environ["DATABRICKS_WORKSPACE_TOKEN_VALUE"] = your-token`

If you are running this notebook on Databricks, the token will be automatically retrieved by using the current Databricks notebook context.

On the other hand, to use Azure Synapse cluster, you have to specify the synapse workspace storage key:

`export ADLS_KEY=your-key`

or in the notebook cell,

`os.environ["ADLS_KEY"] = your-key`

In [5]:
if SPARK_CLUSTER == "azure_synapse" and not os.environ.get("ADLS_KEY"):
    os.environ["ADLS_KEY"] = ADLS_KEY
elif SPARK_CLUSTER == "databricks" and not os.environ.get("DATABRICKS_WORKSPACE_TOKEN_VALUE"):
    os.environ["DATABRICKS_WORKSPACE_TOKEN_VALUE"] = DATABRICKS_WORKSPACE_TOKEN_VALUE

In [6]:
# Get an authentication credential to access Azure resources and register features
if USE_CLI_AUTH:
    # Use AZ CLI interactive browser authentication
    !az login --use-device-code
    from azure.identity import AzureCliCredential
    credential = AzureCliCredential(additionally_allowed_tenants=['*'],)
elif "AZURE_TENANT_ID" in os.environ and "AZURE_CLIENT_ID" in os.environ and "AZURE_CLIENT_SECRET" in os.environ:
    # Use Environment variable secret
    from azure.identity import EnvironmentCredential
    credential = EnvironmentCredential()
else:
    # Try to use the default credential
    from azure.identity import DefaultAzureCredential
    credential = DefaultAzureCredential(
        exclude_interactive_browser_credential=False,
        additionally_allowed_tenants=['*'],
    )

### Configurations

Feathr uses a yaml file to define configurations. Please refer to [feathr_config.yaml]( https://github.com//feathr-ai/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) for the meaning of each field.

All the Feathr configurations can be set to the yaml file via keyword arguments of `generate_config` helper function. Each keyword argument should be the concatenation of different layers of the config name using `__` as a separator.
For example, if you want to specify a different value for the feature registry api endpoint, you can pass `        feature_registry__api_endpoint="YOUR-API-ENDPOINT-URL"`.

Note, a default value for the api endpoint will be set based on `RESOURCE_PREFIX`.

In [7]:
os.environ['JDBC_USER'] = "root"
os.environ['JDBC_PASSWORD'] = "DsteamIC2024"
os.environ['SPARK_LOCAL_IP'] = "127.0.0.1"
os.environ['REDIS_PASSWORD'] = "foobared"  # default password for Redis


# Make sure we get the Feathr jar name, assuming we just have one jar file.
PROJECT_NAME = "nyc_taxi"

In [8]:
import glob
jar_name = glob.glob("./*.jar")[0]
print(f"Found jar file at {jar_name}")


yaml_config = f"""
api_version: 1
project_config:
  project_name: {PROJECT_NAME}
  
spark_config:
  # choice for spark runtime. Currently support: azure_synapse, databricks, local
  spark_cluster: 'local'
  spark_result_output_parts: '1'
  local:
    master: 'local[*]'
    spark.sql.shuffle.partitions: '10'
    spark.driver.memory: "10g"
    spark.executor.memory: "10g"
    feathr_runtime_location: "{jar_name}"

online_store:
  redis:
    # Redis configs to access Redis cluster
    host: '127.0.0.1'
    port: 6379
    ssl_enabled: False

feature_registry:
  # The API endpoint of the registry service
  api_endpoint: "http://127.0.0.1:8081/api/v1"
"""
feathr_workspace_folder = Path(f"./{PROJECT_NAME}_feathr_config.yaml")
feathr_workspace_folder.parent.mkdir(exist_ok=True, parents=True)
feathr_workspace_folder.write_text(yaml_config)
print(yaml_config)

Found jar file at ./vnpt_feathr-0.0.1.jar

api_version: 1
project_config:
  project_name: nyc_taxi
  
spark_config:
  # choice for spark runtime. Currently support: azure_synapse, databricks, local
  spark_cluster: 'local'
  spark_result_output_parts: '1'
  local:
    master: 'local[*]'
    spark.sql.shuffle.partitions: '10'
    spark.driver.memory: "10g"
    spark.executor.memory: "10g"
    feathr_runtime_location: "./vnpt_feathr-0.0.1.jar"

online_store:
  redis:
    # Redis configs to access Redis cluster
    host: '127.0.0.1'
    port: 6379
    ssl_enabled: False

feature_registry:
  # The API endpoint of the registry service
  api_endpoint: "http://127.0.0.1:8081/api/v1"



All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of the config file, same as how you may pass the keyword arguments to `generate_config` utility function.

For example, `feathr_runtime_location` for databricks config can be overwritten by setting `spark_config__databricks__feathr_runtime_location` environment variable.

### Initialize Feathr client

In [9]:
from pathlib import Path
feathr_workspace_folder = Path(f"./{PROJECT_NAME}_feathr_config.yaml")
client = FeathrClient(str(feathr_workspace_folder))

2024-09-06 16:11:49.274 | INFO     | feathr.utils._env_config_reader:get:62 - Config secrets__azure_key_vault__name is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:11:49.276 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__s3__s3_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:11:49.276 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__adls__adls_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:11:49.277 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__wasb__wasb_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-09-06 16:11:49

### Prepare the NYC taxi fare dataset

In [10]:
# If the notebook is runnong on Jupyter, start a spark session:
if is_jupyter():
    spark = (
        SparkSession
        .builder
        .appName("feathr")
        .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.0,io.delta:delta-core_2.12:2.1.1")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        .config("spark.ui.port", "8080")  # Set ui port other than the default one (4040) so that feathr spark job doesn't fail. 
        .getOrCreate()
    )

# Else, you must already have a spark session object available in databricks or synapse notebooks.

bash: /mnt/e/setup/miniconda3/lib/libtinfo.so.6: no version information available (required by bash)
bash: /mnt/e/setup/miniconda3/lib/libtinfo.so.6: no version information available (required by bash)


:: loading settings :: url = jar:file:/mnt/e/setup/miniconda3/envs/feathr3.9/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/cuong/.ivy2/cache
The jars for the packages stored in: /home/cuong/.ivy2/jars
org.apache.spark#spark-avro_2.12 added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3e0688af-41a7-4c21-944f-7ab85f81be61;1.0
	confs: [default]
	found org.apache.spark#spark-avro_2.12;3.3.0 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found io.delta#delta-core_2.12;2.1.1 in central
	found io.delta#delta-storage;2.1.1 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 138ms :: artifacts dl 3ms
	:: modules in use:
	io.delta#delta-core_2.12;2.1.1 from central in [default]
	io.delta#delta-storage;2.1.1 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.apache.spark#spark-avro_2.12;3.3.0 from central in [def

24/09/06 16:11:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/09/06 16:11:51 WARN Utils: Service 'SparkUI' could not bind on port 8080. Attempting port 8081.
24/09/06 16:11:51 WARN Utils: Service 'SparkUI' could not bind on port 8081. Attempting port 8082.


In [11]:
# Use dbfs if the notebook is running on Databricks
if is_databricks():
    WORKING_DIR = f"/dbfs/{PROJECT_NAME}"
else:
    WORKING_DIR = PROJECT_NAME

In [12]:
# Download the data file
data_file_path = "../../feathr_project/test/test_user_workspace/green_tripdata_2020-04_with_index.csv"


### Defining features with Feathr

In Feathr, a feature is viewed as a function, mapping a key and timestamp to a feature value. For more details, please see [Feathr Feature Definition Guide](https://github.com/feathr-ai/feathr/blob/main/docs/concepts/feature-definition.md).

* The feature key (a.k.a. entity id) identifies the subject of feature, e.g. a user_id or location_id.
* The feature name is the aspect of the entity that the feature is indicating, e.g. the age of the user.
* The feature value is the actual value of that aspect at a particular time, e.g. the value is 30 at year 2022.

Note that, in some cases, a feature could be just a transformation function that has no entity key or timestamp involved, e.g. *the day of week of the request timestamp*.

There are two types of features -- anchored features and derivated features:

* **Anchored features**: Features that are directly extracted from sources. Could be with or without aggregation. 
* **Derived features**: Features that are computed on top of other features.

#### Define anchored features

A feature source is needed for anchored features that describes the raw data in which the feature values are computed from. A source value should be either `INPUT_CONTEXT` (the features that will be extracted from the observation data directly) or `feathr.source.Source` object.

In [13]:
TIMESTAMP_COL = "lpep_dropoff_datetime"
TIMESTAMP_FORMAT = "yyyy-MM-dd HH:mm:ss"

In [14]:
# We define f_trip_distance and f_trip_time_duration features separately
# so that we can reuse them later for the derived features.
f_trip_distance = Feature(
    name="f_trip_distance",
    feature_type=FLOAT,
    transform="trip_distance",
)
f_trip_time_duration = Feature(
    name="f_trip_time_duration",
    feature_type=FLOAT,
    transform="cast_float((to_unix_timestamp(lpep_dropoff_datetime) - to_unix_timestamp(lpep_pickup_datetime)) / 60)",
)

features = [
    f_trip_distance,
    f_trip_time_duration,
    Feature(
        name="f_is_long_trip_distance",
        feature_type=BOOLEAN,
        transform="trip_distance > 30.0",
    ),
    Feature(
        name="f_day_of_week",
        feature_type=INT32,
        transform="dayofweek(lpep_dropoff_datetime)",
    ),
    Feature(
        name="f_day_of_month",
        feature_type=INT32,
        transform="dayofmonth(lpep_dropoff_datetime)",
    ),
    Feature(
        name="f_hour_of_day",
        feature_type=INT32,
        transform="hour(lpep_dropoff_datetime)",
    ),
]

# After you have defined features, bring them together to build the anchor to the source.
feature_anchor = FeatureAnchor(
    name="feature_anchor",
    source=INPUT_CONTEXT,  # Pass through source, i.e. observation data.
    features=features,
)

We can define the source with a preprocessing python function. In order to make the source data accessible from the target spark cluster, we upload the data file into either DBFS or Azure Blob Storage if needed.

In [15]:
# Upload files to cloud if needed
if client.spark_runtime == "local":
    # In local mode, we can use the same data path as the source.
    data_source_path = data_file_path
elif client.spark_runtime == "databricks" and is_databricks():
    # If the notebook is running on databricks, we can use the same data path as the source.
    data_source_path = data_file_path.replace("/dbfs", "dbfs:", 1)
else:
    # Otherwise, upload the local file to the cloud storage (either dbfs or adls).
    data_source_path = client.feathr_spark_launcher.upload_or_get_cloud_path(data_file_path)    

In [16]:
def preprocessing(df: DataFrame) -> DataFrame:
    import pyspark.sql.functions as F
    df = df.withColumn("fare_amount_cents", (F.col("fare_amount") * 100.0).cast("float"))
    return df

batch_source = HdfsSource(
    name="nycTaxiBatchSource",
    path=data_source_path,
    event_timestamp_column=TIMESTAMP_COL,
    timestamp_format=TIMESTAMP_FORMAT,
    preprocessing=preprocessing,
)

For the features with aggregation, the supported functions are as follows:

| Aggregation Function | Input Type | Description |
| --- | --- | --- |
|SUM, COUNT, MAX, MIN, AVG	|Numeric|Applies the the numerical operation on the numeric inputs. |
|MAX_POOLING, MIN_POOLING, AVG_POOLING	| Numeric Vector | Applies the max/min/avg operation on a per entry bassis for a given a collection of numbers.|
|LATEST| Any |Returns the latest not-null values from within the defined time window |

In [17]:
agg_key = TypedKey(
    key_column="DOLocationID",
    key_column_type=ValueType.INT32,
    description="location id in NYC",
    full_name="nyc_taxi.location_id",
)

agg_window = "90d"

# Anchored features with aggregations
agg_features = [
    Feature(
        name="f_location_avg_fare",
        key=agg_key,
        feature_type=FLOAT,
        transform=WindowAggTransformation(
            agg_expr="fare_amount_cents",
            agg_func="AVG",
            window=agg_window,
        ),
    ),
    Feature(
        name="f_location_max_fare",
        key=agg_key,
        feature_type=FLOAT,
        transform=WindowAggTransformation(
            agg_expr="fare_amount_cents",
            agg_func="MAX",
            window=agg_window,
        ),
    ),
]

agg_feature_anchor = FeatureAnchor(
    name="agg_feature_anchor",
    source=batch_source,  # External data source for feature. Typically a data table.
    features=agg_features,
)

#### Define derived features

We also define a derived feature, `f_trip_speed`, from the anchored features `f_trip_distance` and `f_trip_time_duration` as follows:

In [18]:
derived_features = [
    DerivedFeature(
        name="f_trip_speed",
        feature_type=FLOAT,
        input_features=[
            f_trip_distance,
            f_trip_time_duration,
        ],
        transform="f_trip_distance / f_trip_time_duration",
    )
]

### Build features

Finally, we build the features.

In [19]:
client.build_features(
    anchor_list=[feature_anchor, agg_feature_anchor],
    derived_feature_list=derived_features,
)

# Register the Features to Share Across Teams

You can register your features in the centralized registry and share the corresponding project with other team members who want to consume those features and for further use.

In [20]:
try:
    client.register_features()
except Exception as e:
    print(e)
print(client.list_registered_features(project_name=PROJECT_NAME))

[{'name': 'f_day_of_week', 'id': 'fd9c9f60-05ec-41b7-a0b8-c25fccd90b64', 'qualifiedName': 'nyc_taxi__feature_anchor__f_day_of_week'}, {'name': 'f_is_long_trip_distance', 'id': '420cec54-6906-49c2-b31f-b183fcc06295', 'qualifiedName': 'nyc_taxi__feature_anchor__f_is_long_trip_distance'}, {'name': 'f_trip_speed', 'id': '4f1c3162-dee8-42b2-8cdd-18d9e0fef25b', 'qualifiedName': 'nyc_taxi__f_trip_speed'}, {'name': 'f_trip_distance', 'id': '67a74383-ab2d-4621-894e-51a4c4f79770', 'qualifiedName': 'nyc_taxi__feature_anchor__f_trip_distance'}, {'name': 'f_trip_time_duration', 'id': 'bda71a00-67b9-49f2-8e35-51efc3f484e1', 'qualifiedName': 'nyc_taxi__feature_anchor__f_trip_time_duration'}, {'name': 'f_day_of_month', 'id': 'c9165e5d-3cd1-4bea-8e3d-a4d44b31a7ab', 'qualifiedName': 'nyc_taxi__feature_anchor__f_day_of_month'}, {'name': 'f_hour_of_day', 'id': '6530f88d-16b5-4cd6-972d-b3b64f0ba19d', 'qualifiedName': 'nyc_taxi__feature_anchor__f_hour_of_day'}, {'name': 'f_location_avg_fare', 'id': '3882ab8

In [21]:
client.list_registered_features(project_name=PROJECT_NAME)

[{'name': 'f_day_of_week',
  'id': 'fd9c9f60-05ec-41b7-a0b8-c25fccd90b64',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_day_of_week'},
 {'name': 'f_is_long_trip_distance',
  'id': '420cec54-6906-49c2-b31f-b183fcc06295',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_is_long_trip_distance'},
 {'name': 'f_trip_speed',
  'id': '4f1c3162-dee8-42b2-8cdd-18d9e0fef25b',
  'qualifiedName': 'nyc_taxi__f_trip_speed'},
 {'name': 'f_trip_distance',
  'id': '67a74383-ab2d-4621-894e-51a4c4f79770',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_trip_distance'},
 {'name': 'f_trip_time_duration',
  'id': 'bda71a00-67b9-49f2-8e35-51efc3f484e1',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_trip_time_duration'},
 {'name': 'f_day_of_month',
  'id': 'c9165e5d-3cd1-4bea-8e3d-a4d44b31a7ab',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_day_of_month'},
 {'name': 'f_hour_of_day',
  'id': '6530f88d-16b5-4cd6-972d-b3b64f0ba19d',
  'qualifiedName': 'nyc_taxi__feature_anchor__f_hour_of_day'},
 {'name': 'f

In [22]:
feature_dict = client.get_features_from_registry(project_name=PROJECT_NAME, return_keys=True, verbose=True)

2024-09-06 16:11:53.399 | INFO     | feathr.client:get_features_from_registry:1147 - Get anchor features from registry: 
2024-09-06 16:11:53.400 | INFO     | feathr.client:get_features_from_registry:1153 - {
  "name": "f_trip_distance",
  "featureType": {
    "type": "TENSOR",
    "tensorCategory": "DENSE",
    "dimensionType": [],
    "valType": "FLOAT"
  },
  "key": [
    {
      "keyColumn": "NOT_NEEDED",
      "keyColumnType": "UNSPECIFIED",
      "fullName": "feathr.dummy_typedkey",
      "description": "feathr.dummy_typedkey",
      "keyColumnAlias": "NOT_NEEDED"
    }
  ],
  "transformation": {
    "transformExpr": "trip_distance"
  }
}
2024-09-06 16:11:53.400 | INFO     | feathr.client:get_features_from_registry:1153 - {
  "name": "f_trip_time_duration",
  "featureType": {
    "type": "TENSOR",
    "tensorCategory": "DENSE",
    "dimensionType": [],
    "valType": "FLOAT"
  },
  "key": [
    {
      "keyColumn": "NOT_NEEDED",
      "keyColumnType": "UNSPECIFIED",
      "fullNam

## we can list all features

In [23]:
[feat.name for feat in list(feature_dict[0].values())]

['f_trip_distance',
 'f_trip_time_duration',
 'f_is_long_trip_distance',
 'f_day_of_week',
 'f_day_of_month',
 'f_hour_of_day',
 'f_location_avg_fare',
 'f_location_max_fare',
 'f_trip_speed']

## we can list all type_key

In [24]:
[type_key.key_column for type_keys in list(feature_dict[1].values()) for type_key in type_keys]

['NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'NOT_NEEDED',
 'DOLocationID',
 'DOLocationID',
 'NOT_NEEDED']

In [25]:
# Stop the spark session if it is a local session.
if is_jupyter():
    spark.stop()