## Feathr Feature Store For Customer360 on Azure - Demo Notebook

This notebook illustrates the use of Feathr Feature Store to create one of the use case for Customer 360. This usecase predicts Sales amount by the Discount offered. It includes following steps:
 
1. Install and set up Feathr with Azure
2. Create shareable features with Feathr feature definition configs.
3. Create a training dataset via point-in-time feature join.
4. Compute and write features.
5. Train a model using these features to predict Sales Amount.
6. Materialize feature value to online store.
7. Fetch feature value in real-time from online store for online scoring.


The feature flow is as follows:
![Feature Engineering](./Feature_engineering_c360.jpg)

#### Prerequisite: Provision cloud resources

First step is to provision required cloud resources if you want to use Feathr. Feathr provides a python based client to interact with cloud resources.

Please follow the steps [here](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html) to provision required cloud resources. Due to the complexity of the possible cloud environment, it is almost impossible to create a script that works for all the use cases. Because of this, [azure_resource_provision.sh](https://github.com/feathr-ai/feathr/blob/main/docs/how-to-guides/azure_resource_provision.sh) is a full end to end command line to create all the required resources, and you can tailor the script as needed, while [the companion documentation](https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-cli.html) can be used as a complete guide for using that shell script.


And the architecture is as below:

![Architecture](https://github.com/feathr-ai/feathr/blob/main/docs/images/architecture.png?raw=true)

#### Sample Dataset

In this demo, we use Feathr Feature Store to showcase Customer360 Features using Feathr. The dataset can be mounted onto a azure blob storage account and seen by executing the following command. The dataset is present in the current directory and it is referenced from [here](https://community.tableau.com/s/question/0D54T00000CWeX8SAL/sample-superstore-sales-excelxls)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:

key = "blobstorekey"
acnt = "studiofeathrazuredevsto"
container = "studio-feathrazure-dev-fs"
mntpnt = "/mnt/studio-feathrazure-dev-fs"

def mountStorageContainer(storageAccount, storageAccountKey, storageContainer, blobMountPoint):
    try:
        print("Mounting {0} to {1}:".format(storageContainer, blobMountPoint))
        dbutils.fs.unmount(blobMountPoint)
        
    except Exception as e:
        print("....Container is not mounted; Attempting mounting now..")
        
    mountStatus = dbutils.fs.mount(source = "wasbs://{0}@{1}.blob.core.windows.net/".format(storageContainer, storageAccount),
                  mount_point = blobMountPoint,
                  extra_configs = {"fs.azure.account.key.{0}.blob.core.windows.net".format(storageAccount): storageAccountKey})
    
    print("....Status of mount is: " + str(mountStatus))
    print()

    
# mountStorageContainer(acnt,key,container,mntpnt)


In [3]:
from feathr.utils.platform import is_databricks, is_jupyter
from pyspark.sql import DataFrame, SparkSession

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [4]:
#!pip install fastparquet

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv("/home/jovyan/work/customer360.csv")
df["sales_order_dt"] = pd.to_datetime(df["sales_order_dt"], dayfirst=True).dt.strftime('%Y-%m-%d')
df["insert_dt"] = pd.to_datetime(df["insert_dt"], dayfirst=True).dt.strftime('%Y-%m-%d')
df["last_modified_dt"] = pd.to_datetime(df["last_modified_dt"], dayfirst=True).dt.strftime('%Y-%m-%d')
df["sales_launch_date"] = pd.to_datetime(df["sales_launch_date"], dayfirst=True).dt.strftime('%Y-%m-%d')

df.to_parquet("customer360.parquet")

In [7]:
pd.read_parquet("customer360.parquet")

Unnamed: 0,cms_txn_sk,sales_cust_id,sales_tran_id,sales_order_id,sales_item_quantity,sales_order_dt,cms_store_sk,sales_store_id,sales_store_name,sales_channel,...,sales_launch_date,premium_prd,ship_mode,payment_preference,insert_dt,last_modified_dt,insert_by,last_modified_by,job_id,batch_id
0,1,JF-15295,txn4679,CA-2015-114510,3,2020-01-27,,0,,Delivery,...,2019-11-23,N,First Class,Debit card,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
1,2,GB-14575,txn4784,US-2018-147984,3,2020-01-18,,0,,Delivery,...,2019-11-16,N,Same Day,Digital Mode,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
2,3,MC-17275,txn4850,CA-2015-107818,5,2020-01-13,,0,,Delivery,...,2019-11-16,N,First Class,Debit card,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
3,4,AS-10240,txn6215,CA-2015-128538,2,2019-12-14,,0,,Delivery,...,2019-10-01,N,First Class,Cash,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
4,5,FH-14275,txn7572,CA-2015-100916,2,2019-11-16,,0,,Online,...,2019-11-03,N,First Class,Digital Mode,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9187,9188,SM-20320,txn9188,US-2016-130512,9,2019-10-01,179.0,Store_284,Store284,Curbside,...,2019-10-01,N,Standard Class,Credit Card,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
9188,9189,SM-20320,txn9189,US-2016-130512,1,2019-10-01,179.0,Store_284,Store284,Curbside,...,2019-10-01,N,Second Class,Cash,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
9189,9190,SM-20320,txn9190,US-2016-130512,7,2019-10-01,179.0,Store_284,Store284,Curbside,...,2019-10-01,Y,Standard Class,Credit Card,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd
9190,9191,SM-20320,txn9191,US-2016-130512,10,2019-10-01,179.0,Store_284,Store284,Curbside,...,2019-10-01,N,Standard Class,Credit Card,2022-03-16,2022-03-16,Manual_Trigger,Manual_Trigger,1234,15678-abcd


In [8]:
data_file_path = "/home/jovyan/work/customer360.parquet"

In [9]:
import feathr
import os
from feathr import (
    FeathrClient,
    # Feature data types
    BOOLEAN, FLOAT, INT32, ValueType, STRING,
    # Feature data sources
    INPUT_CONTEXT, HdfsSource,
    # Feature aggregations
    TypedKey, WindowAggTransformation,
    # Feature types and anchor
    DerivedFeature, Feature, FeatureAnchor,
    # Materialization
    BackfillTime, MaterializationSettings, RedisSink,
    # Offline feature computation
    FeatureQuery, ObservationSettings,
)
from feathr.datasets import nyc_taxi
from feathr.spark_provider.feathr_configurations import SparkExecutionConfiguration
from feathr.utils.config import generate_config
from feathr.utils.job_utils import get_result_df
from feathr.utils.platform import is_databricks, is_jupyter

print(f"Feathr version: {feathr.__version__}")

Feathr version: 1.0.0


In [10]:
print(feathr.__version__)
os.environ['SPARK_LOCAL_IP'] = "127.0.0.1"
os.environ['REDIS_PASSWORD'] = "foobared"  # default password for Redis


# Make sure we get the Feathr jar name, assuming we just have one jar file.
from pathlib import Path
import glob
jar_name = glob.glob("./*.jar")[0]
print(f"Found jar file at {jar_name}")

PROJECT_NAME = "customer360"
yaml_config = f"""
api_version: 1
project_config:
  project_name: {PROJECT_NAME}
  
spark_config:
  # choice for spark runtime. Currently support: azure_synapse, databricks, local
  spark_cluster: 'local'
  spark_result_output_parts: '1'
  local:
    master: 'local[*]'
    spark.sql.shuffle.partitions: '10'
    spark.driver.memory: "10g"
    spark.executor.memory: "10g"
    feathr_runtime_location: "{jar_name}"

online_store:
  redis:
    # Redis configs to access Redis cluster
    host: '127.0.0.1'
    port: 6379
    ssl_enabled: False

feature_registry:
  # The API endpoint of the registry service
  api_endpoint: "http://127.0.0.1:8000/api/v1"
"""
feathr_workspace_folder = Path("./360_feathr_config.yaml")
feathr_workspace_folder.parent.mkdir(exist_ok=True, parents=True)
feathr_workspace_folder.write_text(yaml_config)
print(yaml_config)

1.0.0
Found jar file at ./feathr_2.12-1.0.5-rc4.jar

api_version: 1
project_config:
  project_name: customer360
  
spark_config:
  # choice for spark runtime. Currently support: azure_synapse, databricks, local
  spark_cluster: 'local'
  spark_result_output_parts: '1'
  local:
    master: 'local[*]'
    spark.sql.shuffle.partitions: '10'
    spark.driver.memory: "10g"
    spark.executor.memory: "10g"
    feathr_runtime_location: "./feathr_2.12-1.0.5-rc4.jar"

online_store:
  redis:
    # Redis configs to access Redis cluster
    host: '127.0.0.1'
    port: 6379
    ssl_enabled: False

feature_registry:
  # The API endpoint of the registry service
  api_endpoint: "http://127.0.0.1:8000/api/v1"



#### Prerequisite: Install Feathr

Install Feathr using pip:

#### Prerequisite: Configure the required environment

In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations. Otherwise, update the configuration below.

The code below will write this configuration string to a temporary location and load it to Feathr. Please still refer to [feathr_config.yaml](https://github.com/feathr-ai/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

#### Setup necessary environment variables

You have to setup the environment variables in order to run this sample. More environment variables can be set by referring to [feathr_config.yaml](https://github.com/feathr-ai/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable.

#### Initialize a feathr client

In [11]:
client = FeathrClient(str(feathr_workspace_folder))

2024-08-06 07:26:55.923 | INFO     | feathr.utils._env_config_reader:get:62 - Config secrets__azure_key_vault__name is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:26:55.925 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__s3__s3_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:26:55.926 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__adls__adls_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:26:55.926 | INFO     | feathr.utils._env_config_reader:get:62 - Config offline_store__wasb__wasb_enabled is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:26:55

In [12]:
# If the notebook is runnong on Jupyter, start a spark session:
if is_jupyter():
    spark = (
        SparkSession
        .builder
        .appName("feathr")
        .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.0,io.delta:delta-core_2.12:2.1.1")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        .config("spark.ui.port", "8080")  # Set ui port other than the default one (4040) so that feathr spark job doesn't fail. 
        .getOrCreate()
    )

# Else, you must already have a spark session object available in databricks or synapse notebooks.

#### Define Sources Section
A feature source is needed for anchored features that describes the raw data in which the feature values are computed from. See the python documentation to get the details on each input column.

In [13]:
data_source_path = data_file_path 
TIMESTAMP_FORMAT = "yyyy-MM-dd"
TIMESTAMP_COL = "sales_order_dt"

In [14]:
# def preprocessing(df: DataFrame) -> DataFrame:
#     import pyspark.sql.functions as F
#     df = df.filter(F.col("ship_mode")=="First Class")
#     return df

batch_source = HdfsSource(name="cosmos_final_data",
                          path=data_source_path,
                          event_timestamp_column=TIMESTAMP_COL,
                          # preprocessing=preprocessing,
                          timestamp_format=TIMESTAMP_FORMAT)

#### Defining Features with Feathr:
In Feathr, a feature is viewed as a function, mapping from entity id or key, and timestamp to a feature value.

#### Define Anchors and  Features
A feature is called an anchored feature when the feature is directly extracted from the source data, rather than computed on top of other features.

In [15]:
f_sales_cust_id = Feature(name = "f_sales_cust_id",
                          feature_type = STRING, transform = "sales_cust_id" )

f_sales_tran_id = Feature(name = "f_sales_tran_id",
                          feature_type = STRING, transform = "sales_tran_id" )

f_sales_order_id = Feature(name = "f_sales_order_id",
                           feature_type = STRING, transform = "sales_order_id" )

f_sales_item_quantity = Feature(name = "f_sales_item_quantity", 
                                feature_type = INT32, transform = "cast_float(sales_item_quantity)" )

f_sales_order_dt = Feature(name = "f_sales_order_dt",
                           feature_type = STRING, transform = "sales_order_dt" )

f_sales_sell_price = Feature(name = "f_sales_sell_price",
                             feature_type = INT32, transform = "cast_float(sales_sell_price)" )

f_sales_discount_amt = Feature(name = "f_sales_discount_amt",
                               feature_type = INT32, transform = "cast_float(sales_discount_amt)" )

f_payment_preference = Feature(name = "f_payment_preference",
                               feature_type = STRING, transform = "payment_preference" )


features = [f_sales_cust_id, f_sales_tran_id, f_sales_order_id, f_sales_item_quantity, 
            f_sales_order_dt, f_sales_sell_price, f_sales_discount_amt, f_payment_preference]

request_anchor = FeatureAnchor(name="request_features",
                                source=INPUT_CONTEXT,
                                features=features)

#### Define Derived Features
Derived features are the features that are computed from other features. They could be computed from anchored features, or other derived features.

In [16]:
f_total_sales_amount = DerivedFeature(name = "f_total_sales_amount",
                                   feature_type = FLOAT,
                                   input_features = [f_sales_item_quantity,f_sales_sell_price],
                                   transform = "f_sales_item_quantity * f_sales_sell_price")

f_total_sales_discount= DerivedFeature(name = "f_total_sales_discount",
                                   feature_type = FLOAT,
                                   input_features = [f_sales_item_quantity,f_sales_discount_amt],
                                   transform = "f_sales_item_quantity * f_sales_discount_amt")


f_total_amount_paid= DerivedFeature(name = "f_total_amount_paid",
                                   feature_type = FLOAT,
                                   input_features = [f_sales_sell_price,f_sales_discount_amt],
                                   transform ="f_sales_sell_price - f_sales_discount_amt")

#### Define Aggregate features and anchor the features to batch source.

Note that if the data source is from the observation data, the source section should be INPUT_CONTEXT to indicate the source of those defined anchors.

In [17]:
customer_ID = TypedKey(key_column="sales_cust_id",
                       key_column_type=ValueType.INT32,
                       description="customer ID",
                       full_name="cosmos.sales_cust_id")

agg_features = [Feature(name="f_avg_customer_sales_amount",
                        key=customer_ID,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(sales_sell_price)",
                                                          agg_func="AVG",
                                                          window="1d")),
               
               Feature(name="f_avg_customer_discount_amount",
                        key=customer_ID,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(sales_discount_amt)",
                                                          agg_func="AVG",
                                                          window="1d")),
               
              Feature(name="f_avg_item_ordered_by_customer",
                        key=customer_ID,
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(agg_expr="cast_float(sales_item_quantity)",
                                                          agg_func="AVG",
                                                          window="1d"))]

agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)

#### Building Features
And then we need to build those features so that it can be consumed later. Note that we have to build both the "anchor" and the "derived" features (which is not anchored to a source).

In [18]:
client.build_features(anchor_list=[request_anchor,agg_anchor],
                      derived_feature_list=[f_total_sales_amount, f_total_sales_discount,f_total_amount_paid])

#### Registering Features
We can also register the features with an Apache Atlas compatible service, such as Azure Purview, and share the registered features across teams:

In [19]:
#client.register_features()

In [20]:
#client.list_registered_features(project_name="customer360")

#### Create training data using point-in-time correct feature join
A training dataset usually contains entity id columns, multiple feature columns, event timestamp column and label/target column.

To create a training dataset using Feathr, one needs to provide a feature join configuration file to specify what features and how these features should be joined to the observation data.

In [21]:
feature_names = [
                  "f_avg_item_ordered_by_customer", 
                  "f_avg_customer_discount_amount", 
                  "f_avg_customer_sales_amount", 
                  "f_total_sales_discount",
                  "f_total_sales_amount"
                 ]

In [22]:
feature_query = FeatureQuery(
    feature_list=feature_names, key=customer_ID)
settings = ObservationSettings(
    observation_path=data_source_path,
    event_timestamp_column=TIMESTAMP_COL,
    timestamp_format=TIMESTAMP_FORMAT)

#### Materialize feature value into offline storage
While Feathr can compute the feature value from the feature definition on-the-fly at request time, it can also pre-compute and materialize the feature value to offline and/or online storage.

In [23]:
!mkdir {PROJECT_NAME}

mkdir: cannot create directory ‘customer360’: File exists


In [24]:
DATA_FORMAT="parquet"

In [25]:
client.get_offline_features(
    observation_settings=settings,
    feature_query=feature_query,
    # For more details, see https://feathr-ai.github.io/feathr/how-to-guides/feathr-job-configuration.html
    execution_configurations=SparkExecutionConfiguration({
        "spark.feathr.outputFormat": DATA_FORMAT,
    }),
    output_path=PROJECT_NAME + f"/features.{DATA_FORMAT}",
)

client.wait_job_to_finish(timeout_sec=5000)

2024-08-06 07:26:59.615 | ERROR    | feathr.utils.job_utils:get_result_df:170 - Failed to load result files from /home/jovyan/work/customer360.parquet with format csv.
Feathr is unable to read the Observation data from /home/jovyan/work/customer360.parquet due to permission issue or invalid path. Please either grant the permission or supply the observation column names in the filed: observation_column_names.
2024-08-06 07:26:59.707 | INFO     | feathr.spark_provider._localspark_submission:_get_debug_file_name:288 - Spark log path is debug/customer360_feathr_feature_join_job20240806072659
2024-08-06 07:26:59.708 | INFO     | feathr.spark_provider._localspark_submission:_init_args:263 - Spark job: customer360_feathr_feature_join_job is running on local spark with master: local[*].
2024-08-06 07:26:59.719 | INFO     | feathr.spark_provider._localspark_submission:submit_feathr_job:143 - Detail job stdout and stderr are in debug/customer360_feathr_feature_join_job20240806072659/log.
2024-08

x

2024-08-06 07:27:30.749 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:229 - Spark job with pid 36397 finished in: 31 seconds                     with returncode 0


#### Reading training data from offline storage

In [26]:
# Show feature results
df = get_result_df(
    spark=spark,
    client=client,
    data_format=DATA_FORMAT,
)
df.select(feature_names).show(5)

2024-08-06 07:27:30.771 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:163 - 1 local spark job(s) in this Launcher, only the latest will be monitored.
2024-08-06 07:27:30.772 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:164 - Please check auto generated spark command in debug/customer360_feathr_feature_join_job20240806072659/command.sh and detail logs in debug/customer360_feathr_feature_join_job20240806072659/log.
2024-08-06 07:27:30.772 | INFO     | feathr.spark_provider._localspark_submission:wait_for_completion:229 - Spark job with pid 36397 finished in: 0 seconds                     with returncode 0


+------------------------------+------------------------------+---------------------------+----------------------+--------------------+
|f_avg_item_ordered_by_customer|f_avg_customer_discount_amount|f_avg_customer_sales_amount|f_total_sales_discount|f_total_sales_amount|
+------------------------------+------------------------------+---------------------------+----------------------+--------------------+
|                           6.0|                     18.429613|                   80.64266|                  48.0|               312.0|
|                     3.3333333|                      2.501458|                  37.373333|                   0.0|                 8.0|
|                           6.0|                     14.753651|                   198.6435|                   2.0|                56.0|
|                           5.0|                     2.4403865|                      29.16|                  10.0|               145.0|
|                           8.5|                

####Train a ML model

After getting all the features, let's train a machine learning model with the converted feature by Feathr:

In [27]:
pandas_df = df.toPandas()

In [28]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import r2_score

In [29]:
X = pandas_df['f_total_sales_discount']
y = pandas_df['f_total_sales_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)

# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

# Checking the R-squared on the test set

r_squared = r2_score(y_test, y_pred)
r_squared


print("Model MAPE:")
print(1 - r_squared)
print()
print("Model Accuracy:")
print(r_squared)

Model MAPE:
0.2845262685819667

Model Accuracy:
0.7154737314180333


#### Materialize feature value into online storage
We can push the generated features to the online store like below

In [30]:
!mkdir materialize_hdfs

mkdir: cannot create directory ‘materialize_hdfs’: File exists


In [31]:
FEATURE_TABLE_NAME = "Customer360"

from datetime import datetime
from datetime import timedelta
from feathr import HdfsSink
# Time range to materialize
backfill_time = BackfillTime(start=datetime(2019, 12, 31), end=datetime(2020, 1, 2), step=timedelta(days=1))
from feathr import HdfsSink
# Destinations:
# For online store,
redis_sink = RedisSink(table_name=FEATURE_TABLE_NAME)
hdfs_sink = HdfsSink(output_path="materialize_hdfs", store_name="df0")
# For offline store,
# adls_sink = HdfsSink(output_path=)

settings = MaterializationSettings(
    name=FEATURE_TABLE_NAME + ".job",  # job name
    backfill_time=backfill_time,
    sinks=[ redis_sink, hdfs_sink],  # or adls_sink
    feature_names=["f_avg_item_ordered_by_customer","f_avg_customer_discount_amount"],
)
# 
client.materialize_features(
    settings=settings,
    execution_configurations={"spark.feathr.outputFormat": "parquet", 
                              "spark.feathr.hdfs.local.enable":"true",
                              "spark.sql.shuffle.partitions": '1'},
)

client.wait_job_to_finish(timeout_sec=20)

2024-08-06 07:27:35.149 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__url is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:27:35.150 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__user is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:27:35.151 | INFO     | feathr.spark_provider._localspark_submission:_get_debug_file_name:288 - Spark log path is debug/customer360_feathr_feature_materialization_job20240806072735
2024-08-06 07:27:35.152 | INFO     | feathr.spark_provider._localspark_submission:_init_args:263 - Spark job: customer360_feathr_feature_materialization_job is running on local spark with master: local[*].
2024-08-06 07:27:35.164 | INFO     | feathr.spark_provider._localspark_submission:submit_feathr_job:143 - Detai

x

2024-08-06 07:28:06.318 | ERROR    | feathr.spark_provider._localspark_submission:wait_for_completion:207 - Source location is: SimplePath(path=/home/jovyan/work/customer360.parquet)
generateFeaturesAsDF: 1. call analyze features, e.g. group features
generateFeaturesAsDF: 2. Get AnchorDFMap for Anchored features
generateFeaturesAsDF: 3. Load user specified default values and feature types, if any.
generateFeaturesAsDF: 4. Calculate anchored features
NonTimeBasedDataSourceAccessor loading source SimplePath(path=/home/jovyan/work/customer360.parquet)
generateFeaturesAsDF: 5. Group features based on grouping specified in output processors
generateFeaturesAsDF: 6. Substitute defaults at this stage since all anchored features are generated and grouped together. Substitute before generating derived features.
generateFeaturesAsDF: 7. Calculate derived features.
generateFeaturesAsDF: 8. Prune feature columns before handing it off to output processors. As part of the pruning columns are renamed

In [32]:
from feathr import get_result_df
path = "materialize_hdfs/df0*/daily/2020/01/*"
df = get_result_df(spark=spark,client=client, format="parquet", res_url=path)
df.show()

+------------------------------+------------------------------+--------+
|f_avg_customer_discount_amount|f_avg_item_ordered_by_customer|    key0|
+------------------------------+------------------------------+--------+
|                     0.6421022|                           1.0|JM-15580|
|                    0.69970566|                     5.3333335|AZ-10750|
|                      2.516956|                     4.6666665|BK-11260|
|                      7.981272|                           3.0|BM-11785|
|                     1.4155095|                           7.5|HM-14860|
|                     11.548739|                           9.0|KH-16690|
|                     4.6880927|                           5.0|PN-18775|
|                    0.04808883|                           6.0|KN-16390|
|                     2.1529326|                           2.5|GM-14500|
|                     23.128746|                           2.5|KN-16390|
+------------------------------+-------------------

In [33]:
# Note, to get a single key, you may use client.get_online_features instead
redis_sink = RedisSink(table_name=FEATURE_TABLE_NAME)
materialized_feature_values = client.multi_get_online_features(
    feature_table=FEATURE_TABLE_NAME,
    keys=["KN-16390", "HM-14860", "KN-16390"],
    feature_names=["f_avg_item_ordered_by_customer","f_avg_customer_discount_amount"],
)
materialized_feature_values

{'KN-16390': [2.5, 23.128746032714844], 'HM-14860': [7.5, 1.4155094623565674]}

In [55]:
FEATURE_TABLE_NAME = "FullCustomer360"
from feathr import HdfsSink
# Destinations:
# For online store,
backfill_time = BackfillTime(start=datetime(2020, 2, 7), end=datetime(2020, 2, 14), step=timedelta(days=1))
redis_sink = RedisSink(table_name=FEATURE_TABLE_NAME)
hdfs_sink = HdfsSink(output_path="materialize_hdfs", store_name="df360")
# For offline store,
# adls_sink = HdfsSink(output_path=)

settings = MaterializationSettings(
    name=FEATURE_TABLE_NAME + ".job",  # job name
    backfill_time=backfill_time,
    sinks=[hdfs_sink, redis_sink],  # or adls_sink
    feature_names=["f_avg_item_ordered_by_customer","f_avg_customer_discount_amount"],
)
# 
client.materialize_features(
    settings=settings,
    execution_configurations={"spark.feathr.outputFormat": "parquet", 
                              "spark.feathr.hdfs.local.enable":"true",
                              "spark.sql.shuffle.partitions": '4'},
)

client.wait_job_to_finish(timeout_sec=50)

2024-08-06 07:50:18.930 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__url is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:50:18.930 | INFO     | feathr.utils._env_config_reader:get:62 - Config monitoring__database__sql__user is not found in the environment variable, configuration file, or the remote key value store. Returning the default value: None.
2024-08-06 07:50:18.931 | INFO     | feathr.spark_provider._localspark_submission:_get_debug_file_name:288 - Spark log path is debug/customer360_feathr_feature_materialization_job20240806075018
2024-08-06 07:50:18.932 | INFO     | feathr.spark_provider._localspark_submission:_init_args:263 - Spark job: customer360_feathr_feature_materialization_job is running on local spark with master: local[*].
2024-08-06 07:50:18.944 | INFO     | feathr.spark_provider._localspark_submission:submit_feathr_job:143 - Detai

x

2024-08-06 07:50:50.542 | ERROR    | feathr.spark_provider._localspark_submission:wait_for_completion:216 - 


RuntimeError: Spark job failed.