# Tutorial #3: Experiment and train models using features

In this tutorial series you will experience how features seamlessly integrates all the phases of ML lifecycle: Prototyping features, training and operationalizing.

In part 1 of the tutorial you learnt how to create a feature set spec with custom transformations. In part 2 of the tutorial you learnt how to enable materialization and perform backfill. In this tutorial you will will learn how to experiment with features to improve model performance. You will see how feature store increasing agility in the experimentation and training flows. 

You will perform the following:
- Prototype a create new `acccounts` feature set spec using existing precomputed values as features, unlike part 1 of the tutorial where we created feature set that had custom transformations. You will then Register the local feature set spec as a feature set in the feature store
- Select features for the model: You will select features from the `transactions` and `accounts` feature sets and save them as a feature-retrieval spec
- Run training pipeline that uses the Feature retrieval spec to train a new model. This pipeline will use the built in feature-retrieval component to generate the training data

#### Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

# Prerequisites
1. Please ensure you have executed part 1 and 2 of the tutorial

# Setup

#### Configure Azure ML spark notebook

1. In the "Compute" dropdown in the top nav, select "Serverless Spark Compute". 
1. Click on "configure session" in top status bar -> click on "Python packages" -> click on "upload conda file" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently




#### Start spark session

In [1]:
# run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")

StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 6, Finished, Available)

start spark session


#### Setup root directory for the samples

In [3]:
import os

# please update the dir to ./Users/{your-alias} (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure inm the left nav
root_dir = "./Users/ezzatdemnati/e2e_mlops_process"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 8, Finished, Available)

The folder exists.


#### Initialize the project workspace CRUD client
This is the current workspace where you will be running the tutorial notebook from

In [4]:
### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = "e0d7a68e-191f-4f51-83ce-d93995cd5c09"
project_ws_rg = "my_ml_tests"
project_ws_name = "myworkspace"

# connect to the project workspace
ws_client = MLClient(
    AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name
)

StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 9, Finished, Available)

#### Initialize and get feature store 
Ensure you update the `featurestore_name` to reflect what you created in part 1 of this tutorial

In [5]:
import os
import json
import pandas as pd
from datetime import datetime

from azure.ai.ml.entities import FeatureSetSpecification,RecurrenceTrigger

import sys

root_dir = "./Users/ezzatdemnati/e2e_mlops_process"
sys.path.insert(0,root_dir)

import featurestore.setup.featurestore_setuptools as fs_setup

# Read config file:
with open(os.path.join(root_dir,"featurestore/config/feature_store_config.json"),'r') as f:
    fs_config = json.load(f)

# Init FeatureStore class
fs_class = fs_setup.FeatureStoreTools(subscription_id=fs_config["subscription_id"],
                resource_group_name=fs_config["resource_group_name"],
                location=fs_config["location"],
                featurestore_name=fs_config["name"],
                root_dir=root_dir,
                fs_config=fs_config,
                ml_client=None
            )
# Get FeatureStore
feature_store = fs_class.get_feature_store(verbose=0)


StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 10, Finished, Available)

root_dir:./Users/ezzatdemnati/e2e_mlops_process
self.root_dir:./Users/ezzatdemnati/e2e_mlops_process/featurestore


Class FeatureStoreClient: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Method feature_stores: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
_AzureMLSparkOnBehalfOfCredential.get_token succeeded
Class MaterializationStore: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


#### Step 2a: Select features for model

In [6]:
# get the registered transactions feature set, version 1
transactions_featureset = feature_store.feature_sets.get("transactions", "4")
accounts_featureset = feature_store.feature_sets.get("accounts", "4")
# Notice that account feature set spec is in your local dev environment (this notebook): not registered with feature store yet
features = [
    accounts_featureset.get_feature("accountAge"),
    accounts_featureset.get_feature("numPaymentRejects1dPerUser"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 11, Finished, Available)

Method feature_sets: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
_AzureMLSparkOnBehalfOfCredential.get_token succeeded
_AzureMLSparkOnBehalfOfCredential.get_token succeeded
Method feature_store_entities: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
_AzureMLSparkOnBehalfOfCredential.get_token succeeded


#### Step 2b: Generate training data locally
In this step we generate training data for illustrative purpose. You can optionally train models locally with this. In the upcoming steps in this tutorial, you will train a model in the cloud.

In [None]:
df = spark.read.parquet("wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet")

display(df)


StatementMeta(, , , Waiting, )

In [7]:
from azureml.featurestore import get_offline_features

# Load the observation data. To understand observation data, refer to part 1 of this tutorial
observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call training_df.show() to see correctly formatted value

StatementMeta(c25b756a-1f28-4821-94d2-7088ba2a2663, 68, 12, Submitted, Running)

#### Step 2c: Register the `accounts` featureset with the featurestore
Once you have experimented with different feature definitions locally and sanity tested it, you can register it with the feature store.
For this you will register a featureset asset definition with the feature store.


In [None]:
from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

accounts_fset_config = FeatureSet(
    name="accounts",
    version="1",
    description="accounts featureset",
    entities=["azureml:account:1"],
    stage="Development",
    specification=FeatureSetSpecification(path=accounts_featureset_spec_folder),
    tags={"data_type": "nonPII"},
)

poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())

#### Step 2d: Get registered featureset and sanity test

In [None]:
# look up the featureset by providing name and version
accounts_featureset = featurestore.feature_sets.get("accounts", "1")
# get access to the feature data
accounts_feature_df = accounts_featureset.to_spark_dataframe()
display(accounts_feature_df.head(5))
# Note: Please ignore this warning: Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun

## Step 3: Run training experiment
In this step you will select a list of features, run a training pipeline, and register the model. You can repeat this step till you are happy with the model performance.

#### (Optional) Step 3a: Discover features from Feature Store UI
You have already done this in part 1 of the tutorial after registering the `transactions` feature set. Since you also have `accounts` featureset, you can browse the available features:
* Goto the [Azure ML global landing page](https://ml.azure.com/home?flight=FeatureStores).
* Click on `Feature stores` in the left nav
* You will see the list of feature stores that you have access to. Click on the feature store that you created above.

You can see the feature sets and entity that you created. Click on the feature sets to browse the feature definitions. You can also search for feature  sets across feature stores by using the global search box.

#### (Optional) Step 3b: Discover features from SDK

In [None]:
# List available feature sets
all_featuresets = featurestore.feature_sets.list()
for fs in all_featuresets:
    print(fs)

# List of versions for transactions feature set
all_transactions_featureset_versions = featurestore.feature_sets.list(
    name="transactions"
)
for fs in all_transactions_featureset_versions:
    print(fs)

# See properties of the transactions featureset including list of features
featurestore.feature_sets.get(name="transactions", version="1").features

#### Step 3c: Select features for the model and export it as a feature-retrieval spec
In the previous steps, you selected features from a combination unregistered  and registered feature sets for local experimentation and testing. Now you are ready to experiment in the cloud. Saving the selected features as a feature-retrieval spec and using it in the mlops/cicd flow for training/inference increases your agility in shipping models.

Select features for the model

In [None]:
# you can select features in pythonic way
features = [
    accounts_featureset.get_feature("accountAge"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    "accounts:1:numPaymentRejects1dPerUser",
    "transactions:1:transaction_amount_7d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)

features.extend(more_features)

Export selected features as a feature-retrieval spec

#### Note
Feature retrieval spec is a portable definition of list of features associated with a model. This can help streamline ML model development and operationalizing.This will be an input to the training pipeline (used to generate the training data), then will be packaged along with the model, and will be used during inference to lookup the features. It will be a glue that integrates all phases of the ML lifecycle. Changes to training/inference pipeline can be kept minimal as you experiment and deploy. 

Using feature retrieval spec and the built-in feature retrieval component is optional: you can directly use `get_offline_features()` api as shown above.

Note that the name of the spec should be `feature_retrieval_spec.yaml` when it is packaged with the model for the system to recognize it.

In [None]:
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir + "/project/fraud_model/feature_retrieval_spec"

# check if the folder exists, create one if not
if not os.path.exists(feature_retrieval_spec_folder):
    os.makedirs(feature_retrieval_spec_folder)

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)

## Step 4: Train in the cloud using pipelines and register model if satisfactory
In this step you will manually trigger the training pipeline. In a production scenario, this could be triggered by a ci/cd pipeline based on changes to the feature-retrieval spec in the source repository.

#### Step 4a: Run the training pipeline
The training pipeline has the following steps:

1. Feature retrieval step: This is a built-in component takes as input the feature retrieval spec, the observation data and timestamp column name. It then generates the training data as output. It runs this as a managed spark job.
1. Training step: This step trains the model based on the training data and generates a model (not registered yet)
1. Evaluation step: This step validates whether model performance/quailty is within threshold (in our case it is a placeholder/dummy step for illustration purpose)
1. Register model step: This step registers the model

Note: In part 2 of this tutorial you ran a backfill job to materialize data for `transactions` feature set. Feature retrieval step will read feature values from offline store for this feature set. The behavior will same even if you use `get_offline_features()` api.

In [None]:
from azure.ai.ml import load_job  # will be used later

training_pipeline_path = (
    root_dir + "/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition = load_job(source=training_pipeline_path)
training_pipeline_job = ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15 mins. However subsequent runs can be faster (assuming spark pool is warm - default timeout is 30 mins)

#### Inspect the training pipeline and the model
Open the above pipeline run "web view" in new window to see the steps in the pipeline.


#### Step 4b: Notice the feature retrieval spec in the model artifacts
1. In the left nav of the current workspace -> right click on `Models` -> Open in new tab or window
1. Click on `fraud_model`
1. Click on `Artifacts` in the top nav

You can notice that the feature retrieval spec is packaged along with the model. The model registration step in the training pipeline has done this. You created feature retrieval spec during experimentation, now it has become part of the model definition. In the next tutorial you will see how this will be used during inferencing.


## Step 5: View the feature set and model dependencies

#### Step 5a: View the list of feature sets associated with the model
In the same models page, click on the `feature sets` tab. Here you can see both `transactions` and `accounts` featuresets that this model depends on.

#### Step 5b: View the list of models using the feature sets
1. Open the feature store UI (expalined in a previous step in this tutorial)
1. Click on `Feature sets` on the left nav
1. Click on any of the feature set -> click on `Models` tab

You can see the list of models that are using the feature sets (determined from the feature retrieval spec when the model was registered).

## Cleanup

Part 4 of the tutorial has instructions for deleting the resources

## Next steps
* Part 4 of tutorial: Enable recurrent materialization and run batch inference