# Feature Store Data Ingestion via SageMaker Model Building Pipeline

This sample notebook demonstrates how one can leverage model building pipeline to: 

1. Ingest preprocessed data directly into a feature group.
2. Transform and ingest data into a feature group.

please choose `Python 3 (Data Science)` kernel to run this notebook.

### Setup

In [None]:
import sys
import os
import time
# get the latest sagemaker python SDK
!{sys.executable} -m pip install "sagemaker>=2.41.0"

In [None]:
import sagemaker
import boto3
from sagemaker.session import get_execution_role, Session
from sagemaker.feature_store.feature_group import FeatureGroup, FeatureDefinition, FeatureTypeEnum
from sagemaker.wrangler.processing import DataWranglerProcessor
from sagemaker.wrangler.ingestion import generate_data_ingestion_flow_from_s3_input
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, ProcessingInput, ProcessingOutput
from sagemaker.workflow.parameters import ParameterInteger, ParameterString
from sagemaker.processing import FeatureStoreOutput

import numpy as np
import pandas as pd
import json

In [None]:
sagemaker_session = sagemaker.Session()
boto_session = boto3.Session()
role = get_execution_role()

### Prepare for SageMaker FeatureStore Session

In [None]:
sagemaker_client = boto_session.client("sagemaker")
featurestore_runtime_client = boto_session.client("sagemaker-featurestore-runtime")
featurestore_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime_client,
)

In [None]:
def wait_for_feature_group_create(feature_group: FeatureGroup):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        print(feature_group.describe())
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")

In [None]:
pipeline_to_clean_up = []
feature_group_to_clean_up = []

### Ingest the already transformed data directly into a feature group.

This section shows how you can ingest a pre-processed dataset directly into a feature group via a pipeline execution. More specficially, we will run a pipeline with a single `Processing` step, this step will launch a DataWrangler processing job to ingest your data into a feature group directly, without any transformation.

Let's use the `data/features.csv` as the sample pre-processed features. The dataset has in total 11 features, `f10` is the `event_time_feature_name`, and `f11` is the unique `record_identifier_name`

In [None]:
df = pd.read_csv('./data/features.csv')
df.head(5)

### upload dataset to s3

In [None]:
input_data_uri = sagemaker_session.upload_data(
    path="data/features.csv", 
    bucket=sagemaker_session.default_bucket(), 
    key_prefix="data-ingestion-demo/inputs"
)

In [None]:
input_data_uri

#### create a feature group

In [None]:
def create_feature_group(
    name, 
    feature_definition, 
    record_identifier_name, 
    event_time_feature_name, 
    offline_store_s3_uri, 
    featurestore_session, 
    role
):
    feature_group = FeatureGroup(
        name=name, 
        feature_definitions=feature_definition, 
        sagemaker_session=featurestore_session
    )
    try:
        feature_group.create(
            s3_uri=offline_store_s3_uri,
            record_identifier_name=record_identifier_name,
            event_time_feature_name=event_time_feature_name,
            role_arn=role,
            # we will disable the online store for the purpose of this demo
            # otherwise, the data will be only available after offline sync is done (SLA 15min)
            enable_online_store=False,
        )
        wait_for_feature_group_create(feature_group)
    except Exception as e:
        if 'ResourceInUse' in str(e):
            print("FeatureGroup already exists.") 
        else:
            raise e
    return feature_group

In [None]:
# we need to first define the feature definition for our feature group based on the above dataset.
feature_definition = [
        FeatureDefinition(feature_name="f1", feature_type=FeatureTypeEnum.STRING),
        FeatureDefinition(feature_name="f2", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f3", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f4", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f5", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f6", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f7", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f8", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f9", feature_type=FeatureTypeEnum.INTEGRAL),
        FeatureDefinition(feature_name="f10", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f11", feature_type=FeatureTypeEnum.STRING),
]

In [None]:
feature_group_name = "demo-1-1-ingestion-fg"
# define where feature store should offline sync the features into
offline_store_s3_uri = os.path.join("s3://", sagemaker_session.default_bucket(), feature_group_name)
feature_group = create_feature_group(feature_group_name, feature_definition, "f11", "f10", offline_store_s3_uri, featurestore_session, role)

In [None]:
feature_group_to_clean_up.append(feature_group)

#### Generate the ingestion flow

Lets use a helper function to generate a ingestion only data wrangler flow, the flow will read your data only and output to the feature group you specify.

In [None]:
ingestion_only_flow, output_name = generate_data_ingestion_flow_from_s3_input(
    "features.csv",
    input_data_uri,
    s3_content_type="csv",
    s3_has_header=True,
)

In [None]:
os.makedirs('flows', exist_ok=True)
json.dump(ingestion_only_flow, open("flows/ingestion_only.flow", "w"))

#### Configure DataWranglerProcessor

Lets configure a `DataWranglerProcessor` as the base processor for our `Processing` step.

In [None]:
instance_count = ParameterInteger(name="InstanceCount", default_value=1)
instance_type = ParameterString(name="InstanceType", default_value="ml.m5.4xlarge")
    
data_wrangler_processor = DataWranglerProcessor(
    role=role,
    data_wrangler_flow_source="flows/ingestion_only.flow",
    instance_count=instance_count,
    instance_type=instance_type,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=86400,
)

#### Configure the pipeline

<div class="alert alert-info"> 💡Note that when setting the ProcessingOutput, we need to set app_managed to True in order to ingest into the feature group we specify. 
</div>

In [None]:
inputs = [
    ProcessingInput(
        input_name="features.csv",
        source=input_data_uri,
        destination="/opt/ml/processing/features.csv",
    )
]

outputs = [
    ProcessingOutput(
        output_name=output_name,
        app_managed=True, # this must be true in order to ingest data into a feature group
        feature_store_output=FeatureStoreOutput(feature_group_name=feature_group_name),
    )
]

In [None]:
data_wrangler_step = ProcessingStep(
    name="ingestion-step", processor=data_wrangler_processor, inputs=inputs, outputs=outputs
)

In [None]:
pipeline = Pipeline(
    name="ingestion-only-pipeline",
    parameters=[instance_count, instance_type],
    steps=[data_wrangler_step],
    sagemaker_session=sagemaker_session,
)
pipeline.create(role)

In [None]:
pipeline_to_clean_up.append(pipeline)

#### Execute the pipeline

In [None]:
execution = pipeline.start()
response = execution.describe()
execution.wait(delay=60, max_attempts=10)

#### Confirm features get ingested

Let's use athena query to confirm the features are ingested into our feature group without any transformation

In [None]:
def get_feature_group_data(feature_group, offline_store_s3_uri):
    athena_query = feature_group.athena_query()
    athena_query.run(query_string=f'SELECT * FROM "{athena_query.table_name}"', output_location=f"{offline_store_s3_uri}/query_results")
    athena_query.wait()
    assert "SUCCEEDED" == athena_query.get_query_execution().get("QueryExecution").get(
        "Status"
    ).get("State")
    return athena_query.as_dataframe()

In [None]:
fg_df = get_feature_group_data(feature_group, offline_store_s3_uri)
fg_df.head(5)

In [None]:
assert np.all(fg_df[df.columns] == df)

This demo shows how to copy data from s3 into a feature group, there are other two helper functions that can help you generate flow to ingest from `Athena` or `RedShift`. You can find them at:

[`sagemaker.wrangler.ingestion.generate_data_ingestion_flow_from_athena_dataset_definition`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/wrangler/ingestion.py#L76)

[`sagemaker.wrangler.ingestion.generate_data_ingestion_flow_from_redshift_dataset_definition`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/wrangler/ingestion.py#L123)

The usage is very similar.

### Transform data via DataWrangler, and ingest the transformed data into a feature group.

First, we need to go to the SageMaker DataWrangler console to generate a flow with a transformation. The below DataWrangler flow will import the same `data/features.csv` we have uploaded to s3 from the first scenario. 

1. generate a flow with input from s3

<img src="./pics/input-flow.png" style="width:800px;"/>

2. add transformation

<img src="./pics/add-transform.png" style="width:800px;"/>

3. add one-hot encoding to the categoritcal feature `f1`

<img src="./pics/one-hot-transform.png" style="width:800px;"/>

4. export to feature store

<img src="./pics/export.png" style="width:800px;"/>

once you click **Export to Feature Store** you will be directed to a notebook. Inside the notebook, you can upload the freshly baked `untitled.flow` to a s3 uri. The flow representing the steps shown above is donwloaded locally to `./flows/transformation.flow` for the purpose of this demo. 

<div class="alert alert-info"> 💡Note that one important information inside the notebook is the output name, it is auto-generated from the select output node's ID + output name from the flow file.
</div>

In our case, the output name value is `2351bdcf-a2f6-499c-9665-1203f48eb3cd.default`, this value tells the DataWrangler container where to look up the transformed data.

Then, similarly, lets create another feature group.

In [None]:
# we need to first define the feature definition for our feature group based on the above dataset.
feature_definition = [
        # one-hot encoded f1
        FeatureDefinition(feature_name="f1_F", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f1_M", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f1_I", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f2", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f3", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f4", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f5", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f6", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f7", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f8", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f9", feature_type=FeatureTypeEnum.INTEGRAL),
        FeatureDefinition(feature_name="f10", feature_type=FeatureTypeEnum.FRACTIONAL),
        FeatureDefinition(feature_name="f11", feature_type=FeatureTypeEnum.STRING),
]

In [None]:
feature_group_name = "demo-transformation-ingestion-fg"
# define where feature store should offline sync the features into
offline_store_s3_uri = os.path.join("s3://", sagemaker_session.default_bucket(), feature_group_name)
feature_group = create_feature_group(feature_group_name, feature_definition, "f11", "f10", offline_store_s3_uri, featurestore_session, role)

In [None]:
feature_group_to_clean_up.append(feature_group)

Then, lets create another pipeline to execute the transformation and ingestion.

In [None]:
data_wrangler_processor = DataWranglerProcessor(
    role=role,
    data_wrangler_flow_source="flows/transformation.flow",
    instance_count=instance_count,
    instance_type=instance_type,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=86400,
)

In [None]:
inputs = [
    ProcessingInput(
        input_name="features.csv",
        source=input_data_uri,
        destination="/opt/ml/processing/features.csv",
    )
]

outputs = [
    ProcessingOutput(
        output_name="2351bdcf-a2f6-499c-9665-1203f48eb3cd.default",
        app_managed=True, # this must be true in order to ingest data into a feature group
        feature_store_output=FeatureStoreOutput(feature_group_name=feature_group_name),
    )
]

data_wrangler_step = ProcessingStep(
    name="transformation-ingestion-step", processor=data_wrangler_processor, inputs=inputs, outputs=outputs
)

In [None]:
pipeline = Pipeline(
    name="transformation-ingestion-pipeline",
    parameters=[instance_count, instance_type],
    steps=[data_wrangler_step],
    sagemaker_session=sagemaker_session,
)
pipeline.create(role)

In [None]:
pipeline_to_clean_up.append(pipeline)

In [None]:
execution = pipeline.start()
response = execution.describe()
execution.wait(delay=60, max_attempts=10)

#### Confirm features get transformed and ingested

Let's use athena query to confirm the features are transformed and ingested

In [None]:
fg_df = get_feature_group_data(feature_group, offline_store_s3_uri)
fg_df.head(5)

### Clean up

In [None]:
def clean_up():
    for pipeline in pipeline_to_clean_up:
        try:
            pipeline.delete()
        except Exception as e:
            if 'ResourceNotFound' in str(e):
                pass
        
    for fg in feature_group_to_clean_up:
        try:
            fg.delete()
        except Exception as e:
            if 'ResourceNotFound' in str(e):
                pass

In [None]:
clean_up()