## Materialize Features from the Offline to Online Store

In this example, we demonstrate how customers can use the [Feature Store Spark Connector](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html) to ingest features directly to the offline store, and incrementally materialize the latest features to the online store.

### Create Feature Group

First, create a feature group with online and offline stores configured.

In [None]:
import sagemaker
import boto3

sm_client = boto3.client('sagemaker')
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
region_name = sagemaker_session.boto_region_name
default_bucket = sagemaker_session.default_bucket()
feature_group_name = 'feature-store-offline-to-online-example'

We highly recommend storing offline features using the Apache Iceberg table format. If you need to use the Glue table format, please update the variable below to `'Glue'`. 

For more information on offline store formats, please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-offline.html).

In [None]:
table_format = 'Iceberg' # or 'Glue'

In [None]:
# NOTE: This example only works with the Glue table format.

create_feature_group_output = sm_client.create_feature_group(
    FeatureGroupName=feature_group_name,
    RecordIdentifierFeatureName='RecordIdentifier',
    EventTimeFeatureName='EventTime',
    OnlineStoreConfig={
        'EnableOnlineStore': True
    },
    OfflineStoreConfig={
        'S3StorageConfig': {
            'S3Uri': f's3://{default_bucket}'
        },
        'TableFormat': table_format
    },
    FeatureDefinitions=[
        {
            'FeatureName': 'RecordIdentifier',
            'FeatureType': 'Integral'
        },
        {
            'FeatureName': 'Measure',
            'FeatureType': 'Fractional'
        },
        {
            'FeatureName': 'EventTime',
            'FeatureType': 'String'
        }
    ],
    RoleArn=role
)

### Ingest Data to the Offline Store

We will create a [SageMaker Processing Job](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) which uses the Feature Store Spark Connector to ingests a set of features directly into the offline store.

To use the Feature Store Spark Connector in a Processing Job, we recommend extending the prebuilt SageMaker Spark Processing container as shown in the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html#:~:text=Installation%20on%20a%20Amazon%20SageMaker%20Processing%20Job
). For this example, we have included the Python interface to the Spark Connector under `feature_store_pyspark`. We will download the JAR file which encapsulates the Spark Connector functionality and submit all of these assets when run the processing job.

In [None]:
spark_version = '3.1'

To download the correct JAR, we will use `pip` to install the specific connector library based on the Spark version (`MAJOR.MINOR`) we want to use.

In [None]:
%pip install sagemaker-feature-store-pyspark-{spark_version}

We can then use the command line tool to fetch the location of the JAR.

In [None]:
jar_path = !feature-store-pyspark-dependency-jars
jar_path = jar_path[0]
jar_path

Run a processing job using `scripts/ingest_offline.py` and include the necessary Python modules and JAR.

In [None]:
from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    max_runtime_in_seconds=1200,
    framework_version=spark_version,
)

spark_processor.run(
    submit_app='./scripts/ingest_offline.py',
    arguments=[
        '--feature_group_name',
        feature_group_name,
        '--region_name',
        region_name
    ],
    logs=False,
    submit_jars=[jar_path],
    submit_py_files=[
        './feature_store_pyspark/feature_store_manager.py',
        './feature_store_pyspark/wrapper.py'
    ]
)

### Materialize Latest Features to Online Store

Now that our features are ingested to offline store, we can materialize the latest features (for each record identifier) to the online store. To do this, we we will run another Spark Processing Job using `scripts/materialize.py`. Since the task may need to run on a regular cadence, we can add the processing job to a SageMaker Pipeline. This pipeline can then be scheduled with [Amazon EventBridge](https://docs.aws.amazon.com/sagemaker/latest/dg/pipeline-eventbridge.html).

In [None]:
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep

pipeline_name = feature_group_name
pipeline_session = PipelineSession()

spark_processor = PySparkProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    max_runtime_in_seconds=1200,
    sagemaker_session=pipeline_session,
    framework_version=spark_version
)

processor_args = spark_processor.run(
    submit_app='./scripts/materialize.py',
    logs=False,
    arguments = [
        '--table_format',
        table_format,
        '--feature_group_name',
        feature_group_name,
        '--region_name',
        region_name
    ],
    submit_jars=[
        jar_path
    ],
    submit_py_files=[
        './feature_store_pyspark/feature_store_manager.py',
        './feature_store_pyspark/wrapper.py'
    ]
)

step_process = ProcessingStep(name='MaterializeToOnlineStore', step_args=processor_args)

pipeline = Pipeline(
    name=pipeline_name,
    steps=[step_process],
)

pipeline.upsert(role_arn=role)


Manually run the pipeline.

In [None]:
execution = pipeline.start()
execution.wait()

Verify that the latest features are available in the online store.

In [None]:
fs_client = boto3.client('sagemaker-featurestore-runtime')
fs_client.batch_get_record(
    Identifiers=[
        {
            'FeatureGroupName': feature_group_name,
            'RecordIdentifiersValueAsString': ['1', '2', '3']
        }
    ]
)