# Data Wrangler Feature Store Notebook

Use this notebook to create a feature group and add features to an offline or online
feature store using a Data Wrangler .flow file.

A single *feature* corresponds to a column in your dataset. A *feature group* is a predefined
schema for a collection of features - each feature in the feature group has a specified data
type and name. A single *record* in a feature group corresponds to a row in your datataframe.
A *feature store* is a collection of feature groups.

This notebook uses Amazon SageMaker Feature Store (Feature Store) to create a feature group
and ingest data into feature store. To learn more about SageMaker Feature Store, see
[Amazon Feature Store Documentation](http://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).

To create a feature group, you will create the following resources:
* A feature definition using a schema, record identifier, and event-time feature name.
* An online or offline store configuration.

You will use a processing job to process your data at scale and ingest the data into this feature group.

First, use the following cell to install dependencies.

In [None]:
# SageMaker Python SDK version 2.x is required
import sagemaker
import subprocess
import sys

original_version = sagemaker.__version__
if sagemaker.__version__ != "2.17.0":
    subprocess.check_call(
        [sys.executable, "-m", "pip", "install", "sagemaker==2.17.0"]
    )
    import importlib
    importlib.reload(sagemaker)

In [None]:
import os
import uuid
import json
import time
import boto3
import sagemaker

## Parameters
The following lists configurable parameters that are used throughout this notebook.

In [None]:
# S3 bucket for saving processing job outputs
# Feel free to specify a different bucket here if you wish.
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = "data_wrangler_flows"
flow_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_name = f"flow-{flow_id}"
flow_uri = f"s3://{bucket}/{prefix}/{flow_name}.flow"

flow_file_name = "99_transform_star_rating_to_sentiment.flow"

iam_role = sagemaker.get_execution_role()

container_uri = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:1.0.2"

# Processing Job Resources Configurations
processing_job_name = f"data-wrangler-feature-store-processing-{flow_id}"
processing_dir = "/opt/ml/processing"

# URL to use for sagemaker client.
# If this is None, boto will automatically construct the appropriate URL to use
# when communicating with sagemaker.
sagemaker_endpoint_url = None

## Push Flow to S3
Use the following cell to upload the Data Wrangler .flow file to Amazon S3 so that
it can be used as an input to the processing job.

In [None]:
# Load .flow file
with open(flow_file_name) as f:
    flow = json.load(f)

# Upload to S3
s3_client = boto3.client("s3")
s3_client.upload_file(flow_file_name, bucket, f"{prefix}/{flow_name}.flow")

print(f"Data Wrangler Flow notebook uploaded to {flow_uri}")

## Create Feature Group

In [None]:
feature_group_name = f'FG-{flow_name}'
print(f"Feature Group Name: {feature_group_name}")


The following cell maps types between Data Wrangler supported types and Feature Store
supported types (`String`, `Fractional`, and `Integral`). The default type is set to `String`.
This means that, if a column in your dataset is not a `float` or `long` type,
it will default to `String` in your Feature Store.

In [None]:
datawrangler_FG_type_mapping = {
    'float': 'Fractional',
    'long': 'Integral'
}

# Some schema types in Data Wrangler are not supported by Feature Store.
# Feature store supports String, Integral, and Fractional types.
# The following will create a default_FG_type set to String for these types.
default_FG_type = "String"


The following is a list of the column names and data types of the final dataset that will be produced
when your data flow is used to process your input dataset.

In [None]:
column_schema = [
    {
        "name": "marketplace",
        "type": "string"
    },
    {
        "name": "review_id",
        "type": "string"
    },
    {
        "name": "product_id",
        "type": "string"
    },
    {
        "name": "product_title",
        "type": "string"
    },
    {
        "name": "product_category",
        "type": "string"
    },
    {
        "name": "vine",
        "type": "string"
    },
    {
        "name": "verified_purchase",
        "type": "string"
    },
    {
        "name": "review_headline",
        "type": "string"
    },
    {
        "name": "review_body",
        "type": "string"
    },
    {
        "name": "review_date",
        "type": "date"
    },
    {
        "name": "customer_id",
        "type": "long"
    },
    {
        "name": "product_parent",
        "type": "long"
    },
    {
        "name": "star_rating",
        "type": "long"
    },
    {
        "name": "helpful_votes",
        "type": "long"
    },
    {
        "name": "total_votes",
        "type": "long"
    }
]


Select Record identifier and Event time feature name. These are required parameters for feature group
creation.
* **Record identifier name** is the name of the feature whose value uniquely identiﬁes a Record
deﬁned in the feature group's feature definitions.
* **Event time feature name** is the name of the EventTime of a Record in FeatureGroup.
A EventTime is point in time when a new event occurs that corresponds to the creation or update of a
Record in FeatureGroup. All Records in the FeatureGroup must have a corresponding EventTime.

In [None]:
record_identifier_name = None
if record_identifier_name is None:
   raise RuntimeError("Select a column name as the feature group identifier.")

event_time_feature_name = None
if event_time_feature_name is None:
   raise RuntimeError("Select a column name as the event time feature name.")

# Below you map the schema detected from Data Wrangler to Feature Group Types.
feature_definitions = [
    {
        "FeatureName": schema['name'],
        "FeatureType": datawrangler_FG_type_mapping.get(
            schema['type'],
            default_FG_type
         )
    } for schema in column_schema
]
print(feature_definitions)


The following are your online and offline store configurations. You enable an online
store by setting `EnableOnlineStore` to `True`. The offline store is located in an
Amazon S3 bucket in your account. To update the bucket used, update the
parameter `bucket` in the second code cell in this notebook.

In [None]:
sagemaker_client = boto3.client("sagemaker", endpoint_url=sagemaker_endpoint_url)

# Online Store Configuration
online_store_config = {
    "EnableOnlineStore": True
}

# Offline Store Configuration
s3_uri = 's3://' + bucket # this is the default bucket defined in previous cells
offline_store_config = {
    "S3StorageConfig": {
        "S3Uri": s3_uri
    }
}

# Create Feature Group
create_fg_response = sagemaker_client.create_feature_group(
    FeatureGroupName = feature_group_name,
    EventTimeFeatureName = event_time_feature_name,
    RecordIdentifierFeatureName = record_identifier_name,
    FeatureDefinitions = feature_definitions,
    OnlineStoreConfig = online_store_config,
    OfflineStoreConfig = offline_store_config,
    RoleArn = iam_role)

# Describe Feature Group
status = sagemaker_client.describe_feature_group(FeatureGroupName=feature_group_name)
while status['FeatureGroupStatus'] != 'Created':
    if status['FeatureGroupStatus'] == 'CreateFailed':
        raise RuntimeError(f"Feature Group Creation Failed: {status}")
    status = sagemaker_client.describe_feature_group(FeatureGroupName=feature_group_name)
    print("Feature Group Status: " + status['FeatureGroupStatus'])
    time.sleep(3)

print(status)


Use the following code cell to define helper functions for creating inputs to
a processing job.

In [None]:
def create_flow_notebook_processing_input(base_dir, flow_s3_uri):
    return {
        "InputName": "flow",
        "S3Input": {
            "LocalPath": f"{base_dir}/flow",
            "S3Uri": flow_s3_uri,
            "S3DataType": "S3Prefix",
            "S3InputMode": "File",
        },
    }

def create_s3_processing_input(base_dir, name, dataset_definition):
    return {
        "InputName": name,
        "S3Input": {
            "LocalPath": f"{base_dir}/{name}",
            "S3Uri": dataset_definition["s3ExecutionContext"]["s3Uri"],
            "S3DataType": "S3Prefix",
            "S3InputMode": "File",
        },
    }

def create_redshift_processing_input(base_dir, name, dataset_definition):
    return {
        "InputName": name,
        "DatasetDefinition": {
            "RedshiftDatasetDefinition": {
                "ClusterId": dataset_definition["clusterIdentifier"],
                "Database": dataset_definition["database"],
                "DbUser": dataset_definition["dbUser"],
                "QueryString": dataset_definition["queryString"],
                "ClusterRoleArn": dataset_definition["unloadIamRole"],
                "OutputS3Uri": f'{dataset_definition["s3OutputLocation"]}{name}/',
                "OutputFormat": dataset_definition["outputFormat"].upper(),
            },
            "LocalPath": f"{base_dir}/{name}",
        },
    }

def create_athena_processing_input(base_dir, name, dataset_definition):
    return {
        "InputName": name,
        "DatasetDefinition": {
            "AthenaDatasetDefinition": {
                "Catalog": dataset_definition["catalogName"],
                "Database": dataset_definition["databaseName"],
                "QueryString": dataset_definition["queryString"],
                "OutputS3Uri": f'{dataset_definition["s3OutputLocation"]}{name}/',
                "OutputFormat": dataset_definition["outputFormat"].upper(),
            },
            "LocalPath": f"{base_dir}/{name}",
        },
    }

def create_processing_inputs(processing_dir, flow, flow_uri):
    """Helper function for creating processing inputs
    :param flow: loaded data wrangler flow notebook
    :param flow_uri: S3 URI of the data wrangler flow notebook
    """
    processing_inputs = []
    flow_processing_input = create_flow_notebook_processing_input(processing_dir, flow_uri)
    processing_inputs.append(flow_processing_input)

    for node in flow["nodes"]:
        if "dataset_definition" in node["parameters"]:
            data_def = node["parameters"]["dataset_definition"]
            name = data_def["name"]
            source_type = data_def["datasetSourceType"]

            if source_type == "S3":
                s3_processing_input = create_s3_processing_input(
                    processing_dir, name, data_def)
                processing_inputs.append(s3_processing_input)
            elif source_type == "Athena":
                athena_processing_input = create_athena_processing_input(
                    processing_dir, name, data_def)
                processing_inputs.append(athena_processing_input)
            elif source_type == "Redshift":
                redshift_processing_input = create_redshift_processing_input(
                    processing_dir, name, data_def)
                processing_inputs.append(redshift_processing_input)
            else:
                raise ValueError(f"{source_type} is not supported for Data Wrangler Processing.")
    return processing_inputs

## Start ProcessingJob
Now, the Processing Job is submitted to a boto client. The status of the processing job is
monitored with the boto client, and this notebook waits until the job is no longer 'InProgress'.

In [None]:
# Processing job name
print(f'Processing Job Name: {processing_job_name}')

processingResources = {
        'ClusterConfig': {
            'InstanceCount': 1,
            'InstanceType': 'ml.m5.4xlarge',
            'VolumeSizeInGB': 30
        }
    }

appSpecification = {'ImageUri': container_uri}

sagemaker_client.create_processing_job(
        ProcessingInputs=create_processing_inputs(processing_dir, flow, flow_uri),
        ProcessingOutputConfig={
            'Outputs': [
                {
                    'OutputName': '76b4470a-6f43-46b8-be22-f0377a21c62d.default',
                    'FeatureStoreOutput': {
                        'FeatureGroupName': feature_group_name
                    },
                    'AppManaged': True
                }
            ],
        },
        ProcessingJobName=processing_job_name,
        ProcessingResources=processingResources,
        AppSpecification=appSpecification,
        RoleArn=iam_role
    )


status = sagemaker_client.describe_processing_job(ProcessingJobName=processing_job_name)

while status['ProcessingJobStatus'] in ('InProgress', 'Failed'):
    if status['ProcessingJobStatus'] == 'Failed':
        raise RuntimeError(f"Processing Job failed: {status}")
    status = sagemaker_client.describe_processing_job(ProcessingJobName=processing_job_name)
    print(status['ProcessingJobStatus'])
    time.sleep(60)

print(status)


### Cleanup
Uncomment the following code cell to revert the SageMaker Python SDK to the original version used
before running this notebook. This notebook upgrades the SageMaker Python SDK to 2.x, which may
cause other example notebooks to break. To learn more about the changes introduced in the
SageMaker Python SDK 2.x update, see
[Use Version 2.x of the SageMaker Python SDK.](https://sagemaker.readthedocs.io/en/stable/v2.html).

In [None]:
# _ = subprocess.check_call(
#     [sys.executable, "-m", "pip", "install", f"sagemaker=={original_version}"]
# )

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}