# Automated data transformation and ingestion from an Amazon S3 bucket to SageMaker Feature Store

## Architecture Overview
This notebook shows you how to use [AWS Service Catalog](https://aws.amazon.com/servicecatalog), [SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html), and [Pipelines](https://aws.amazon.com/sagemaker/pipelines/) to create re-usable and portable components in SageMaker Studio.

This project automates feature transformations and ingestion into [SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/), triggered off of new data files that are uploaded to an S3 bucket. The SageMaker project creates all necessary components, sets up all permissions and links between resources.

<img src="../design/feature-store-ingestion-pipeline.drawio.svg" style="background-color:white;" alt="solution overview" width="1000"/>

## Prerequisites
The following resources must be created before you can proceed with deployment of the SageMaker project:
- A Data Wrangler `.flow` file which contains an output node. The `.flow` file must be uploaded to a designated S3 prefix
- A Feature group to store features extracted from the data 
- SageMaker project portfolio -> done with [intial setup](../README.md#deploy-sagemaker-project-portfolio)
- S3 bucket where new data files will be uploaded

All these tasks are done in the [`00-setup` notebook](00-setup.ipynb). Please make sure you run through the setup notebook before running this one.

In [None]:
import sagemaker
import boto3
import time
import json
import os
from time import gmtime, strftime
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.session import Session

print(sagemaker.__version__)

In [None]:
# load environment variables from %store
%store -r 

In [None]:
%store

In [None]:
try:
    data_bucket
    dw_flow_file_url
    dw_output_name
    feature_group_name
    s3_fs_query_output_prefix
    s3_data_prefix
    s3_flow_prefix
    abalone_dataset_local_url
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-setup.ipynb notebook")
    print("+++++++++++++++++++++++++++++++++++++++++++++++")

In [None]:
# Set the string literals
s3_input_data_prefix = f"{data_bucket}/feature-store-ingestion-pipeline/landing-zone/"
pipeline_name_prefix = "s3-fs-ingest-pipeline"

%store s3_input_data_prefix

In [None]:
print("Project parameters:")
print(f"S3 data prefix to monitor: {s3_input_data_prefix}")
print(f"Data Wrangler flow URL: {dw_flow_file_url}")
print(f"Data Wrangler output name: {dw_output_name}")
print(f"Feature group name: {feature_group_name}")

## Create data load project
⭐ You can create a project in Studio IDE (Option 1) or programmatically directly in this notebook (Option 2). Option 2 is recommended as it requires no manual input. Option 1 is given to demostrate the UX for project parameters.

### Option 1: Create a project in Studio

1. Select **Projects** from **SageMaker resources** widget:

<img src="../img/studio-create-project.png" alt="studio-create-project" width="400"/>

2. Navigate to **Organization templates** and select a project template for automated transformation and ingestion pipeline. Click on **Select project template**:

<img src="../img/studio-select-project-template.png" width="800"/>

3. Enter the project parameters
<img src="../img/studio-enter-project-parameters.png" width="800"/>

The parameters are:
- **Project name and description**: provide your project name and description
- **Pipeline name prefix**: provide a prefix for the pipeline name or leave default
- **Pipeline description**: provide a description for your pipeline or leave default
- **S3 prefix**: set to the value of `s3_input_data_prefix` variable
- **Data Wrangler flow S3 url**: set to the value of `dw_flow_file_url` variable
- **Data Wrangler output name**: set to the value of `dw_output_name` variable
- **Feature group name**: set to the value of `feature_group_name` variable
- **Lambda execution role**: provide your own IAM role for the lambda function or leave at `Auto` to automatically create a new one

Click on **Create project**

<div class="alert alert-info"> 💡 <strong> Wait until project creation is completed </strong>
</div>
The banner "Creating project...":

<img src="../img/studio-creating-project-banner.png" alt="studio-creating-project-banner" width="500"/>

will change to the project details page:

<img src="../img/studio-project-created.png" width="800"/>

#### Get the name and id of the created project

<div class="alert alert-info"> 💡 <strong> Run the following cells only if you use Option 1 - create a project in Studio IDE </strong>

In [None]:
# Get the latest created project
sm = boto3.client("sagemaker")
r = sm.list_projects(SortBy="CreationTime", SortOrder="Descending")

In [None]:
r

In [None]:
if r.get("ProjectSummaryList") is None or len(r.get("ProjectSummaryList")) == 0:
    raise Exception("[ERROR]: cannot retrieve the project list!")
    
if r["ProjectSummaryList"][0]["ProjectStatus"] not in ("CreateCompleted"):
    raise Exception("[ERROR]: wait until project creation is completed!")
else:
    project_name = r["ProjectSummaryList"][0]["ProjectName"]
    project_id = r["ProjectSummaryList"][0]["ProjectId"]

### End of Option 1 section
---

### Option 2: Create project in code - recommended
<div class="alert alert-info"> 💡 <strong> Skip this section if you created a project via Studio IDE </strong>

You can use [boto3 Python SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_project) to create a new project from the notebook or any Python code.
First, get the `ProvisioningArtifactIds` and `ProductId` from service catalog CloudFormation template:

In [None]:
cf = boto3.client("cloudformation")

r = cf.describe_stacks(StackName="sm-project-sc-portfolio")

Set parameters for the SageMaker project:

In [None]:
sm = boto3.client("sagemaker")

provisioning_artifact_ids = [v for v in r["Stacks"][0]["Outputs"] if v["OutputKey"] == "ProvisioningArtifactIds"][0]["OutputValue"]
product_id = [v for v in r["Stacks"][0]["Outputs"] if v["OutputKey"] == "ProductId"][0]["OutputValue"]
project_name = f"s3-fs-ingest-{strftime('%d-%H-%M-%S', gmtime())}"
project_parameters = [
            {
                'Key': 'PipelineDescription',
                'Value': 'Feature Store ingestion pipeline'
            },
            {
                'Key': 'DataWranglerFlowUrl',
                'Value': dw_flow_file_url
            },
            {
                'Key': 'DataWranglerOutputName',
                'Value': dw_output_name
            },
            {
                'Key': 'S3DataPrefix',
                'Value': s3_input_data_prefix
            },
            {
                'Key': 'FeatureGroupName',
                'Value': feature_group_name
            },
            {
                'Key': 'PipelineNamePrefix',
                'Value': pipeline_name_prefix
            },
        ]

Finally, create a SageMaker project from the service catalog product template:

In [None]:
# create SageMaker project
r = sm.create_project(
    ProjectName=project_name,
    ProjectDescription="Feature Store ingestion from S3",
    ServiceCatalogProvisioningDetails={
        'ProductId': product_id,
        'ProvisioningArtifactId': provisioning_artifact_ids,
        'ProvisioningParameters': project_parameters
    },
)

print(r)
project_id = r["ProjectId"]

<div class="alert alert-info"> 💡 <strong> Wait until project creation is completed </strong>
</div>

### End of Option 2 section
---

## Working with data ingestion project

### Project resources
The project template creates all necessary resources for an automated data transformation and ingestion:
- [EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rules.html) for launching an AWS Lambda function whenever any new data is uploaded to the specified S3 prefix
- AWS Lambda function which launches the SageMaker pipeline
- SageMaker pipeline which runs a processing job using a DataWrangler processor
- DataWrangler processor which uses an uploaded `.flow` file with data transformation workflow

### CodeCommit repository with seed code
All source code for pipeline creation and pipeline parameter configuration is delivered as an [AWS CodeCommit](https://aws.amazon.com/codecommit/) repository. The code is fully functional and works out-of-the-box. You own this code and can change any configuration or parameters of the pipeline according to your requirements.

To start working with the code you must clone the repository into Studio user's home directory:

<img src="../img/studo-project-clone-repo.png" alt="studo-project-clone-repo" width="800"/>

You can make your changes to the source code and push it to the CodeCommit repository. The project also delivers an [AWS CodePipeline](https://aws.amazon.com/codepipeline/) CI/CD pipeline which launches an [AWS CodeBuild](https://aws.amazon.com/codebuild/) stage whenever there is a new commit in the repository. The build pulls the code from the repository and calls `create_pipeline` function (file `build.py`). You can change the existing or provide your own code in the `pipeline.create_pipeline` in the file `pipeline.py`. The default code configures a SageMaker pipeline with Data Wrangler processor and upserts the pipeline.

### SageMaker pipeline
The project delivers a SageMaker pipeline consisting of one processing step with Data Wrangler processor. The pipeline performs the transformation contained in a specified Data Wrangler `.flow` file and ingests the transformed features in a specified feature group in Feature Store.
This pipeline is launched by a Lambda function whenever there is a new file uploaded to the specified S3 location. The pipeline is linked to the project and available in the **Pipeline** tab of the project details page:

<img src="../img/studio-project-details-pipelines.png" alt="studio-project-details-pipelines" width="800"/>

From there you can see the pipeline graph, parameters, settings, and the execution history:

<img src="../img/studio-pipeline-execution-history.png" alt="studio-pipeline-execution-history" width="800"/>

You can also start a new execution manually from Studio by clicking on **Start an execution** and provide pipeline parameters:

<img src="../img/studio-pipeline-parameter-input.png" alt="studio-pipeline-parameter-input" width="500"/>

## Test the automation pipeline

To test the deployed data transformation and feature store ingestion pipeline, perform the following steps:
1. Upload a data file to the monitored S3 prefix location - this will launch the data transformation and ingestion via our data pipeline
1. Monitor the pipeline execution
1. Check the loaded data in the feature group

### Upload data to S3 bucket

⭐ The EventBridge rule monitors two S3 events: `PutObject` and `CompleteMultipartUpload`. If you copy an object between two S3 buckets, the EventBrige rule won't be launched.

The following s3 `PUT` event will launch the Lambda function, which will start a new pipeline execution:

In [None]:
file_name = f"abalone-{strftime('%d-%H-%M-%S', gmtime())}.csv"

In [None]:
!aws s3 cp {abalone_dataset_local_url} s3://{s3_input_data_prefix}{file_name}

### Monitor pipeline execution

In [None]:
try:
    project_id
    project_name
except NameError:
    raise Exception("[ERROR]: project_id or project_name variables are not set")
    
if project_id is None or project_name is None:
    raise Exception("[ERROR]: project_id or project_name variables are not set")

In [None]:
# Get the the project data
r = sm.describe_project(ProjectName=project_name)

# Get the pipeline prefix from the project parameters
pipeline_name_prefix = [p for p in r["ServiceCatalogProvisioningDetails"]["ProvisioningParameters"] if p["Key"] == "PipelineNamePrefix"][0]["Value"]

In [None]:
pipeline_name_prefix

In [None]:
# set the pipeline name
s3_to_fs_pipeline_name = f"{pipeline_name_prefix}-{project_id}"

%store s3_to_fs_pipeline_name

In [None]:
# check pipeline execution 
summaries = sm.list_pipeline_executions(PipelineName=s3_to_fs_pipeline_name).get('PipelineExecutionSummaries')
summaries

In [None]:
latest_execution = sm.list_pipeline_executions(PipelineName=s3_to_fs_pipeline_name).get('PipelineExecutionSummaries')[0].get('PipelineExecutionArn')
print (latest_execution)

In [None]:
# Wait for pipeline execution to complete 'Executing' status
while sm.describe_pipeline_execution(PipelineExecutionArn=latest_execution)["PipelineExecutionStatus"] == "Executing":
    print('Pipeline is in Executing status...')
    time.sleep(30)
    
print('Pipeline is done Executing')
print(sm.describe_pipeline_execution(PipelineExecutionArn=latest_execution))

Alternatively, you can monitor the pipeline execution inside the Pipeline widget of Studio:

![](../img/studio-pipeline-executing.png)

### Check the loaded data
Once the execution completes, we can check that the data is loaded into the feature group.

Create a feature group object:

In [None]:
feature_store_session = Session()

feature_group = FeatureGroup(
    name=feature_group_name, 
    sagemaker_session=feature_store_session
)

In [None]:
# Build SQL query to features group
fs_query = feature_group.athena_query()

query_string = f'SELECT * FROM "{fs_query.table_name}"'
print(f'Prepared query {query_string}')
print(fs_query)

In [None]:
# Run Athena query. The output is loaded to a Pandas dataframe.
fs_query.run(
    query_string=query_string, 
    output_location=f"s3://{s3_fs_query_output_prefix}"
)

fs_query.wait()
data_df = fs_query.as_dataframe()

The `DataFrame` contains now all features from the feature group:

In [None]:
data_df

### Start pipeline run via SDK
You can start the data transformation and ingestion pipeline on demand using [SageMaker SDK](https://sagemaker.readthedocs.io/en/v2.57.0/workflows/pipelines/index.html). `pipeline.start` function allows you to provide parameter values to override the default value for the pipeline execution. 

In [None]:
# get Pipeline object
pipeline = Pipeline(name=s3_to_fs_pipeline_name)

In [None]:
# start execution with the specified parameters
execution = pipeline.start(
    parameters=dict(
        InputDataUrl=f"s3://{s3_input_data_prefix}{abalone_dataset_file_name}",
        InputFlowUrl=dw_flow_file_url,
        FlowOutputName=dw_output_name,
        FeatureGroupName=feature_group_name
    )
)

In [None]:
execution.wait()

In [None]:
execution.list_steps()

### Change the default values for pipeline parameters
To change the default values for the parameters, you can edit `pipeline.py` file with pipeline and parameter definition code:
```python
    # setup pipeline parameters
    p_processing_instance_count = ParameterInteger(
        name="ProcessingInstanceCount",
        default_value=1
    )
    p_processing_instance_type = ParameterString(
        name="ProcessingInstanceType",
        default_value="ml.m5.4xlarge"
    )
    p_processing_volume_size = ParameterInteger(
        name="ProcessingVolumeSize",
        default_value=50
    )
    p_flow_output_name = ParameterString(
        name='FlowOutputName',
        default_value=flow_output_name
    )
    p_input_flow = ParameterString(
        name='InputFlowUrl',
        default_value=data_wrangler_flow_s3_url
    )
    p_input_data = ParameterString(
        name="InputDataUrl",
        default_value=input_data_s3_url
    )
    p_feature_group_name = ParameterString(
        name="FeatureGroupName",
        default_value=feature_group_name
    )
```

The CI/CD CodePipeline pipeline will be automatically started after you commit and push the changes into the project's source code repository.

# Release resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

Proceed to the [`99-clean-up` notebook](99-clean-up.ipynb).