# Automating feature transformations with SageMaker Data Wrangler, Pipelines, and Feature Store

1. Add permissions to Amazon SageMaker
2. Create a data wrangler flow
3. Export the flow and create feature groups
4. Ingest historical data into feature store
5. Set up SageMaker pipeline and Lambdas triggered off new data from S3

The first part of the blog post will walk the reader through creating a set of feature transformations on the flight delay dataset and then export that flow to a generated notebook that creates feature groups and ingests historical flight data into the feature store. This notebook forms the second half of the example, where readers will create a SageMaker pipeline and a lambda function to automate the feature transformations and feature store ingest on new data each day.

## Add permissions for Amazon SageMaker
As part of automation in this notebook, you will create IAM roles to assign to AWS Lambda. To do that, you first need to give permission to Amazon SageMaker to create and manage IAM roles. You can provide those permissions by adding the following permissions as an [inline policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-edit.html#edit-inline-policy-console) to your Amazon SageMaker role.

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IAMPolicy",
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:AttachRolePolicy",
                "iam:CreateRole",
                "iam:PassRole"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "lambda.amazonaws.com",
                        "sagemaker.amazonaws.com"
                        ]
                }
            }
        },
        {
            "Sid": "LambdaFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:UpdateFunctionCode",
                "lambda:AddPermission",
                "sts:GetCallerIdentity"
            ],
            "Resource": "*"
        }
    ]
}

## Create a SM Pipeline from the Data Wrangler Flow

In [None]:
# SageMaker Python SDK version 2.x is required
import sagemaker
import subprocess
import sys

original_version = sagemaker.__version__
if sagemaker.__version__ != "2.20.0":
    subprocess.check_call(
        [sys.executable, "-m", "pip", "install", "sagemaker==2.20.0"]
    )
    import importlib
    importlib.reload(sagemaker)

In [None]:
import os
import uuid
import json
import time
import boto3
import sagemaker

In [None]:
# Update S3 Bucket to the one cotaining dataset and flow file.
bucket = None
if bucket is None:
   raise RuntimeError("Please add name of the bucket that contains the data and flow files.")

# Update Flow URI from 4th cell of the Jupyter notebook that was exported by Amazon SageMaker Data Wrangler.
flow_uri = None
if flow_uri is None:
   raise RuntimeError("Please add flow URI from the Jupyter notebook that was exported by Amazon SageMaker Data Wrangler.")

# Update Feature Group name from 5th cell of the Jupyter notebook that was exported by Amazon SageMaker Data Wrangler.
feature_group_name = None
if feature_group_name is None:
   raise RuntimeError("Please add Feature Group name from the Jupyter notebook that was exported by Amazon SageMaker Data Wrangler.")

# If you updated prefix in the other Jupyter notebook, update it here as well.
prefix = "data_wrangler_flows"
flow_id = ((((flow_uri.split(prefix))[1])[1:]).split('.'))[0]

sess = sagemaker.Session()

iam_role = sagemaker.get_execution_role()

container_uri = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:1.1.1"

# Processing Job Resources Configurations
processing_job_name = f"data-wrangler-feature-store-processing-{flow_id}"
processing_dir = "/opt/ml/processing"

# URL to use for sagemaker client.
# If this is None, boto will automatically construct the appropriate URL to use when communicating with sagemaker.
sagemaker_endpoint_url = None

In [None]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)
processing_instance_count = ParameterInteger(
    name="ProcessingInstanceCount",
    default_value=1
)
processing_instance_type = ParameterString(
    name="ProcessingInstanceType",
    default_value="ml.m5.4xlarge"
)
input_flow= ParameterString(
    name='InputFlow',
    default_value=flow_uri
)

In [None]:
from sagemaker.processing import Processor

processor = Processor(
    role=iam_role,
    image_uri=container_uri,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type
)

In [None]:
from sagemaker.processing import FeatureStoreOutput
feature_store_output = FeatureStoreOutput(feature_group_name=feature_group_name)
feature_store_output

In [None]:
output_name = "26b67c19-4e8b-401b-a817-db82c20a17ed.default"
output_content_type = "CSV"

def create_container_arguments(output_name, output_content_type):
    output_config = {
        output_name: {
            "content_type": output_content_type
        }
    }
    return [f"--output-config '{json.dumps(output_config)}'"]

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
    

step_process = ProcessingStep(
    name="DailyFlightDataETL",
    processor=processor,
    inputs=[
        ProcessingInput(input_name='flow', 
                        destination='/opt/ml/processing/flow',
                        source=input_flow,
                        s3_data_type= 'S3Prefix',
                        s3_input_mode= 'File'
                       )
    ],
    outputs=[
        ProcessingOutput(
            output_name="26b67c19-4e8b-401b-a817-db82c20a17ed.default",
            app_managed=True, feature_store_output= feature_store_output)
    ],
   job_arguments=create_container_arguments(output_name, output_content_type)
   
)

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name='dw-fs-automation'

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type, 
        processing_instance_count,
        input_flow
    ],
    steps=[step_process],
    sagemaker_session=sess
)

In [None]:
import json

definition = json.loads(pipeline.definition())

In [None]:
pipeline.upsert(sagemaker.get_execution_role())

In [None]:
execution=pipeline.start(execution_description='removing all ref to output name',
                        parameters={
                            'InputFlow':flow_uri
                                   })

## Setup Automation

### Import Libraries

In [None]:
import boto3
from zipfile import ZipFile
import time
import inspect
import Utils

### Setup Variables
#### We now set variables that will be used to setup the automation. The default placeholder values will work but you can update them as well, if you wish.

In [None]:
role_name = f"sm-lambda-role-{time.strftime('%d-%H-%M-%S', time.gmtime())}"
fcn_name = f"sm-lambda-fcn-{time.strftime('%d-%H-%M-%S', time.gmtime())}"
iam_desc = 'IAM Policy for Lambda triggering AWS SageMaker Pipeline'
fcn_desc = 'AWS Lambda function for automatically triggering AWS SageMaker Pipeline'
bucket_arn = f"arn:aws:s3:::{bucket}"
account_num = boto3.client('sts').get_caller_identity()['Account']

### Setup IAM Roles
#### AWS Lambda needs permissions to be able to call other AWS services. These permissions are provided by IAM roles. We first create the IAM role that will be assumed by AWS Lambda and then assign permissions to it.

In [None]:
#Create IAM role for the Lambda function
new_role = Utils.create_role(role_name, iam_desc)

#Wait for IAM role to be active
print('Pause for 10 seconds ...')
for i in range(10,0,-1):
    time.sleep(1)
    print('Resuming in {} seconds'.format(i))
print('Resuming now!')
#Add permissions to the IAM role
Utils.add_permissions(new_role['name'])

### Setup AWS Lambda function
#### We need AWS Lambda to automtically trigger Amazon SageMaker Pipelines to process newly arrived dataset in Amazon S3. The detailed code is available in `Utils.py`. Once the AWS Lambda function is created, we zip it into a deployment package ready for upload onto AWS Lambda. Once the package is ready, we create the AWS Lambda function using the IAM role created earlier

In [None]:
import inspect
#Create code for AWS Lambda function
lambda_code = Utils.create_lambda_fcn(bucket, flow_uri, pipeline_name)

#Zip AWS Lambda function code
#Write code to a .py file
with open('lambda_function.py', 'w') as f:
    f.write(inspect.cleandoc(lambda_code))
#Compress file into a zip
with ZipFile('function.zip','w') as z:
    z.write('lambda_function.py')
#Use zipped code as AWS Lambda function code
with open('lambda_function.py', 'w') as f:
    f.write(lambda_code)

#Create AWS Lambda function
with open('function.zip', 'rb') as f:
    fcn_code = f.read()   
new_lambda_arn = Utils.create_lambda(fcn_name, fcn_desc, fcn_code, new_role['arn'])

### Setup Amazon S3
#### Lastly, we setup Amazon S3 to trigger AWS Lambda whenever a new CSV file is uploaded into the Bucket specified earlier. 

In [None]:
#Add permission for Amazon S3 to trigger AWS Lambda
Utils.allow_s3(fcn_name, bucket_arn, account_num)

#Setup new CSV upload notifications on Amazon S3
Utils.add_notif(bucket, new_lambda_arn)

### Completion
#### You have now successfully setup the automation. Try and test the setup by uploading a CSV file into your Amazon S3 Bucket and see if it triggers 

In [None]:
#Print Confirmation
print('GOOD JOB: You are all set!')