# Data Preparation and Feature Engineering with Amazon SageMaker Pipeline and Data Wrangler

With the introduction of Amazon SageMaker Data Wrangler, data preparation and feature engineering can be easily achieved from Data Wrangler's visual interface. And with Amazon SageMaker Pipeline, tasks of orchestrating SageMaker jobs and authoring reproducible machine learning pipelines are dramatically simplified. 

This example will focus on combining the force of SageMaker Data Wrangler and SageMaker Pipeline to simplify the process of data preparation and feature engineering. 

### Create a Data Wrangler Workflow for Data Preparation

First, we will start by creating a workflow in SageMaker Data Wrangler's virtual interface. There is a step-by-step [guide](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html) to follow in case you are not already familiar with Data Wrangler. The dataset we will use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone) [1] and we will use Data Wrangler to prepare the data for training. 


[1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

#### Import Data

![Import Data](img/import.png)
Make sure to unselect `First row is header` if your dataset doesn't contain header row.

#### Use Custom Transform to add Header Row 
![Add Header Row](img/header.png) 
(Here we use PySpark SQL in the example)

#### Impute Numeric Columns
![Impute](img/impute_numeric.png) 
(Repeat this operation for all numeric columns)

#### Scale Values
![Impute](img/scale_values.png) 
(Repeat this operation for all numeric columns)

#### Impute Categorical
![Impute Ctegorical](img/handle_missing.png) 

#### One-hot Encoding
![One-hot Encoding](img/one_hot_encode.png) 

### Export Data Wrangler Workflow

Next, we will export the workflow created in the previous step. There are several ways to export a workflow; Data Wrangler offers export options to SageMaker Data Wrangler Job, Pipeline, Python code and Feature Store.

The following options create a Jupyter Notebook to execute your data flow and integrate with the respective SageMaker feature.
* Data Wrangler Job: launches a SageMaker procesing job to execute your workflow
* Pipeline: creates a single step pipeline to execute your workflow
* Feature Store: injects outputs your workflow to a SageMaker feature store

For a one time data preparation, export the workflow and run the generated notebook. To reuse the workflow in any other pipeline, we will the need the below information from the notebook:
* Output Name
* S3 URI which the SageMaker Data Wrangler recipe is uploaded to

Output Name and S3 URI can be found in the `Parameterers` section

#### Select Export Steps
![Select Steps](img/select_steps.png) 

#### Select Export Options
![Select Export Option](img/export.png) 

#### Take a Note of Output Name
![Output Name](img/output_name.png) 

#### Take a Note of S3 URI of Workflow Recipe 
![flow](img/upload_flow_to_s3.png) 
(Sample from pipeline notebook, execute the cell will upload Data Wrangler workflow recipe to S3)

### Run Data Wrangler Workflow in SageMaker Pipeline

Next, we will demonstrate how to reuse Data Wrangler workflow in any SageMaker Pipeline.

##### Setup

In [None]:
import boto3
import sagemaker

from sagemaker.inputs import TrainingInput
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.wrangler.processing import DataWranglerProcessor
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)
from sagemaker.workflow.steps import TrainingStep

# Parameters
region = boto3.Session().region_name
sagemaker_session = sagemaker
base_dir = "/opt/ml/processing"

abalone_sample_data = "s3://my-sample-data/abalone.csv"
output_content_type = "CSV"
output_name = "0a562d05-69dc-4bc0-ab95-9dd0d2e4b6a3.default"

data_wrangler_recipe_s3_uri = "s3://my-data-wrangler-workflow/sample.flow"
data_wrangler_instance_count = ParameterInteger(name="InstanceCount", default_value=1)
data_wrangler_instance_type = ParameterString(name="InstanceType", default_value="ml.m5.4xlarge")
output_config = {output_name: {"content_type": output_content_type}}
job_argument = [f"--output-config '{json.dumps(output_config)}'"]
output_s3_uri = f"s3://{sagemaker_session.default_bucket()}/output"

training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.xlarge")
model_path = f"s3://{sagemaker_session.default_bucket()}/AbaloneTrain"

##### Create Data Wrangler Job

In [None]:
# Data Wrangler Job
inputs = [
    ProcessingInput(
        input_name="abalone.csv",
        source=abalone_sample_data,
        destination=f"{base_dir}/abalone.csv",
        s3_data_type="S3Prefix",
        s3_input_mode="File",
        s3_data_distribution_type="FullyReplicated",
    )
]


outputs = [
    ProcessingOutput(
        output_name=output_name,
        source=f"{base_dir}/output",
        destination=output_s3_uri,
        s3_upload_mode="EndOfJob",
    )
]

data_wrangler_processor = DataWranglerProcessor(
    role=role,
    data_wrangler_flow_source=data_wrangler_recipe_s3_uri,
    instance_count=data_wrangler_instance_count,
    instance_type=data_wrangler_instance_type,
    sagemaker_session=sagemaker_session,
    max_runtime_in_seconds=86400,
)

data_wrangler_step = ProcessingStep(
    name="data-wrangler-step",
    processor=data_wrangler_processor,
    inputs=inputs,
    outputs=outputs,
    job_arguments=job_argument,
)

##### Create Training Job

In [None]:
# Training Job

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    role=role,
)
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)

training_step = TrainingStep(
    name="abalone-training-step",
    estimator=xgb_train,
    inputs={
        "train": TrainingInput(
            s3_data=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[
                output_name
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

##### Create Pipeline

In [None]:
# Pipeline

pipeline = Pipeline(
    name="sample-pipeline",
    parameters=[data_warngler_instance_count, data_wrangler_instance_type, training_instance_type],
    steps=[data_wrangler_step, training_step],
    sagemaker_session=sagemaker_session,
)

##### (Optional) Examining the pipeline definition

In [None]:
import json

definition = json.loads(pipeline.definition())

##### Submit the pipeline to SageMaker and start execution

Submit the pipeline definition to the Pipeline service. The role passed in will be used by the Pipeline service to create all the jobs defined in the steps.

In [None]:
pipeline.upsert(role_arn=role)

Start the pipeline and accept all the default parameters.

In [None]:
execution = pipeline.start()