# Prepare data for ML

**SageMaker Studio Kernel**: Data Science

In this exercise you will do:
 - Run a Preprocessing Job using Amazon SageMaker Processing Job for preparing data for training ML models

***

## Install Requirements

In [None]:
! pip install sagemaker-studio-image-build

## Build Container

The Dockerfile defined is created starting from the public python 3.7.1 image

In [None]:
! pygmentize ./../algorithms/processing/Dockerfile

In [None]:
%%sh

cd ./../algorithms/processing

sm-docker build .  --repository sm-end-to-end-preprocessing-mlops:latest

## Part 1/2 - Setup
Here we'll import some libraries and define some variables. You can also take a look on the scripts that were previously created for preparing the data and training our model.

In [None]:
import boto3
import logging
import sagemaker
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

***

### Global configurations

Configuration variables used for Processing, Training, and registration

In [None]:
region = boto3.session.Session().region_name
role_name = "mlops-sagemaker-execution-role"
role = "arn:aws:iam::{}:role/{}".format(boto3.client('sts').get_caller_identity().get('Account'), role_name)

kms_account_id = boto3.client('sts').get_caller_identity().get('Account')

kms_alias = "ml-kms"

bucket_name = ""

In [None]:
boto_session = boto3.Session(region_name=region)

sagemaker_client = boto_session.client("sagemaker")
runtime_client = boto_session.client("sagemaker-runtime")

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_runtime_client=runtime_client,
    default_bucket=bucket_name
)

In [None]:
kms_key = "arn:aws:kms:{}:{}:alias/{}".format(region, kms_account_id, kms_alias)

***

## Part 2/2: Run the processing Job

### Step 1/3: Create the Processing Job

#### Define input variables

In [None]:
ecr_image_name = "sm-end-to-end-preprocessing-mlops"
ecr_image_version = "latest"

processing_entrypoint = "./../algorithms/processing/src/processing.py"
processing_instance_count = 1
processing_instance_type = "ml.t3.large"
processing_input_files_path = "data/input"
processing_output_files_path = "data/output"

#### Get the dataset and upload it to an S3 bucket

In [None]:
# Download the 
# clean the buckets first
s3_client.delete_object(Bucket=bucket_name, Key=processing_input_files_path)

input_data = sagemaker_session.upload_data('./../data/data.csv', key_prefix=processing_input_files_path)

LOGGER.info(input_data)

#### Create Processor

In [None]:
!pygmentize ./../algorithms/processing/src/processing.py

In [None]:
processor = ScriptProcessor(
    command=["python3"],
    image_uri="{}.dkr.ecr.{}.amazonaws.com/{}:{}".format(
        boto3.client('sts').get_caller_identity().get('Account'),
        boto3.session.Session().region_name,
        ecr_image_name,
        ecr_image_version),
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    output_kms_key=kms_key,
    sagemaker_session=sagemaker_session
)

In [None]:
processor.run(
    code=processing_entrypoint,
    inputs=[
        ProcessingInput(
            input_name="input",
            source="s3://{}/{}".format(bucket_name, processing_input_files_path),
            destination="/opt/ml/processing/input"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="output",
            source="/opt/ml/processing/output",
            destination="s3://{}/{}".format(bucket_name, processing_output_files_path))
    ],
    wait=True
)

We have just seen how to prepare data using Amazon SageMaker Processing Job. In order to create a ML models please complete the following lab.

 > [Train-Build-Model](./04-Train-Build-Model.ipynb)

If we want to test the execution of a Custom Script container, we can execute the following lab.
 > [Train-Custom-Script-Container](./05-Train-Build-Model-Custom-Script-Container.ipynb)

If we want to explore Amazon SageMaker Feature Store, we can execute the following lab (Optional).

 > [Store-Features](./03-Store-Features.ipynb)