# Using Batch Transform with SageMaker Studio

## Environment Setup

- Image: Data Science
- Kernel: Python 3
- Instance type: ml.t3.medium

## Background

Esse notebook é baseado em notebooks anteriores onde treinamentos modelos para prever quando um cliente irá abandonar um serviço de telecomunicação. Nesse notebook, vamos treinar um modelo para fazermos inferências (predições) em batches de dados (carregados `batch_data.csv`).

Esse Notebook foi adaptado do [SageMaker examples](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb).

## Inicialize o ambiente e as variáveis

In [None]:
# Install sagemaker-experiments
import sys
!{sys.executable} -m pip install sagemaker-experiments
# Please restart the notebook after executing this line

In [None]:
# Import libraries
import boto3
import re
import pandas as pd
import numpy as np
import os
import time

import sagemaker
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.inputs import TrainingInput

# Get the SageMaker session and the execution role from the SageMaker domain
sess = sagemaker.Session()
role = get_execution_role()

bucket = '<name-of-your-bucket>' # Update with the name of a bucket that is already created in S3
prefix = 'demo' # The name of the folder that will be created in the S3 bucket

In [None]:
from time import strftime
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker
from botocore.exceptions import ClientError

## Dataset

Para essa atividade, o dataset já foi escolhido e separado em `train.csv` e `validation.csv`.

Vamos enviar o dataset para o bucket S3 para que o SageMaker possa utilizá-lo.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Experiments

Nessa seção, nós vamos utilizar o SageMaker Experiments. Assim que configurarmos, nós podemos iniciar o treinamento do modelo.

In [None]:
# Create an experiment
create_date = strftime("%Y-%m-%d-%H-%M-%S")
experiment_name = 'batch-transform-churn-experiment'
experiment_description = 'A demo experiment'

# Use a try-block so we can re-use an existing experiment rather than creating a new one each time
try:
    experiment = Experiment.create(experiment_name=experiment_name.format(create_date), 
                                   description=experiment_description)
except ClientError as e:
    print(f'{experiment_name} already exists and will be reused.')

In [None]:
# Create a trial for the experiment
trial_name = "batch-transform-churn-trial-2"

demo_trial = Trial.create(trial_name = trial_name.format(create_date),
                          experiment_name = experiment_name)

## Treinamento

Vamos fazer o treinamento novamente.

Precisamos especificar: onde está os nossos dados de treinamento, o caminho para o container que irá executar o algoritmo e o algoritmo a ser utilizado (junto com seus hyperparâmetros).

In [None]:
# The location of our training and validation data in S3
s3_input_train = TrainingInput(
    s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv'
)
s3_input_validation = TrainingInput(
    s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv'
)

In [None]:
# The location of the XGBoost container version 1.5-1 (an AWS-managed container)
container = sagemaker.image_uris.retrieve('xgboost', sess.boto_region_name, '1.5-1')

In [None]:
# Set up experiment_config, which will be passed to the Estimator; this component will be for the training part only (later on, we'll update the TrialComponentDisplayName for the batch transform job
experiment_config={'ExperimentName': experiment_name,
                   'TrialName': trial_name,
                   'TrialComponentDisplayName': 'TrainingJob'}

In [None]:
# Initialize hyperparameters
hyperparameters = {
                    'max_depth':'5',
                    'eta':'0.2',
                    'gamma':'4',
                    'min_child_weight':'6',
                    'subsample':'0.8',
                    'objective':'binary:logistic',
                    'eval_metric':'error',
                    'num_round':'100'}

# Output path where the trained model will be saved
output_path = 's3://{}/{}/output'.format(bucket, prefix)

# Set up the Estimator, which is training job
xgb = sagemaker.estimator.Estimator(image_uri=container, 
                                    hyperparameters=hyperparameters,
                                    role=role,
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge', 
                                    output_path=output_path,
                                    sagemaker_session=sess)

In [None]:
# "fit" executes the training job
# We're passing in experiment_config so that the training results will be tied to the experiment
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, experiment_config=experiment_config) 

## Transformação Batch (Batch Transform)

Agora que fizemos o treinamento do modelo, vamos utilizar para fazer predições de batches de dados. Batch Transform vai provisionar a infraestrutura necessária, e irá executar a inferência.

Para essa lição, nós vamos passar os dados com `batch_data.csv`. 

IMPORTANTE: O dataset utilizado para fazer predições em batch não pode ter a coluna de target.

In [None]:
# Read data into a dataframe
batch_data_path = 'batch_data.csv'
df = pd.read_csv(batch_data_path, delimiter=',', index_col=None)

batch_data = df.iloc[:, 1:] # delete the target column
batch_data.to_csv('batch_data_for_transform.csv', header=False, index = False)

# Upload the new CSV file (without the target column) to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'batch/batch_data_for_transform.csv')).upload_file('batch_data_for_transform.csv')

In [None]:
# The location of the batch data used for prediction, and location for batch output
s3_batch_input = 's3://{}/{}/batch/batch_data_for_transform.csv'.format(bucket,prefix) 
s3_batch_output = 's3://{}/{}/batch/batch-inference'.format(bucket, prefix) 

# Create the Batch Transform job
transformer = xgb.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",
    strategy="MultiRecord",
    assemble_with="Line",
    accept="text/csv",
    output_path=s3_batch_output
)

# Update the TrialComponentDisplay name; this is for the transform part of the trial (the previous component was for training)
experiment_config={'ExperimentName': experiment_name,
                   'TrialName': trial_name,
                   'TrialComponentDisplayName': 'BatchTransformJob'}

transformer.transform(s3_batch_input, content_type="text/csv", split_type="Line", experiment_config = experiment_config)
transformer.wait()

In [None]:
# Download the batch transform output locally
!aws s3 cp --recursive $transformer.output_path ./

In [None]:
# View the first ten predictions (you can also double-click the file in the folder view to see all predictions)
!head batch_data_for_transform.csv.out

## Fazendo a limpeza

Nessa seção, vamos fazer a limpeza na casa e deletar nossos experimentos e modelos.

In [None]:
# Function to iterate through an experiment to delete its trials, then delete the experiment itself
def cleanup_sme_sdk(demo_experiment):
    for trial_summary in demo_experiment.list_trials():
        trial = Trial.load(trial_name=trial_summary.trial_name)
        for trial_component_summary in trial.list_trial_components():
            tc = TrialComponent.load(
                trial_component_name=trial_component_summary.trial_component_name)
            trial.remove_trial_component(tc)
            try:
                # Comment out to keep trial components
                tc.delete()
            except:
                # Trial component is associated with another trial
                continue
            # To prevent throttling
            time.sleep(.5)
        trial.delete()
        experiment_name = demo_experiment.experiment_name
    demo_experiment.delete()
    print(f"\nExperiment {experiment_name} deleted")

In [None]:
# Call the function above to delete an experiment and its trials
# Fill in your experiment name (not the display name)
experiment_to_cleanup = Experiment.load(experiment_name='batch-transform-churn-experiment')

cleanup_sme_sdk(experiment_to_cleanup)