# Save to S3 with a SageMaker Processing Job

<div class="alert alert-info"> 💡 <strong> Quick Start </strong>
To run this notebook, select the Run menu above and click <strong>Run all cells</strong>. 
<strong><a style="color: #0397a7 " href="#Job-Status-&-S3-Output-Location">
    <u>View the status of the job</u></a>.
</strong>
</div>


This notebook executes Built-in XGBoost training on a dataset

---

## Contents

1. [Next Steps](#Next-Steps)
    1. [Load Processed Data into Pandas](#(Optional)-Load-Processed-Data-into-Pandas)
    1. [Train a model with SageMaker](#Train-a-model-with-SageMaker)
---

## Output: S3 settings

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

1. <b>bucket</b>: you can configure the S3 bucket where Data Wrangler will save the output. The default bucket from 
the SageMaker notebook session is used. 

</div>

In [None]:
!pip install -U sagemaker

In [None]:
import time
import uuid
import boto3
import sagemaker

# Sagemaker session
sess = sagemaker.Session()

region = boto3.Session().region_name

# You can configure this with your own bucket name, e.g.
bucket = sess.default_bucket()


Below are the inputs required by the SageMaker Python SDK to launch a processing job.

The Data Wrangler Flow is also provided to the Processing Job as an input source which we configure below.

## Check output

**This have to be run with larger notebook instance if kernel dies**

Currently run with m5.24xlarge


In [None]:
!pip install -q awswrangler pandas
import awswrangler as wr

# Import sagemaker_datawrangler to show visualizations and automated data
# quality insights, and export code to prepare data in a pandas data frame.
try:
    import sagemaker_datawrangler
except ImportError:
    print("sagemaker_datawrangler is not imported. Change your kernel to the Data Science 3.0 Kernel Image and try again.")
    pass

In [None]:
%%time
# Check data:
content_type = 'parquet'

train_set_s3_uri = 's3://dsoaws/nyc-taxi-orig-cleaned-dropped-parquet-2019/train'
df_train = wr.s3.read_parquet(train_set_s3_uri, dataset=True)

validation_set_s3_uri = 's3://dsoaws/nyc-taxi-orig-cleaned-dropped-parquet-2019/validation'
df_validation = wr.s3.read_parquet(validation_set_s3_uri, dataset=True)


In [None]:
%%time

# print("There are %d train examples and %d validation examples.".format(df_train.count(), df_validation.count()))
print(f"There are {len(df_train):,} train examples.")
print(f"There are {len(df_validation):,} validation examples.")

## Next Steps

Now that data is available on S3 you can use other SageMaker components that take S3 URIs as input such as 
SageMaker Training, Built-in Algorithms, etc. Similarly you can load the dataset into a Pandas dataframe 
in this notebook for further inspection and work. The examples below show how to do both of these steps.

By default optional steps do not run automatically, set `run_optional_steps` to True if you want to 
execute optional steps

### (Optional) Load Processed Data into Pandas

We use the [AWS SDK for pandas library](https://github.com/awslabs/aws-sdk-pandas) to load the exported 
dataset into a Pandas data frame for a preview of first 10000 rows.

To turn on automated visualizations and data insights for your pandas data frame, import the sagemaker_datawrangler library.

In [None]:
chunksize = 10000

# KHOI: we are loading the training set here only
if content_type.upper() == "CSV":
    dfs = wr.s3.read_csv(train_set_s3_uri, chunksize=chunksize)
elif content_type.upper() == "PARQUET":
    dfs = wr.s3.read_parquet(train_set_s3_uri, chunked=chunksize)
else:
    print(f"Unexpected output content type {content_type}") 

df = next(dfs)
df

## Train a model with SageMaker
Now that the data has been processed, you may want to train a model using the data. The following shows an 
example of doing so using a popular algorithm - XGBoost. For more information on algorithms available in 
SageMaker, see [Getting Started with SageMaker Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). 
It is important to note that the following XGBoost objective ['binary', 'regression', 'multiclass'] 
hyperparameters, or content_type may not be suitable for the output data, and will require changes to 
train a proper model. Furthermore, for CSV training, the algorithm assumes that the target 
variable is in the first column. For more information on SageMaker XGBoost, 
see https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.


### Set Training Data and Validation Data paths
We set the training input data path from the output of the Data Wrangler processing job..

In [None]:
s3_training_input_path = train_set_s3_uri
print(f"Training input data path: {s3_training_input_path}")

s3_validation_input_path = validation_set_s3_uri
print(f"Validation input data path: {s3_validation_input_path}")

### Configure the algorithm and training job

The Training Job hyperparameters are set. For more information on XGBoost Hyperparameters, 
see https://xgboost.readthedocs.io/en/latest/parameter.html.

In [None]:
# IAM role for executing the processing job.
iam_role = sagemaker.get_execution_role()
region = boto3.Session().region_name
container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# set an output path where the trained model will be saved
s3_folder = 'gsml-nyc-taxi-ml-built-in-xgboost-public-dataset-2019'
s3_training_job_output_prefix = f"{s3_folder}/built-in-xgboost"
training_job_output_path = 's3://{}/{}/{}/output'.format(bucket, s3_training_job_output_prefix, 'nyc-taxi-2019-built-in-xgboost')

hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
}

train_content_type = (
    "application/x-parquet" if content_type.upper() == "PARQUET"
    else "text/csv"
)
train_input = sagemaker.inputs.TrainingInput(
    s3_data=s3_training_input_path,
    content_type=train_content_type,
    distribution='ShardedByS3Key',
)

# add validation input
validation_content_type = (
    "application/x-parquet" if content_type.upper() == "PARQUET"
    else "text/csv"
)
validation_input = sagemaker.inputs.TrainingInput(
    s3_data=s3_validation_input_path,
    content_type=validation_content_type,
    distribution='ShardedByS3Key',
)

### Start the Training Job

The TrainingJob configurations are set using the SageMaker Python SDK Estimator, and which is fit using 
the training data from the Processing Job that was run earlier.

In [None]:
estimator = sagemaker.estimator.Estimator(
    container,
    iam_role,
    hyperparameters=hyperparameters,
    instance_count=3,
    instance_type="ml.m5.24xlarge",
)


In [None]:
%%time

estimator.fit({"train": train_input, 'validation': validation_input})

Now that you have a trained model there are a number of different things you can do. 
For more details on training with SageMaker, please see 
https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html.