# Assignment 2: use SageMaker processing and training jobs
In this assignment you move your data processing, feature enginering, and model training code to SageMaker jobs.

Refer to the notebook [`02-sagemaker-containers.ipynb`](../02-sagemaker-containers.ipynb) for code snippets and a general guidance for the exercises in this assignment notebook.

## Import packages

In [2]:
import time

import boto3
import numpy as np  
import pandas as pd  
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sklearn.metrics import roc_auc_score

sagemaker.__version__

'2.107.0'

## Excercise 1
- Use SageMaker session object to [upload](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.upload_data)  the dataset to an Amazon S3 bucket. Use a SageMaker [default bucket](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.default_bucket)
- Move data processing code to a Python executable script. You can pass any parameters to your script to parametrize the data processing
- Use [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html) [`SKLearnProcessor`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor) class to setup a processing job. 
- Configure processing job's [inputs](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) and [outputs](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingOutput) to point the processing job to Amazon S3 locations
- [Run](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) the processing job

### Python SDK processor classes
Use the most suitable class to implement a processor for your use case:
    
![](../img/python-sdk-processors.png)

In [12]:
session = sagemaker.Session()

In [13]:
# Write data upload code
# input_s3_url = # S3 key to the full dataset


In [14]:
%%writefile preprocessing_assignment.py

# Write executable data processing code here
import pandas as pd
import numpy as np
import argparse
import os

def _parse_args():
    
    parser = argparse.ArgumentParser()
    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='bank-additional-full.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    
    return parser.parse_known_args()


if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    
    print("Data processing and feature engineering start")
    
    # Processing code here
    
    # Save the datasets (train, validation, test) locally
    
    print("## Processing complete. Exiting.")

Overwriting preprocessing_assignment.py


In [15]:
# Create SKLearnProcessor
framework_version = "0.23-1"
processing_instance_type = "ml.m5.large"
processing_instance_count = 1

# sklearn_processor = SKLearnProcessor()


In [16]:
# Define procesing inputs and outputs
processing_inputs = [] # use input_s3_url as pointer to the full dataset

processing_outputs = [] # map local directories in the processing container to Amazon S3 locations

In [18]:
# Start the processing job
# sklearn_processor.run() 

## Exercise 2
- Get an ECR URI for the used built-in SageMaker algorithm
- Configure data input channels for the training job
- Use [`Estimator`]() class to setup a training job
- Set [hyperparameters]()
- [Run]() the training job

### Python SDK estimator classes
SageMaker Python SDK contains corresponding [`EstimatorBase`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)-derived classes to access each of the built-in algorithms. You can extend [`Framework`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework) class to implement a training with a custom framework.

![](../img/python-sdk-estimators.png)

## Continue with the assignment 3
Navigate to the [assignment 3](03-assignment-sagemaker-pipeline.ipynb) notebook.