# Bring Your Own Data Workshop

## Develop, Train, Optimize and Deploy Scikit-Learn models

Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html 

SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html

boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client 

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML models. More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. 

### Import required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)

import time


## Prepare our Environment

We'll need to:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.etup Sagemaker environment and variables

In [None]:
# setting up SageMaker parameters
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "sklearn-example"  # Location in the bucket to store our files
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

In [None]:
bucket_name

### Explore your data in your notebook

In [None]:
dataset_path='final_df3.csv'#"<dataset path local or from s3>"

In [None]:
df = pd.read_csv(dataset_path)

In [None]:
df.head()

In [None]:
target_label='Status'#'<write your target label name here>'

### Upload data to S3 for processing

In [None]:
# Upload CSV files to S3 for SageMaker processing and training
rawdata_uri = sm_session.upload_data(
    path=dataset_path,
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

### Prepare and run the Processing job

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)


In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from datetime import datetime

job_name='sklearn-processing-'+datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

output_path_train='s3://'+bucket_name+'/'+job_name+'/processing/output/train/'
output_path_val='s3://'+bucket_name+'/'+job_name+'/processing/output/validation/'
output_path_test='s3://'+bucket_name+'/'+job_name+'/processing/output/test/'

sklearn_processor.run(
    code='./scripts/preprocess.py',
    job_name=job_name, 
    # arguments = ['arg1', 'arg2'],
    inputs=[ProcessingInput(
        source=dataset_path,
        #source = 's3_path_to_data'
        destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output/train', destination = output_path_train),
        ProcessingOutput(source='/opt/ml/processing/output/validation', destination = output_path_val),
        ProcessingOutput(source='/opt/ml/processing/output/test', destination = output_path_test)]
)

### Prepare and run the training job

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = '0.23-1'
base_job_name='sklearn-training'

sklearn_estimator = SKLearn(
    entry_point='scripts/train.py',
    role = sm_role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    base_job_name=base_job_name,
    framework_version=FRAMEWORK_VERSION,
    metric_definitions=[
        {'Name': 'accuracy',
         'Regex': "validation:accuracy : ([0-9.]+).*$"},
        {'Name': 'auc',
         'Regex': "validation:auc : ([0-9.]+).*$"}],
    hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 3,
                       'features': 'Var1 Var2 Var3 Var4',
                       'target': target_label}
)

In [None]:
# launch training job, with asynchronous call
sklearn_estimator.fit({'train':output_path_train, 'validation': output_path_val})

In [None]:
response = sm_client.describe_training_job(TrainingJobName='sklearn-training-2021-04-13-22-50-14-086') #Replace with training job name
model_output_path=response['ModelArtifacts']['S3ModelArtifacts']

In [None]:
model_output_path

In [None]:
print(output_path_test)
print(output_path_train)

In [None]:
%store model_output_path
%store output_path_test
%store output_path_train

## [Optional] Deploy model

In [None]:
sklearn_predictor = sklearn_estimator.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1)



## Evaluate with hold out test data

In [None]:
test_df=pd.read_csv(output_path_test+'test.csv')

In [None]:
testy=test_df[target_label]
testX=test_df.drop(target_label, axis=1)

In [None]:
testX

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
predictions=sklearn_predictor.predict(testX)

predictions[:5]

In [None]:
test_results = pd.concat(
    [
        pd.Series(predictions, name="y_pred", index=test_df.index),
        test_df,
    ],
    axis=1
)
test_results.head()

In [None]:
import util

In [None]:
util.plotting.generate_classification_report(
    y_real=test_results[target_label],
    y_predict_proba=test_results["y_pred"],
    decision_threshold=0.5,
    class_names_list=["good", "default"],
    title="Initial risk model",
)


## Visualize Tree