
# What you will accomplish

In this guide, you will:

    - Build, train, and tune a model using script mode
    - Detect bias in ML models and understand model predictions
    - Deploy the trained model to a real-time inference endpoint for testing
    - Evaluate the model by generating sample predictions and understanding feature impact



### Bio: Josh, Data Scientist
### Location: Bellevue, WA

### Scenario:  

Josh, is a data scientist at Figma Insurance who is working on training a binary classifier model that can predict fraudulent activity on insurance claims. Josh first uses SageMaker Studio with some python libraries to explore a dataset and finds some key characteristics that he can then use to engineer some features. Once he completes that, he decides to use XGboost framework to start training a base line model. This is exactly what you will start actioning on in this tutorial. You will use XGBoost and Panda Libraries to start training a base line model and do some experimentation and model tuning with various parameters using SM Hyperparameter. 

Next, Josh will check to see if the model has any underlying bias that could cause the model to be impartial against certain facets like gender, race, and other demographics using SM Clarify. he also realizes that he needs to have an explainability around her model for why did the model predict something. What caused the model to make certain predictions? using SM Explainability Reports. After Josh completes bias and explainability checks, he is ready to deploy the model on a staging environment using SM Endpoints. Then he finishes up by conducting some inference test calls to understand how the model performs on a new unseen dataset. Once that is complete, he terminates her endpoints.

Six months later her new teammate, Christina, sees a deprecation in the performance of the deployed model which leads him to believe that the model might need to go through a re-training process. Thanks to SM pipeline, Josh can easily retrain the whole model using a new data with just a click of a button. 


# Step 2: Set up a SageMaker Studio notebook ( library installation & variable settings)

In [2]:
%pip install -q  xgboost==1.3.1 pandas==1.0.5

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sparkmagic 0.20.4 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
sagemaker 2.143.0 requires importlib-metadata<5.0,>=1.4.0, but you have importlib-metadata 6.1.0 which is incompatible.
sagemaker 2.143.0 requires PyYAML==5.4.1, but you have pyyaml 6.0 which is incompatible.
sagemaker-data-insights 0.3.3 requires pandas>=1.1.4, but you have pandas 1.0.5 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
%pip install boto3

[0mNote: you may need to restart the kernel to use updated packages.


### NOTE: Please change your alias here as this will help us not intermix multiple users logs.

In [7]:
alias ='dewanup'

In [12]:
import pandas as pd
import boto3
import sagemaker
import json
import joblib
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import (
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner
)
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

# Setting SageMaker variables
sess = sagemaker.Session()
write_bucket = sess.default_bucket()
write_prefix = "fraud-detect-demo-"+alias

region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)

sagemaker_role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
read_bucket = "sagemaker-sample-files"
read_prefix = "datasets/tabular/synthetic_automobile_claims" 


# Setting S3 location for read and write operations
train_data_key = f"{read_prefix}/train.csv"
test_data_key = f"{read_prefix}/test.csv"
validation_data_key = f"{read_prefix}/validation.csv"
model_key = f"{write_prefix}/model"
output_key = f"{write_prefix}/output"


train_data_uri = f"s3://{read_bucket}/{train_data_key}"
test_data_uri = f"s3://{read_bucket}/{test_data_key}"
validation_data_uri = f"s3://{read_bucket}/{validation_data_key}"
model_uri = f"s3://{write_bucket}/{model_key}"
output_uri = f"s3://{write_bucket}/{output_key}"
estimator_output_uri = f"s3://{write_bucket}/{write_prefix}/training_jobs"
bias_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/bias"
explainability_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/explainability"

In [13]:
tuning_job_name_prefix = "xgbtune-" +alias
training_job_name_prefix = "xgbtrain-"+alias

xgb_model_name = "fraud-detect-xgb-model-" +alias
endpoint_name_prefix = "xgb-fraud-model-dev-"+alias
train_instance_count = 1
train_instance_type = "ml.m4.xlarge"
predictor_instance_count = 1
predictor_instance_type = "ml.m4.xlarge"
clarify_instance_count = 1
clarify_instance_type = "ml.m4.xlarge"

# Step 3: Develop and get a training script

In [63]:
%%writefile xgboost_train.py

import argparse
import os
import joblib
import json
import pandas as pd
import xgboost as xgb
from sklearn.metrics import roc_auc_score

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Hyperparameters and algorithm parameters are described here
    parser.add_argument("--num_round", type=int, default=50)
    parser.add_argument("--max_depth", type=int, default=2)
    parser.add_argument("--eta", type=float, default=0.2)
    parser.add_argument("--subsample", type=float, default=0.6)
    parser.add_argument("--colsample_bytree", type=float, default=0.6)
    parser.add_argument("--objective", type=str, default="binary:logistic")
    parser.add_argument("--eval_metric", type=str, default="auc")
    parser.add_argument("--nfold", type=int, default=3)
    parser.add_argument("--early_stopping_rounds", type=int, default=3)
    

    # SageMaker specific arguments. Defaults are set in the environment variables
    # Location of input training data
    parser.add_argument("--train_data_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    # Location of input validation data
    parser.add_argument("--validation_data_dir", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    # Location where trained model will be stored. Default set by SageMaker, /opt/ml/model
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    # Location where model artifacts will be stored. Default set by SageMaker, /opt/ml/output/data
    parser.add_argument("--output_data_dir", type=str, default=os.environ.get("SM_OUTPUT_DATA_DIR"))
    
    args = parser.parse_args()

    data_train = pd.read_csv(f"{args.train_data_dir}/train.csv")
    train = data_train.drop("fraud", axis=1)
    label_train = pd.DataFrame(data_train["fraud"])
    dtrain = xgb.DMatrix(train, label=label_train)
    
    
    data_validation = pd.read_csv(f"{args.validation_data_dir}/validation.csv")
    validation = data_validation.drop("fraud", axis=1)
    label_validation = pd.DataFrame(data_validation["fraud"])
    dvalidation = xgb.DMatrix(validation, label=label_validation)

    params = {"max_depth": args.max_depth,
              "eta": args.eta,
              "objective": args.objective,
              "subsample" : args.subsample,
              "colsample_bytree":args.colsample_bytree
             }
    
    num_boost_round = args.num_round
    nfold = args.nfold
    early_stopping_rounds = args.early_stopping_rounds
    
    cv_results = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        nfold=nfold,
        early_stopping_rounds=early_stopping_rounds,
        metrics=["auc"],
        seed=42,
    )
    
    model = xgb.train(params=params, dtrain=dtrain, num_boost_round=len(cv_results))
    
    train_pred = model.predict(dtrain)
    validation_pred = model.predict(dvalidation)
    
    train_auc = roc_auc_score(label_train, train_pred)
    validation_auc = roc_auc_score(label_validation, validation_pred)
    
    print(f"[0]#011train-auc:{train_auc:.2f}")
    print(f"[0]#011validation-auc:{validation_auc:.2f}")

    metrics_data = {"hyperparameters" : params,
                    "binary_classification_metrics": {"validation:auc": {"value": validation_auc},
                                                      "train:auc": {"value": train_auc}
                                                     }
                   }
              
    # Save the evaluation metrics to the location specified by output_data_dir
    metrics_location = args.output_data_dir + "/metrics.json"
    
    # Save the model to the location specified by model_dir
    model_location = args.model_dir + "/xgboost-model"

    with open(metrics_location, "w") as f:
        json.dump(metrics_data, f)

    with open(model_location, "wb") as f:
        joblib.dump(model, f)

Overwriting xgboost_train.py


# Step 4: Train a base line model first using SageMaker Training Estimators

In [64]:
xgb_estimator_base = XGBoost(
                        entry_point="xgboost_train.py",
                        output_path=estimator_output_uri,
                        code_location=estimator_output_uri,
                        role=sagemaker_role,
                        instance_count=train_instance_count,
                        instance_type=train_instance_type,
                        framework_version="1.3-1",
                        base_job_name=training_job_name_prefix
                    )
# Setting the input channels for tuning job
s3_input_train = TrainingInput(s3_data="s3://{}/{}".format(read_bucket, train_data_key), content_type="csv", s3_data_type="S3Prefix")
s3_input_validation = (TrainingInput(s3_data="s3://{}/{}".format(read_bucket, validation_data_key), 
                                    content_type="csv", s3_data_type="S3Prefix")
                      )

xgb_estimator_base.fit(inputs={"train": s3_input_train, "validation": s3_input_validation})

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m4.xlarge.
INFO:sagemaker:Creating training-job with name: xgbtrain-dewanup-2023-04-03-06-06-39-282


2023-04-03 06:06:40 Starting - Starting the training job...
2023-04-03 06:07:05 Starting - Preparing the instances for training......
2023-04-03 06:08:12 Downloading - Downloading input data...
2023-04-03 06:08:37 Training - Downloading the training image.....[34m[2023-04-03 06:09:33.749 ip-10-2-127-225.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-04-03 06:09:33.778 ip-10-2-127-225.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-04-03:06:09:33:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-04-03:06:09:33:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-04-03:06:09:33:INFO] Invoking user training script.[0m
[34m[2023-04-03:06:09:33:INFO] Module xgboost_train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-04-03:06:09:33:INFO] Generating setup.cfg[0m
[34m[2023-04-03:06:09:33:INFO] Generating MANIFEST.in[0m
[34m[2023-04-03:0

In [65]:
import xgboost as xgb
from sklearn.metrics import roc_auc_score
def evaluate(s3_input_test, model):
    data_test = pd.read_csv(s3_input_test)
    label_test = pd.DataFrame(data_test["fraud"])
    test = data_test.drop("fraud", axis=1)
    dtest = xgb.DMatrix(test, label=label_test)
    test_pred = model.predict(dtest)
    test_auc = roc_auc_score(label_test, test_pred)
    return test_auc

# Step 5: Evaluate the baseline model

In [66]:
model_s3_uri = xgb_estimator_base.model_data

In [67]:
!mkdir -p ./tmp/model/
!aws s3 cp $model_s3_uri ./tmp/model/model.tar.gz
!tar -xvzf ./tmp/model/model.tar.gz -C ./tmp/model/

download: s3://sagemaker-us-east-1-539179515961/fraud-detect-demo-dewanup/training_jobs/xgbtrain-dewanup-2023-04-03-06-06-39-282/output/model.tar.gz to tmp/model/model.tar.gz
xgboost-model


In [68]:
s3_input_test = "s3://{}/{}".format(read_bucket, test_data_key)
model = joblib.load('./tmp/model/xgboost-model')
test_auc = evaluate(s3_input_test, model)
print(" Test AUC for baseline model" ,round(test_auc,2))

 Test AUC for baseline model 0.82


# Step 6: Launch hyperparameter tuning jobs in script mode using SageMaker Hyperparameter tuning jobs

In [69]:
# SageMaker estimator

# Set static hyperparameters that will not be tuned
static_hyperparams = {  
                        "eval_metric" : "auc",
                        "objective": "binary:logistic",
                        "num_round": "100"
                      }

xgb_estimator = XGBoost(
                        entry_point="xgboost_train.py",
                        output_path=estimator_output_uri,
                        code_location=estimator_output_uri,
                        hyperparameters=static_hyperparams,
                        role=sagemaker_role,
                        instance_count=train_instance_count,
                        instance_type=train_instance_type,
                        framework_version="1.3-1",
                        base_job_name=training_job_name_prefix
                    )

# Setting ranges of hyperparameters to be tuned
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "subsample": ContinuousParameter(0.7, 0.95),
    "colsample_bytree": ContinuousParameter(0.7, 0.95),
    "max_depth": IntegerParameter(1, 8)
}

objective_metric_name = "validation:auc"

# Setting up tuner object
tuner_config_dict = {
                     "estimator" : xgb_estimator,
                     "max_jobs" : 5,
                     "max_parallel_jobs" : 2,
                     "objective_metric_name" : objective_metric_name,
                     "hyperparameter_ranges" : hyperparameter_ranges,
                     "base_tuning_job_name" : tuning_job_name_prefix,
                     "strategy" : "Random"
                    }
tuner = HyperparameterTuner(**tuner_config_dict)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m4.xlarge.


#### [Note 1]: Below cell ⬇️ will take about 7 mins to run as it tries to tune multiple parameters ( each dot indicates running state and ! indicates completion of the job)

#### [Note 2] You can head over to Sagemaker Console -> Hyperparameter tuning jobs to view the progress of the launched tuning jobs

In [70]:
# Setting the input channels for tuning job
s3_input_train = TrainingInput(s3_data="s3://{}/{}".format(read_bucket, train_data_key), content_type="csv", s3_data_type="S3Prefix")
s3_input_validation = (TrainingInput(s3_data="s3://{}/{}".format(read_bucket, validation_data_key), 
                                    content_type="csv", s3_data_type="S3Prefix")
                      )

tuner.fit(inputs={"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)
tuner.wait()

INFO:sagemaker:Creating hyperparameter tuning job with name: xgbtune-dewanup-230403-0612


................................................................!
!


### Analyzing Tuner results

In [71]:
# Summary of tuning results ordered in descending order of performance
df_tuner = sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe()
df_tuner = df_tuner[df_tuner["FinalObjectiveValue"]>-float('inf')].sort_values("FinalObjectiveValue", ascending=False)
df_tuner

Unnamed: 0,colsample_bytree,eta,max_depth,subsample,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,0.859696,0.319158,2.0,0.91308,xgbtune-dewanup-230403-0612-005-a64cd0c3,Completed,0.83,2023-04-03 06:17:13+00:00,2023-04-03 06:17:50+00:00,37.0
4,0.75036,0.319645,2.0,0.722313,xgbtune-dewanup-230403-0612-001-50ef28ba,Completed,0.8,2023-04-03 06:14:08+00:00,2023-04-03 06:15:55+00:00,107.0
1,0.879701,0.482695,7.0,0.874829,xgbtune-dewanup-230403-0612-004-b7783d97,Completed,0.75,2023-04-03 06:16:32+00:00,2023-04-03 06:17:04+00:00,32.0
2,0.709801,0.653393,4.0,0.857024,xgbtune-dewanup-230403-0612-003-7457e134,Completed,0.72,2023-04-03 06:16:25+00:00,2023-04-03 06:17:02+00:00,37.0
3,0.815706,0.103396,7.0,0.839529,xgbtune-dewanup-230403-0612-002-1d2d6601,Completed,0.71,2023-04-03 06:14:36+00:00,2023-04-03 06:16:18+00:00,102.0


### UI to Check: 📺
    - Experiments -> Hyperparameter Tuning Job / SM Hyperparameter Training Console

# Step 7: Evaluate the Hyperparameter Tuner Model

In [80]:
best_train_job_name = tuner.best_training_job()
model_s3_uri = estimator_output_uri + '/' + best_train_job_name + '/output/model.tar.gz'

In [81]:
!mkdir -p ./tmp/hpo-model/
!aws s3 cp $model_s3_uri ./tmp/hpo-model/model.tar.gz
!tar -xvzf ./tmp/hpo-model/model.tar.gz -C ./tmp/hpo-model/

download: s3://sagemaker-us-east-1-539179515961/fraud-detect-demo-dewanup/training_jobs/xgbtune-dewanup-230403-0612-005-a64cd0c3/output/model.tar.gz to tmp/hpo-model/model.tar.gz
xgboost-model


In [82]:
s3_input_test = "s3://{}/{}".format(read_bucket, test_data_key)
model = joblib.load('./tmp/hpo-model/xgboost-model')
test_auc = evaluate(s3_input_test, model)
print(" Test AUC for baseline model" ,round(test_auc,2))

 Test AUC for baseline model 0.84


# Step 8: Deploy the Model using SM Endpoints for further inference testing

In [75]:
tuner_job_info = sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)

model_matches = sagemaker_client.list_models(NameContains=xgb_model_name)["Models"]

if not model_matches:
    _ = sess.create_model_from_job(
            name=xgb_model_name,
            training_job_name=tuner_job_info['BestTrainingJob']["TrainingJobName"],
            role=sagemaker_role,
            image_uri=tuner_job_info['TrainingJobDefinition']["AlgorithmSpecification"]["TrainingImage"]
            )
else:

    print(f"Model {xgb_model_name} already exists.")

INFO:sagemaker:Creating model with name: fraud-detect-xgb-model-dewanup


In [76]:
best_train_job_name = tuner.best_training_job()

model_path = estimator_output_uri + '/' + best_train_job_name + '/output/model.tar.gz'
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")
create_model_config = {"model_data":model_path,
                       "role":sagemaker_role,
                       "image_uri":training_image,
                       "name":endpoint_name_prefix,
                       "predictor_cls":sagemaker.predictor.Predictor
                       }
# Create a SageMaker model
model = sagemaker.model.Model(**create_model_config)

# Deploy the best model and get access to a SageMaker Predictor
predictor = model.deploy(initial_instance_count=predictor_instance_count, 
                         instance_type=predictor_instance_type,
                         serializer=CSVSerializer(),
                         deserializer=CSVDeserializer())
print(f"\nModel deployed at endpoint : {model.endpoint_name}")


INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: xgb-fraud-model-dev-dewanup
INFO:sagemaker:Creating endpoint-config with name xgb-fraud-model-dev-dewanup-2023-04-03-06-19-11-018
INFO:sagemaker:Creating endpoint with name xgb-fraud-model-dev-dewanup-2023-04-03-06-19-11-018


--------!
Model deployed at endpoint : xgb-fraud-model-dev-dewanup-2023-04-03-06-19-11-018


In [91]:
# Sample test data
test_df = pd.read_csv(test_data_uri)
payload = test_df.drop(["fraud"], axis=1).iloc[10].to_list()
print(f"Model prediction : {int(float(predictor.predict(payload)[0][0]))}, True label : {test_df['fraud'].iloc[10]}")

Model prediction : 0, True label : 0
