# Notebook 2: Breast Cancer Classification using Gene Expression Data

## Learning Objectives
- Use SageMaker processing, training, and hyperparameter tuning jobs to optimize cost and performance
- Compare model performance using SageMaker Experiments

## Environment Notes:
This notebook was created and tested on an `ml.t3.medium (2 vCPU + 4 GiB)` notebook instance running the `Python 3 (Data Science)` kernel in SageMaker Studio.

## Table of Contents
1. [Background](#1.-Background)
    1. [Jobs](#1.A.-Jobs)
    1. [Experiments](#1.B.-Experiments)
1. [Preparation](#2.-Preparation)
    1. [Import Python libraries](#2.A.-Import-Python-Libraries)
    1. [Create Some Necessary Clients](#2.B.-Create-some-necessary-clients)
    1. [Create an Experiment](#2.C.-Create-an-experiment)
    1. [Specify S3 Bucket and Prefix](#2.D.-Specify-S3-bucket-and-prefix)
    1. [Define Local Working Directories](#2.E.-Define-local-working-directories)
1. [Data Preparation with Amazon SageMaker Processing](#3.-Data-Preparation-with-Amazon-SageMaker-Processing)
    1. [Upload Raw Data to S3](#3.A.-Upload-Raw-Data-to-S3)
    1. [Create SageMaker Processing Job Script](#3.B.-Create-SageMaker-Processing-Job-Script)
    1. [Submit SageMaker Processing Job](#3.C.-Submit-SageMaker-Processing-Job)
1. [Model Training](#4.-Model-Training)
    1. [Train Model Using a SKLearn Random Forest Algorithm](#4.A.-Train-Model-Using-a-SKLearn-Random-Forest-Algorithm)
    1. [Train Model using a Keras MLP](#4.B.-Train-Model-using-a-Keras-MLP)
    1. [Train Model Using the XGBoost Algorithm](#4.C.-Train-Model-Using-the-XGBoost-Algorithm)
1. [Model Evaluation](#5.-Model-Evaluation)
    1. [Download and Run the Trained XGBoost Model](#5.A.-Download-and-Run-the-Trained-XGBoost-Model)
    1. [Compare Model Results Using SageMaker Experiments](#5.B.-Compare-Model-Results-Using-SageMaker-Experiments)
1. [Hyperparameter Optimization](#6.-Hyperparameter-Optimization)

---

## 1. Background
In notebook 1 of this series, we demonstrated using RNAseq data to predict HER2 status using the compute resources on the notebook server. However, using notebook server resources to process large amounts of data or train complex models is generally not a good idea. It's possible to scale up your notebook server, but any time you spend on non-compute intensive tasks (i.e. most of your time) will be wasted. A better idea is to run your notebook on a small server and submit compute-intensive tasks to independent jobs. SageMaker provides managed services for running data processing, model training, and hyperparameter tuning jobs. In this notebook, we'll demonstrate how to leverage these services to optimize the performance and cost of our tasks.

Specifically, we'll demonstrate two best practices: Experiments and Jobs

---

## 1.A. SageMaker Jobs

[SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) processing, training, and tuning jobs allow data scientists to submit compute-heavy processes to external services. This keeps costs optimized and ensures that these tasks run in reproducible environments. It also improves data scientist productivity by allowing these jobs to run in "the background" and provides resiliancy if something happens to your notebook environment.

![alt text](img/jobs.png "Jobs")

## 1.B. SageMaker Experiments

![alt text](img/experiments.png "Experiments")

[SageMaker Experiments](https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings) make it as easy as possible to track data preparation and analysis steps. Organizing your ML project into experiments helps you manage large numbers of trials and alternative algorithms. Experiments also ensure that any artifacts your generate for production use can be traced back to their source.

## 2. Preparation

Let's start by specifying:

- The Python libraries that we'll use throughout the analysis
- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

### 2.A. Import Python Libraries

In [2]:
!pip install sagemaker-experiments
# !pip install xgboost==1.2.0 #this needs this specific version of xgboost

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting sagemaker-experiments
  Using cached sagemaker_experiments-0.1.35-py3-none-any.whl (42 kB)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.35


In [3]:
import os
import json
import boto3
import argparse
import numpy as np
import pandas as pd
import xgboost as xgb
from time import strftime
from botocore.client import ClientError

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, f1_score

import sagemaker
from sagemaker import get_execution_role, session
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics
from sagemaker.tensorflow import TensorFlow
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

import pickle
import matplotlib.pyplot as plt 
import seaborn as sns

ImportError: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /root/.local/lib/python3.7/site-packages/pandas/_libs/window/aggregations.cpython-37m-x86_64-linux-gnu.so)

### 2.B. Create Some Necessary Clients

In [None]:
session = boto3.session.Session()
sm_session = sagemaker.session.Session()
region = session.region_name
role = get_execution_role()
s3 = boto3.client('s3', region_name=region)
account_id = boto3.client('sts').get_caller_identity().get('Account')

### 2.C. Create an Experiment

We create a new SageMaker experiment specific to our scientific goal, in this case to predict HER2 status.

In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")
brca_her2_experiment = Experiment.create(experiment_name = f"BRCA-HER2-{create_date}",
                                    description = "Predict HER2 status using TCGA RNAseq data.",
                                    tags = [{'Key': 'Creator', 'Value': 'bloyal'}])

### 2.D. Specify S3 Bucket and Prefix

In [None]:
# Create S3 Buckets for this project
bucket_name = f"brca-her2-classifier-{account_id}"
print(f"S3 bucket name is {bucket_name}")

### 2.E. Define Local Working Directories

In [None]:
WORKING_DIR = os.getcwd()
DATA_DIR = os.path.join(WORKING_DIR, "data")
print(f"Working directory is {WORKING_DIR}")
print(f"Data directory is {DATA_DIR}")

## 3. Data Preparation  with Amazon SageMaker Processing

Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker. Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.

![processing](https://sagemaker.readthedocs.io/en/stable/_images/amazon_sagemaker_processing_image1.png)

Here, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.  SageMaker Processing includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option, so it can be used with many different data transformation technologies and tasks.    

To use SageMaker Processing, simply supply a Python data preprocessing script as shown below.  For this example, we're using a SageMaker prebuilt Scikit-learn container, which includes many common functions for processing data.  There are few limitations on what kinds of code and operations you can run, and only a minimal contract:  input and output data must be placed in specified directories.  If this is done, SageMaker Processing automatically loads the input data from S3 and uploads transformed data back to S3 when the job is complete.

### 3.A. Upload Raw Data to S3

Download the raw data

In [None]:
# Define working directories
WORKING_DIR = os.getcwd()
DATA_DIR = os.path.join(WORKING_DIR, "data")
print(f"Working directory is {WORKING_DIR}")
print(f"Data directory is {DATA_DIR}")

# Get TCGA BRCA Gene Expression Data
!wget https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/HiSeqV2_PANCAN.gz -nc -P $DATA_DIR/input/raw/
!gzip -df $DATA_DIR/input/raw/HiSeqV2_PANCAN.gz

# Get TCGA BRCA Phenotype Data
!wget https://tcga.xenahubs.net/download/TCGA.BRCA.sampleMap/BRCA_clinicalMatrix -nc -P $DATA_DIR/input/raw/

In [None]:
# Check if bucket already exists. If it does not, create it
try:
    s3.head_bucket(Bucket=bucket_name)
except ClientError:
    s3.create_bucket(Bucket=bucket_name)
    print(f"Created Bucket: {bucket_name} in Region: {region}")

In [None]:
clinical_source = sm_session.upload_data(f"{DATA_DIR}/input/raw/BRCA_clinicalMatrix", bucket=bucket_name, key_prefix='data/input')
RNAseq_source = sm_session.upload_data(f"{DATA_DIR}/input/raw/HiSeqV2_PANCAN", bucket=bucket_name, key_prefix='data/input')
print(f"Clinical phenotypes now available at {clinical_source}")
print(f"Normalized expression data now available at {RNAseq_source}")

### 3.B. Create SageMaker Processing Job Script

In [None]:
#Create folder for processing script
os.makedirs(os.path.join(WORKING_DIR,"scripts/processing"), exist_ok=True)

Let's define the script we want our processing job to use. Note that because this is run on a remote service, we don't need to install any of the dependencies locally!

In [None]:
%%writefile scripts/processing/processing.py

import os
import argparse
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

def _parse_args():

    parser = argparse.ArgumentParser()
    
    parser.add_argument('--train_test_split_ratio', type=float, default=0.2)
    parser.add_argument('--local_path', type=str, default="/opt/ml/processing")

    return parser.parse_known_args()

if __name__ == "__main__":
        
    ### Command line parser
    args, _ = _parse_args()

    DATA_DIR = os.path.join(args.local_path, "input")
    print(f"Data directory is {DATA_DIR}")
    
    ### Load Gene Expression RNA-seq
    genom = pd.read_csv(os.path.join(DATA_DIR, "HiSeqV2_PANCAN"), sep='\t')
    genom_identifiers = genom["sample"].values.tolist()

    ### Load Phenotypes
    phenotypes = pd.read_csv(os.path.join(DATA_DIR, "BRCA_clinicalMatrix"),sep='\t')

    #### Keep `HER2_Final_Status_nature2012` target variables
    phenotypes_subset = phenotypes[["sampleID", "HER2_Final_Status_nature2012"]].reset_index(drop=True)
    phenotypes_subset.fillna("Negative", inplace=True)

    ### Transpose Methylation and Gene Expression datasets in order to join with Phenotypes on sampleID
    genom_transpose = genom.set_index("sample").transpose().reset_index().rename(columns={"index": "sampleID"})

    ### Merge datasets
    df = pd.merge(phenotypes_subset, genom_transpose, on="sampleID", how="left")

    ### Encode target
    df["target"] = [0 if t == "Negative" else 1 for t in df['HER2_Final_Status_nature2012']]
    df = df.drop(['HER2_Final_Status_nature2012','sampleID'], axis=1)
    ## Move target to first column
    df.insert(loc=0, column='target', value=df.pop('target'))
    ## Drop rows with NaN values
    df = df.dropna()

    ### Train-Valid-Test split
    # Hold out 20% of the data for testing
    train_df, test_df = train_test_split(df, test_size=args.train_test_split_ratio)
    # Hold out an additional 20% of the training data for validaton
    train_df, val_df = train_test_split(train_df, test_size=args.train_test_split_ratio)

    print(f"The training data has {train_df.shape[0]} records and {train_df.shape[1]} columns.")
    print(f"The validation data has {val_df.shape[0]} records and {val_df.shape[1]} columns.")
    print(f"The test data has {test_df.shape[0]} records and {test_df.shape[1]} columns.")
   
    # Save data

    os.makedirs(os.path.join(args.local_path, "output/train"), exist_ok=True)
    training_output_path = os.path.join(args.local_path,'output/train/train.csv')
    train_df.to_csv(training_output_path, header=True, index=False)
    print(f"Training data saved to {training_output_path}")
    
    os.makedirs(os.path.join(args.local_path, "output/val"), exist_ok=True)
    val_output_path = os.path.join(args.local_path,'output/val/val.csv')
    val_df.to_csv(val_output_path, header=True, index=False)
    print(f"Validation data saved to {val_output_path}")
          
    os.makedirs(os.path.join(args.local_path, "output/test"), exist_ok=True)
    test_output_path = os.path.join(args.local_path,'output/test/test.csv')
    test_df.to_csv(test_output_path, header=True, index=False)
    print(f"Test data saved to {test_output_path}")        

In [None]:
### Uncomment to test processing script locally
# !python scripts/processing/processing.py --local_path data

### 3.C. Submit SageMaker Processing Job

In [None]:
# Define the inputs for the processing job
inputs = [ProcessingInput(source=f"s3://{bucket_name}/data/input/",
                          destination='/opt/ml/processing/input',
                          s3_data_distribution_type='ShardedByS3Key'
                         )         
         ]

# Define the outputs for the processing job
outputs = [ProcessingOutput(output_name='train',
                            source='/opt/ml/processing/output/train',
                            destination=f"s3://{bucket_name}/data/output/train/"
                           ),
           ProcessingOutput(output_name='validation',
                            source='/opt/ml/processing/output/val',
                            destination=f"s3://{bucket_name}/data/output/val/"
                           ),
           ProcessingOutput(output_name='test',
                            source='/opt/ml/processing/output/test',
                            destination=f"s3://{bucket_name}/data/output/test/"
                           )
          ]

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

processing_run_name = f"Processing-{strftime('%Y-%m-%d-%H-%M-%S')}"

sklearn_processor.run(
    job_name=processing_run_name,
    code='scripts/processing/processing.py',
    inputs=inputs,
    outputs=outputs, 
    experiment_config={
        "ExperimentName": brca_her2_experiment.experiment_name,
        "TrialComponentDisplayName": processing_run_name
        },
    wait=True
)

Download Processed Data from S3

In [None]:
sm_session.download_data(f"{DATA_DIR}/output/train", bucket=bucket_name, key_prefix='data/output/train/train.csv')
sm_session.download_data(f"{DATA_DIR}/output/val", bucket=bucket_name, key_prefix='data/output/val/val.csv')
sm_session.download_data(f"{DATA_DIR}/output/test", bucket=bucket_name, key_prefix='data/output/test/test.csv')

## 4. Model Training

Now that our training data is set up, we can train some models. To highlight the benefits of experiment tracking, we're going to train models using three different frameworks:
- The XGBoost algorithm
- The random forest model from Scikit Learm
- A multi-layer perceptron (MLP) neural network in Keras

Since we're using SageMaker jobs to run our training, we don't need to install any additional libraries or spin up expensive compute resources on our notebook server. The jobs use their own dependencies and we're only charged for the time they run.

First, let's define some variables that all three training jobs will need.

In [None]:
# define the data type and paths to the training and validation datasets
content_type = "text/csv"

s3_input_train = sagemaker.inputs.TrainingInput(
    f"s3://{bucket_name}/data/output/train/train.csv", 
    content_type=content_type
)

s3_input_validation = sagemaker.inputs.TrainingInput(
    f"s3://{bucket_name}/data/output/val/val.csv", 
    content_type=content_type
)

s3_input_test = sagemaker.inputs.TrainingInput(
    f"s3://{bucket_name}/data/output/test/test.csv", 
    content_type=content_type
)

model_output_path = f"s3://{bucket_name}/"

### 4.A. Train Model Using a SKLearn Random Forest Algorithm

Create training script

In [None]:
#Create a trial
rf_trial = Trial.create(
        trial_name=f"RF-Trial-{strftime('%Y-%m-%d-%H-%M-%S')}",
        experiment_name=brca_her2_experiment.experiment_name
    )

In [None]:
#Create folder for RF training script
os.makedirs(os.path.join(WORKING_DIR,"scripts/rf_train"), exist_ok=True)

In [None]:
%%writefile scripts/rf_train/rf_train.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score
from smexperiments.tracker import Tracker

def model_fn(model_dir):
    """Load model for inference"""
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

def _parse_args():
    """Parse job parameters."""
    
    parser = argparse.ArgumentParser()
    
    # Hyperparameters are described here.
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)
    
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    
    parser.add_argument("--train-file", type=str, default="train.csv")
    parser.add_argument("--validation-file", type=str, default="val.csv")
    parser.add_argument("--test-file", type=str, default="test.csv")
    
    parser.add_argument("--target", type=str, default="target")

    return parser.parse_known_args()


if __name__ == "__main__":

    try:
        my_tracker = Tracker.load()
    except ValueError:
        my_tracker = Tracker.create()
    
    print("extracting arguments")
    args, _ = _parse_args()
    print(args)

    print("Preparing data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    validation_df = pd.read_csv(os.path.join(args.validation, args.validation_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    train_labels = np.array(train_df.pop("target"))
    validation_labels = np.array(validation_df.pop("target"))
    test_labels = np.array(test_df.pop("target"))
    
    train_np = np.array(train_df)
    validation_np = np.array(validation_df)    
    test_np = np.array(test_df)
        
    # Use the scale_pos_weight parameter to account for the imbalanced classes in our data
    pos_weight = float(np.sum(train_labels == 0) / np.sum(train_labels == 1))

    # train
    print("training model")
    classifier = RandomForestClassifier(
        n_estimators=args.n_estimators, 
        min_samples_leaf=args.min_samples_leaf, 
        class_weight="balanced",
        n_jobs=-1, 
        verbose=1
    )

    classifier.fit(train_np, train_labels)
    
    print("Evaluating model")

    # evaluate test data
    test_predictions = classifier.predict(test_np)
    
    accuracy = accuracy_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:accuracy', value=accuracy)  
    
    precision = precision_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:precision', value=precision) 
    
    f1 = f1_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:f1', value=f1)
    
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"F1 Score: {f1:.2f}")
        
    my_tracker.close()    
    
    print("Saving model")
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(classifier, path)
    print("Model saved to " + path)

Create a `requirements.txt` file in the training script directory to install additional dependencies in the training container. This is a great way to install an extra package or two without creating your own container image from scratch!

In [None]:
!echo "sagemaker-experiments" > scripts/rf_train/requirements.txt

In [None]:
rf_job_name= f"RF-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

rf_estimator = SKLearn(
    entry_point="rf_train.py",
    source_dir = "scripts/rf_train",   
    output_path = model_output_path,
    role=role,
    instance_count=1,
    instance_type="ml.c5.2xlarge",
    framework_version="0.23-1",
    enable_sagemaker_metrics=True,
    base_job_name=rf_job_name,
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "target": "target",
    },
)

rf_estimator.fit(
    {'train': s3_input_train, 'validation': s3_input_validation, 'test': s3_input_test},
    job_name=rf_job_name,
    experiment_config={
            "TrialName": rf_trial.trial_name,
            "TrialComponentDisplayName": rf_job_name,
        },
    wait=False
)

---

In [None]:
# # Download model
# print(rf_job_name)
# print(bucket_name)
# sm_session.download_data("models", bucket=bucket_name, key_prefix=f"{rf_job_name}/output/model.tar.gz")
# !tar xvfz models/model.tar.gz
# from joblib import dump, load
# rf_model = load('model.joblib') 
# rf_model.get_params()

### 4.B. Train Model using a Keras MLP

In [None]:
#Create a trial
tf_trial = Trial.create(
        trial_name=f"TF-Trial-{strftime('%Y-%m-%d-%H-%M-%S')}",
        experiment_name=brca_her2_experiment.experiment_name
    )

In [None]:
#Create folder for Keras training script
os.makedirs(os.path.join(WORKING_DIR,"scripts/tf_train"), exist_ok=True)

In [None]:
%%writefile scripts/tf_train/tf_train.py
import argparse
import os
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, f1_score
from smexperiments.tracker import Tracker

import tensorflow as tf
from tensorflow.python.keras.utils.np_utils import to_categorical 
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, \
 Conv1D, MaxPool1D, Flatten, concatenate

def binary_mlp(metrics, output_bias=None):
    ### Setup loss and output node activation

    output_activation = "sigmoid"
    loss = tf.keras.losses.BinaryCrossentropy()#from_logits=True

    
    ### Gene Expression Encoder
    genom_input = Input(shape = (20530,),
                        name = 'genom_input'
                       )
    genom_layer = Dense(units = 64,
                        kernel_regularizer = tf.keras.regularizers.l2(0.001),
                        activation = 'relu',
                        name = 'genom_layer1'
                       )(genom_input)
    #genom_layer = BatchNormalization(name = 'genom_layer1_normalized')(genom_layer)
    genom_layer = Dense(units = 32,
                        kernel_regularizer = tf.keras.regularizers.l2(0.001),
                        activation = 'relu',
                        name = 'genom_layer2'
                       )(genom_layer)    
    

    X = BatchNormalization(name = 'X_normalized')(genom_layer)


    X = Dense(units = 32,
              activation = 'relu',
              kernel_regularizer = tf.keras.regularizers.l2(0.001),
              name = 'X1'
             )(X)
    X = Dense(units = 16,
              activation = 'relu',
              kernel_regularizer = tf.keras.regularizers.l2(0.001),
              name = 'X2'
             )(X)

    output = Dense(units = 1, activation = output_activation)(X)
    
    ### Compile the model
    model = tf.keras.Model(genom_input, output)

    model.compile(optimizer='adam',
                  loss=loss,
                  metrics=metrics
                 )
    
    return model

def _parse_args():
    """Parse job parameters."""
    
    parser = argparse.ArgumentParser()
    
    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch_size', type=int, default=100)
    parser.add_argument('--learning_rate', type=float, default=0.1)

    # input data and model directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    
    parser.add_argument("--train-file", type=str, default="train.csv")
    parser.add_argument("--validation-file", type=str, default="val.csv")
    parser.add_argument("--test-file", type=str, default="test.csv")
    
    parser.add_argument("--target", type=str, default="target")

    args, _ = parser.parse_known_args()

    return parser.parse_known_args()


if __name__ =='__main__':

    try:
        my_tracker = Tracker.load()
    except ValueError:
        my_tracker = Tracker.create()
    
    print("extracting arguments")
    args, _ = _parse_args()
    print(args)

    print("Preparing data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    validation_df = pd.read_csv(os.path.join(args.validation, args.validation_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    train_labels = np.array(train_df.pop("target"))
    validation_labels = np.array(validation_df.pop("target"))
    test_labels = np.array(test_df.pop("target"))
    
    train_np = np.array(train_df)
    validation_np = np.array(validation_df)    
    test_np = np.array(test_df)
    
    EPOCHS = 150
    BATCH_SIZE = 32

    EARLY_STOPPING = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', 
        verbose=1,
        patience=10,
        mode='auto',
        restore_best_weights=True
    )

    # Instantiate classifier
    classifier = binary_mlp(metrics=["accuracy", "binary_accuracy"], output_bias=None)    
    
    # Fit classifier
    history = classifier.fit(x=train_np,
                                y=train_labels,
                                validation_data=(validation_np,validation_labels),
                                callbacks=[EARLY_STOPPING],               
                                batch_size=BATCH_SIZE,
                                epochs=EPOCHS,
                                verbose=1
    )
    
    print("Evaluating model")
    for epoch, value in enumerate(history.history["loss"]):
        my_tracker.log_metric(metric_name='train:loss', value=value, iteration_number=epoch)
        
    for epoch, value in enumerate(history.history["val_loss"]):
        my_tracker.log_metric(metric_name='validation:loss', value=value, iteration_number=epoch)        

    # evaluate test data
    test_predictions = classifier(test_np)    
    discrete_predictions = np.around(test_predictions).astype(int)
    
    accuracy = accuracy_score(test_labels, discrete_predictions)
    my_tracker.log_metric(metric_name='test:accuracy', value=accuracy) 
    
    precision = precision_score(test_labels, discrete_predictions)
    my_tracker.log_metric(metric_name='test:precision', value=precision)   
    
    f1 = f1_score(test_labels, discrete_predictions)
    my_tracker.log_metric(metric_name='test:f1', value=f1)
    
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"F1 Score: {f1:.2f}")
        
    my_tracker.close()
    
    print("Saving model")
    classifier.save(args.model_dir)
    print(f"Model saved to {args.model_dir}")

In [None]:
!echo "sagemaker-experiments" > scripts/tf_train/requirements.txt

In [None]:
tf_job_name= f"TF-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

tf_estimator = TensorFlow(
    entry_point="tf_train.py",
    source_dir="scripts/tf_train",
    output_path = model_output_path,
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    enable_sagemaker_metrics=True,

    framework_version="2.2",
    py_version="py37",
    metric_definitions=[
        {"Name": "test:accuracy", "Regex": "Accuracy: ([0-9.]+)$"},
        {"Name": "test:precision", "Regex": "Precision: ([0-9.]+)$"},
        {"Name": "test:f1", "Regex": "F1 Score: ([0-9.]+)$"},        
    ]
)

tf_estimator.fit(
    {'train': s3_input_train, 'validation': s3_input_validation, 'test': s3_input_test},
    job_name=tf_job_name,
    experiment_config={
            "TrialName": tf_trial.trial_name,
            "TrialComponentDisplayName": tf_job_name,
        },
    wait=False
)

### 4.C. Train Model Using the XGBoost Algorithm

In [None]:
#Create a trial
xgb_trial = Trial.create(
        trial_name=f"XGBoost-Trial-{strftime('%Y-%m-%d-%H-%M-%S')}",
        experiment_name=brca_her2_experiment.experiment_name
    )

In [None]:
#Create folder for XGB training script
os.makedirs(os.path.join(WORKING_DIR,"scripts/xgb_train"), exist_ok=True)

In [None]:
%%writefile scripts/xgb_train/xgb_train.py

import argparse
import json
import logging
import os
import numpy as np
import pandas as pd
import pickle as pkl
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, f1_score
from smexperiments.tracker import Tracker

def model_fn(model_dir):
    """Deserialize and return fitted model.

    Note that this should have the same name as the serialized model in the _xgb_train method
    """
    model_file = 'xgboost-model.pkl'
    booster = pkl.load(open(os.path.join(model_dir, model_file), 'rb'))
    return booster

def _parse_args():
    """Parse job parameters."""
    
    parser = argparse.ArgumentParser()
    
    # Hyperparameters are described here.
    parser.add_argument('--objective', type=str, default="binary:logistic")
    parser.add_argument('--booster', type=str, default="gbtree")
    parser.add_argument('--eval_metric', type=str, default="error")
    parser.add_argument('--n_estimators', type=int, default=15)
    parser.add_argument('--max_depth', type=int, default=6)
    parser.add_argument('--min_child_weight', type=float, default=1)
    parser.add_argument('--subsample', type=float, default=1)
    parser.add_argument('--gamma', type=float, default=0)  
    parser.add_argument('--alpha', type=float, default=0)  
    
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    
    parser.add_argument("--train-file", type=str, default="train.csv")
    parser.add_argument("--validation-file", type=str, default="val.csv")
    parser.add_argument("--test-file", type=str, default="test.csv")

    return parser.parse_known_args()

if __name__ == '__main__':
    
    try:
        my_tracker = Tracker.load()
    except ValueError:
        my_tracker = Tracker.create()
    
    print("extracting arguments")
    args, _ = _parse_args()
    print(args)

    hyper_params_dict = {
        'objective': args.objective,
        'booster': args.booster,
        'eval_metric': args.eval_metric,
        'n_estimators': args.n_estimators,
        'max_depth': args.max_depth, 
        'min_child_weight': args.min_child_weight,
        'subsample': args.subsample,
        'gamma': args.gamma,
        'alpha': args.alpha    
    }

    print("Preparing data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    validation_df = pd.read_csv(os.path.join(args.validation, args.validation_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    train_labels = np.array(train_df.pop("target"))
    validation_labels = np.array(validation_df.pop("target"))
    test_labels = np.array(test_df.pop("target"))
    
    train_np = np.array(train_df)
    validation_np = np.array(validation_df)    
    test_np = np.array(test_df)
        
    # Use the scale_pos_weight parameter to account for the imbalanced classes in our data
    pos_weight = float(np.sum(train_labels == 0) / np.sum(train_labels == 1))

    classifier = xgb.XGBClassifier(
        scale_pos_weight=pos_weight, # Use pos_weight value calculated above to account for unbalanced classes
        use_label_encoder=False,
        **hyper_params_dict
    )

    print("Fitting model")
    classifier.fit(
        train_np,
        train_labels,
        eval_set=[(train_np, train_labels), (validation_np, validation_labels)], 
        verbose=True
    )
    
    print("Evaluating model")
    results = classifier.evals_result()    
    for epoch, value in enumerate(results["validation_0"]["error"]):
        my_tracker.log_metric(metric_name='train:error', value=value, iteration_number=epoch)
        
    for epoch, value in enumerate(results["validation_1"]["error"]):
        print(f"[{epoch}]#011validation-error:{value}")  # Required for SageMaker to pick up this metric in the logs during HPO (See section 6)
        my_tracker.log_metric(metric_name='validation:error', value=value, iteration_number=epoch)

    # evaluate test data
    test_predictions = classifier.predict(test_np)
    
    accuracy = accuracy_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:accuracy', value=accuracy)  
    
    precision = precision_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:precision', value=precision)   
    
    f1 = f1_score(test_labels, test_predictions)
    my_tracker.log_metric(metric_name='test:f1', value=f1)
    
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"F1 Score: {f1:.2f}")
        
    my_tracker.close()    
    
    print("Saving model")
    path = os.path.join(args.model_dir, "xgboost-model.pkl")
    pkl.dump(classifier, open(path, 'wb'))
    print("Model saved to " + path)


In [None]:
!echo "sagemaker-experiments" > scripts/xgb_train/requirements.txt

In [None]:
xgb_job_name= f"XGB-Training-Job-{strftime('%Y-%m-%d-%H-%M-%S')}"

hyper_params_dict = {
    'objective':             "binary:logistic",
    'booster':               "gbtree",
    'eval_metric':           "error",
    'n_estimators':          15,
    'max_depth':             6, 
    'min_child_weight':      1,
    'subsample':             1,
    'gamma':                 0,
    'alpha':                 0
}

xgb_estimator = XGBoost(entry_point = "xgb_train.py", 
                    source_dir = "scripts/xgb_train",
                    output_path = model_output_path,
                    framework_version='1.2-1',
                    hyperparameters=hyper_params_dict,
                    role=role,
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    enable_sagemaker_metrics=True,
                   )


xgb_estimator.fit(
    {'train': s3_input_train, 'validation': s3_input_validation,'test': s3_input_test},
    job_name=xgb_job_name,
    experiment_config={
            "TrialName": xgb_trial.trial_name,
            "TrialComponentDisplayName": xgb_job_name,
        },
    logs=True,
    wait=True)

## 5. Model Evaluation

### 5.A. Download and Run the Trained XGBoost Model

In Notebook 1, we used a confusion matrix to evaluate the accuracy of our model. Let's download our trained XGBoost model and do the same thing here.

First, we download the model artifact from S3 and load it into our notebook.

In [None]:
sm_session.download_data("models", bucket=bucket_name, key_prefix=f"{xgb_job_name}/output/model.tar.gz")
!tar xvfz models/model.tar.gz -C models

loaded_model = pickle.load(open("models/xgboost-model.pkl", 'rb'))

Next, we read in the test data and seperate it into features and labels

In [None]:
test_df = pd.read_csv(f"{DATA_DIR}/output/test/test.csv")
test_labels = np.array(test_df.pop("target"))
test_np = np.array(test_df)

Finally, we define a function for generating a confusion matrix and use it to analyze our test predictions

In [None]:
# Create a custom function for generating a confusion matrix for a given p-value
def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions > p)
    plt.figure(figsize=(5,5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title('Confusion matrix @{:.2f}'.format(p))
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    
    if len(set(list(labels))) == 2:
        print('Correctly un-detected (True Negatives): ', cm[0][0])
        print('Incorrectly detected (False Positives): ', cm[0][1])
        print('Misses (False Negatives): ', cm[1][0])
        print('Hits (True Positives): ', cm[1][1])
        print('Total: ', np.sum(cm[1]))

In [None]:
# evaluate predictions
test_predictions = loaded_model.predict(test_np)
accuracy = accuracy_score(test_labels, test_predictions)
precision = precision_score(test_labels, test_predictions)
f1 = f1_score(test_labels, test_predictions)

plot_cm(test_labels, np.array(test_predictions))

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"F1 Score: {f1:.2f}")

### 5.B. Compare Model Results Using SageMaker Experiments

SageMaker Experiments saves key information about our models for easy viewing and comparison in the SageMaker Studio UI.

To start, click on the SageMaker Resources icon on the Studio sidebar and select `Experiments and trials` from the menu. To view information about your experiment click on the name (should start with "BRCA-HER2-" and then select `Open in trial component list`.

![alt text](img/sm-resources-tab.png "Studio Resources")

The Trial Component list has a record for each of the training jobs, plus the processing job. You can click on a trial component name for more information about that job.

![alt text](img/trial-component-list.png "Trial Component List")

We can compare the performance of our model training jobs by adding an additional metric to the table. To do this, click on the Gear on the Studio sidebar and then `test:f1` in the Metrics section.

![alt text](img/metrics.png "Metrics")

Now we can see that the XGBoost model had the highest f1 score on the test data.

![alt text](img/tc-list-2.png "Updated Trial Component List")

You can view the same information programmatically by using the `ExperimentAnalytics` class

In [None]:
search_expression = {
    "Filters": [
        {
            "Name": "DisplayName",
            "Operator": "Contains",
            "Value": "Training",
        }
    ],
}

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sm_session,
    experiment_name=brca_her2_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.test:f1.last",
    sort_order="Descending",
    metric_names=["test:f1"],
    parameter_names=["SageMaker.InstanceType"],
)

trial_component_analytics.dataframe()

## 6. Hyperparameter Optimization

In the previous section, we saw that our XGBoost classifier gave the best results on our test dataset. However, we can likely improve its accuracy further through hyperparameter optimization (HPO). During HPO, we repeatedly train our model with small changes to one or more parameters each time. SageMaker Training is a great fit for this because it allows us to run multiple training jobs in parallel.

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)


In [None]:
hyperparameter_ranges = {
    "alpha": ContinuousParameter(0, 250, scaling_type="Auto"),
    "eta": ContinuousParameter(0.1, 0.5, scaling_type="Auto"),
}

In [None]:
tuner_log = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name = "validation:error",
    objective_type = "Minimize",
    hyperparameter_ranges = hyperparameter_ranges,
    max_jobs=25,
    max_parallel_jobs=10,
)

tuner_log.fit(
    {'train': s3_input_train, 'validation': s3_input_validation,'test': s3_input_test},
)

View tuning job results

In [None]:
tuner_log = HyperparameterTuner.attach("sagemaker-xgboost-220126-1405", estimator_cls="sagemaker.xgboost.estimator.XGBoost")

In [None]:
tuner_description = tuner_log.describe()
objective_name = tuner_description["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["MetricName"]
tuner = tuner_log.analytics()
tuner.dataframe().sort_values(by="FinalObjectiveValue")

In [None]:


# tuning_job_result = sm_client.(
#     HyperParameterTuningJobName=tuning_job_name
# )

# status = tuning_job_result["HyperParameterTuningJobStatus"]
# if status != "Completed":
#     print("Reminder: the tuning job has not been completed.")

# job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
# print("%d training jobs have completed" % job_count)

# objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
# is_minimize = objective["Type"] != "Maximize"
# objective_name = objective["MetricName"]

In [None]:
import bokeh
import bokeh.io

bokeh.io.output_notebook()
from bokeh.plotting import figure, show
from bokeh.models import HoverTool

class HoverHelper:
    def __init__(self, tuning_analytics):
        self.tuner = tuning_analytics

    def hovertool(self):
        tooltips = [
            ("FinalObjectiveValue", "@FinalObjectiveValue"),
            ("TrainingJobName", "@TrainingJobName"),
        ]
        for k in self.tuner.tuning_ranges.keys():
            tooltips.append((k, "@{%s}" % k))

        ht = HoverTool(tooltips=tooltips)
        return ht

    def tools(self, standard_tools="pan,crosshair,wheel_zoom,zoom_in,zoom_out,undo,reset"):
        return [self.hovertool(), standard_tools]


hover = HoverHelper(tuner)

In [None]:
df = tuner.dataframe()[tuner.dataframe()["FinalObjectiveValue"] > -float("inf")]


ranges = tuner.tuning_ranges
figures = []
for hp_name, hp_range in ranges.items():
    categorical_args = {}
    if hp_range.get("Values"):
        # This is marked as categorical.  Check if all options are actually numbers.
        def is_num(x):
            try:
                float(x)
                return 1
            except:
                return 0

        vals = hp_range["Values"]
        if sum([is_num(x) for x in vals]) == len(vals):
            # Bokeh has issues plotting a "categorical" range that's actually numeric, so plot as numeric
            print("Hyperparameter %s is tuned as categorical, but all values are numeric" % hp_name)
        else:
            # Set up extra options for plotting categoricals.  A bit tricky when they're actually numbers.
            categorical_args["x_range"] = vals

    # Now plot it
    p = figure(
        plot_width=500,
        plot_height=500,
        title="Objective vs %s" % hp_name,
        tools=hover.tools(),
        x_axis_label=hp_name,
        y_axis_label=objective_name,
        **categorical_args,
    )
    p.circle(source=df, x=hp_name, y="FinalObjectiveValue")
    figures.append(p)
show(bokeh.layouts.Column(*figures))