### Introduction

Jupyter notebooks are divided into cells that can contain markdown or code that you can run interactively from the notebook interface. You can progress through the cells in the notebook by clicking the play button in the notebook tab's toolbar:

![](assets/2024-09-06-10-35-01.png)

Click the play button to advance to the next cell and continue on in the lab whenever you have completed a cell.

After clicking the play button, the status in the left-hand side of the bottom status bar will change from **Idle** to **Busy**:

![](assets/2024-09-06-10-46-24.png)

Wait for the status to change back to **Idle** before proceeding to the next cell.

#### Notebook Overview

This notebook walks through the process for preparing data and then training a model using Amazon SageMaker HyperParameter Tuning.

The dataset you will be using is the [public domain Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is a popular dataset for demonstrating classification with machine learning. The dataset consists of 150 samples of iris flowers. Using the numeric features of the flowers, such as sepal length and width, you will train a model that can classify each flower into one of three of the Iris flower species: setosa, versicolor, or virginica.

You will use Python 3 as the programming language. This notebook is built on the SageMaker notebook conda_python3 environment. conda refers to Anaconda which is a data science platform. The environment comes with many common Python machine learning and data science libraries already installed.

### Installing Dependencies

To begin, you will ensure that you have the versions of the dependencies required by this lab. Run the following cell to install the versions of the required libraries. 

In [None]:
%%time
!pip install boto3==1.35.4 sagemaker==2.229.0 

### Importing the Required Libraries

You are using the popular `numpy` and `pandas` libraries for data manipulation, along with `boto3` and `sagemaker` for interacting with Amazon SageMaker and Amazon S3 resources. Run the following cell to import them and create a SageMaker client session.

In [None]:
import sagemaker
import boto3

import numpy as np                                      # For performing matrix operations and numerical processing
import pandas as pd                                     # For manipulating tabular data
from sklearn.model_selection import train_test_split    # For splitting the data into training and testing sets

import os


region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')

### Getting the Notebook's IAM Role

To interact with SageMaker, you need an IAM role that grants the necessary permissions. A role was created for you during lab setup that grants access to Amazon S3 and to create and run Amazon SageMaker HyperParameter Tuning jobs.

Run the following cell to get the role associated with the notebook instance and store it in a variable for use later.

In [None]:
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

### Configuring Storage

To train a model, you need to access a dataset, prepare the data for training, and store the model artifacts. Using Amazon S3 for storage is a best practice when working with Amazon SageMaker.

The following cell retrieves the name of a bucket beginning with `lab-notebook-` that was created for you during lab setup.

Run the following cell to configure storage for your HyperParameter training job.

In [3]:
prefix = "iris"
bucket = next((bucket['Name'] for bucket in boto3.client('s3').list_buckets()['Buckets'] if bucket['Name'].startswith('lab-notebook-')), None)


sess = sagemaker.Session(
    default_bucket = bucket
)

### Loading and Displaying the Data

To load the `iris.csv` dataset, you will use the `pandas` library. The dataset contains four features: sepal length, sepal width, petal length, and petal width. The target column is the species of the iris flower.

Run the following cell to load the dataset into a Pandas data frame and display the first few rows.

In [None]:

data = pd.read_csv('./iris.csv', sep=',')
pd.set_option('display.max_columns', 11)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
data


### Preparing the Data for Training

Typically, when training a machine learning model, you need to split the dataset into training and validation sets. The training set is used to train the model, and the validation set is used to evaluate the model's performance.

As this dataset is small, no preprocessing is required apart from splitting the data. When working with larger and more complex datasets, you may need to perform additional preprocessing steps, such as normalizing the data or encoding categorical variables.

It is conventional when performing classification tasks to reorder the data so that the target variable is the first column.

Run the following cell to split the dataset into training and validation sets, re-order the columns, and display the first few rows of each dataset split.

In [None]:
# First, split the data into training (70%) and remaining data (30%)
train_data, remaining_data = train_test_split(data, test_size=0.3, random_state=1729, stratify=model_data['Species'])

# Then, split the remaining data into validation (20% of original data) and test (10% of original data)
validation_data, test_data = train_test_split(remaining_data, test_size=1/3, random_state=1729, stratify=remaining_data['Species'])  # 1/3 of 30% is 10%

# Reorder the columns to have 'Species' as the first column, but keep all other columns
train_data[['Species'] + [col for col in train_data.columns if col != 'Species']].to_csv('train.csv', index=False, header=False)
validation_data[['Species'] + [col for col in validation_data.columns if col != 'Species']].to_csv('validation.csv', index=False, header=False)
test_data[['Species'] + [col for col in test_data.columns if col != 'Species']].to_csv('test.csv', index=False, header=False)

# Display the first few rows of each dataset to verify the split
print("Training Data:")
print(train_data.head())
print("\nValidation Data:")
print(validation_data.head())
print("\nTest Data:")
print(test_data.head())

### Uploading the Data to Amazon S3

Now that you have split the data into training and validation datasets, you have to make them available to the SageMaker service.

The following cell creates Amazon S3 prefixes and URIs for the training and validation datasets. The datasets will be stored in different prefixes within the same bucket. You will use the S3 URIs when configuring the SageMaker HyperParameter Tuning job.

Run the following cell to upload the training and validation datasets to the Amazon S3 bucket.

In [7]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

train_s3_key = '{}/train'.format(prefix)
validation_s3_key = '{}/validation'.format(prefix)

s3_input_train = 's3://{}/{}'.format(bucket, train_s3_key)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, validation_s3_key)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(train_s3_key, 'train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(validation_s3_key, 'validation.csv')).upload_file('validation.csv')


### Configuring a HyperParameter Tuning Job

The following cell creates a variable named `tuning_job_config` that contains the configuration for the SageMaker HyperParameter Tuning job.

The `ParameterRanges` section specifies the parameters that are tunable. Each of these parameters can be changed to help prevent over fitting, and control the complexity of the model.

When using Amazon HyperParameter Tuning with SageMaker, these parameters are modified iteratively to find the best combination of parameters for the model. Automating this tuning is often more efficient than manually tuning the parameters.

The `ResourceLimits` section specifies the maximum number of training jobs that can be run in parallel and the maximum number of training jobs that can be run in total. In this lab, the numbers are small because the dataset is small, and the training jobs are quick to complete. In a non-lab environment, you may want to increase these numbers to speed up the tuning process.

The `HyperParameterTuningJobObjective` section specifies the metric that you want to optimize. In this case, you are optimizing the validation loss. This is suitable for classification tasks where the goal is to minimize the loss. For other types of machine learning tasks, such as predicting a continuous value, you may want to optimize a different metric.

Run the following cell to create a configuration for a HyperParameter Tuning job.

In [None]:
tuning_job_config = {
    "ParameterRanges": {
        "ContinuousParameterRanges": [
            {
                "Name": "eta",
                "MinValue": "0.1",
                "MaxValue": "0.5"
            },
            {
                "Name": "min_child_weight",
                "MinValue": "0",
                "MaxValue": "120"
            },
            {
                "Name": "subsample",
                "MinValue": "0.5",
                "MaxValue": "1"
            },
            {
                "Name": "colsample_bytree",
                "MinValue": "0.5",
                "MaxValue": "1"
            },
            {
                "Name": "gamma",
                "MinValue": "0",
                "MaxValue": "5"
            }
        ],
        "IntegerParameterRanges": [
            {
                "Name": "max_depth",
                "MinValue": "1",
                "MaxValue": "10"
            }
        ]
    },
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs": 9,
        "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",  # Use Bayesian optimization for efficient tuning
    "HyperParameterTuningJobObjective": {
        "MetricName": "validation:mlogloss",  # Suitable for multi-class classification
        "Type": "Minimize"  # Minimize log loss for best accuracy
    },
    "RandomSeed": 123
}

### Defining a HyperParameter Training Job

The following cell creates a variable named `training_job_config` that contains the configuration for the SageMaker HyperParameter Tuning training job.

You are using a managed algorithm container image provided by Amazon SageMaker called XGBoost (eXtreme Gradient Boosting). This is a popular algorithm for classification tasks. You can use your docker container image if you have a custom algorithm or custom dependencies.

Notice that as well as specifying the `TrainingImage`, you also are providing a `RoleArn`. The role is used by SageMaker to access the training data and to store the model artifacts in Amazon S3.

Run the following cell to define a HyperParameter Tuning training job.

In [None]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

training_job_definition = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train
                }
            }
        },
        {
            "ChannelName": "validation",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation
                }
            }
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}output".format(bucket, prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m5.large",
        "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
        "objective": "multi:softmax",  # Changed to multi-class objective
        "num_class": "3",  # Number of classes in the Iris dataset
        "eval_metric": "mlogloss",  # Use multi-class log loss for evaluation
        "num_round": "100"  # Number of boosting rounds
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 300
    }
}

To launch the HyperParameter Tuning job, you need to create a SageMaker client and call the `create_hyper_parameter_tuning_job` method. The Amazon SageMaker service expects to be called with a tuning job configuration and training job definition, as well as a name for the job.

Run the following cell to create a SageMaker client and launch the HyperParameter Tuning job.

In [None]:
tuning_job_name = "iris-training-job"

smclient.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name,
    HyperParameterTuningJobConfig=tuning_job_config,
    TrainingJobDefinition=training_job_definition,
)

In [None]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

training_job_definition = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train
                }
            }
        },
        {
            "ChannelName": "validation",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation
                }
            }
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}output".format(bucket, prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m5.large",
        "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
        "objective": "multi:softmax",  # Changed to multi-class objective
        "num_class": "3",  # Number of classes in the Iris dataset
        "eval_metric": "mlogloss",  # Use multi-class log loss for evaluation
        "num_round": "100"  # Number of boosting rounds
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 300
    }
}

In [None]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)

tuning_job_config = {
    "ParameterRanges": {
        "ContinuousParameterRanges": [
            {
                "Name": "eta",
                "MinValue": "0.1",
                "MaxValue": "0.5"
            },
            {
                "Name": "min_child_weight",
                "MinValue": "0",
                "MaxValue": "120"
            },
            {
                "Name": "subsample",
                "MinValue": "0.5",
                "MaxValue": "1"
            },
            {
                "Name": "colsample_bytree",
                "MinValue": "0.5",
                "MaxValue": "1"
            },
            {
                "Name": "gamma",
                "MinValue": "0",
                "MaxValue": "5"
            }
        ],
        "IntegerParameterRanges": [
            {
                "Name": "max_depth",
                "MinValue": "1",
                "MaxValue": "10"
            }
        ]
    },
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs": 9,
        "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",  # Use Bayesian optimization for efficient tuning
    "HyperParameterTuningJobObjective": {
        "MetricName": "validation:mlogloss",  # Suitable for multi-class classification
        "Type": "Minimize"  # Minimize log loss for best accuracy
    },
    "RandomSeed": 123
}

training_job_definition = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train
                }
            }
        },
        {
            "ChannelName": "validation",
            "CompressionType": "None",
            "ContentType": "csv",
            "DataSource": {
                "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation
                }
            }
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}output".format(bucket, prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,  # Reduced to 1 because the Iris dataset is small
        "InstanceType": "ml.m5.large",
        "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
        "objective": "multi:softmax",  # Changed to multi-class objective
        "num_class": "3",  # Number of classes in the Iris dataset
        "eval_metric": "mlogloss",  # Use multi-class log loss for evaluation
        "num_round": "100"  # Number of boosting rounds
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 300
    }
}



tuning_job_name = "iris-training-job-eikltf1"
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                           HyperParameterTuningJobConfig = tuning_job_config,
                                           TrainingJobDefinition = training_job_definition)