# Hyperparameter Tuning the Iris Dataset With XGBoost on Amazon SageMaker AMT

___

This example demonstrates how to use Amazon SageMaker to perform hyperparameter tuning on an XGBoost model for the Iris dataset.

Hyperparameter tuning is the process of finding the optimal set of hyperparameters for a machine learning model. Hyperparameters are model parameters that are set before the training process begins, such as the learning rate and the maximum depth of the trees.

This notebook covers the configuration and execution of a hyperparameter tuning job using Amazon SageMaker. The Iris dataset is prepared and split into train, validation, and test sets, which are then uploaded to Amazon S3. The hyperparameter tuning job and training job configurations are defined, and the tuning job is launched.

___

### 1. Import Libraries

In [None]:
import sagemaker
import boto3
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import os

### 2. Set up SageMaker session and S3 bucket

In [None]:
region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'DEMO-iris-hpo-xgboost'

### 3. Load and prepare the Iris dataset

In [None]:
iris = load_iris()
data = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])

# Display the first few rows to verify data loading
data.head()

### 4. Split the data into training, validation, and test sets

In [None]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
validation_data, test_data = train_test_split(test_data, test_size=0.5, random_state=42)

# Verify the sizes of the splits
print(f"Training data size: {train_data.shape}")
print(f"Validation data size: {validation_data.shape}")
print(f"Test data size: {test_data.shape}")

### 5. Save and upload data to S3

In [None]:
train_data.to_csv('train.csv', index=False, header=False)
validation_data.to_csv('validation.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

print(f"Data uploaded to S3 bucket: {bucket}, prefix: {prefix}")

### 6. Configure hyperparameter tuning settings

In [None]:
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.5",
          "MinValue": "0.1",
          "Name": "eta"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "5",
          "MinValue": "2",
          "Name": "max_depth"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 4,
      "MaxParallelTrainingJobs": 2
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:rmse",
      "Type": "Minimize"
    }
}

### 7. Get training image URI for XGBoost

In [None]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')
print(f"Using training image: {training_image}")

### 8. Configure S3 input paths for training and validation data

In [None]:
s3_input_train = f's3://{bucket}/{prefix}/train'
s3_input_validation = f's3://{bucket}/{prefix}/validation/'
print(f"Training data path: {s3_input_train}")
print(f"Validation data path: {s3_input_validation}")

### 9. Define the training job configuration

In [None]:
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": f"s3://{bucket}/{prefix}/output"
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m5.large",
      "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "num_round": "50",
      "objective": "reg:squarederror",
      "verbosity": "2"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 1800
    }
}

### 10. Launch the hyperparameter tuning job

In [None]:
tuning_job_name = "IrisHPO"
smclient.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name,
    HyperParameterTuningJobConfig=tuning_job_config,
    TrainingJobDefinition=training_job_definition
)

print(f"Launched hyperparameter tuning job: {tuning_job_name}")

### 11. Monitoring Hyperparameter Tuning Jobs in SageMaker

In the SageMaker console, you can monitor the status of the training jobs created by the hyperparameter tuning job.

1. In the left navigation pane, under **Training**, click **Hyperparameter tuning jobs**, and click on the hyperparameter tuning job you launched.
2. In the **Training jobs** section, you can view a list of individual training jobs along with their current statuses. This section provides real-time updates so you can see which jobs have been completed, which are still running, and if any have encountered errors.
3. Click **Best training job** to view the details of the best training job and review the configurations and results for this model.

Throughout the tuning process, SageMaker uses the objective metric (in this case, the validation RMSE) from each training job to determine the best-performing model. While the tuning job runs, SageMaker continuously updates which job has achieved the best objective metric. When the tuning job finishes, SageMaker highlights the training job that returned the best objective metric.

After identifying the best training job, you can deploy it to a SageMaker endpoint for inference:
1. Choose **Create model** to deploy the best training job as a model on SageMaker.
2. Adjust the model endpoint settings as needed and proceed to launch the deployment.

Deploying the model creates a SageMaker endpoint, which allows you to send data for predictions and test the model’s performance in a production-like environment.

### 12. Clean up

To avoid incurring unnecessary charges, use the AWS Management Console to delete the resources that you created for it.
1. Open the SageMaker console and delete the notebook instance. Stop the instance before deleting it.
2. Open the Amazon S3 console and delete the bucket that you created to store model artifacts and the training dataset.
3. Open the Amazon CloudWatch console and delete all of the log groups that have names starting with /aws/sagemaker/.

___