# Hyperparameter Tuning the Iris Dataset With XGBoost on Amazon SageMaker AMT

___

This example demonstrates how to use Amazon SageMaker to perform hyperparameter tuning on an XGBoost model for the Iris dataset.

Hyperparameter tuning is the process of finding the optimal set of hyperparameters for a machine learning model. Hyperparameters are model parameters that are set before the training process begins, such as the learning rate and the maximum depth of the trees.

This notebook covers the configuration and execution of a hyperparameter tuning job using Amazon SageMaker. The Iris dataset is prepared and split into train, validation, and test sets, which are then uploaded to Amazon S3. The hyperparameter tuning job and training job configurations are defined, and the tuning job is launched.

___

### 1. Import Libraries
The libraries enable AWS machine learning implementation through SageMaker, data manipulation with pandas/numpy, and preparation of the iris dataset for model training.

In [None]:
import sagemaker
import boto3
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import os

### 2. Set up SageMaker session and S3 bucket
The code establishes AWS connections and configures SageMaker session settings with necessary role permissions and S3 storage.

In [None]:
region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'DEMO-iris-hpo-xgboost'

### 3. Load and prepare the Iris dataset
The iris dataset loads into a pandas DataFrame, combining feature data with target values.

In [None]:
iris = load_iris()
data = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])

# Display the first few rows to verify data loading
data.head()

### 4. Split the data into training, validation, and test sets
The dataset splits into training, validation, and test sets using a 60-20-20 ratio.

In [None]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
validation_data, test_data = train_test_split(test_data, test_size=0.5, random_state=42)

# Verify the sizes of the splits
print(f"Training data size: {train_data.shape}")
print(f"Validation data size: {validation_data.shape}")
print(f"Test data size: {test_data.shape}")

### 5. Save and upload data to S3
The prepared datasets upload to designated S3 bucket locations for model access.

In [None]:
train_data.to_csv('train.csv', index=False, header=False)
validation_data.to_csv('validation.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

print(f"Data uploaded to S3 bucket: {bucket}, prefix: {prefix}")

### 6. Configure hyperparameter tuning settings
Hyperparameter ranges define the model tuning scope, including learning rate and tree depth parameters.

In [None]:
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.5",
          "MinValue": "0.1",
          "Name": "eta"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "5",
          "MinValue": "2",
          "Name": "max_depth"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 4,
      "MaxParallelTrainingJobs": 2
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:rmse",
      "Type": "Minimize"
    }
}

### 7. Get training image URI for XGBoost
The XGBoost container image retrieves for model training implementation.

In [None]:
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')
print(f"Using training image: {training_image}")

### 8. Configure S3 input paths for training and validation data
S3 paths configure to specify locations for accessing training and validation data.

In [None]:
s3_input_train = f's3://{bucket}/{prefix}/train'
s3_input_validation = f's3://{bucket}/{prefix}/validation/'
print(f"Training data path: {s3_input_train}")
print(f"Validation data path: {s3_input_validation}")

### 9. Define the training job configuration
The training job configuration sets up the XGBoost algorithm with specified resources, input channels, and base parameters.

In [None]:
training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": f"s3://{bucket}/{prefix}/output"
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m5.large",
      "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "num_round": "50",
      "objective": "reg:squarederror",
      "verbosity": "2"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 1800
    }
}

### 10. Launch the hyperparameter tuning job
The hyperparameter tuning job launches with the defined configuration to optimize the XGBoost model.

In [None]:
tuning_job_name = "IrisHPO"
smclient.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name,
    HyperParameterTuningJobConfig=tuning_job_config,
    TrainingJobDefinition=training_job_definition
)

print(f"Launched hyperparameter tuning job: {tuning_job_name}")

### 11. Monitoring Hyperparameter Tuning Jobs

The SageMaker console enables monitoring of hyperparameter tuning jobs through the Training section. The Hyperparameter tuning jobs panel displays individual training job statuses and updates their progress in real-time. SageMaker continuously evaluates the validation RMSE to identify the best-performing model, with detailed configurations and results accessible through the Best training job option. Once optimization completes, the best-performing model can be deployed by creating an endpoint, enabling real-time predictions in a production environment.

To access tuning jobs:
* Navigate to Training > Hyperparameter tuning jobs 
* Select your launched tuning job
* View Best training job for optimal model details
* Choose Create model to deploy for production inference

### 12. Clean up

To avoid incurring unnecessary charges, use the AWS Management Console to delete the resources that you created for it.
- Open the SageMaker console and delete the notebook instance. Stop the instance before deleting it.
- Amazon S3 console and delete the bucket that you created to store model artifacts and the training dataset.
- Open the Amazon CloudWatch console and delete all of the log groups that have names starting with /aws/sagemaker/.

___