# Model Building

### Initialize AWS SageMaker Environment and Define Data Paths

This code sets up the environment for working with Amazon SageMaker by importing necessary libraries and initializing key variables. It begins by establishing an AWS SageMaker session and obtaining the execution role, which grants permissions to interact with AWS resources. A default S3 bucket is defined to store data, and a prefix is used to organize activity-specific data paths within this bucket. Paths for training, validation, and testing data are specified, each stored in its respective folder within the S3 bucket, making it easier to access and manage data for machine learning tasks.

In [1]:
# Import necessary libraries for SageMaker session, AWS SDK, and data manipulation
from sagemaker import Session
import sagemaker
import boto3
import re
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import os

# Obtain the SageMaker execution role for the notebook, which grants permissions to access AWS resources
role = get_execution_role()

# Set the default S3 bucket for SageMaker sessions; if none exists, SageMaker will create one
bucket = sagemaker.Session().default_bucket()

# Define a prefix for S3 paths to organize activity-related data within the bucket
prefix = 'mlops/activity-3'

# Initiate a SageMaker session, which is used to handle interactions with SageMaker APIs
sess = Session()

# Define S3 paths for train, validation, and test data, organized by prefix within the S3 bucket
train_path = f"s3://{bucket}/{prefix}/train/"
validation_path = f"s3://{bucket}/{prefix}/validation/"
test_path = f"s3://{bucket}/{prefix}/test/"


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### Retrieve Amazon SageMaker XGBoost Container Image URI

This code retrieves the Amazon SageMaker container image URI for the latest version of the XGBoost framework, ensuring compatibility with the current AWS region. The container URI is essential for launching and managing an XGBoost training job on SageMaker, as it provides the runtime environment pre-configured with the necessary libraries and dependencies. This region-specific approach ensures that the container image URI corresponds to resources available in the current region.

In [2]:
# Retrieve the Amazon SageMaker container URI for the latest version of the XGBoost framework
# This is region-specific, so it uses the current AWS session's region
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, 
                                          framework='xgboost', 
                                          version='latest')

### Define S3 Input Data Locations for SageMaker Training

This code defines the input data locations for both training and validation datasets, stored in Amazon S3, to be used in an Amazon SageMaker training job. Using the TrainingInput class, SageMaker recognizes these data inputs and allows seamless access during training. The content_type parameter is set to 'csv' to indicate that the input files are in CSV format, which is essential for compatibility with SageMaker’s data ingestion pipeline. By organizing data in this way, training and validation data can be easily referenced and managed in subsequent stages of the machine learning pipeline.

In [3]:
# Specify the S3 input locations for training and validation data
# The `TrainingInput` class indicates that these data inputs will be used in a training job
# `content_type='csv'` specifies that the data is in CSV format

s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket}/{prefix}/train',
    content_type='csv'
)

s3_input_validation = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket}/{prefix}/validation/',
    content_type='csv'
)

### Set Up and Train an XGBoost Model on Amazon SageMaker

This code configures and initiates the training of an XGBoost model on Amazon SageMaker. Using the Estimator class, it specifies key parameters such as the instance type, IAM role, and the S3 location where the trained model will be stored. The model's hyperparameters, including max_depth, eta, gamma, min_child_weight, and subsample, are set to optimize the XGBoost algorithm for binary classification. Finally, the .fit() method starts the training job using the designated training and validation datasets stored in S3, enabling SageMaker to process the data and train the model accordingly.

In [4]:
from time import gmtime, strftime
# Initialize a new SageMaker session
sess = sagemaker.Session()
# Generate a unique model name based on the current time to ensure uniqueness
model_name = "xgboost" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# Define an XGBoost estimator using the SageMaker Estimator API
# `container`: URI of the XGBoost container image (retrieved earlier)
# `role`: IAM role with permissions for SageMaker to access AWS resources
# `instance_count`: Number of compute instances to use
# `instance_type`: Type of SageMaker instance for training
# `output_path`: S3 location for saving model artifacts
# `sagemaker_session`: SageMaker session object created earlier

xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sess,
    model_name=model_name
)

# Set hyperparameters for the XGBoost training job
# `max_depth`: Maximum tree depth for base learners
# `eta`: Step size shrinkage to prevent overfitting
# `gamma`: Minimum loss reduction required for further partitioning
# `min_child_weight`: Minimum sum of instance weight needed in a child node
# `subsample`: Fraction of samples used per tree
# `silent`: Verbosity (0 means silent mode)
# `objective`: Learning objective (here, binary classification)
# `num_round`: Number of boosting rounds

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    objective='binary:logistic',
    num_round=100
)

# Launch the training job, passing in the S3 paths for training and validation datasets
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

Model name: xgboost2024-11-12-13-21-34


INFO:sagemaker:Creating training-job with name: xgboost-2024-11-12-13-21-34-549


2024-11-12 13:21:38 Starting - Starting the training job...
2024-11-12 13:21:53 Starting - Preparing the instances for training...
2024-11-12 13:22:19 Downloading - Downloading input data...
2024-11-12 13:22:49 Downloading - Downloading the training image......
2024-11-12 13:24:07 Training - Training image download completed. Training in progress.
2024-11-12 13:24:07 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2024-11-12:13:23:55:INFO] Running standalone xgboost training.[0m
[34m[2024-11-12:13:23:55:INFO] File size need to be processed in the node: 4.35mb. Available memory size in the node: 8461.23mb[0m
[34m[2024-11-12:13:23:55:INFO] Determined delimiter of CSV input is ','[0m
[34m[13:23:55] S3DistributionType set as FullyReplicated[0m
[34m[13:23:55] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2024-11-12:13:23:55:INFO] Determined delimiter of CSV input is ','[0m
[34

### Deploy Trained XGBoost Model as a Real-Time Endpoint

This code deploys the trained XGBoost model to an Amazon SageMaker endpoint for real-time inference. The deploy() method sets up a fully managed endpoint, allowing the model to serve predictions via API requests. By specifying initial_instance_count and instance_type, you can control the scalability and resource allocation for handling inference requests. This deployment enables the model to be used for predictions in a production setting, supporting applications that require low-latency, real-time predictions.

In [5]:
# Deploy the trained XGBoost model as an endpoint for real-time inference
# `initial_instance_count`: Number of instances to serve predictions
# `instance_type`: Type of instance to host the endpoint

xgb_predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

INFO:sagemaker:Creating model with name: xgboost-2024-11-12-13-25-25-216
INFO:sagemaker:Creating endpoint-config with name xgboost-2024-11-12-13-25-25-216
INFO:sagemaker:Creating endpoint with name xgboost-2024-11-12-13-25-25-216


-------!

### Configure Serializer for Model Endpoint Input Format

This code configures the input data format for requests sent to the deployed SageMaker endpoint. By setting the serializer to CSVSerializer, input data is converted to CSV format before it is passed to the endpoint for inference. This format aligns with the trained XGBoost model’s expectations, ensuring smooth data processing and accurate predictions.

In [6]:
# Set the serializer for the predictor to format input data as CSV
# This serializer ensures that input data sent to the endpoint is correctly formatted as CSV, matching the model's expected input format

xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [7]:
# Use AWS CLI to list all files in the specified S3 directory for test data
# `$test_path` contains the path to the S3 bucket and folder where test data files are stored

!aws s3 ls $test_path

2024-11-03 08:06:30     498229 test_script_x.csv
2024-11-03 08:06:30       8238 test_script_y.csv


In [8]:
# Use pip to uninstall the `s3fs` and `fsspec` packages
# `-y` automatically confirms the uninstallation

!pip uninstall -y s3fs fsspec

[0mFound existing installation: fsspec 2023.6.0
Uninstalling fsspec-2023.6.0:
  Successfully uninstalled fsspec-2023.6.0


In [9]:
# Use pip to install specific versions of the `s3fs` and `fsspec` packages
# This ensures compatibility with the required version for the project

!pip install s3fs==2023.6.0 fsspec==2023.6.0

Collecting s3fs==2023.6.0
  Downloading s3fs-2023.6.0-py3-none-any.whl.metadata (1.6 kB)
Collecting fsspec==2023.6.0
  Downloading fsspec-2023.6.0-py3-none-any.whl.metadata (6.7 kB)
Collecting aiobotocore~=2.5.0 (from s3fs==2023.6.0)
  Downloading aiobotocore-2.5.4-py3-none-any.whl.metadata (19 kB)
Collecting botocore<1.31.18,>=1.31.17 (from aiobotocore~=2.5.0->s3fs==2023.6.0)
  Downloading botocore-1.31.17-py3-none-any.whl.metadata (5.9 kB)
Downloading s3fs-2023.6.0-py3-none-any.whl (28 kB)
Downloading fsspec-2023.6.0-py3-none-any.whl (163 kB)
Downloading aiobotocore-2.5.4-py3-none-any.whl (73 kB)
Downloading botocore-1.31.17-py3-none-any.whl (11.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m100.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, botocore, aiobotocore, s3fs
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.162
    Uninstalling botocore-1.34.162:
      Successfully uninst

In [10]:
import s3fs

### Load Test Data for Inference: Features and Labels

This code loads the test data for inference from two CSV files stored in the S3 test path. The test_script_x.csv file contains the feature data (X), and test_script_y.csv contains the actual labels (y) for the test dataset. By setting header=None, the code ensures that the files are read without assuming any header row in the CSV files, which is useful when the data doesn't include column headers. These dataframes will be used for evaluating the model or making predictions.

In [11]:
# Load the test data for features (X) and labels (y) from CSV files stored in the S3 test path
# `test_script_x.csv` contains the features, and `test_script_y.csv` contains the corresponding labels
# `header=None` ensures the CSV files are loaded without assuming any header row

test_data_x = pd.read_csv(os.path.join(test_path, 'test_script_x.csv'), header=None)
test_data_y = pd.read_csv(os.path.join(test_path, 'test_script_y.csv'), header=None)

### Batch Prediction for Large Datasets Using SageMaker Endpoint

This function, predict, is designed to handle large datasets by splitting the input data into smaller batches, making it easier to process and predict without overwhelming system resources. The data (input features) is split into batches of a specified size (500 rows by default), and each batch is sent to the deployed SageMaker model endpoint (predictor) for inference. The predictions are collected in a string, which is later converted into a numerical array using np.fromstring. The result is the final array of predictions for the entire dataset. The function is invoked on test_data_x, which contains the feature data for making predictions using the trained XGBoost model.

In [12]:
# Define a function to make predictions on large datasets by splitting the data into smaller batches
# `data`: The input data to predict (features)
# `predictor`: The SageMaker model predictor for inference
# `rows`: The number of rows per batch (default is 500)
# The function splits the data into batches to avoid memory overload and sends each batch to the endpoint for prediction

def predict(data, predictor, rows=500):
    # Split the input data into smaller batches of the specified size
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    
    # Initialize an empty string to store the concatenated predictions
    predictions = ''
    
    # Loop over each batch and request predictions from the SageMaker endpoint
    for array in split_array:
        # Make predictions and append the results to the predictions string
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    # Convert the comma-separated string of predictions into a numpy array
    return np.fromstring(predictions[1:], sep=',')

# Call the `predict` function on the test data using the XGBoost predictor
predictions = predict(test_data_x, xgb_predictor)

  return bound(*args, **kwds)


### Generate Confusion Matrix for Model Predictions

This code generates a confusion matrix using pd.crosstab, which compares the predicted values with the actual labels from the test set. The index parameter contains the actual values (test_data_y[0]), and the columns parameter contains the rounded predictions (np.round(predictions)) to map them to discrete classes. The resulting matrix shows how many instances were correctly or incorrectly classified, providing an overview of the model’s classification accuracy. The matrix is labeled with actuals for true labels and predictions for the predicted values.

In [13]:
# Generate a confusion matrix to evaluate the model's performance by comparing actual vs predicted values
# `test_data_y[0]`: Actual labels (ground truth) for the test set
# `predictions`: Predicted values from the model (rounded to nearest integer for classification)
# The result is a crosstab showing how well the predictions match the actual labels

pd.crosstab(index=test_data_y[0], columns=np.round(predictions), 
            rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3584,51
1,383,101


In [29]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2024-11-12-13-25-25-216
INFO:sagemaker:Deleting endpoint with name: xgboost-2024-11-12-13-25-25-216


# Model Deployment

### Initialize Boto3 Clients for SageMaker and SageMaker Runtime

In [16]:
import boto3

client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")

### Retrieve Model Artifacts from the Trained XGBoost Model

This code retrieves the location of the model artifacts from the trained XGBoost model. xgb.model_data contains the S3 URI where the trained model is stored, including the model's weights, configurations, and other necessary components. These artifacts are essential for making predictions and for future use, such as model deployment or re-training. The path returned points to the location where the model was saved after training, allowing further interactions with the trained model in SageMaker.

In [34]:
# Access the model artifacts (e.g., model weights and configurations) of the trained XGBoost model
# `xgb.model_data` contains the S3 path to the model artifacts generated during training

model_artifacts = xgb.model_data

# Display the path to the model artifacts stored in S3
model_artifacts

's3://sagemaker-us-east-1-607119565685/mlops/activity-3/output/xgboost-2024-11-11-13-53-34-644/output/model.tar.gz'

### Create and Register a Serverless Model in SageMaker

This code creates a new model in Amazon SageMaker by specifying the model name, container image, model artifacts (stored in S3), and any necessary environment variables. The create_model() function registers the model, allowing it to be used for inference. A unique model name is generated using the current timestamp to avoid conflicts with existing models. The model container is specified in "SingleModel" mode, meaning only one model is deployed per container. After the model is created, the response contains the ARN (Amazon Resource Name) of the model, which uniquely identifies it within the SageMaker environment.

In [18]:
from time import gmtime, strftime

# Generate a unique model name based on the current time to ensure uniqueness
model_name = "xgboost-serverless" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# Define dummy environment variables for the container
# These variables can be used within the container to configure logging levels or other settings
byo_container_env_vars = {"SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SOME_ENV_VAR": "myEnvVar"}

# Create the model in SageMaker
# `ModelName`: The unique name for the model being created
# `Containers`: A list containing the container definition for the model
# `Image`: The URI of the container image for the model
# `ModelDataUrl`: The S3 URI pointing to the model artifacts
# `Environment`: Environment variables that will be set in the container
create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": container,  # The XGBoost container retrieved earlier
            "Mode": "SingleModel",  # The model type for inference
            "ModelDataUrl": model_artifacts,  # The S3 URI for the model artifacts
            "Environment": byo_container_env_vars,  # Custom environment variables for the container
        }
    ],
    ExecutionRoleArn=role,  # The IAM role to allow SageMaker to interact with AWS resources
)

# Print the ARN of the created model for reference
print("Model Arn: " + create_model_response["ModelArn"])

Model name: xgboost-serverless2024-11-11-14-03-11
Model Arn: arn:aws:sagemaker:us-east-1:607119565685:model/xgboost-serverless2024-11-11-14-03-11


### Create a Serverless Endpoint Configuration for the XGBoost Model

This code creates an endpoint configuration for a serverless deployment of the XGBoost model in Amazon SageMaker. The configuration includes a unique name generated from the current timestamp and specifies a production variant, which defines how the model will be deployed. The ServerlessConfig includes the memory size (MemorySizeInMB) and the maximum concurrency (MaxConcurrency) for the serverless endpoint, determining the available resources for handling inference requests. After the configuration is created, the ARN (Amazon Resource Name) of the endpoint configuration is printed for reference, which can later be used to deploy the model endpoint.

In [21]:
from time import gmtime, strftime

# Generate a unique endpoint configuration name based on the current time
xgboost_epc_name = "mlops-serverless-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Create an endpoint configuration for the XGBoost model
# `EndpointConfigName`: The unique name for the endpoint configuration
# `ProductionVariants`: Defines the production variants, including the model and serverless configuration

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name,  # Unique name for the endpoint configuration
    ProductionVariants=[
        {
            "VariantName": "byoVariant",  # Name of the production variant
            "ModelName": model_name,  # The registered model name
            "ServerlessConfig": {
                "MemorySizeInMB": 3072,  # The amount of memory for the serverless endpoint (in MB)
                "MaxConcurrency": 1,     # Maximum number of concurrent invocations
            },
        },
    ],
)

# Print the ARN of the created endpoint configuration for reference
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

Endpoint Configuration Arn: arn:aws:sagemaker:us-east-1:607119565685:endpoint-config/mlops-serverless-epc2024-11-11-14-04-53


### Create a Serverless Endpoint for the XGBoost Model

This code creates a new serverless endpoint in Amazon SageMaker using the previously defined endpoint configuration. The endpoint name is generated uniquely by appending the current timestamp to avoid name collisions. The create_endpoint() function creates the actual endpoint, which is used to invoke the model for real-time predictions. The response contains the ARN (Amazon Resource Name) of the endpoint, which is then printed for reference. This endpoint will be used to deploy the XGBoost model and serve real-time predictions in a serverless fashion.

In [22]:
from time import gmtime, strftime

# Generate a unique endpoint name based on the current time to ensure uniqueness
endpoint_name = "xgboost-serverless-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Create an endpoint using the previously created endpoint configuration
# `EndpointName`: The unique name for the endpoint
# `EndpointConfigName`: The name of the endpoint configuration that defines how the model will be deployed

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,  # Unique name for the endpoint
    EndpointConfigName=xgboost_epc_name,  # The endpoint configuration containing serverless setup
)

# Print the ARN of the created endpoint for reference
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:607119565685:endpoint/xgboost-serverless-ep2024-11-11-14-04-55


### Monitor Endpoint Creation Status Until InService

This code monitors the creation status of a SageMaker endpoint until it transitions to the "InService" state, indicating that the endpoint is ready to serve predictions. It uses the describe_endpoint() function to query the current status of the endpoint. The loop checks the status every 15 seconds and prints the status until the endpoint reaches "InService." Once the endpoint is ready, the final response with endpoint details is returned. This ensures that the endpoint is fully set up before proceeding with making predictions.

In [35]:
import time

# Describe the endpoint to check its creation status
describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

# Continuously check the endpoint status until it reaches "InService" (i.e., ready for inference)
# The loop checks the status every 15 seconds
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])  # Print the current status of the endpoint
    time.sleep(15)  # Wait for 15 seconds before checking again

# Once the status is "InService", the endpoint is ready
describe_endpoint_response  # Return the final response when the endpoint is in service

{'EndpointName': 'xgboost-serverless-ep2024-11-11-14-04-55',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:607119565685:endpoint/xgboost-serverless-ep2024-11-11-14-04-55',
 'EndpointConfigName': 'mlops-serverless-epc2024-11-11-14-04-53',
 'ProductionVariants': [{'VariantName': 'byoVariant',
   'DeployedImages': [{'SpecifiedImage': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
     'ResolvedImage': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost@sha256:0c8f830ac408e6dee08445fb60392e9c3f05f790a4b3c07ec22327c08f75bcbf',
     'ResolutionTime': datetime.datetime(2024, 11, 11, 14, 4, 57, 153000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 0,
   'CurrentServerlessConfig': {'MemorySizeInMB': 3072, 'MaxConcurrency': 1}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2024, 11, 11, 14, 4, 56, 478000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 11, 11, 14, 6, 49, 810000, tzinfo=tzl

### Invoke a SageMaker Endpoint for Real-Time Inference

This code demonstrates how to invoke a SageMaker endpoint for real-time inference. The payload is a CSV-formatted string, representing a single input sample for the model. The invoke_endpoint() function is used to send the payload to the endpoint, and the response contains the prediction result. The ContentType="text/csv" specifies that the payload is in CSV format. The result is read from the response's body, decoded, and printed. This allows you to get real-time predictions from the deployed model endpoint.

In [24]:
# Define a payload for the prediction request, which is a CSV-formatted string in byte format
payload = b"3., 999.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 1.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   0., 0.,   0.,   0.,   0.,   0.,   1.,   0.,   1.,   0.,   0.,   1., 0.,   0.,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   1., 0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0., 0.,   1.,   0."

# Invoke the SageMaker endpoint for real-time inference using the runtime client
# `EndpointName`: The name of the deployed model endpoint
# `Body`: The data (payload) being sent to the endpoint for prediction
# `ContentType`: The format of the data being sent, which is CSV in this case

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,  # Name of the endpoint being invoked
    Body=payload,  # The input data for prediction
    ContentType="text/csv",  # Content type, as the data is in CSV format
)

# Read and decode the prediction result from the response and print it
print(response["Body"].read().decode())

0.07072833180427551


In [36]:
client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name)
client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'd1e7a81a-15dc-4b18-8e8d-23c180173fa0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd1e7a81a-15dc-4b18-8e8d-23c180173fa0',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 11 Nov 2024 14:40:27 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

# Automatic Model Tuning

### Setting Up Hyperparameter Ranges for SageMaker Model Tuning

This code configures hyperparameter ranges to optimize a model’s performance on Amazon SageMaker. Each range defines values for SageMaker’s Hyperparameter Tuner to explore, helping find the best hyperparameter combination.

- **`ContinuousParameter(0, 1)`**: Defines a range for continuous hyperparameters, which SageMaker will test within the specified range (0 to 1).
- **`IntegerParameter(1, 10)`**: Defines a range for integer hyperparameters, allowing SageMaker to explore integer values between 1 and 10.
- **`CategoricalParameter(...)`**: (Not used here) Defines a range for categorical values if needed.

In [14]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# Define the range of hyperparameters to tune
hyperparameter_ranges = {
    'eta': ContinuousParameter(0, 1),               # Learning rate, from 0 to 1
    'min_child_weight': ContinuousParameter(1, 10), # Minimum sum of weights for child nodes, from 1 to 10
    'alpha': ContinuousParameter(0, 2),             # L1 regularization term, from 0 to 2
    'max_depth': IntegerParameter(1, 10)            # Maximum depth of trees, from 1 to 10
}

# Set the objective metric for tuning
objective_metric_name = 'validation:auc' # The metric to optimize during tuning


### Initializing Hyperparameter Tuner for XGBoost

This code creates a `HyperparameterTuner` to optimize an XGBoost model on SageMaker. It sets:

- **`xgb`**: The XGBoost estimator to tune.
- **`objective_metric_name`**: The metric to optimize (`validation:auc`).
- **`hyperparameter_ranges`**: Parameter ranges for tuning.
- **`max_jobs=20`**: Total tuning jobs.
- **`max_parallel_jobs=3`**: Concurrent jobs allowed.

In [15]:
# Initialize the Hyperparameter Tuner
tuner = HyperparameterTuner(
    xgb,                         # XGBoost estimator
    objective_metric_name,       # Metric to optimize (validation:auc)
    hyperparameter_ranges,       # Hyperparameter ranges to explore
    max_jobs=20,                 # Total tuning jobs
    max_parallel_jobs=3          # Concurrent jobs allowed
)

### Launching Hyperparameter Tuning Job

This code initiates the tuning job for the XGBoost model using SageMaker. It specifies the S3 input paths for the training and validation datasets.

- **`{'train': s3_input_train}`**: S3 path to training data.
- **`{'validation': s3_input_validation}`**: S3 path to validation data.

In [16]:
# Launch the hyperparameter tuning job
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating hyperparameter tuning job with name: xgboost-241112-1342


........................................................................................................!


### Checking Hyperparameter Tuning Job Status

This code retrieves the current status of the latest hyperparameter tuning job for the SageMaker model.

- **`tuner.latest_tuning_job.job_name`**: Fetches the name of the most recent tuning job.
- **`HyperParameterTuningJobStatus`**: Indicates the job's current status (e.g., `InProgress`, `Completed`, or `Failed`).

In [17]:
import boto3

# Check the status of the latest hyperparameter tuning job
status = boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)['HyperParameterTuningJobStatus']

status

'Completed'

### Retrieving the Best Training Job from Hyperparameter Tuning

This code returns the training job name with the best performance from the completed hyperparameter tuning jobs.

In [18]:
# Get the name of the best-performing training job
best_job_name = tuner.best_training_job()
best_job_name

'xgboost-241112-1342-015-d109cea7'

### Deploying the Best Model from Hyperparameter Tuning

This code deploys the best model from the hyperparameter tuning job to an Amazon SageMaker endpoint.

- **`initial_instance_count=1`**: Specifies one instance for deploying the model.
- **`instance_type='ml.m4.xlarge'`**: Defines the instance type for hosting the model (in this case, a `ml.m4.xlarge` instance).

In [19]:
# Deploy the best model from the tuning job to a SageMaker endpoint
tuner_predictor = tuner.deploy(
    initial_instance_count=1,          # Number of instances to deploy
    instance_type='ml.m4.xlarge'      # Type of instance for hosting
)


2024-11-12 13:50:08 Starting - Found matching resource for reuse
2024-11-12 13:50:08 Downloading - Downloading the training image
2024-11-12 13:50:08 Training - Training image download completed. Training in progress.
2024-11-12 13:50:08 Uploading - Uploading generated training model
2024-11-12 13:50:08 Completed - Resource reused by training job: xgboost-241112-1342-018-383eacd5

INFO:sagemaker:Creating model with name: xgboost-2024-11-12-13-54-23-060





INFO:sagemaker:Creating endpoint-config with name xgboost-241112-1342-015-d109cea7
INFO:sagemaker:Creating endpoint with name xgboost-241112-1342-015-d109cea7


--------!

### Setting the Serializer for Model Inference

This code configures the input data format for requests sent to the deployed SageMaker endpoint. By setting the serializer to CSVSerializer, input data is converted to CSV format before it is passed to the endpoint for inference. This format aligns with the trained XGBoost model’s expectations, ensuring smooth data processing and accurate predictions.

In [21]:
# Set the serializer to handle CSV input format for inference
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

### Load Test Data for Inference: Features and Labels

This code loads the test data for inference from two CSV files stored in the S3 test path. The test_script_x.csv file contains the feature data (X), and test_script_y.csv contains the actual labels (y) for the test dataset. By setting header=None, the code ensures that the files are read without assuming any header row in the CSV files, which is useful when the data doesn't include column headers. These dataframes will be used for evaluating the model or making predictions.

In [23]:
test_data_x = pd.read_csv(os.path.join(test_path, 'test_script_x.csv'),header=None)
test_data_y = pd.read_csv(os.path.join(test_path, 'test_script_y.csv'),header=None)

### Making Predictions with the Deployed Model

This code sends the test data to the deployed model and retrieves predictions.

In [26]:
# Make predictions using the deployed model and convert the result to a NumPy array
predictions = predict(test_data_x.to_numpy(), tuner_predictor)


### Generate Confusion Matrix for Model Predictions

This code generates a confusion matrix using pd.crosstab, which compares the predicted values with the actual labels from the test set. The index parameter contains the actual values (test_data_y[0]), and the columns parameter contains the rounded predictions (np.round(predictions)) to map them to discrete classes. The resulting matrix shows how many instances were correctly or incorrectly classified, providing an overview of the model’s classification accuracy. The matrix is labeled with actuals for true labels and predictions for the predicted values.

In [27]:
# Generate a confusion matrix to evaluate the model's performance by comparing actual vs predicted values
# `test_data_y[0]`: Actual labels (ground truth) for the test set
# `predictions`: Predicted values from the model (rounded to nearest integer for classification)
# The result is a crosstab showing how well the predictions match the actual labels

pd.crosstab(index=test_data_y[0], columns=np.round(predictions), 
            rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3592,43
1,383,101


### Delete Endpoint

In [28]:
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-241112-1342-015-d109cea7
INFO:sagemaker:Deleting endpoint with name: xgboost-241112-1342-015-d109cea7
