# Train a Model

**TODO:**
- Update code to use **SageMaker Python SDK V3**
- Need to learn more about hyperparameters
- Analyze XGBoost report

## Model Training Overview

**Note:** This notebook uses **SageMaker Python SDK V2**

To train a machine learning model, we first need **clean and well-prepared data**.

In my repository  ðŸ‘‰ **[AI_Learning_DataPrep_SageMaker](https://github.com/VijayBheemineni/AI_Learning_DataPrep_SageMaker)**,  
I analyzed the `adult_data.csv` dataset, performed data cleaning and feature transformations, and split the data into three datasets:

- **Training dataset**
- **Validation dataset**
- **Test dataset**

All three processed CSV files are stored in **Amazon S3** and are used as inputs for model training and evaluation.

---

## Choosing the Machine Learning Algorithm

Once data preparation is complete, the next step is to select an appropriate **machine learning algorithm**.

In this use case, the goal is to predict whether an individual's income is:
- `>=50K` or
- `<50K`

Since the output has only **two possible outcomes**, this is a **binary classification problem**.  
For this reason, I am using the **XGBoost algorithm**, which is well-suited for structured/tabular data and is commonly used for classification problems.

---

## What Happens During Model Training

The objective of model training is to create a model that can make accurate predictions on **new, unseen data**.

- The **training data** contains both input features and the target label (`income`)
- Future data used for predictions **does not contain the target label**

During training:
- The algorithm learns patterns that map input features to the target
- These learned patterns are stored as a **trained ML model**
- This model can then be used to predict income categories for new data

---

## Hyperparameters and Model Tuning

In addition to selecting an algorithm, we also configure **hyperparameters**.

Hyperparameters:
- Control how the training job runs
- Influence model behavior and learning process
- Have a significant impact on model performance and accuracy

Selecting the right hyperparameter values is an important part of training an effective model.

---


## Task 1: Setup the Environment


In [None]:
#Install matplotlib, bokeh, seaborn and restart kernel
%pip install matplotlib # Low level plotting library to create static plots
%pip uninstall bokeh -y # Python Visualization Library for creating interactive charts
%pip install bokeh==2.4.2
%pip install seaborn # High level statistical visualization library built on Matplotlib
%reset -f

# Import packages
import boto3
from botocore.exceptions import ClientError
import sagemaker
from sagemaker.debugger import Rule, rule_configs
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
from time import gmtime, strftime


sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
boto3_session = boto3.Session()
sagemaker_client = boto3_session.client('sagemaker')

# Reload modules
%load_ext autoreload
%autoreload 2

## Task 2: Check if S3 bucket exists and accessible

In [None]:
# Check if S3 bucket exists
from botocore.exceptions import ClientError


def check_s3_bucket(bucket_name: str) -> bool:
    """
    Check if the S3 bucket exists and is accessible.

    Args:
        bucket_name (str): Name of the S3 bucket.

    Returns:
        bool: True if bucket exists and accessible, False otherwise.
    """
    s3 = boto3.client('s3')
    try:
        s3.head_bucket(Bucket=bucket_name)
        return True
    except ClientError:
        return False


def get_user_input(prompt: str) -> str:
    """
    Prompt user for input and ensure it's not empty.

    Args:
        prompt (str): Prompt text to display.

    Returns:
        str: User input.
    """
    while True:
        value = input(prompt).strip()
        if value:
            return value
        print("Input cannot be empty. Please try again.")


# -------------------------
# Interactive inputs
# -------------------------
bucket_name = get_user_input("Enter the S3 bucket name: ")
prefix = get_user_input(
    "Enter prefix/folder path which contains data (e.g., 'scripts/data'): "
)

# -------------------------
# Check bucket existence
# -------------------------
if not check_s3_bucket(bucket_name):
    raise ValueError(
        f"S3 Bucket '{bucket_name}' does not exist or you don't have access!"
    )

print(f"S3 Bucket '{bucket_name}' exists âœ…")
print(f"Prefix/folder to use: '{prefix}'")

## Task 3: Configure S3 datasets path and Training Input Objects

In [None]:
# Configure S3 'train', 'validation' dataset paths.
train_path = f"s3://{bucket_name}/{prefix}/train/adult_data_processed_train.csv"
validation_path = f"s3://{bucket_name}/{prefix}/validation/adult_data_processed_validation.csv"
test_path = f"s3://{bucket_name}/{prefix}/test/adult_data_processed_test.csv"

# Set up the TrainingInput objects. Setting S3 as datasource.
train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(validation_path, content_type='text/csv')
test_input = TrainingInput(test_path, content_type='text/csv')

print(f'Training path: {train_path}')
print(f'Validation path: {validation_path}')
print(f'Test path: {test_path}')

## Task 4: Retrieve 'xgboost' container URI

In [None]:
# -----------------------------
# Generate a unique run name
# -----------------------------
create_date = strftime("%Y%m%d-%H%M%S")
run_name = f"vijay-xgboost-income-classification-{create_date}"

# -----------------------------
# Retrieve XGBoost container URI
# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-west-2.html#xgboost-us-west-2
# -----------------------------
FRAMEWORK_NAME = "xgboost"
FRAMEWORK_VERSION = "1.7-1"

container_uri = image_uris.retrieve(
    framework=FRAMEWORK_NAME,
    region=region,
    version=FRAMEWORK_VERSION
)

print(f"XGBoost container URI: {container_uri}")

## Task 5: Create "Estimator" object

- https://sagemaker.readthedocs.io/en/v2.20.0/api/
- https://sagemaker.readthedocs.io/en/v2.20.0/amazon_sagemaker_debugger.html#pre-defined-debugger-hook-configuration-for-built-in-rules
- https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

### What is an Estimator

An Estimator is a configuration object that tells SageMaker:

- which algorithm to use
- where the data is
- what infrastructure to use
- how to run the training job

In [None]:
# =========================
# Configuration
# =========================
INSTANCE_TYPE = "ml.m5.xlarge"
INSTANCE_COUNT = 1
FRAMEWORK_NAME = "xgboost"

# S3 Location where Model Artifact will be stored.
output_path = f"s3://{bucket_name}/{prefix}/output"

print(f"SageMaker SDK Version: {sagemaker.__version__}")
print(f"Training output path: {output_path}")
print(f"Instance type: {INSTANCE_TYPE}")
print(f"Instance count: {INSTANCE_COUNT}")

# =========================
# XGBoost Estimator
# =========================
xgboost_estimator = sagemaker.estimator.Estimator(
    image_uri=container_uri,
    role=role,
    instance_count=INSTANCE_COUNT,
    instance_type=INSTANCE_TYPE,
    output_path=output_path,
    sagemaker_session=sagemaker_session,
    rules=[Rule.sagemaker(rule_configs.create_xgboost_report())],
)

## Task 6: Setting HyperParameters

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

### What are HyperParameters?

Think of hyperparameters as the "settings" or "knobs" you adjust before training your model. They're like the settings on a camera before you take a photo - you adjust them based on what you're trying to capture.

Unlike the patterns the model learns from data (which are called "parameters"), hyperparameters are values **you set beforehand** to control how the learning process works.

For example, in XGBoost:
- **max_depth**: How deep should each decision tree grow? (like deciding how many questions to ask before making a decision)
- **learning_rate**: How quickly should the model learn? (too fast and it might miss details, too slow and it takes forever)
- **num_round**: How many trees should we build? (more trees can mean better accuracy, but also more computation time)

Finding the right hyperparameter values is part art, part science - it often requires experimentation to see what works best for your specific problem.

In [None]:
# TODO: Need to learn more about these parameters
XGBOOST_HYPERPARAMETERS = {
    "max_depth": 5,
    "eta": 0.2,
    "gamma": 4,
    "min_child_weight": 6,
    "subsample": 0.7,
    "verbosity": 0,
    "objective": "binary:logistic",
    "num_round": 800,
}
xgboost_estimator.set_hyperparameters(**XGBOOST_HYPERPARAMETERS)

## Task 7: Training the Model

The `fit()` method starts the training job. We will call the method with training and validation datasets.

https://sagemaker.readthedocs.io/en/v2.40.0/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit

### Check Training Job Status

AWS Console --> SageMaker AI --> Model training & customization --> Training & tuning jobs --> check for job which starts with "vijay-xgboost-income-classifier"

In [None]:
run_timestamp = strftime("%Y%m%d-%H%M%S")
training_job_name = f"vijay-xgboost-income-classifier-{run_timestamp}"

training_data = {
    "train": train_input,
    "validation": validation_input,
}

xgboost_estimator.fit(
    inputs=training_data,
    job_name=training_job_name,
    wait=True,
    logs=True,
)

## Task 8: Artifacts

After training completes, SageMaker generates several important outputs called "artifacts". Think of these as the deliverables from your training job.

The main artifacts include:

1. **Model Artifact**: This is the trained model itself - the actual file that contains all the learned patterns from your data. It's saved as a compressed file (tar.gz) in S3 and can be deployed to make predictions.

2. **XGBoost Report**: This is an automated analysis report generated by SageMaker Debugger. It provides insights into how well your model trained, including metrics, potential issues, and recommendations for improvement.

These artifacts are stored in the S3 output path we configured earlier, making them easy to access and use for deployment or further analysis.

**TODO:** Analyze XGBoost report

In [None]:
output_path = xgboost_estimator.output_path.rstrip("/")
job_name = xgboost_estimator.latest_training_job.job_name

xgboost_model_output = f"{output_path}/{job_name}/output"
xgboost_report = f"{output_path}/{job_name}/rule-output/CreateXgboostReport"

print(f"Model artifacts stored at: {xgboost_model_output}")
print(f"XGBoost training report stored at: {xgboost_report}")

---

## Summary

This repository focuses on the model training stage of my AWS AI learning journey using Amazon SageMaker. It builds on previously prepared and processed datasets and demonstrates how to train a binary classification model using the XGBoost algorithm.

The notebook covers:

- Configuring a SageMaker training job
- Selecting and tuning model hyperparameters
- Training and validating the model using structured data stored in Amazon S3
- Monitoring training performance and reviewing generated model artifacts and reports