# 01 - Training

<div class="alert alert-block alert-info">
<b> Use Case: You are working for Octank Bank and you want to develop a classification model that predicts whethere a customer has credit risk or not.
</div> 

In this notebook you will:
1. Train an XGBoost model in the Jupyter environment in SageMaker Studio
2. Save model artifacts to Amazon S3
3. Save the model to the Amazon SageMaker model registry

Before you run this notebook, make sure you have executed the module `00_setup` notebook.

## 1. Set up environment

First, let's restore variables from the `00_setup` notebook and import the data science libraries required.

You will also initialise the following boto3 clients:

- _s3_client_ for managing storage
- _sagemaker_client_ for model operations

In [None]:
%store -r train_data_path test_data_path
%store -r bucket_name model_prefix role

In [None]:
import os
import warnings
import joblib
import tarfile
import boto3
import numpy as np
import pandas as pd
import xgboost
import sagemaker
from sagemaker.s3 import S3Uploader
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.exceptions import DataConversionWarning
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from datetime import datetime
from time import gmtime, strftime

# Initialize boto3 clients
s3_client = boto3.client('s3')
sagemaker_client = boto3.client('sagemaker')

session = boto3.session.Session()
region=session.region_name


## 2. Perform feature engineering


In this step, you'll load CSV file with the training data set you created in the`00_setup` notebook. This file contains the features and target variable that you'll use to train the XGBoost model for credit risk prediction. Let's examine the structure of the dataset to confirm:

- The number of records (rows) available for training
- The total number of features (columns)
- The data types of each feature
- Potential missing values

In [None]:
if os.path.exists(train_data_path):
    df = pd.read_csv(train_data_path)
    
    # Display information about the DataFrame
    print(f"Data loaded successfully from {train_data_path}")
    print(f"DataFrame shape: {df.shape} (rows, columns)")
    
    # Display column information
    print("\nColumns info:")
    print(df.info())
else:
    print(f"Error: The file {train_data_path} does not exist.")

The next step is data preprocessing. The `process` function below is used to prepare the credit risk data for training with XGBoost. It performs several important steps to get the data ready:
- it encodes categorical features (like loan purpose and credit history) using one-hot encoding, which creates separate columns for each category.
- it separates the features from the target variable (credit risk), and encodes the target labels as numbers.
- it fits the featurizer model and transforms the datasets
- it splits the data into training and validation sets

   
All processed datasets are saved as CSV files. Additionally, the function saves the featurizer model and uploads it to Amazon S3. This ensures you can apply the exact same transformations to new data when making predictions during inference.

In [None]:
import os
import joblib
import tarfile
import logging
import numpy as np
import pandas as pd
from time import strftime, gmtime
from typing import Tuple, Dict
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

# Define some helper functions
def save_to_csv(data: np.ndarray, filepath: str) -> None:
    pd.DataFrame(data).to_csv(filepath, header=False, index=False)

def upload_to_s3(local_path: str, s3_path: str) -> None:
    try:
        S3Uploader.upload(local_path, s3_path)
        print(f"✓ Uploaded to {s3_path}")
    except Exception as e:
        print(f"Failed to upload to S3: {str(e)}")
        raise

def create_s3_link(bucket_name: str, prefix: str, model_dir: str) -> str:
    region = boto3.session.Session().region_name
    return (f"https://s3.console.aws.amazon.com/s3/buckets/"
            f"{bucket_name}/{prefix}/{model_dir}/?region={region}&tab=objects")

In [None]:
def preprocess(df: pd.DataFrame, bucket_name: str, prefix: str) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, str]:
    """
    Preprocess credit risk data for machine learning model training.
    
    Args:
        df: Input dataframe containing credit risk data
        bucket_name: S3 bucket name to store artifacts
        prefix: S3 prefix path for organizing artifacts
    
    Returns:
        X_train, X_val, y_train, y_val: train and validation datasets and labels
        model_dir: local path to model parameters
    """


    try:
        # Configuration constants
        categorical_cols = [
            "credit_history",
            "purpose",
            "personal_status_sex",
            "other_debtors",
            "property",
            "other_installment_plans",
            "housing",
            "job",
            "telephone",
            "foreign_worker"
        ]
        
        target_column = "credit_risk"
        split_ratio = 0.3
        random_state = 42
        # Create timestamp and directories
        timestamp = strftime('%Y-%m-%d-%H-%M-%S', gmtime())
        print(f"Starting preprocessing with timestamp: {timestamp}")

        output_path = f"output/pre-process-{timestamp}/data"
        model_dir = f"output/pre-process-{timestamp}/model/sklearn"
        os.makedirs(output_path, exist_ok=True)
        os.makedirs(model_dir, exist_ok=True)

        # Validate input data
        missing_cols = [col for col in categorical_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing columns: {missing_cols}")
        if target_column not in df.columns:
            raise ValueError(f"Target column {target_column} not found")

        # Create and configure transformer
        transformer = make_column_transformer(
            (OneHotEncoder(sparse_output=False, handle_unknown='ignore'), 
             categorical_cols),
            remainder="passthrough"
        )

        # Prepare features and target
        X = df.drop(target_column, axis=1)
        y = df[target_column]

        # Log class distribution
        class_dist = y.value_counts(normalize=True).round(3) * 100
        print(f"Class distribution:\n{class_dist}")

        # Transform features
        print("Transforming features...")
        featurizer_model = transformer.fit(X)
        features = featurizer_model.transform(X)
        labels = LabelEncoder().fit_transform(y)

        # Split dataset
        print(f"Splitting data with {split_ratio:.0%} validation ratio...")
        X_train, X_val, y_train, y_val = train_test_split(
            features, labels,
            test_size=split_ratio,
            random_state=random_state,
            stratify=labels
        )

        # Log dataset shapes
        print(f"Training set: {X_train.shape}, Validation set: {X_val.shape}")

        # Save datasets
        datasets = {
            "train_features.csv": X_train,
            "train_labels.csv": y_train,
            "val_features.csv": X_val,
            "val_labels.csv": y_val
        }
        
        for filename, data in datasets.items():
            save_to_csv(data, os.path.join(output_path, filename))

        # Save and archive model
        model_path = os.path.join(model_dir, "model.joblib")
        model_archive = os.path.join(model_dir, "model.tar.gz")
        
        print("Saving feature transformer model...")
        joblib.dump(featurizer_model, model_path)
        
        with tarfile.open(model_archive, "w:gz") as tar:
            tar.add(model_path, arcname="model.joblib")
        print(f"Model archived to {model_archive}")

        # Upload to S3
        s3_path = f"s3://{bucket_name}/{prefix}/{model_dir}"
        upload_to_s3(model_archive, s3_path)

        print("Preprocessing completed successfully!")
        
        # Display S3 link
        from IPython.display import display, HTML
        s3_link = create_s3_link(bucket_name, prefix, model_dir)
        display(HTML(f'<b>Click the link to navigate to the outputs in Amazon S3:</b> <a href="{s3_link}" target="_blank">View</a>'))

        return X_train, X_val, y_train, y_val, model_dir

    except Exception as e:
        print(f"Preprocessing failed: {str(e)}")
        raise

In [None]:
# Suppress warnings
warnings.filterwarnings("ignore", category=DataConversionWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
X_train, X_val, y_train, y_val, featurizer_model_dir = preprocess(df, bucket_name, model_prefix)

# Store this as you will need it in subsequent notebooks

%store featurizer_model_dir


## 3.  Train the XGBoost model

You are now ready to train the model. The`train`function below uses some of the XGBoost hyperparameters as input and runs the training. The model artifacts are saved in Amazon S3 for later use.

In [None]:
def train(bucket_name, prefix, X, val_X, y, val_y, num_round=100, params=None, early_stopping_rounds=10,):
    """
    Train an XGBoost model with validation and save the model artifact.
    
    Parameters:
    -----------
    X : array-like
        Training features
    val_X : array-like
        Validation features
    y : array-like
        Training labels
    val_y : array-like
        Validation labels
    num_round : int, default=100
        Number of boosting rounds
    params : dict, default=None
        XGBoost parameters
    early_stopping_rounds : int, default=10
        Stop training if validation performance doesn't improve
        
    Returns:
    --------
    dict
        Model training results including the model and evaluation metrics
    """
    # Set default parameters if none provided
    if params is None:
        params = {
            'objective': 'binary:logistic',
            'eval_metric': 'auc',
            'max_depth': 5,
            'eta': 0.1
        }
    
    # Create a timstamp for the current time
    timestamp = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

    # Create output directory
    model_dir = f"output/training-{timestamp}/model"
    try:
        os.makedirs(model_dir, exist_ok=True)
        print(f"Directory '{model_dir}' created successfully.")
    except OSError as e:
        print(f"Error creating directory '{model_dir}': {e}")
        
    # Create DMatrix objects (only once)
    dtrain = xgboost.DMatrix(X, label=y)
    dval = xgboost.DMatrix(val_X, label=val_y)

    # Set up evaluation watchlist
    watchlist = [(dtrain, "train"), (dval, "validation")]

    # Dictionary to store evaluation results
    evaluation_results = {}
    
    print("Training the model...")
    bst = xgboost.train(
        params=params, 
        dtrain=dtrain, 
        evals=watchlist, 
        num_boost_round=num_round,
        early_stopping_rounds=early_stopping_rounds,
        evals_result=evaluation_results
    )
    
    # Evaluate the model
    val_preds = bst.predict(dval)
    auc_score = roc_auc_score(val_y, val_preds)
    print(f"Validation AUC: {auc_score:.4f}")
    
    # Save model 
    model_path = os.path.join(model_dir, "model.ubj")
    bst.save_model(model_path)
    print(f"Model saved to {model_path}")

    # Compress the model artifact to a tar file
    model_archive = os.path.join(model_dir, "model.tar.gz")
    with tarfile.open(model_archive, "w:gz") as tar:
        tar.add(model_path, arcname="model.ubj")
    print(f"Model archived to {model_archive}")

     # Upload to Amazon S3
    s3_path = f"s3://{bucket_name}/{prefix}/{model_dir}"
    upload_to_s3(model_archive, s3_path)

    print("Training completed successfully!")
    
    # Display S3 link
    from IPython.display import display, HTML
    s3_link = create_s3_link(bucket_name, prefix, model_dir)
    display(HTML(f'<b>Click the link to navigate to the outputs in Amazon S3:</b> <a href="{s3_link}" target="_blank">View</a>'))

    # Return model and metrics
    return {
        "model": bst,
        "auc_score": auc_score,
        "evaluation_results": evaluation_results,
        "s3_model_path": f"{s3_path}/model.tar.gz",
        "archive_path": model_archive
    }

We are now ready to call the *train* method with the training and validation datasets. The method returns a dictionary containing the trained model, evaluation metrics, and file paths to the saved model artifacts.

In [None]:
hyperparameters = {
    "max_depth": "5",
    "eta": "0.1",
    "gamma": "4",
    "min_child_weight": "6",
    "silent": "1",
    "objective": "binary:logistic",
    "num_round": "100",
    "subsample": "0.8",
    "eval_metric": "auc"
}
num_round = 50

training_results = train(bucket_name, model_prefix, X_train, X_val, y_train, y_val,num_round, hyperparameters)

## 4.  Analyse the model results

Use the `matplotlib` library to visualize the AUC score (measures model's ability to distinguish between good and bad credit risks) for training and validation.

In [None]:
model = training_results["model"]

# Extract and store the S3 model artifact URI
model_artifact = training_results["s3_model_path"]
print(model_artifact)
%store model_artifact

# Print the validation AUC score
validation_auc=training_results['auc_score']
print(f"Validation AUC: {validation_auc}")

# Plot the training and validation metrics
results = training_results["evaluation_results"]

plt.figure(figsize=(10, 6))
plt.plot(results['train']['auc'], label='Train AUC')
plt.plot(results['validation']['auc'], label='Validation AUC')
plt.xlabel('Boosting Round')
plt.ylabel('AUC')
plt.title('XGBoost Training Performance')
plt.legend()
plt.grid(True)
plt.show()


## 6. (Optional) Save the model to the SageMaker model registry

[Amazon SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) is a capability that helps organize and manage machine learning models in production environments.  It allows you to catalog models, manage different versions, track metadata (like training metrics), and maintain model lineage for traceability. 

The model registry is structured with **model package groups** that contain different versions of the model called **model packages**. You can organize model package groups into _Collections_ for better management. The Model Registry integrates with SageMaker's MLOps tools, enabling automated model deployment through CI/CD pipelines and facilitating collaboration across teams. 

In this section, you will register the featurizer and XGboost models you trained in the previous steps to the model registry. First, you create a new model package group for the credit risk prediction models.

In [None]:
# Create model package group name
timestamp = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
model_package_group_name = f"credit-risk-{timestamp}"

# Create the model package group
try:
    response = sagemaker_client.create_model_package_group(
        ModelPackageGroupName=model_package_group_name,
        ModelPackageGroupDescription="Credit risk prediction models based on XGBoost",
        Tags=[
            {
                'Key': 'Project',
                'Value': 'CreditRiskPrediction'
            },
            {
                'Key': 'Framework',
                'Value': 'XGBoost'
            }
        ]
    )
    
    # Get the ARN of the model package group
    model_package_group_arn = response['ModelPackageGroupArn']
    
    print(f"Successfully created Model Package Group: {model_package_group_name}")
    print(f"Model Package Group ARN: {model_package_group_arn}")
    
    # Store the model package group name for later use
    # This can be used when registering model versions
    model_registry_details = {
        "model_package_group_name": model_package_group_name,
        "model_package_group_arn": model_package_group_arn
    }
    
except Exception as e:
    print(f"Error creating model package group: {str(e)}")

# Return the model package group name
model_package_group_name

Now that the model package group is ready, you can start registering model packages to it.  A **model package** contains information about the model version itself and details for deployment such as the location of the model artifacts, metadata, the URI of the container image to be used during deployment, the SageMaker instance types, the model approval status, and more.  

The following code registers the model to the model registry by adding a model package to the model package group created earler.

When you deploy models on Amazon SageMaker, you can choose to use AWS pre-built container images, extend them, or bring your own container image. In this lab, you will use the **sagemaker distribution image** and specifically the same version as the one that powers your JupyterLab environment.

In [None]:
# Get the current region
region = boto3.Session().region_name

# Dynamically fetch the ECR image URI for XGBoost
image_uri = os.environ['SAGEMAKER_INTERNAL_IMAGE_URI']

# Create base model package input
create_model_package_input_dict = {
    "ModelPackageGroupName": model_package_group_arn,
    "ModelPackageDescription": "Model to detect credit risk",
    "ModelApprovalStatus": "PendingManualApproval",
    "Domain": "MACHINE_LEARNING",
    "Task": "CLASSIFICATION",
    "CustomerMetadataProperties": {
        "ProjectName": "CreditRiskPrediction",
        "ModelType": "XGBoost",
        "BusinessProblem": "Credit Risk Assessment"
    }
}

# Create inference specification with dynamically fetched image URI
inference_specification = {
    "InferenceSpecification": {
        "Containers": [
            {
                "Image": image_uri,
                "ModelDataUrl": training_results["s3_model_path"]
            }
        ],
        "SupportedRealtimeInferenceInstanceTypes": [
            "ml.t2.medium", 
            "ml.m5.large", 
            "ml.m5.xlarge"
        ],
        "SupportedContentTypes": ["text/csv"],
        "SupportedResponseMIMETypes": ["text/csv"]
    }
}

# Update the input dict with inference specification
create_model_package_input_dict.update(inference_specification)

# Create the model package
create_model_package_response = sagemaker_client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response["ModelPackageArn"]
print('ModelPackage Version ARN : {}'.format(model_package_arn))

### Conclusion and Next Steps:
- You preprocessed the data
- The trained an XGboost model 
- You saved the outputs to the model registry
- You're ready for Module 2!

In Module 2, you'll use the model you trained to make predictions