# Bank Marketing Prediction - SageMaker MLflow

Predict customers for bank campaigns using SageMaker Python SDK v3 ModelTrainer + MLflow

**Dataset**: https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

## 1. Prerequisites & Setup

In [1]:
!pip install --upgrade mlflow>=3.4.0 sagemaker-mlflow==0.2.0 boto3 pandas scikit-learn xgboost -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.4.0 requires nvidia-ml-py3<8.0,>=7.352.0, which is not installed.
sagemaker-studio 1.1.4 requires pydynamodb>=0.7.4, which is not installed.
aiobotocore 2.22.0 requires botocore<1.37.4,>=1.37.2, but you have botocore 1.42.40 which is incompatible.
autogluon-core 1.4.0 requires scikit-learn<1.8.0,>=1.4.0, but you have scikit-learn 1.8.0 which is incompatible.
autogluon-features 1.4.0 requires scikit-learn<1.8.0,>=1.4.0, but you have scikit-learn 1.8.0 which is incompatible.
autogluon-multimodal 1.4.0 requires scikit-learn<1.8.0,>=1.4.0, but you have scikit-learn 1.8.0 which is incompatible.
autogluon-multimodal 1.4.0 requires transformers[sentencepiece]<4.50,>=4.38.0, but you have transformers 4.57.3 which is incompatible.
autogluon-tabular 1.4.0 requires scikit-learn<1.8.0,>=1.4.0, but 

In [3]:
#!pip install --upgrade pip -q
#!pip install -Uq "sagemaker==3.4" --force-reinstall

[31mERROR: Cannot install sagemaker-core==2.3.1 and sagemaker==3.4.0 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m

In [88]:
!pip install --upgrade pip -q
!pip install -Uq "sagemaker==3.3.1" "boto3==1.42.30" "sagemaker-core==2.3.1" "sagemaker-mlops==1.3.1" "sagemaker-serve==1.4.0" --force-reinstall

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.4.0 requires nvidia-ml-py3<8.0,>=7.352.0, which is not installed.
dash 2.18.1 requires dash-core-components==2.0.0, which is not installed.
dash 2.18.1 requires dash-html-components==2.0.0, which is not installed.
dash 2.18.1 requires dash-table==5.0.0, which is not installed.
jupyter-ai 2.31.7 requires faiss-cpu!=1.8.0.post0,<2.0.0,>=1.8.0, which is not installed.
sagemaker-studio 1.1.4 requires pydynamodb>=0.7.4, which is not installed.
aiobotocore 2.22.0 requires botocore<1.37.4,>=1.37.2, but you have botocore 1.42.41 which is incompatible.
amazon-sagemaker-jupyter-ai-q-developer 1.2.8 requires numpy<=2.0.1, but you have numpy 2.4.2 which is incompatible.
amazon-sagemaker-sql-magic 0.1.4 requires numpy<2, but you have numpy 2.4.2 which is incompatible.
amazon-sagemaker-sql-magic 0.1.

In [90]:
import boto3
import sagemaker
import mlflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import zipfile
import os
from datetime import datetime

# SageMaker v3 imports
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.core.training.configs import SourceCode, InputData, Compute
from sagemaker.core.helper.session_helper import Session, get_execution_role
from sagemaker.core import image_uris
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.core.resources import ModelPackage

#print(f'SageMaker: {sagemaker.__version__}, MLflow: {mlflow.__version__}')

In [9]:
# Initialize SageMaker session
sagemaker_session = Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix = 'bank-marketing'

print(f'Region: {region}\nBucket: {bucket}\nRole: {role}')

Region: us-west-2
Bucket: sagemaker-us-west-2-736264693883
Role: arn:aws:iam::736264693883:role/service-role/AmazonSageMaker-ExecutionRole-20250402T133578


## 2. MLflow Configuration
Now you will configure SageMakerAI MLflow app for the experiment tracking. 
Go to the SageMakerAI MLflow app section in the SageMakerAI studio and copy the DefaultMLFlowApp ARN to use in the cell below to verify.

In [22]:
# TODO: Replace with your MLflow App name if needed.
mlflow_app_name = 'DefaultMLFlowApp'  

# Get MLflow tracking URI
sm_client = boto3.client('sagemaker', region_name=region)
mlflow_list = sm_client.list_mlflow_apps()
print(f'\n Number of MLflow apps found: {len(mlflow_list['Summaries'])}')
for mlflow_app in mlflow_list['Summaries']:
    if mlflow_app['Name'] == mlflow_app_name:
        mlflow_app_arn = mlflow_app['Arn']

# Set MLflow tracking
mlflow.set_tracking_uri(mlflow_app_arn)
print(f'\nMLflow ARN set: {mlflow_app_arn}')


 Number of MLflow apps found: 3

MLflow ARN set: arn:aws:sagemaker:us-west-2:736264693883:mlflow-app/app-ZLTLMHY2WXU4


In [23]:
mlflow_experiment_name = 'bank-marketing-prediction'
try:
    mlflow.create_experiment(mlflow_experiment_name)
except:
    mlflow.set_experiment(mlflow_experiment_name)
print(f'MLflow app Experiment: {mlflow_experiment_name}')

MLflow app Experiment: bank-marketing-prediction


## 3. Data Preparation

In [34]:
# Download dataset
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip

--2026-02-03 21:03:34--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘bank-additional.zip’

bank-additional.zip     [ <=>                ] 434.15K  --.-KB/s    in 0.1s    

Last-modified header missing -- time-stamps turned off.
2026-02-03 21:03:34 (3.07 MB/s) - ‘bank-additional.zip’ saved [444572]

Archive:  bank-additional.zip
  inflating: bank-additional/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/bank-additional/
  inflating: __MACOSX/bank-additional/._.DS_Store  
  inflating: bank-additional/.Rhistory  
  inflating: bank-additional/bank-additional-full.csv  
  inflating: bank-additional/bank-additional-names.txt  
  inflating: bank-additional/bank-additional.csv  
  inflating: __MACOSX/._bank-ad

In [35]:
# View dataset
data = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=";")
pd.set_option("display.max_columns", 500)  # Make sure we can see all of the columns
pd.set_option("display.max_rows", 50)  # Keep the output on one page
data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,334,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,383,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,189,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,442,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [36]:
# Load data
df = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')
print(f'Shape: {df.shape}\nTarget distribution:')
df.head()

Shape: (41188, 21)
Target distribution:


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [37]:
# Encode categorical features
cat_cols = [c for c in df.select_dtypes(include=['object']).columns if c != 'y']
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

# Encode target
df['y'] = (df['y'] == 'yes').astype(int)
print(f'✓ Encoded {len(cat_cols)} features')

✓ Encoded 10 features


In [38]:
# Split data
X, y = df.drop('y', axis=1), df['y']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Save locally
os.makedirs('data', exist_ok=True)
pd.concat([y_train, X_train], axis=1).to_csv('data/train.csv', index=False, header=False)
pd.concat([y_test, X_test], axis=1).to_csv('data/test.csv', index=False, header=False)
print(f'Train: {X_train.shape}, Test: {X_test.shape}')

Train: (32950, 20), Test: (8238, 20)


In [39]:
# Upload to S3
train_s3 = sagemaker_session.upload_data('data/train.csv', bucket, f'{prefix}/data/train')
test_s3 = sagemaker_session.upload_data('data/test.csv', bucket, f'{prefix}/data/test')
print(f'Train S3: {train_s3}\nTest S3: {test_s3}')

Train S3: s3://sagemaker-us-west-2-736264693883/bank-marketing/data/train/train.csv
Test S3: s3://sagemaker-us-west-2-736264693883/bank-marketing/data/test/test.csv


## 4. Model Training with ModelTrainer

In [79]:
# Create training script directory
os.makedirs('scripts', exist_ok=True)

training_script = '''import argparse
import os
import json
import logging
import sys
import xgboost as xgb
import pandas as pd
import mlflow
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--max_depth', type=int, default=5)
    parser.add_argument('--eta', type=float, default=0.2)
    parser.add_argument('--gamma', type=int, default=4)
    parser.add_argument('--min_child_weight', type=int, default=6)
    parser.add_argument('--subsample', type=float, default=0.8)
    parser.add_argument('--num_round', type=int, default=100)
    return parser.parse_known_args()

if __name__ == '__main__':
    args, _ = parse_args()
    
    # Load data
    train_data = pd.read_csv('/opt/ml/input/data/train/train.csv', header=None)
    test_data = pd.read_csv('/opt/ml/input/data/test/test.csv', header=None)
    
    X_train, y_train = train_data.iloc[:, 1:], train_data.iloc[:, 0]
    X_test, y_test = test_data.iloc[:, 1:], test_data.iloc[:, 0]
    
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    # Set MLFlow specifics
    mlflow_app_arn = os.environ.get('MLFLOW_TRACKING_URI', None)
    mlflow_experiment_name = os.environ.get('MLFLOW_EXP', None)
    # MLflow setup
    mlflow.set_tracking_uri(mlflow_app_arn)
    mlflow.set_experiment(mlflow_experiment_name)
    
    # Enable autologging - captures everything automatically
    # mlflow.xgboost.autolog()
    mlflow.xgboost.autolog(
        log_input_examples=True,
        log_model_signatures=True,
        log_models=True,
        log_datasets=True,
        model_format="json",  # Recommended for portability
        registered_model_name="bank-prediction-XGBoostModel",
        extra_tags={"team": "data-science"},
    )
    
    # MLflow tracking
    with mlflow.start_run():
        params = {
            'max_depth': args.max_depth,
            'eta': args.eta,
            'gamma': args.gamma,
            'min_child_weight': args.min_child_weight,
            'subsample': args.subsample,
            'objective': 'binary:logistic',
            'eval_metric': 'auc'
        }
        
        mlflow.log_params(params)
        
        # Train
        model = xgb.train(params, dtrain, args.num_round, evals=[(dtest, 'test')])
        
        # Evaluate
        y_pred_proba = model.predict(dtest)
        y_pred = (y_pred_proba > 0.5).astype(int)
        
        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred),
            'recall': recall_score(y_test, y_pred),
            'f1': f1_score(y_test, y_pred),
            'auc': roc_auc_score(y_test, y_pred_proba)
        }
        
        mlflow.log_metrics(metrics)
        print(f'Metrics: {metrics}')
        
        # Save model
        model_path = '/opt/ml/model'
        os.makedirs(model_path, exist_ok=True)
        model.save_model(f'{model_path}/xgboost-model')
        mlflow.xgboost.log_model(
            model, 
            name="bank-prediction-XGBoostModel"
            )
'''

with open('scripts/train.py', 'w') as f:
    f.write(training_script)
    
print('✓ Training script created')

✓ Training script created


In [80]:
# Get XGBoost container image
xgboost_image = image_uris.retrieve(
    framework='xgboost',
    region=region,
    version='1.7-1', #3.0-5
    py_version="py311",
    image_scope='training',
    instance_type="ml.m5.xlarge",
)
print(f'XGBoost image: {xgboost_image}')

XGBoost image: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.7-1


In [81]:
# Configure ModelTrainer with v3 API
source_code = SourceCode(
    source_dir='scripts',
    entry_script='train.py',
    requirements="requirements.txt"
)

compute = Compute(
    instance_type='ml.m5.xlarge',
    instance_count=1,
    volume_size_in_gb=30
)

hyperparameters = {
    'max_depth': 5,
    'eta': 0.2,
    'gamma': 4,
    'min_child_weight': 6,
    'subsample': 0.8,
    'num_round': 100
}

model_trainer = ModelTrainer(
    sagemaker_session=sagemaker_session,
    training_image=xgboost_image,
    source_code=source_code,
    compute=compute,
    hyperparameters=hyperparameters,
    base_job_name='bank-marketing-xgboost',
    environment={'MLFLOW_TRACKING_URI': mlflow_app_arn,
                 'MLFLOW_EXP': mlflow_experiment_name
                }
)

print('ModelTrainer configured')

ModelTrainer configured


In [82]:
# Start training
input_data_train = InputData(channel_name='train', data_source=train_s3)
input_data_test = InputData(channel_name='test', data_source=test_s3)

model_trainer.train(
    input_data_config=[input_data_train, input_data_test],
    wait=True
)
# Go the the sagemaker Studio training to find the training job in-progress with name "bank-marketing-xgboost*"
print(f' Training completed: {model_trainer.latest_training_job.name}')

Output()

Go to your SageMaker MLflow App and view the new experiment run logged from the training job.

## 5. Build Model with ModelBuilder

Use ModelBuilder to prepare the trained model for deployment.

In [121]:
# Get model artifacts
training_job_name = model_trainer._latest_training_job.training_job_name
model_data_s3 = model_trainer._latest_training_job.model_artifacts
print(f'Model artifacts: {model_data_s3}')

Model artifacts: s3_model_artifacts='s3://sagemaker-us-west-2-736264693883/bank-marketing-xgboost/bank-marketing-xgboost-20260203231743/output/model.tar.gz'


In [122]:
# Create schema builder for model input/output
sample_input = X_test.iloc[:1].values.tolist()
sample_output = [[0.8, 0.2]]  # Binary classification probabilities

schema_builder = SchemaBuilder(sample_input, sample_output)
print(f'✓ Schema builder created with sample input shape: {np.array(sample_input).shape}')

✓ Schema builder created with sample input shape: (1, 20)


In [100]:

from sagemaker.serve.spec.inference_spec import InferenceSpec
from sagemaker.serve.utils.types import ModelServer

In [117]:
# Get inference image URI
inference_image = image_uris.retrieve(
    framework='xgboost',
    region=region,
    version='1.7-1',
    image_scope='inference'
)
print(f'Inference image: {inference_image}')

Inference image: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.7-1


In [125]:
model_data_s3 = model_trainer._latest_training_job.model_artifacts.s3_model_artifacts

In [126]:
# Create ModelBuilder with trained model artifacts
model_builder = ModelBuilder(
    image_uri=inference_image,
    s3_model_data_url=model_data_s3,
    role_arn=role,
    sagemaker_session=sagemaker_session
)

print('✓ ModelBuilder configured')

✓ ModelBuilder configured


In [127]:
# Build the model
model_name = f'bank-marketing-model-{datetime.now().strftime("%Y%m%d%H%M%S")}'
built_model = model_builder.build(model_name=model_name)

print(f'✓ Model built: {built_model.model_name}')

✓ Model built: bank-marketing-model-20260204181757


In [128]:
# Deploy model to endpoint using ModelBuilder
endpoint_name = f'bank-marketing-{datetime.now().strftime("%Y%m%d%H%M%S")}'

endpoint = model_builder.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type='ml.m5.large',
    wait=True
)

print(f'✓ Endpoint deployed: {endpoint.endpoint_name}')

Output()

✓ Endpoint deployed: bank-marketing-20260204181814


In [129]:
# Prepare test data in CSV format (XGBoost expects CSV without headers)
test_sample = X_test.iloc[:5]
test_csv = test_sample.to_csv(header=False, index=False)

print('Test data (first 5 samples):')
print(test_csv[:200] + '...')

Test data (first 5 samples):
32,4,0,6,0,0,0,0,3,3,131,5,999,0,1,1.4,93.918,-42.7,4.961,5228.1
37,10,3,6,0,0,0,0,4,3,100,1,999,0,1,-2.9,92.963,-40.8,1.262,5076.2
73,5,0,5,1,2,0,0,3,2,131,2,999,0,1,-1.7,94.215,-40.3,0.81,4991.6
44,...


In [143]:
# Make predictions using endpoint.invoke()
response = endpoint.invoke(
    body=test_csv,
    content_type='text/csv'
)

# Parse predictions
predictions_raw = response.body.read().decode('utf-8')
predictions = [float(p) for p in predictions_raw.strip().split('\n')]

print('Sample Predictions:')
for i, pred in enumerate(predictions):
    print(f'  Sample {i+1}: {pred:.4f} (Class: {"Yes" if pred > 0.5 else "No"})')

Sample Predictions:
  Sample 1: 0.0005 (Class: No)
  Sample 2: 0.0963 (Class: No)
  Sample 3: 0.2998 (Class: No)
  Sample 4: 0.0002 (Class: No)
  Sample 5: 0.3543 (Class: No)


In [131]:
# Compare with actual labels
actual_labels = y_test.iloc[:5].values
print('\nPrediction vs Actual:')
for i, (pred, actual) in enumerate(zip(predictions, actual_labels)):
    pred_class = 1 if pred > 0.5 else 0
    match = '✓' if pred_class == actual else '✗'
    print(f'  {match} Sample {i+1}: Predicted={pred_class}, Actual={actual}, Probability={pred:.4f}')


Prediction vs Actual:
  ✓ Sample 1: Predicted=0, Actual=0, Probability=0.0005
  ✓ Sample 2: Predicted=0, Actual=0, Probability=0.0963
  ✓ Sample 3: Predicted=0, Actual=0, Probability=0.2998
  ✓ Sample 4: Predicted=0, Actual=0, Probability=0.0002
  ✓ Sample 5: Predicted=0, Actual=0, Probability=0.3543


In [132]:
# Summary
print('=' * 60)
print('DEPLOYMENT SUMMARY')
print('=' * 60)
print(f'Training Job: {training_job_name}')
print(f'Model Name: {built_model.model_name}')
print(f'Model Artifacts: {model_data_s3}')
print(f'Endpoint Name: {endpoint.endpoint_name}')
print(f'Endpoint ARN: {endpoint.arn}')
print(f'MLflow Tracking URI: {tracking_uri}')
print(f'MLflow Experiment: {experiment_name}')
print('=' * 60)

DEPLOYMENT SUMMARY
Training Job: bank-marketing-xgboost-20260203231743
Model Name: bank-marketing-model-20260204181757
Model Artifacts: s3://sagemaker-us-west-2-736264693883/bank-marketing-xgboost/bank-marketing-xgboost-20260203231743/output/model.tar.gz
Endpoint Name: bank-marketing-20260204181814


## Register the model to SageMaker AI Model Registry

> Note: The trained model is already registered in MLflow and if the auto-registration flag is enabled sagemaker will automatically register the model from MLflow into SageMaker AI Model Registry. If you used the default MLflow app then the the auto-registration flag is enabled and subsiquently the model from MLflow is registered automatically into SageMaker AI Model Registry. Alternatevly, you can also register the model directly into SageMaker AI Model Registry as shown below.

In [145]:
step_args=model_builder.register(
        model_package_group_name=built_model.model_name"-manual",
        content_types=["application/json"],
        response_types=["application/json"],
        inference_instances=["ml.m5.xlarge"],
        approval_status="Approved"
)

# Clean up section (Optional)
1. Delete the endpoint as it is live