# Stage 2: Distributed Training with Kubeflow Training SDK

This notebook covers:
- Creating PyTorchJob using **kubeflow-training SDK** (programmatic approach)
- Distributed training with PyTorch DDP
- Using Feast features from shared storage
- Monitoring training progress

**Prerequisites:**
- Completed Notebook 01 (Feast features ready)
- OpenShift AI cluster with Kubeflow Training Operator
- PVCs created for data and model storage

## 1. Install Dependencies

In [1]:
%pip install -q kubeflow-training==1.9.3 kubernetes yamlmagic

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from pathlib import Path

# Kubeflow Training SDK
from kubernetes import client
from kubeflow.training import TrainingClient
from kubeflow.training.models import V1Volume, V1VolumeMount, V1PersistentVolumeClaimVolumeSource

print('Imports successful')

Imports successful


## 2. Training Configuration

Define training parameters using YAML (following distributed-workloads pattern)

In [3]:
%load_ext yamlmagic

In [45]:
%%yaml training_parameters

# Model architecture
model_type: mlp
hidden_dims: [256, 128, 64, 32]
dropout: 0.3

# Training hyperparameters (GPU/CPU compatible)
num_epochs: 10
batch_size: 512  # Good for GPUs (underutilized) and CPUs (manageable)
learning_rate: 0.0014  # Scaled for batch_size=512 (sqrt(2) * 0.001)
weight_decay: 0.0001
early_stopping_patience: 5

# Training optimizations
use_amp: true  # Automatic Mixed Precision (NVIDIA GPU only, auto-disabled on AMD/CPU)
grad_clip_norm: 1.0

# Data configuration
data_source: direct  # Fast direct parquet loading (no Feast/Ray/PostgreSQL)
data_path: /shared/feature_repo/data
chunk_size: 15000  # Balance between I/O efficiency and memory
sample_size: null  # Use full dataset (421k rows) - comment out for quick test
test_size: 0.2
val_size: 0.1

# Output configuration
model_output_dir: /shared/models
checkpoint_every: 2

# Distributed training
backend: auto  # Auto-detect hardware: nccl (NVIDIA GPU), gloo (AMD GPU/CPU)
seed: 42

<IPython.core.display.Javascript object>

In [46]:
# Display configuration
print('Training Configuration:\n')
for key, value in training_parameters.items():
    print(f'   {key}: {value}')

Training Configuration:

   model_type: mlp
   hidden_dims: [256, 128, 64, 32]
   dropout: 0.3
   num_epochs: 10
   batch_size: 512
   learning_rate: 0.0014
   weight_decay: 0.0001
   early_stopping_patience: 5
   use_amp: True
   grad_clip_norm: 1.0
   data_source: direct
   data_path: /shared/feature_repo/data
   chunk_size: 15000
   sample_size: None
   test_size: 0.2
   val_size: 0.1
   model_output_dir: /shared/models
   checkpoint_every: 2
   backend: auto
   seed: 42


## 3. Configure Kubeflow Training Client

In [None]:
# Option 1: In-cluster authentication (running in OpenShift AI workbench)
# The client will automatically use the service account token
# training_client = TrainingClient()

# Option 2: External authentication (uncomment if connecting from outside cluster)
token = "<your-openshift-token>"
api_server = "<your-openshift-api-server-url>"
configuration = client.Configuration()
configuration.host = api_server
configuration.api_key = {"authorization": f"Bearer {token}"}
configuration.verify_ssl = False
api_client = client.ApiClient(configuration)
training_client = TrainingClient(client_configuration=api_client.configuration)

print('Kubeflow Training client configured')

Kubeflow Training client configured


## 7. Submit Distributed Training Job

Create PyTorchJob using kubeflow-training SDK (simplified API)

In [None]:
JOB_NAME = 'walmart-sales-forecasting'
NAMESPACE='kft-feast-quickstart'
TRAINING_IMAGE = 'quay.io/modh/training:py311-cuda124-torch251'

In [None]:
# Import self-contained training function (compatible with SDK's inspect.getsource())
from torch_training import training_func

job = training_client.create_job(
    job_kind="PyTorchJob",
    name=JOB_NAME,
    namespace=NAMESPACE,
    train_func=training_func,  # Self-contained function with all dependencies
    parameters=training_parameters,  # YAML config passed as dict
    
    # Distributed training configuration
    num_workers=2,  # 2 worker pods
    num_procs_per_worker=1,
    
    # Resource allocation per worker
    resources_per_worker={
        "nvidia.com/gpu": 1,
        "memory": "40Gi",
        "cpu": 2,
    },
    
    base_image=TRAINING_IMAGE,
    
    # Environment variables
    env_vars={
        # Training configuration
        "PYTHONUNBUFFERED": "1",
        "NCCL_DEBUG": "INFO",
        # Ray configuration (for Feast offline store)
        "RAY_DEDUP_LOGS": "0",
    },
    
    # Package dependencies (PostgreSQL + Ray + Feast)
    packages_to_install=[
        # Core dependencies
        "pandas==2.2.3",
        "numpy==2.2.0",
        "pyarrow==17.0.0",
        "scikit-learn==1.6.1",
        "joblib>=1.3.0",
        # Feast with PostgreSQL and Ray support (extras ensure compatibility)
        # "feast[postgres,ray]==0.54.0",  # Includes psycopg2, sqlalchemy, ray deps
        # TorchData for streaming
        "torchdata>=0.7.0",
        # "psycopg2==2.9.11",
        # "dill>=0.4.0",
    ],
    
    # Shared PVC for Feast repo and model outputs
    volumes=[
        V1Volume(
            name="shared-storage",
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="shared-storage")
        ),
    ],
    volume_mounts=[
        V1VolumeMount(name="shared-storage", mount_path="/shared"),
    ],
)

print(f"PyTorchJob '{JOB_NAME}' submitted successfully!")


PyTorchJob 'walmart-sales-forecasting' submitted successfully!




## 8. Monitor Training Job

In [73]:
# Get job status
job = training_client.get_job(name=JOB_NAME, namespace=NAMESPACE, job_kind='PyTorchJob')

print(f'Job Status:\n')
print(f'Name: {job.metadata.name}')
print(f'Namespace: {job.metadata.namespace}')
print(f'Creation: {job.metadata.creation_timestamp}')

if job.status:
    print(f'\nConditions:')
    if job.status.conditions:
        for condition in job.status.conditions:
            print(f'  {condition.type}: {condition.status}')
            if condition.message:
                print(f'    Message: {condition.message}')

Job Status:

Name: walmart-sales-forecasting
Namespace: kft-feast-quickstart
Creation: 2025-10-19 15:57:04+00:00

Conditions:
  Created: True
    Message: PyTorchJob walmart-sales-forecasting is created.
  Running: True
    Message: PyTorchJob walmart-sales-forecasting is running.


In [74]:
# Stream logs (following distributed-workloads pattern)
print('Master Pod Logs (streaming):\n')
print('='*70)

# Use kubeflow-training SDK to follow logs
training_client.get_job_logs(
    name=JOB_NAME,
    namespace=NAMESPACE,
    job_kind='PyTorchJob',
    follow=True)

Master Pod Logs (streaming):

[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,057 INFO     STARTING DISTRIBUTED TRAINING
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,058 INFO     Config: epochs=10, batch=512, lr=0.0014
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,058 INFO     Data source: direct
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,058 INFO     Data path: /shared/feature_repo/data
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,058 INFO     Output dir: /shared/models
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,058 INFO     Train/Val split: 90%/10%
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,224 INFO     Detected NVIDIA GPU: NVIDIA A100-SXM4-80GB (count: 1)
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,224 INFO     Auto-selected backend: nccl for device type: cuda
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:57:33,226 INFO     [Global Rank



[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:07,220 INFO     [Rank 0/1] Epoch 7 | Batch 250 | Loss: 0.1102
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:07,633 INFO     [Rank 0/1] Epoch 7 | Batch 300 | Loss: 0.0568
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,013 INFO     [Rank 0/1] Epoch 7 | Batch 350 | Loss: 0.0500
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,264 INFO     [Rank 0] Epoch 7 training complete, 381 batches processed
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,264 INFO     [Rank 0] Training loss aggregated: 0.0522
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,264 INFO     [Rank 0] Starting validation...
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,296 INFO     [Rank 0] Validation batch 0
[Pod walmart-sales-forecasting-master-0]: 2025-10-19 15:58:08,411 INFO     [Rank 0] Validation complete, 30 batches processed
[Pod walmart-sales-forecasting-master-0]: 2025-10-1

 {})

## 9. Wait for Completion (Optional)

In [75]:
import time

print('Waiting for training to complete...')
print('(Training takes ~40-60 min with GPUs, or 2-3 hours with CPUs)\n')

timeout = 7200  # 2 hours
start_time = time.time()
check_interval = 60

while True:
    job = training_client.get_job(name=JOB_NAME, namespace=NAMESPACE, job_kind='PyTorchJob')
    
    if job.status and job.status.conditions:
        for condition in job.status.conditions:
            if condition.type == 'Succeeded' and condition.status == 'True':
                print(f'\n Training completed successfully!')
                print(f'   Duration: {int((time.time() - start_time) / 60)} minutes')
                break
            elif condition.type == 'Failed' and condition.status == 'True':
                print(f'\n Training failed: {condition.message}')
                break
        else:
            elapsed = int(time.time() - start_time)
            if elapsed > timeout:
                print(f'\n  Timeout reached ({timeout}s)')
                break
            print(f' Training in progress... ({elapsed//60} min elapsed)')
            time.sleep(check_interval)
            continue
        break
    else:
        print(' Job starting...')
        time.sleep(check_interval)

Waiting for training to complete...
(Training takes ~40-60 min with GPUs, or 2-3 hours with CPUs)


 Training completed successfully!
   Duration: 0 minutes


## 10. Cleanup (Optional)

In [76]:
training_client.delete_job(name=JOB_NAME,namespace=NAMESPACE, job_kind='PyTorchJob')



## 11. Summary

### Completed:
✅ Configured kubeflow-training SDK client  
✅ Copied feature data to shared PVC  
✅ Created PyTorchJob programmatically (not YAML)  
✅ Submitted distributed training (1 master + 2 workers)  
✅ Monitored with `get_job_logs()` SDK method

### Key Patterns (from distributed-workloads):

**1. YAML Configuration:**
```python
%%yaml training_parameters
model_type: mlp
num_epochs: 50
```

**2. SDK Job Creation:**
```python
training_client.create_job(
    job=pytorchjob,
    namespace=NAMESPACE
)
```

**3. Log Streaming:**
```python
training_client.get_job_logs(
    name=JOB_NAME,
    follow=True
)
```

**4. Shared Storage:**
- PVCs for feature_repo and models
- Feast data accessible to all workers
- Model checkpoints saved to shared storage

### Next Steps:
Proceed to **Notebook 03** for model evaluation and inference