# üéõÔ∏è GPU Switchboard: Dynamic Resource Handover

This notebook demonstrates **RHOAI 3.0's GPU-as-a-Service** capabilities using Kueue for dynamic resource allocation.

## The Demo Scenario

Our cluster has **5 GPUs** available:
- 1x `g6.12xlarge` ‚Üí 4 GPUs
- 1x `g6.4xlarge` ‚Üí 1 GPU

**Baseline State (Full Saturation):**
- ‚úÖ `mistral-3-bf16` ‚Üí 4 GPUs (ON)
- ‚úÖ `mistral-3-int4` ‚Üí 1 GPU (ON)
- ‚è∏Ô∏è `devstral-2` ‚Üí 4 GPUs (OFF, waiting)

**The Challenge:** What happens when we need to start Devstral-2 but all GPUs are in use?

**The Answer:** Kueue's intelligent queuing holds the request until resources become available!


## Setup: Import Libraries and Check Cluster State


In [None]:
import subprocess
import json
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets
import time

NAMESPACE = "private-ai"

# Model configuration
MODELS = {
    "mistral-3-bf16": {"gpus": 4, "color": "#2196F3", "desc": "BF16 Full Precision"},
    "mistral-3-int4": {"gpus": 1, "color": "#4CAF50", "desc": "INT4 Quantized"},
    "devstral-2": {"gpus": 4, "color": "#FF9800", "desc": "Agentic Model"}
}

def run_oc(args):
    """Execute oc command and return output"""
    result = subprocess.run(["oc"] + args, capture_output=True, text=True)
    return result.stdout.strip(), result.returncode

def get_model_status(name):
    """Get current minReplicas and ready state for a model"""
    out, _ = run_oc(["get", "inferenceservice", name, "-n", NAMESPACE, "-o", "json"])
    if out:
        data = json.loads(out)
        min_replicas = data.get("spec", {}).get("predictor", {}).get("minReplicas", 0)
        ready = data.get("status", {}).get("conditions", [{}])[-1].get("status", "Unknown")
        return min_replicas, ready
    return 0, "Unknown"

def set_model_replicas(name, replicas):
    """Set minReplicas for a model"""
    patch = json.dumps({"spec": {"predictor": {"minReplicas": replicas}}})
    out, code = run_oc(["patch", "inferenceservice", name, "-n", NAMESPACE, 
                        "--type=merge", "-p", patch])
    return code == 0

print("‚úÖ Setup complete. Connected to namespace:", NAMESPACE)


## Current Cluster Status

Let's check the current GPU allocation and model states:


In [None]:
def show_cluster_status():
    """Display current cluster status"""
    total_gpus = 5
    used_gpus = 0
    
    html = "<h3>üìä Cluster Status</h3>"
    html += "<table style='width:100%; border-collapse: collapse;'>"
    html += "<tr style='background:#f5f5f5;'><th>Model</th><th>GPUs</th><th>State</th><th>Ready</th></tr>"
    
    for name, config in MODELS.items():
        min_rep, ready = get_model_status(name)
        state = "üü¢ ON" if min_rep > 0 else "‚ö´ OFF"
        ready_icon = "‚úÖ" if ready == "True" else ("‚è≥" if min_rep > 0 else "‚Äî")
        
        if min_rep > 0:
            used_gpus += config["gpus"]
        
        html += f"<tr style='border-bottom:1px solid #ddd;'>"
        html += f"<td style='padding:8px;'><b style='color:{config['color']}'>{name}</b><br>"
        html += f"<small>{config['desc']}</small></td>"
        html += f"<td style='padding:8px; text-align:center;'>{config['gpus']}</td>"
        html += f"<td style='padding:8px; text-align:center;'>{state}</td>"
        html += f"<td style='padding:8px; text-align:center;'>{ready_icon}</td>"
        html += "</tr>"
    
    html += "</table>"
    
    # GPU usage bar
    pct = (used_gpus / total_gpus) * 100
    bar_color = "#4CAF50" if pct < 80 else ("#FF9800" if pct < 100 else "#f44336")
    html += f"<h3>üîã GPU Quota: {used_gpus}/{total_gpus}</h3>"
    html += f"<div style='background:#e0e0e0; border-radius:10px; height:30px; width:100%;'>"
    html += f"<div style='background:{bar_color}; width:{pct}%; height:100%; border-radius:10px; "
    html += f"text-align:center; line-height:30px; color:white; font-weight:bold;'>{pct:.0f}%</div></div>"
    
    display(HTML(html))

show_cluster_status()


## üéõÔ∏è GPU Switchboard

Use these toggles to control which models are running. Watch how Kueue manages the GPU quota!


In [None]:
# Create toggle switches for each model
output = widgets.Output()

def create_toggle(name, config):
    min_rep, _ = get_model_status(name)
    
    toggle = widgets.ToggleButton(
        value=(min_rep > 0),
        description=f"{name} ({config['gpus']} GPU)",
        button_style='success' if min_rep > 0 else '',
        layout=widgets.Layout(width='280px', height='50px'),
        style={'font_weight': 'bold'}
    )
    
    def on_toggle(change):
        with output:
            clear_output(wait=True)
            new_replicas = 1 if change['new'] else 0
            action = "Enabling" if new_replicas else "Disabling"
            print(f"üîÑ {action} {name}...")
            
            if set_model_replicas(name, new_replicas):
                toggle.button_style = 'success' if new_replicas else ''
                print(f"‚úÖ {name} set to minReplicas={new_replicas}")
                
                # Check for queueing
                if new_replicas > 0:
                    time.sleep(3)
                    wl_out, _ = run_oc(["get", "pods", "-n", NAMESPACE, 
                                        "-l", f"serving.kserve.io/inferenceservice={name}"])
                    if "SchedulingGated" in str(wl_out) or "Pending" in str(wl_out):
                        print(f"\n‚ö†Ô∏è  {name} is PENDING in Kueue queue!")
                        print("   ‚Üí Not enough GPUs available")
                        print("   ‚Üí Disable another model to free resources")
            else:
                print(f"‚ùå Failed to update {name}")
            
            print("\n" + "="*50)
            show_cluster_status()
    
    toggle.observe(on_toggle, names='value')
    return toggle

# Create toggles
toggles = [create_toggle(name, config) for name, config in MODELS.items()]

# Layout
header = widgets.HTML("<h2>üéõÔ∏è GPU Switchboard</h2><p>Toggle models ON/OFF to manage GPU allocation</p>")
toggle_box = widgets.VBox(toggles, layout=widgets.Layout(padding='10px'))

display(header)
display(toggle_box)
display(widgets.HTML("<hr><h3>üìã Operation Log</h3>"))
display(output)


## üìà Kueue Workload Status

Monitor the Kueue queue to see pending workloads:


In [None]:
def show_kueue_status():
    """Show Kueue workload queue status"""
    print("üîç Kueue Workloads in private-ai namespace:\n")
    out, _ = run_oc(["get", "workload", "-n", NAMESPACE, "-o", "wide"])
    print(out if out else "No workloads found")
    
    print("\n" + "="*70)
    print("\nüè¢ ClusterQueue Status:\n")
    out, _ = run_oc(["get", "clusterqueue", "rhoai-main-queue", "-o", "wide"])
    print(out if out else "ClusterQueue not found")

show_kueue_status()


## üé¨ Demo Script: The Resource Handover

Follow these steps to demonstrate dynamic GPU allocation:

### Step 1: Verify Baseline (100% Saturation)
```
mistral-3-bf16 (4 GPU) = ON  ‚úÖ
mistral-3-int4 (1 GPU) = ON  ‚úÖ
devstral-2 (4 GPU)     = OFF ‚è∏Ô∏è
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Total: 5/5 GPUs used (100%)
```

### Step 2: Attempt to Start Devstral-2
1. Toggle **devstral-2** to ON
2. Observe: Kueue puts it in **PENDING** state
3. Why? `4 + 1 + 4 = 9 > 5` (over quota)

### Step 3: The Handover
1. Toggle **mistral-3-bf16** to OFF
2. Watch: Devstral-2 **INSTANTLY** starts
3. New state: `0 + 1 + 4 = 5` (exactly at quota)

### The Message
> "This is GPU-as-a-Service. The platform dynamically manages resources, 
> ensuring fair access and preventing any single team from hoarding GPUs."


In [None]:
# Quick refresh of all states
print("üîÑ Refreshing status...\n")
show_cluster_status()
print("\n")
show_kueue_status()
