# Getting Started with Kubernetes on AMD GPUs - Interactive Tutorial

**Target Audience**: Infrastructure administrators and DevOps teams exploring AMD GPUs for production Kubernetes workloads

This notebook provides hands-on experience with deploying and managing AI inference workloads on Kubernetes clusters with AMD GPUs.

## Prerequisites
- Kubernetes cluster with AMD GPU support
- AMD GPU Operator installed
- kubectl configured to access your cluster
- vLLM inference service deployed

---

## 🚀 Section 1: Environment Setup and Verification

Let's start by verifying our Kubernetes cluster and AMD GPU setup.

In [None]:
import subprocess
import json
import time
import requests
from IPython.display import display, HTML, Markdown
import pandas as pd

def run_kubectl(command):
    """Helper function to run kubectl commands and return output"""
    try:
        result = subprocess.run(
            f"kubectl {command}", 
            shell=True, 
            capture_output=True, 
            text=True, 
            check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        return f"Error: {e.stderr.strip()}"

def run_command(command):
    """Helper function to run any shell command"""
    try:
        result = subprocess.run(command, shell=True, capture_output=True, text=True, check=True)
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        return f"Error: {e.stderr.strip()}"

print("✅ Helper functions loaded")

In [None]:
# Check Kubernetes cluster information
print("🔍 Kubernetes Cluster Information")
print("=" * 50)

cluster_info = run_kubectl("cluster-info")
print(cluster_info)

print("\n📊 Node Status:")
nodes = run_kubectl("get nodes -o wide")
print(nodes)

In [None]:
# Check AMD GPU Operator installation
print("🎯 AMD GPU Operator Status")
print("=" * 40)

# Check if AMD GPU Operator namespace exists
gpu_ns = run_kubectl("get namespace kube-amd-gpu")
print(f"GPU Operator Namespace: {gpu_ns}")

# Check GPU operator pods
print("\n🔧 GPU Operator Pods:")
gpu_pods = run_kubectl("get pods -n kube-amd-gpu")
print(gpu_pods)

# Check node labels for AMD GPUs
print("\n🏷️ Node GPU Labels:")
node_labels = run_kubectl("get nodes -L feature.node.kubernetes.io/amd-gpu")
print(node_labels)

In [None]:
# Check GPU resources availability
print("💾 GPU Resources on Nodes")
print("=" * 35)

gpu_resources = run_kubectl('get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"')
print(gpu_resources)

# Check for any running GPU workloads
print("\n🏃 Current GPU Workloads:")
gpu_workloads = run_kubectl('get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,"GPU_REQUESTS:.spec.containers[*].resources.requests.amd\.com/gpu"')
print(gpu_workloads)

## 🤖 Section 2: Deploy and Test vLLM AI Inference

Now let's work with AI inference workloads using vLLM on our AMD GPU-enabled Kubernetes cluster.

In [None]:
# Check if vLLM deployment exists
print("🔍 vLLM Deployment Status")
print("=" * 30)

vllm_deployment = run_kubectl("get deployment vllm-inference")
print(f"Deployment Status: {vllm_deployment}")

# Check vLLM pods
print("\n📦 vLLM Pods:")
vllm_pods = run_kubectl("get pods -l app=vllm-inference")
print(vllm_pods)

# Check vLLM service
print("\n🌐 vLLM Service:")
vllm_service = run_kubectl("get service vllm-service")
print(vllm_service)

In [None]:
# Get service endpoint for API testing
def get_vllm_endpoint():
    """Get the vLLM service endpoint"""
    try:
        # Try to get LoadBalancer external IP
        external_ip = run_kubectl("get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'")
        if external_ip and external_ip != "null" and "Error" not in external_ip:
            return f"http://{external_ip}"
        
        # Fallback to NodePort
        node_ip = run_kubectl("get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}'")
        node_port = run_kubectl("get service vllm-service -o jsonpath='{.spec.ports[0].nodePort}'")
        if node_ip and node_port and "Error" not in node_ip and "Error" not in node_port:
            return f"http://{node_ip}:{node_port}"
        
        # Fallback to port-forward (we'll indicate this)
        return "port-forward"
    except:
        return "port-forward"

endpoint = get_vllm_endpoint()
print(f"🌍 vLLM Service Endpoint: {endpoint}")

if endpoint == "port-forward":
    print("\n⚠️ No external access detected. Use port-forward for testing:")
    print("   kubectl port-forward service/vllm-service 8000:8000")
    endpoint = "http://localhost:8000"
    print(f"   Then use: {endpoint}")

In [None]:
# Test vLLM API health endpoint
def test_vllm_health(endpoint_url):
    """Test vLLM health endpoint"""
    try:
        response = requests.get(f"{endpoint_url}/health", timeout=10)
        if response.status_code == 200:
            return "✅ Healthy", response.text
        else:
            return f"❌ Status: {response.status_code}", response.text
    except requests.exceptions.RequestException as e:
        return "❌ Connection Failed", str(e)

print("🏥 Testing vLLM Health Endpoint")
print("=" * 35)

if endpoint != "port-forward":
    status, response = test_vllm_health(endpoint)
    print(f"Health Status: {status}")
    print(f"Response: {response}")
else:
    print("⚠️ Cannot test health endpoint without port-forward setup.")
    print("Run the port-forward command above and then retry this cell.")

In [None]:
# Test vLLM API with a simple completion request
def test_vllm_completion(endpoint_url, prompt, max_tokens=50):
    """Test vLLM completion endpoint"""
    try:
        payload = {
            "model": "microsoft/Llama-3.2-1B-Instruct",
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.7
        }
        
        response = requests.post(
            f"{endpoint_url}/v1/completions",
            json=payload,
            headers={"Content-Type": "application/json"},
            timeout=30
        )
        
        if response.status_code == 200:
            return "✅ Success", response.json()
        else:
            return f"❌ Status: {response.status_code}", response.text
            
    except requests.exceptions.RequestException as e:
        return "❌ Request Failed", str(e)

print("🧠 Testing vLLM AI Completion")
print("=" * 35)

test_prompt = "The benefits of using Kubernetes for AI workloads include:"

if endpoint != "port-forward":
    print(f"📝 Prompt: {test_prompt}")
    print("\n🔄 Generating response...")
    
    status, response = test_vllm_completion(endpoint, test_prompt, max_tokens=100)
    print(f"\nStatus: {status}")
    
    if "Success" in status:
        completion = response['choices'][0]['text']
        print(f"\n🤖 AI Response: {completion}")
        print(f"\n📊 Usage: {response.get('usage', 'N/A')}")
    else:
        print(f"Error: {response}")
else:
    print("⚠️ Cannot test completion endpoint without port-forward setup.")
    print("Run the port-forward command and ensure the service is accessible.")

## 📈 Section 3: Scaling and Monitoring GPU Workloads

Explore Kubernetes' scaling capabilities with GPU workloads and monitor resource usage.

In [None]:
# Check current deployment scale
print("📊 Current Deployment Scale")
print("=" * 35)

current_replicas = run_kubectl("get deployment vllm-inference -o jsonpath='{.spec.replicas}'")
ready_replicas = run_kubectl("get deployment vllm-inference -o jsonpath='{.status.readyReplicas}'")

print(f"Desired Replicas: {current_replicas}")
print(f"Ready Replicas: {ready_replicas}")

# Show detailed deployment status
print("\n�� Deployment Details:")
deployment_status = run_kubectl("describe deployment vllm-inference")
# Show only the relevant parts
lines = deployment_status.split('\n')
for line in lines[:15]:  # First 15 lines usually contain the key info
    print(line)

In [None]:
# Demonstrate scaling the deployment
print("🚀 Scaling vLLM Deployment")
print("=" * 30)

# Scale to 2 replicas (if we have enough GPUs)
print("📈 Scaling to 2 replicas...")
scale_result = run_kubectl("scale deployment vllm-inference --replicas=2")
print(f"Scale command result: {scale_result}")

# Wait a moment and check status
print("\n⏳ Waiting for scaling to take effect...")
time.sleep(10)

# Check new status
new_status = run_kubectl("get deployment vllm-inference")
print(f"\n📊 Updated Deployment Status:")
print(new_status)

# Show pods
print("\n📦 Pod Status:")
pod_status = run_kubectl("get pods -l app=vllm-inference")
print(pod_status)

In [None]:
# Monitor GPU resource usage
print("💾 GPU Resource Monitoring")
print("=" * 30)

# Check GPU allocation across nodes
print("🖥️ GPU Resources per Node:")
gpu_allocation = run_kubectl('get nodes -o custom-columns=NAME:.metadata.name,"TOTAL_GPU:.status.capacity.amd\.com/gpu","ALLOCATABLE_GPU:.status.allocatable.amd\.com/gpu"')
print(gpu_allocation)

# Check which pods are using GPUs
print("\n🎯 GPU Usage by Pods:")
gpu_pods = run_kubectl('get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,NODE:.spec.nodeName,"GPU_REQUEST:.spec.containers[*].resources.requests.amd\.com/gpu","GPU_LIMIT:.spec.containers[*].resources.limits.amd\.com/gpu"')
print(gpu_pods)

# Show resource usage if available
print("\n📈 Node Resource Summary:")
for i in range(3):  # Try to get info for up to 3 nodes
    node_info = run_kubectl(f'describe node | grep -A 10 "Allocated resources" || echo "No detailed resource info available"')
    if "No detailed" not in node_info:
        print(node_info[:500])  # Limit output
        break

In [None]:
# Scale back to 1 replica for resource efficiency
print("📉 Scaling Back to 1 Replica")
print("=" * 35)

scale_down = run_kubectl("scale deployment vllm-inference --replicas=1")
print(f"Scale down result: {scale_down}")

# Wait and verify
time.sleep(5)
final_status = run_kubectl("get deployment vllm-inference")
print(f"\n📊 Final Deployment Status:")
print(final_status)

print("\n✅ Scaling demonstration completed!")

## 🔧 Section 4: Advanced Operations and Troubleshooting

Learn essential commands for managing GPU workloads in production environments.

In [None]:
# Troubleshooting commands and information gathering
print("🔍 Essential Troubleshooting Commands")
print("=" * 45)

# 1. Check events for any issues
print("1️⃣ Recent Cluster Events:")
events = run_kubectl("get events --sort-by=.metadata.creationTimestamp | tail -10")
print(events)

print("\n" + "="*50)

# 2. Check logs from vLLM pods
print("2️⃣ vLLM Pod Logs (last 10 lines):")
vllm_logs = run_kubectl("logs -l app=vllm-inference --tail=10")
print(vllm_logs)

print("\n" + "="*50)

# 3. Check GPU operator logs
print("3️⃣ GPU Operator Logs (last 5 lines):")
gpu_operator_logs = run_kubectl("logs -n kube-amd-gpu -l app.kubernetes.io/name=gpu-operator-charts --tail=5")
print(gpu_operator_logs)

In [None]:
# Performance and resource monitoring
print("📊 Performance Monitoring Commands")
print("=" * 40)

# Check if metrics are available
print("1️⃣ Checking GPU Metrics Availability:")
metrics_service = run_kubectl("get service -n kube-amd-gpu | grep metrics || echo 'No metrics service found'")
print(metrics_service)

# Try to access metrics if available
if "metrics" in metrics_service and "No metrics" not in metrics_service:
    print("\n2️⃣ GPU Metrics Endpoint:")
    node_ip = run_kubectl("get nodes -o jsonpath='{.items[0].status.addresses[?(@.type==\"InternalIP\")].address}'")
    print(f"📈 Metrics available at: http://{node_ip}:32500/metrics")
    print("   Use this endpoint with Prometheus for monitoring")
else:
    print("\n⚠️ GPU metrics exporter not found or not configured")

print("\n3️⃣ Essential Monitoring Commands:")
monitoring_commands = [
    "kubectl top nodes",
    "kubectl top pods",
    "kubectl get pods -o wide",
    "kubectl describe node <node-name>",
    "kubectl get events --sort-by=.metadata.creationTimestamp"
]

for cmd in monitoring_commands:
    print(f"   • {cmd}")

In [None]:
# Generate a summary report
print("📋 Cluster Summary Report")
print("=" * 30)

# Collect all key information
summary_data = {
    "Cluster Status": run_kubectl("get nodes --no-headers | wc -l").strip() + " nodes",
    "AMD GPU Nodes": run_kubectl("get nodes -l feature.node.kubernetes.io/amd-gpu=true --no-headers | wc -l").strip(),
    "GPU Operator Status": "Installed" if "kube-amd-gpu" in run_kubectl("get namespaces") else "Not Installed",
    "vLLM Deployment": "Running" if "vllm-inference" in run_kubectl("get deployments") else "Not Found",
    "Total GPU Resources": run_kubectl('get nodes -o jsonpath="{.items[*].status.capacity.amd\.com/gpu}"').replace(" ", "+") or "0",
    "LoadBalancer Service": "Available" if "LoadBalancer" in run_kubectl("get service vllm-service") else "Not Available"
}

# Display summary
for key, value in summary_data.items():
    print(f"✓ {key}: {value}")

print("\n" + "="*50)
print("🎉 Tutorial Complete!")
print("\nNext Steps:")
print("• Explore different AI models with vLLM")
print("• Set up monitoring with Prometheus/Grafana")
print("• Implement autoscaling policies")
print("• Configure resource quotas for multi-tenancy")
print("• Explore multi-GPU model parallelism")

## 🎓 Key Takeaways and Next Steps

### What You've Learned

1. **AMD GPU + Kubernetes Integration**: Successfully deployed the AMD GPU Operator to expose GPU resources as schedulable Kubernetes resources

2. **AI Inference Deployment**: Deployed vLLM inference server with proper GPU allocation and external access via LoadBalancer

3. **Scaling Operations**: Demonstrated horizontal scaling of GPU workloads and monitoring resource usage

### Production Considerations

- **Resource Management**: Use resource quotas and limits to prevent GPU resource contention
- **Monitoring**: Implement comprehensive monitoring with Prometheus and Grafana
- **High Availability**: Deploy across multiple nodes with anti-affinity rules
- **Security**: Use network policies and pod security standards

### Useful Resources

- [AMD GPU Operator Documentation](https://rocm.github.io/gpu-operator/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Kubernetes GPU Scheduling](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
- [ROCm Blog Series](https://rocm.blogs.amd.com/artificial-intelligence/k8s-orchestration-part1/README.html)

---

**Congratulations!** 🎉 You now have hands-on experience with AMD GPU-accelerated Kubernetes clusters for AI workloads.