# ⚙️ Week 09-10 · Notebook 03 · Tensors & GPU Acceleration in Edge Plants

Benchmark tensor operations and GPU utilization strategies when compute is limited to shared workstations or on-prem clusters.

## 🎯 Learning Objectives
- Manipulate PyTorch tensors across CPU/GPU devices.
- Evaluate mixed-precision trade-offs for manufacturing workloads.
- Profile memory and throughput to align with plant uptime schedules.
- Capture governance notes for IT change management.

## 🧩 Scenario
A tier-2 supplier alternates between an A100 workstation and a shared RTX 6000. Training jobs must finish overnight without impacting SCADA traffic. You need a repeatable benchmark harness and downtime mitigation plan.

In [None]:
import torch
import time
import pandas as pd

torch.manual_seed(1234)

## 🧮 Tensor Fundamentals: Representing Manufacturing Data

Tensors are the fundamental data structure for deep learning. We'll use them to represent everything from sensor readings to maintenance logs. Let's start by creating a tensor representing a batch of sensor data and inspecting its properties. This is a critical first step for governance, ensuring data types and memory layouts are compliant with our IT policies.

In [None]:
# Example: A batch of 4 sensor readings, each with 3 values (e.g., temperature, pressure, vibration)
sensor_data = torch.tensor([
    [25.5, 101.3, 0.02],
    [26.1, 101.4, 0.03],
    [24.9, 101.2, 0.02],
    [25.8, 101.3, 0.025]
], dtype=torch.float32)

print("Sensor Data Tensor:")
print(sensor_data)
print(f"\nShape: {sensor_data.shape}")
print(f"Data Type: {sensor_data.dtype}")
print(f"Device: {sensor_data.device}")
print(f"Memory Layout (Stride): {sensor_data.stride()}")

If a GPU is available, we can move our tensors to it for a massive speedup. This is crucial for training models overnight. We'll write a function that gracefully falls back to the CPU if no GPU is found, which is common for technicians using standard laptops on the plant floor.

In [None]:
def get_device_summary():
    """Checks for GPU availability and returns a summary."""
    summary = {}
    if torch.cuda.is_available():
        device = torch.device("cuda")
        summary['device_type'] = "cuda"
        summary['device_name'] = torch.cuda.get_device_name(0)
        # Create a large tensor to inspect memory allocation
        large_tensor = torch.randn((2048, 2048), device=device)
        summary['tensor_on_gpu'] = large_tensor.is_cuda
        summary['memory_allocated_gb'] = torch.cuda.memory_allocated(0) / 1e9
        del large_tensor # Clean up memory
        torch.cuda.empty_cache()
    else:
        device = torch.device("cpu")
        summary['device_type'] = "cpu"
        summary['device_name'] = "N/A"
        summary['tensor_on_gpu'] = False
        summary['memory_allocated_gb'] = 0

    return summary

device_summary = get_device_summary()
print(f"Device Summary: {device_summary}")

# Set the default device for subsequent operations
device = torch.device(device_summary['device_type'])
print(f"\nDefault device set to: '{device}'")

## ⚡ Mixed Precision Benchmark
Evaluate float32 vs. bfloat16/float16 throughput. Use caution on safety-critical inference pipelines.

In [None]:
def matmul_benchmark(size=4096, dtype=torch.float32, device='cpu', runs=5):
    """Performs a matrix multiplication benchmark for a given configuration."""
    a = torch.randn(size, size, dtype=dtype, device=device)
    b = torch.randn(size, size, dtype=dtype, device=device)
    
    # Warm-up run
    _ = torch.matmul(a, b)
    if device == 'cuda':
        torch.cuda.synchronize()

    times = []
    for _ in range(runs):
        start_time = time.time()
        _ = torch.matmul(a, b)
        if device == 'cuda':
            torch.cuda.synchronize() # Wait for the GPU operation to complete
        end_time = time.time()
        times.append(end_time - start_time)
        
    avg_time = sum(times) / len(times)
    return avg_time

def benchmark_suite():
    """Runs the benchmark across available devices and data types."""
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # Define dtypes to test
    dtypes_to_test = {
        "float32": torch.float32
    }
    if device == 'cuda':
        # bfloat16 is generally preferred on modern GPUs (Ampere and newer)
        if torch.cuda.is_bf16_supported():
             dtypes_to_test["bfloat16"] = torch.bfloat16
        dtypes_to_test["float16"] = torch.float16

    results = []
    print(f"--- Running Benchmark on {device.upper()} ---")
    for name, dtype in dtypes_to_test.items():
        avg_time = matmul_benchmark(dtype=dtype, device=device)
        results.append({
            'device': device,
            'dtype': name,
            'avg_time_seconds': round(avg_time, 5)
        })
        print(f"  {name}: {avg_time:.5f} seconds per matmul")
        
    return pd.DataFrame(results)

benchmark_results = benchmark_suite()
print("\n--- Benchmark Results ---")
print(benchmark_results)

## 🧾 Memory Profiling Checklist
- Capture `torch.cuda.memory_summary()` at job start and end.
- Enforce IT's GPU allocation window (e.g., 18:00-06:00) to avoid SCADA conflicts.
- Log utilization metrics in maintenance CMMS for accountability.

In [None]:
if torch.cuda.is_available():
    print("--- GPU Memory Summary ---")
    # Provides a detailed breakdown of memory usage
    summary_text = torch.cuda.memory_summary(device=device, abbreviated=True)
    print(summary_text)
else:
    print("--- GPU Memory Summary ---")
    print("No GPU available. Memory profiling is only applicable for CUDA devices.")

## 🛡️ Downtime Mitigation & Change Management Plan

Deploying AI models into a production manufacturing environment requires careful planning to avoid disrupting operations. This is a template for a change management ticket.

| Risk Category | Specific Risk | Mitigation Strategy | Owner |
|---|---|---|---|
| **Compute Resource Conflict** | AI training job overloads the shared workstation, impacting the SCADA system that monitors production lines. | **Schedule-Based Access:** Limit AI training to off-peak hours (e.g., 10 PM - 6 AM). Use `nice` in Linux or process priority settings in Windows to de-prioritize the training script. | IT / AI Team |
| **Model Performance Degradation** | A new model version (e.g., using `bfloat16`) produces incorrect or unsafe predictions for a critical process like quality control. | **Canary Deployment:** Route 1% of inference requests to the new model. Compare its outputs against the stable `float32` model. Only roll out fully after a 24-hour validation period with no discrepancies. | AI Team / QA |
| **GPU Hardware Failure** | The primary GPU (A100) fails mid-training, halting model development. | **Graceful Fallback:** Ensure all training scripts can run on a secondary device (e.g., RTX 6000) or even CPU with a smaller batch size. The script should automatically detect the available device. | AI Team |
| **Network Congestion** | Transferring large datasets or model checkpoints across the plant network interferes with critical operational data flow. | **Data Locality & Off-Peak Transfers:** Pre-stage datasets on the training workstation. Schedule transfers of large model files during the approved off-peak window. | IT / Network Ops |
| **Rollback Failure** | A deployed model needs to be rolled back, but the previous version's artifacts are missing or incompatible. | **Version Control for Models:** Use a model registry (like MLflow or a simple versioned directory structure) to store model weights, tokenizer configs, and performance metrics for every deployed version. | AI Team |


## 🧪 Lab Assignment
1. Run the benchmark suite on both A100 and RTX 6000, compare throughput.
2. Add power draw instrumentation (e.g., `nvidia-smi --query-gpu=power.draw`).
3. Propose a mixed-precision policy for safety-critical vs. internal tools.
4. Submit change-management ticket with benchmark evidence and rollback plan.

## ✅ Checklist
- [ ] Tensor device audit completed
- [ ] Mixed-precision benchmark logged
- [ ] Memory summary archived for compliance
- [ ] Downtime mitigation plan approved

## 📚 References
- PyTorch Performance Tuning Guide
- NVIDIA A100 vs. RTX 6000 Comparison (2025)
- *Operational Technology Change Management Handbook* (ISA, 2023)