# GPU STREAM Benchmark: Memory Bandwidth Analysis

The **STREAM benchmark** is the industry standard for measuring sustained memory bandwidth. Originally developed by Dr. John McCalpin for CPUs, it's been adapted for GPUs to characterize HBM (High Bandwidth Memory) performance.

## What STREAM Measures

STREAM tests simple vector operations that stress memory bandwidth with minimal compute:

| Kernel | Operation | Memory Streams | Bytes/Element |
|--------|-----------|----------------|---------------|
| **Init** | `A = scalar` | 1 write | 8 bytes |
| **Read** | `temp = B` | 1 read | 8 bytes |
| **Scale** | `A = α·B` | 1 read + 1 write | 16 bytes |
| **Triad** | `A = B·D + C` | 3 reads + 1 write | 32 bytes |

**Triad** is the classic STREAM test and represents realistic bandwidth under load.

## Relevance to LLM Workloads

### ✅ Why STREAM Matters:
- **Decode phase bottleneck**: Autoregressive generation (batch=1) is memory-bound
- **HBM bandwidth ceiling**: Establishes theoretical peak bandwidth
- **KV cache streaming**: Reading/writing attention cache during generation
- **Sanity check**: If LLM inference can't approach STREAM bandwidth, there's room for optimization

### ⚠ STREAM Limitations for LLMs:
- **Access patterns differ**: STREAM is sequential; LLMs have strided (attention heads) and irregular patterns
- **Compute/memory mix**: STREAM is pure bandwidth; LLMs mix GEMM (compute-bound) with memory-bound ops
- **Cache behavior**: STREAM bypasses cache; small-batch LLM inference benefits from L2/L3 locality
- **Real bottlenecks**: FlashAttention fusion, quantization overhead, kernel fusion matter more

**Bottom line**: STREAM gives you the **upper bound** for memory bandwidth. For LLM-specific analysis, you'll also want roofline models, attention benchmarks, and GEMM profiling.

## Setup

In [30]:
from rocmGPUBenches import create_stream_benchmark_runner
from rocmGPUBenches.storage import BenchmarkDB
from rocmGPUBenches.visualization import plot_sweep, plot_comparison
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [31]:
# Initialize benchmark runner and database
runner = create_stream_benchmark_runner()
db = BenchmarkDB("stream_results.db")

Registered benchmark: stream


## 1. Quick Test: Classic STREAM Triad

The triad kernel (`A = B*D + C`) is the gold standard - it uses 4 memory streams and represents realistic bandwidth under load.

In [32]:
# Test triad with 100M elements (800 MB per array, 3.2 GB total)
result = runner.run("stream", {
    "problem_size": 100_000_000,
    "kernel_type": "triad",
    "block_size": 256
})

print(f"STREAM Triad Bandwidth: {result.primary_metric:.2f} GB/s")
print(f"Execution Time: {result.exec_time_ms:.3f} ms")
print(f"\nThis represents your GPU's sustained HBM bandwidth ceiling.")

STREAM Triad Bandwidth: 2050.02 GB/sCompiling stream kernel with optimizations (one-time compilation)...

Execution Time: 1.653 ms

This represents your GPU's sustained HBM bandwidth ceiling.
Compile flags: -O3 -ffast-math --gpu-max-threads-per-block=1024
Kernel compilation complete! Loaded 4 kernel functions.


## 2. Compare All Kernel Types

Different kernels use different numbers of memory streams, revealing bandwidth scaling behavior.

In [None]:
# Test all 4 kernel types
N = 100_000_000
kernel_types = ["init", "read", "scale", "triad"]
results = {}

print("Kernel Type  | Bandwidth (GB/s) | Time (ms) | Memory Streams")
print("-" * 70)

for kernel_type in kernel_types:
    result = runner.run("stream", {
        "problem_size": N,
        "kernel_type": kernel_type,
        "block_size": 256
    })

    results[kernel_type] = result

    # Determine stream count
    streams = {"init": 1, "read": 1, "scale": 2, "triad": 4}[kernel_type]

    print(f"{kernel_type:12s} | {result.primary_metric:16.2f} | {result.exec_time_ms:9.3f} | {streams}")

    # Save to database
    db.save_result(
        benchmark_name="stream",
        params={"problem_size": N, "kernel_type": kernel_type, "block_size": 256},
        result=result
    )

Kernel Type  | Bandwidth (GB/s) | Time (ms) | Memory Streams
----------------------------------------------------------------------
init         |           815.66 |     1.077 | 1
read         |           784.85 |     1.127 | 1
scale        |          1370.94 |     1.253 | 2
triad        |          2061.45 |     1.637 | 4


In [34]:
# Visualize kernel comparison
kernel_names = list(results.keys())
bandwidths = [results[k].primary_metric for k in kernel_names]
stream_counts = [1, 1, 2, 4]

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Bandwidth by Kernel Type", "Bandwidth vs Memory Streams"),
    specs=[[{"type": "bar"}, {"type": "scatter"}]]
)

# Bar chart
fig.add_trace(
    go.Bar(x=kernel_names, y=bandwidths, name="Bandwidth",
           marker_color=['#636EFA', '#EF553B', '#00CC96', '#AB63FA']),
    row=1, col=1
)

# Scatter plot: bandwidth vs stream count
fig.add_trace(
    go.Scatter(x=stream_counts, y=bandwidths, mode='lines+markers',
               name="Bandwidth", marker=dict(size=10), line=dict(width=2)),
    row=1, col=2
)

fig.update_xaxes(title_text="Kernel Type", row=1, col=1)
fig.update_xaxes(title_text="Number of Memory Streams", row=1, col=2)
fig.update_yaxes(title_text="Bandwidth (GB/s)", row=1, col=1)
fig.update_yaxes(title_text="Bandwidth (GB/s)", row=1, col=2)

fig.update_layout(height=400, showlegend=False, title_text="STREAM Kernel Comparison")
fig.show()

## 3. Problem Size Sweep: Cache Effects

Vary problem size from 1 KB to 8 GB to observe:
- **Small sizes**: Fit in L2/L3 cache → higher bandwidth
- **Large sizes**: Spill to HBM → sustained bandwidth plateau

In [None]:
# Problem sizes: 1 KB to 8 GB (in number of doubles)
# Each double is 8 bytes, and we have 4 arrays
sizes_kb = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536,
            131072, 262144, 524288, 1048576, 2097152]  # KB values

# Convert KB to number of doubles (1 KB = 128 doubles, since 1024 bytes / 8 bytes/double)
problem_sizes = [kb * 128 for kb in sizes_kb]

print(f"Running size sweep: {len(problem_sizes)} sizes from {sizes_kb[0]} KB to {sizes_kb[-1]} KB")
print("This will take a few minutes...\n")

# Run triad kernel across all sizes
for i, (size, kb) in enumerate(zip(problem_sizes, sizes_kb), 1):
    result = runner.run("stream", {
        "problem_size": size,
        "kernel_type": "triad",
        "block_size": 256
    })

    db.save_result(
        benchmark_name="stream",
        params={"problem_size": size, "kernel_type": "triad", "block_size": 256},
        result=result
    )

    if i % 5 == 0:
        print(f"  [{i:2d}/{len(problem_sizes)}] Size: {kb:7d} KB → {result.primary_metric:.2f} GB/s")

Running size sweep: 22 sizes from 1 KB to 2097152 KB
This will take a few minutes...

  [ 5/22] Size:      16 KB → 2.43 GB/s
  [10/22] Size:     512 KB → 80.17 GB/s


  [15/22] Size:   16384 KB → 1084.96 GB/s
  [20/22] Size:  524288 KB → 2041.98 GB/s


### Observations:

1. **Small sizes (< 64 MB)**: May show higher bandwidth due to L2/L3 cache hits
2. **Plateau region (> 256 MB)**: Represents true HBM bandwidth - this is your **memory bandwidth ceiling**
3. **Cache hierarchy**: Look for inflection points where bandwidth drops (cache size boundaries)

For **LLM decode phase** (batch=1), your workload will be in the HBM-bound regime since model weights (GBs) far exceed cache size (MBs).

## 4. Block Size Sensitivity

Test different thread block sizes to find optimal occupancy.

In [None]:
# Test different block sizes
block_sizes = [64, 128, 256, 512, 1024]
N = 100_000_000

print("Block Size | Bandwidth (GB/s) | Time (ms)")
print("-" * 50)

for block_size in block_sizes:
    result = runner.run("stream", {
        "problem_size": N,
        "kernel_type": "triad",
        "block_size": block_size
    })

    print(f"{block_size:10d} | {result.primary_metric:16.2f} | {result.exec_time_ms:9.3f}")

    db.save_result(
        benchmark_name="stream",
        params={"problem_size": N, "kernel_type": "triad", "block_size": block_size},
        result=result
    )

Block Size | Bandwidth (GB/s) | Time (ms)
--------------------------------------------------
        64 |           839.14 |     3.824
       128 |          1387.22 |     2.338
       256 |          2077.21 |     1.625
       512 |          2320.25 |     1.421
      1024 |          2316.54 |     1.430


## 5. Key Takeaways

### For LLM Inference:

1. **Decode Phase Target**: Your STREAM triad bandwidth (~2 TB/s) is the **ceiling** for memory-bound kernels
   - If decode throughput << STREAM bandwidth, there's optimization headroom
   - Typical optimizations: kernel fusion, reduce memory roundtrips, quantization

2. **Prefill Phase**: GEMM-bound, not memory-bound
   - Need **roofline model** to understand compute vs bandwidth trade-offs
   - STREAM alone doesn't tell the full story

3. **Real-World Considerations**:
   - Attention patterns are strided (not sequential like STREAM)
   - Quantization adds dequant overhead
   - FlashAttention/PagedAttention use tiling for better locality

### Next Benchmarks to Explore:
- **Roofline Model**: Compute vs memory bandwidth trade-offs
- **Attention Kernels**: FlashAttention-style fused operations
- **GEMM**: Different shapes and precisions (FP16/BF16/FP8)
- **Fused Kernels**: RMSNorm, LayerNorm, activations

STREAM gives you the **foundation** - now build up to realistic LLM kernels! 🚀

## 6. Export Results

In [37]:
# Export all results to CSV for further analysis
db.export_csv("stream_results.csv")
print("✓ Results exported to stream_results.csv")

# Show summary statistics
df = db.query("SELECT * FROM results WHERE benchmark_name = 'stream' AND kernel_type = 'triad'")
print(f"\nCollected {len(df)} triad measurements")
print(f"Peak bandwidth: {df['primary_metric'].max():.2f} GB/s")
print(f"Average bandwidth (large sizes): {df[df['problem_size'] > 10_000_000]['primary_metric'].mean():.2f} GB/s")

Exported 166 results to stream_results.csv
✓ Results exported to stream_results.csv

Collected 0 triad measurements
Peak bandwidth: nan GB/s
Average bandwidth (large sizes): nan GB/s
