# 03. Floyd-Warshall APSP Benchmark

This notebook benchmarks the following variants of the Floyd-Warshall algorithm for All-Pairs Shortest Path (APSP):
- `floyd_serial`
- `floyd_openmp`
- `floyd_cuda`

Due to its O(V³) complexity, this algorithm is best suited for dense graphs with a smaller number of vertices.

## 1. Setup

Copy and paste the utility functions from `00_setup_build.ipynb`.

In [None]:
import subprocess, statistics, re, os, json, time, pandas as pd

def run_command(cmd, timeout=300):
    try:
        print("  >", cmd)
        return subprocess.run(cmd, shell=True, capture_output=True,
                             text=True, check=True, timeout=timeout).stdout
    except subprocess.CalledProcessError as e:
        print("    stderr:", e.stderr.strip())
    except subprocess.TimeoutExpired:
        print("    timeout")
    return None

def parse_time(out):
    if not out: return None
    m = re.search(r"time:\s*([0-9]*\.?[0-9]+)\s*(ms|s|sec|seconds)?", out, re.I)
    if not m: return None
    val = float(m.group(1)); unit = (m.group(2) or "s").lower()
    return val/1000.0 if unit.startswith("ms") else val

def time_exe(cmd, warmups=1, runs=3):
    if not cmd: return None
    for _ in range(warmups): _ = run_command(cmd)
    samples = []
    for _ in range(runs):
        t = parse_time(run_command(cmd))
        if t is not None: samples.append(t)
    return statistics.median(samples) if samples else None

## 2. Dataset Considerations

Floyd-Warshall operates on a dense adjacency matrix. While our executables generate graphs from an edge list representation based on a `density` parameter, it's important to understand the implications:

- **Dense Graphs (`density` ≈ 1.0)**: This is the ideal use case for Floyd-Warshall, as the `O(V³)` complexity is matched by the `O(V²)` data size.
- **Sparse Graphs (`density` << 1.0)**: The algorithm still performs `O(V³)` operations, but most of them will involve an infinity value, representing wasted work. For sparse graphs, Johnson's algorithm is typically superior.

## 3. Benchmark Parameters

In [None]:
#@markdown ### Benchmark Parameters (Floyd-Warshall APSP)
V_list = "50,100,200,300"  #@param {type:"string"}
min_w = -10                  #@param {type:"integer"}
max_w = 50                   #@param {type:"integer"}
density = 0.3                #@param {type:"number"}
threads = 8                  #@param {type:"integer"}

V_list = [int(x) for x in V_list.split(",")]
executables = ['floyd_serial','floyd_openmp','floyd_cuda']

### Algorithmic Variant: Blocked Floyd-Warshall

The standard Floyd-Warshall algorithm has poor cache utilization because its memory access pattern (iterating through rows and columns) does not exhibit good locality. The **Blocked Floyd-Warshall** algorithm improves this by partitioning the adjacency matrix into smaller blocks (or tiles) and performing as many operations as possible on a block while it is in the cache or GPU shared memory.

In [None]:
# Pseudo-code for Blocked Floyd-Warshall (illustrative)
def floyd_warshall_blocked(dist_matrix, B):
    N = dist_matrix.shape[0]
    for k0 in range(0, N, B):
        # Phase 1: process block (k0, k0)
        # ... full FW update within this block ...
        
        # Phase 2: process row k0 and column k0 blocks
        # ... update row/col blocks using pivots from the (k0,k0) block ...
        
        # Phase 3: update all other blocks
        # ... update remaining blocks using pivots from the (k0,k0) block ...
    return dist_matrix

## 4. Command Builder

In [None]:
def build_cmd_fw(exe, v, *, min_w, max_w, density, threads):
    path = os.path.join("bin", exe)
    if not os.path.exists(path): return None
    args = [str(v), str(min_w), str(max_w), str(density)]
    if "openmp" in exe:
        args.append(str(threads))
    return " ".join([path] + args)

## 5. Run Benchmarks

In [None]:
rows = []
for v in V_list:
    print(f"\nFloyd-Warshall for V={v}")
    row = {"vertices": v}
    for exe in executables:
        cmd = build_cmd_fw(exe, v, min_w=min_w, max_w=max_w, 
                           density=density, threads=threads)
        t = time_exe(cmd)
        row[exe] = t
        if t is not None: print(f"  {exe}: {t:.6f}s")
    rows.append(row)

import pandas as pd
df_fw = pd.DataFrame(rows).set_index("vertices").sort_index()
df_fw.to_csv("floyd_times.csv")
df_fw

## 6. Speedup Analysis

In [None]:
import numpy as np, seaborn as sns, matplotlib.pyplot as plt
base = df_fw['floyd_serial']
speed = pd.DataFrame({
    "floyd_openmp_speedup": base / df_fw['floyd_openmp'],
    "floyd_cuda_speedup":   base / df_fw['floyd_cuda'],
}, index=df_fw.index)

display(speed)
sns.lineplot(data=speed.reset_index().melt("vertices", var_name="variant", value_name="speedup"),
             x="vertices", y="speedup", hue="variant", marker="o")
plt.axhline(1, ls="--", c="gray"); plt.yscale("log"); plt.show()
speed.to_csv("floyd_speedup.csv")

## 7. CPU vs. GPU Scaling Discussion

After running the benchmarks, we can analyze the results:

- **Small `V`**: For small numbers of vertices, the GPU version may be slower than the CPU versions (serial or OpenMP). This is due to the overhead of launching a CUDA kernel and transferring data to the GPU, which can dominate the total execution time.
- **Large `V`**: As `V` increases, the `O(V³)` computational work grows much faster than the `O(V²)` data transfer cost. At a certain crossover point, the massive parallelism of the GPU will overcome the initial overhead, leading to significant speedups.
- **OpenMP Scaling**: The OpenMP version should provide a speedup over the serial version, but it will likely not be linear with the number of threads. This is because the algorithm is often memory-bandwidth bound on modern multi-core CPUs.