Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Summary
Adds benchmark-style test coverage for the BLAS level‑1 dot kernel so Python, PyTorch, Triton, and (optionally) CuTe implementations can be compared under a consistent harness. This closes the testing gap tracked in issue #19 and updates the README to show that dot now has an associated test.

Design
Follows the existing benchmarking pattern used elsewhere in the project. A factory creates structured input pairs (x, y) on a given device/dtype, and kernel_course.testing is used to collect implementations, run timed benchmarks, and compute FLOPs. Rather than asserting correctness directly here, the focus is on performance comparison across backends using the same shapes, devices, and dtypes.

Changes

  • Added test_dot.py which:
    • imports the Python, PyTorch, Triton, and CuTe dot helpers, each guarded with a feature flag (HAS_PYTORCH, HAS_TRITON, HAS_CUTE) so the suite degrades gracefully when a backend is unavailable,
    • defines a factory that builds 1D tensors via torch.linspace on the requested device and dtype,
    • uses testing.get_impls and testing.run_benchmarks to time each available backend implementation.
  • Updated the BLAS table in README.md to mark the Test column for dot as ✅ and link to test_dot.py.

Implementation notes

  • Benchmarks are parameterized over device (cuda and mps, with skip markers if unavailable), dtype (float32, float16, bfloat16), and problem size (2^4, 2^8, 2^16) to cover small and moderately large vectors.
  • FLOP count is set to 2 * numel to match the dot product’s multiply‑add cost model.
  • BenchmarkConfig(warmup=3, repeat=1_000) balances stability and runtime, providing enough iterations to smooth out noise while keeping the test run manageable.
  • Results are printed via testing.show_benchmarks(results), which is consistent with existing benchmark‑style tests in the repo.

Tests

  • test_dot_benchmark runs as part of pytest tests/test_dot.py (or pytest tests/) and will automatically skip device configurations that are not available on the current machine.
  • Backends are pulled in only if their corresponding modules import successfully, avoiding hard failures when, for example, Triton or CuTe is not installed.

Documentation

  • README has been updated so the dot row now shows a ✅ in the Test column and links to test_dot.py, keeping the operator matrix aligned with actual coverage.
  • The existing dot.md “Testing” section that references test_dot.py is now accurate with this file in place.

Checklist

Provides parameterized benchmark over CUDA and MPS devices with multiple dtypes and sizes to compare python, PyTorch, Triton, and Cutlass implementations for the dot kernel
Updates the kernel summary so dot now shows an available unit test, keeping the documentation aligned with actual test coverage
@LoserCheems LoserCheems merged commit dedcc48 into main Dec 1, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants