[PERFORMANCE OPTIMIZATION] add dot test script #49
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds benchmark-style test coverage for the BLAS level‑1
dotkernel so Python, PyTorch, Triton, and (optionally) CuTe implementations can be compared under a consistent harness. This closes the testing gap tracked in issue #19 and updates the README to show thatdotnow has an associated test.Design
Follows the existing benchmarking pattern used elsewhere in the project. A factory creates structured input pairs
(x, y)on a given device/dtype, andkernel_course.testingis used to collect implementations, run timed benchmarks, and compute FLOPs. Rather than asserting correctness directly here, the focus is on performance comparison across backends using the same shapes, devices, and dtypes.Changes
dothelpers, each guarded with a feature flag (HAS_PYTORCH,HAS_TRITON,HAS_CUTE) so the suite degrades gracefully when a backend is unavailable,factorythat builds 1D tensors viatorch.linspaceon the requesteddeviceanddtype,testing.get_implsandtesting.run_benchmarksto time each available backend implementation.dotas ✅ and link to test_dot.py.Implementation notes
cudaandmps, with skip markers if unavailable), dtype (float32,float16,bfloat16), and problem size (2^4,2^8,2^16) to cover small and moderately large vectors.2 * numelto match the dot product’s multiply‑add cost model.BenchmarkConfig(warmup=3, repeat=1_000)balances stability and runtime, providing enough iterations to smooth out noise while keeping the test run manageable.testing.show_benchmarks(results), which is consistent with existing benchmark‑style tests in the repo.Tests
test_dot_benchmarkruns as part ofpytest tests/test_dot.py(orpytest tests/) and will automatically skip device configurations that are not available on the current machine.Documentation
dotrow now shows a ✅ in the Test column and links to test_dot.py, keeping the operator matrix aligned with actual coverage.Checklist
dottest coverage and fixtures #19)dot)