-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Problem statement
The BLAS level-1 dot kernel has no CuTe backend implementation in this project. In the README BLAS table, the CuTe column for dot is empty, which prevents a full cross-backend comparison for one of the most fundamental BLAS-1 operations.
Without a CuTe dot kernel:
- users cannot see how dot products are expressed with CuTe primitives and thread scheduling,
- there is no CuTe baseline for performance comparison against PyTorch and Triton dot products,
- CuTe-based examples remain incomplete for basic BLAS-1 coverage.
Proposed solution
Implement a CuTe-based dot kernel that matches the mathematical semantics of the Python reference and fits within the project’s backend structure.
Concretely:
- Add a CuTe dot product kernel in the appropriate CuTe backend directory (once established), implementing
$z = x^\top y$ for 1D vectors. - Use CuTe constructs suitable for reductions and vector operations.
- Align the public entry-point API with other backends to allow uniform dispatch.
Alternatives considered
Alternatives such as omitting CuTe dot or reusing other backends for performance comparisons would:
- limit the educational value of demonstrating reductions in CuTe,
- leave the CuTe column incomplete in the README BLAS table,
- reduce CuTe’s role as a first-class backend in the project.
Implementation details
- Decide on file layout and build integration for CuTe kernels.
- Implement the dot product using CuTe abstractions for memory and thread scheduling.
- Ensure numerical equivalence with the Python reference and compatibility with the project’s testing and benchmarking utilities.
Use case
The CuTe dot kernel will:
- showcase a simple yet important reduction in CuTe,
- enable performance and implementation comparisons across backends,
- act as a basis for more advanced CuTe kernels (e.g. matrix multiplications).
Related work
- CuTe/CUTLASS examples of dot products and reductions.
- Standard BLAS
ddot/sdotimplementations.
Additional context
This issue complements the dot Python/PyTorch/Triton feature requests and contributes to full CuTe coverage of BLAS-1 operations.
Metadata
Metadata
Assignees
Labels
No labels