Skip to content

[FEATURE REQUEST] dot CuTe kernel implementation #18

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-1 dot kernel has no CuTe backend implementation in this project. In the README BLAS table, the CuTe column for dot is empty, which prevents a full cross-backend comparison for one of the most fundamental BLAS-1 operations.

Without a CuTe dot kernel:

  • users cannot see how dot products are expressed with CuTe primitives and thread scheduling,
  • there is no CuTe baseline for performance comparison against PyTorch and Triton dot products,
  • CuTe-based examples remain incomplete for basic BLAS-1 coverage.

Proposed solution

Implement a CuTe-based dot kernel that matches the mathematical semantics of the Python reference and fits within the project’s backend structure.

Concretely:

  • Add a CuTe dot product kernel in the appropriate CuTe backend directory (once established), implementing $z = x^\top y$ for 1D vectors.
  • Use CuTe constructs suitable for reductions and vector operations.
  • Align the public entry-point API with other backends to allow uniform dispatch.

Alternatives considered

Alternatives such as omitting CuTe dot or reusing other backends for performance comparisons would:

  • limit the educational value of demonstrating reductions in CuTe,
  • leave the CuTe column incomplete in the README BLAS table,
  • reduce CuTe’s role as a first-class backend in the project.

Implementation details

  • Decide on file layout and build integration for CuTe kernels.
  • Implement the dot product using CuTe abstractions for memory and thread scheduling.
  • Ensure numerical equivalence with the Python reference and compatibility with the project’s testing and benchmarking utilities.

Use case

The CuTe dot kernel will:

  • showcase a simple yet important reduction in CuTe,
  • enable performance and implementation comparisons across backends,
  • act as a basis for more advanced CuTe kernels (e.g. matrix multiplications).

Related work

  • CuTe/CUTLASS examples of dot products and reductions.
  • Standard BLAS ddot/sdot implementations.

Additional context

This issue complements the dot Python/PyTorch/Triton feature requests and contributes to full CuTe coverage of BLAS-1 operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions