[PERFORMANCE OPTIMIZATION] add dot pytorch kernel #47

LoserCheems · 2025-12-01T03:24:22Z

Summary
Adds a PyTorch implementation of the BLAS level‑1 dot kernel so users can compute dot products via a consistent backend API instead of calling raw PyTorch ops directly. This fills the PyTorch column for dot in the BLAS table and aligns with the Python reference added in earlier work.

Design
Implements kernel_course.pytorch_ops.dot.dot(x, y) using idiomatic PyTorch operations: elementwise multiplication followed by a reduction. The function accepts two tensors, multiplies them elementwise, and sums the result into a scalar torch.Tensor. The interface mirrors the Python dot helper and other PyTorch kernels (e.g., copy, swap) so switching backends is straightforward.

Changes

Added dot.py with a dot(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor helper that computes torch.sum(torch.mul(x, y)).
Documented the function’s purpose, arguments, and return value via a docstring in line with existing PyTorch ops.
Updated the BLAS table in README.md to mark the PyTorch dot implementation as ✅ and link directly to dot.py.

Implementation notes

Relies on PyTorch broadcasting semantics, which allows x and y to be any shapes that broadcast to a common shape before the elementwise multiply.
The reduction is performed via torch.sum, producing a scalar tensor on the same device and with the same dtype as the inputs, which makes it suitable as a reference for later Triton/CuTe implementations.
The implementation is intentionally minimal, prioritizing clarity and alignment with the mathematical definition over micro‑optimizations.

Tests

No dedicated tests are included in this PR. Basic checks were done manually by comparing the helper’s output to torch.dot(x.flatten(), y.flatten()) and torch.sum(x * y) on small tensors.
A follow‑up tests/test_dot.py will treat the Python reference implementation as ground truth and validate the PyTorch backend against it across devices/dtypes, and then be wired into pytest tests/.

Documentation

The new helper is documented via its docstring.
The README BLAS table now accurately reflects that the Python and PyTorch dot kernels are available; dot.md already describes the shared interface and how to run the dot tests once they are added.

Checklist

Linked issue provided ([FEATURE REQUEST] dot PyTorch kernel implementation #16)
API stabilized
Tests added or updated (covered in a future test_dot.py PR)
Docs added or updated (README row updated; aligns with dot.md)
No known performance regressions

Provides a reusable tensor dot routine to standardize future vector operations

Updates the BLAS kernel matrix so the PyTorch column reflects the completed dot implementation, keeping the support table accurate for users.

Copilot

Pull request overview

This PR adds a PyTorch implementation of the BLAS level-1 dot kernel to compute dot products via a consistent backend API. The implementation fills the PyTorch column in the BLAS operations table, allowing users to compute z = x^T y using kernel_course.pytorch_ops.dot.dot(x, y).

Key changes:

Added dot(x, y) function using torch.sum(torch.mul(x, y)) operations
Updated README.md to mark PyTorch dot as complete (✅)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
kernel_course/pytorch_ops/dot.py	Implements dot product computation using PyTorch elementwise multiply and sum operations
README.md	Updates BLAS table to mark PyTorch dot implementation as complete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-01T03:27:24Z

kernel_course/pytorch_ops/dot.py

+    Returns:
+        torch.Tensor: The dot product of `x` and `y`.
+    """
+


The PyTorch implementation does not flatten the input tensors before computing the dot product, unlike the Python reference implementation. The Python reference (lines 19-20) uses x.reshape(-1) and y.reshape(-1) to ensure inputs are 1D vectors.

Without flattening, this implementation will:

Produce incorrect results for multi-dimensional tensors

Not be numerically equivalent to the Python reference

Potentially fail with broadcasting errors for certain input shapes

Recommendation: Add tensor flattening before the multiplication:

x = x.reshape(-1) y = y.reshape(-1) z = torch.sum(torch.mul(x, y))

Suggested change

x = x.reshape(-1)

y = y.reshape(-1)

Copilot · 2025-12-01T03:27:24Z

README.md

 | [scal](./docs/scal.md) | scale vector | $y = \alpha y$ | $n$ | $2n$ | [✅](./kernel_course/python_ops/scal.py) | [✅](./kernel_course/pytorch_ops/scal.py) | [✅](./kernel_course/triton_ops/scal.py) | ❌ | [✅](./tests/test_scal.py) |
 | [axpby](./docs/axpby.md) | update vector| $y = \alpha x + \beta y$ | $3n$ | $3n$ | [✅](./kernel_course/python_ops/axpby.py) | [✅](./kernel_course/pytorch_ops/axpby.py) | [✅](./kernel_course/triton_ops/axpby.py) | ❌ | [✅](./tests/test_axpby.py) |
-| [dot](./docs/dot.md) | dot product | $z = x^\top y$ | $2n$ | $2n$ | [✅](./kernel_course/python_ops/dot.py) | ❌ | ❌ | ❌ | ❌ |
+| [dot](./docs/dot.md) | dot product | $z = x^\top y$ | $2n$ | $2n$ | [✅](./kernel_course/python_ops/dot.py) | [✅](./kernel_course/pytorch_ops/dot.py) | ❌ | ❌ | ❌ |


This PR marks the PyTorch dot implementation as complete (✅) but does not include a corresponding test file. All other complete operations (copy, swap, scal, axpby) have test files (test_copy.py, test_swap.py, test_scal.py, test_axpby.py) that validate implementations across backends.

The PR description acknowledges this: "A follow‑up tests/test_dot.py will treat the Python reference implementation as ground truth..." However, marking the implementation as complete without tests is inconsistent with the existing convention.

Recommendation: Either:

Include tests/test_dot.py in this PR before marking as ✅, or

Keep the Test column as ❌ until the test file is added

Copilot · 2025-12-01T03:27:25Z

kernel_course/pytorch_ops/dot.py

+    y: torch.Tensor,
+) -> torch.Tensor:
+    """
+    Computes the dot product of two tensors using PyTorch operations.


[nitpick] The docstring description lacks detail compared to similar operations. Other PyTorch ops include more specific descriptions of what the operation does with the inputs.

For example:

copy.py: "Copies the contents of tensor x into tensor y using PyTorch operations."

scal.py: "Scales the contents of tensor y by a scalar alpha using PyTorch operations."

swap.py: "Swaps the contents of tensor x with tensor y using PyTorch operations."

Recommendation: Update the description to be more specific:

""" Computes the dot product of two tensors by multiplying corresponding elements and summing the results using PyTorch operations.

This clarifies what "dot product" means computationally and maintains consistency with the existing documentation style.

Suggested change

Computes the dot product of two tensors using PyTorch operations.

Computes the dot product of two tensors by multiplying corresponding elements

and summing the results using PyTorch operations.

LoserCheems added 2 commits December 1, 2025 11:22

Introduces PyTorch dot helper

8395ca8

Provides a reusable tensor dot routine to standardize future vector operations

Marks dot as available in PyTorch

5d0bd04

Updates the BLAS kernel matrix so the PyTorch column reflects the completed dot implementation, keeping the support table accurate for users.

Copilot AI review requested due to automatic review settings December 1, 2025 03:24

github-actions bot assigned SNHuan Dec 1, 2025

Copilot started reviewing on behalf of LoserCheems December 1, 2025 03:24 View session

Copilot finished reviewing on behalf of LoserCheems December 1, 2025 03:27

Copilot AI reviewed Dec 1, 2025

View reviewed changes

LoserCheems merged commit 3ac79a2 into main Dec 1, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PERFORMANCE OPTIMIZATION] add dot pytorch kernel #47

[PERFORMANCE OPTIMIZATION] add dot pytorch kernel #47

Uh oh!

LoserCheems commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Copilot AI Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Computes the dot product of two tensors using PyTorch operations.
	Computes the dot product of two tensors by multiplying corresponding elements
	and summing the results using PyTorch operations.

[PERFORMANCE OPTIMIZATION] add dot pytorch kernel #47

[PERFORMANCE OPTIMIZATION] add dot pytorch kernel #47

Uh oh!

Conversation

LoserCheems commented Dec 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants