Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache hints #1312

Open
j-stephan opened this issue May 11, 2021 · 6 comments
Open

Add cache hints #1312

j-stephan opened this issue May 11, 2021 · 6 comments

Comments

@j-stephan
Copy link
Member

This is a followup to #18. We require portable load/store functionality with cache hints (such as __ldg or __stwb on CUDA). This should integrate nicely with #1249 and can probably be solved in the same or a followup PR.

@j-stephan j-stephan added this to the Version 0.8.0 milestone May 11, 2021
@j-stephan j-stephan added this to To do in Release 0.8 via automation May 11, 2021
This was referenced May 11, 2021
@j-stephan j-stephan removed this from To do in Release 0.8 Nov 10, 2021
@j-stephan j-stephan added this to To do in Release 0.9 via automation Nov 10, 2021
@j-stephan j-stephan removed this from the Version 0.9.0 (I/2022) milestone Mar 29, 2022
@j-stephan j-stephan removed this from To do in Release 0.9 Mar 29, 2022
@j-stephan j-stephan added this to To do in Release 1.0 via automation Mar 29, 2022
@bernhardmgruber
Copy link
Member

Btw, this can be handled really nice with std::mdspan and LLAMA accessors.

@j-stephan
Copy link
Member Author

Since alpaka is about being explicit: Maybe we should add alpaka "intrinsics" that emulate the behavior of __ldg and friends?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2022

A comment based on our recent experience: __ldg has the downside of always returning a copy, not a const reference. So it cannot be used with an interface like:

T& x() {
  return *px;
}

T const& x() const {
  return __ldg(px);
}

It can be used like

T x() const {
  return __ldg(px);
}

but that can be expensive (or wrong) for complex types, and cannot be used to get the pointer to px.

@bernhardmgruber
Copy link
Member

I am really not sure whether we can provide/define a cross platform behavior of architecture specific intrinsics. Taking __ldg as an example: It is used for read-only data loads, which may go through a different cache hierarchy and have lower latency. What would be the equivalent for the CPU? x86 has the MOVNTDQA instruction, for streaming loads, but is that actually the same? And what about other backends?

@j-stephan
Copy link
Member Author

Intel has a SYCL extension for FPGAs called "load-store units". See here. Maybe it would be worthwhile to copy the concept into alpaka where it could look like this:

// We are inside an alpaka kernel
using ReadOnlyLSU = alpaka::LoadStoreUnit<alpaka::ReadOnly>;

auto val = ReadOnlyLSU::load(some_ptr); // If using CUDA call __ldg() underneath. Otherwise perform a normal load if there is no equivalent.
ReadOnlyLSU::store(some_ptr, val); // This should cause a compile-time error.

So it cannot be used with an interface like:

T& x() {
    return *px;
}

T const& x() const {
    return __ldg(px);
}

Wouldn't that be kind of illegal in CUDA? __ldg requires that the memory in question is read-only for the lifetime of the kernel. With an interface like that you couldn't guarantee that the user won't overwrite the value.

It can be used like

T x() const {
    return __ldg(px);
}

but that can be expensive (or wrong) for complex types, and cannot be used to get the pointer to px.

__ldg() is restricted to fundamental scalar and vector types IIRC. If it gets more complex the compiler would probably complain anyway.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2022

The point was to give a small, self-contained example, that shows the problem with the interface of __ldg. The actual implementation is significantly more complex: .

Wouldn't that be kind of illegal in CUDA? __ldg requires that the memory in question is read-only for the lifetime of the kernel. With an interface like that you couldn't guarantee that the user won't overwrite the value.

Actually we can, because the object is either a mutable View or a ConstView, at the kernel level.

__ldg() is restricted to fundamental scalar and vector types IIRC. If it gets more complex the compiler would probably complain anyway.

Whether __ldg is actually used internally or not can depend on the type:

T x() const {
    if constexpr(can_use_ldg<T>) {
        return __ldg(px);
    } else {
        return *px;
    }
}

But the interface should be the same for all types, and using __ldg() for the fundamental types prevents returning the complex ones by reference.

So far we have managed to make things work using the __restrict__ qualifier instead of explicitly using __ldg.

@bernhardmgruber bernhardmgruber removed this from To do in Release 1.0 Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants