Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudaLaunchHostFunc #4338

Merged
merged 4 commits into from
Dec 3, 2020
Merged

Add cudaLaunchHostFunc #4338

merged 4 commits into from
Dec 3, 2020

Conversation

leofang
Copy link
Member

@leofang leofang commented Nov 25, 2020

According to the CUDA Runtime API documentation, cudaStreamAddCallback will be deprecated in favor of this new function, added since CUDA 10.0:

This function is slated for eventual deprecation and removal. If you do not require the callback to execute in case of a device error, consider using cudaLaunchHostFunc. Additionally, this function is not supported with cudaStreamBeginCapture and cudaStreamEndCapture, unlike cudaLaunchHostFunc.

TODO:

  • Skip tests for CUDA 9.x
  • Add stubs

Comment on lines -96 to +113
if not cuda.runtime.is_hip:
stream = cuda.Stream.null
else:
# adding callbacks to the null stream in HIP would segfault...
stream = cuda.Stream()
stream = self.stream
Copy link
Member Author

@leofang leofang Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the mistake I made in #3835 🤦🏻‍♂️ self.stream was totally ignored...

@leofang
Copy link
Member Author

leofang commented Nov 26, 2020

/test

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall, I have a few questions only, but thanks for working on it @leofang !

Comment on lines +232 to +233
callback (function): Callback function. It must take only one
argument (user data object), and returns nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a naive question, but add_callback supports/requires 3 arguments: stream, error status and user data. Shouldn't it be the same here? I think for the test you proposed in #4322 (comment) it would be very useful to have the stream so that we can verify that it's indeed happening on the stream we're expecting.

Copy link
Member Author

@leofang leofang Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am following the requirement from cudaLaunchHostFunc that the host function should only have 1 argument instead of 3.

For the test we would like to do there, we could either use add_callback + a 3-arg callback, or use this new function launch_host_func + a 1-arg callback, with the stream added as part of the arg:

    data = []
    data.append(stream)
    stream.launch_host_func(callback, data)

and check stream.ptr in the callback function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I missed the fact that's how cudaLaunchHostFunc should behave out of my ignorance of that function, thanks for clarifying!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, Peter! I didn't know this function before either. I became aware of it only because your PTDS PR made me wonder if it's possible to test it programmatically other than eyeballing nvvp/nsys, and I started browsing the Runtime API doc 😄

stream.launch_host_func(
lambda t: out.append(t[0]), (i, numpy_array))

stream.synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test here be performed on an explicit stream too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in fact done in the other test added below. It's just that the stream pointer is wrapped by ExternalStream there, but it should achieve what you had in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I should have checked more than what GH showed to see how the stream was created, thanks for pointing that out and sorry for the false alarm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking! It's always good to have extra pairs of eyes 🙏

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @leofang !

@takagi
Copy link
Member

takagi commented Dec 1, 2020

pfnCI, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit d7dc7d3, target branch master) failed with status FAILURE.

Comment on lines +827 to +830
if _is_hip_environment:
raise RuntimeError('This feature is not supported on HIP')
if CUDA_VERSION < 10000:
raise RuntimeError('This feature is only supported on CUDA 10.0+')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takagi by inspecting the generated C file I noticed Cython (at least 0.29.21 that I'm using) has a nice property that for the if conditions that can be determined at compile time, Cython will be smart enough to determine the rest of code is dead and eliminate it! (So for RTD and HIP, this function actually stops right after raising.) As a result, it's fine even if we don't add any stubs (for example I forgot to add stubs for CUDA 9.2 😂) Can we rely on this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds so nice! Would you check if it works with Cython 0.28.0 as, currently, CuPy requires Cython 0.28.0 or later to build it from its source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @takagi, I just checked Cython 0.28.0 will also eliminate the dead code! This is the warning thrown during cythonizing (in both versions):

[50/59] Cythonizing cupy_backends/cuda/api/runtime.pyx
warning: cupy_backends/cuda/api/runtime.pyx:832:16: Unreachable code

Though I think 0.28 is too outdated and should be avoided as much as possible (#4148).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask you to post a new issue so other core members can find that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, @takagi, see #4393.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@leofang
Copy link
Member Author

leofang commented Dec 1, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit d7dc7d3, target branch master) succeeded!

@takagi
Copy link
Member

takagi commented Dec 3, 2020

LGTM!

@takagi takagi merged commit 60f37e8 into cupy:master Dec 3, 2020
@leofang leofang deleted the launch_host branch December 3, 2020 14:28
@leofang
Copy link
Member Author

leofang commented Dec 4, 2020

Thanks @takagi @pentschev!

@leofang leofang mentioned this pull request Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants