Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add post-kernel-launch assertions #119

Merged

Conversation

pcanal
Copy link
Contributor

@pcanal pcanal commented Jan 26, 2021

No description provided.

@mrguilima
Copy link
Contributor

Curious to know the reasoning behind this reordering of includes.

@pcanal
Copy link
Contributor Author

pcanal commented Jan 26, 2021

@sethrj Is there a way to customize the CI build for this run? In the default settings the test "app/demo-rasterizer" is not run. (It needs CELERITAS_USE_VecGeom enabled).

@pcanal
Copy link
Contributor Author

pcanal commented Jan 26, 2021

Curious to know the reasoning behind this reordering of includes.

This is unrelated to this PR. It is already in the master via: #118
where you can find how this is a work-around an nvling bug (present in cuda 10 and fixed in cuda 11).

@pcanal pcanal force-pushed the ChechCIvsdemo-rasterizerRDemoKernel branch from a4cbcf7 to 2650d44 Compare January 26, 2021 20:07
@amandalund
Copy link
Contributor

@pcanal isn't it run in the cuda-11 build?

@pcanal
Copy link
Contributor Author

pcanal commented Jan 26, 2021

isn't it run in the cuda-11 build?

Indeed it is. I missed that there was 2 builds. So on wc.fnal.gov+cuda10 it fails in the CI with cuda11 it works. I will try locally with cuda 11.

@sethrj
Copy link
Member

sethrj commented Jan 26, 2021

@pcanal Looks like something got mangled pretty badly in the title, can you fix it? It looks like this is just adding the call to check for a launch error? Would you be up for checking the other places in the code where we launch kernels, and make sure they also have the error peeking?

@pcanal pcanal changed the title Chech c ivsdemo rasterizer r demo kernel Check if demo-rasterizer works properly on the CI or silently fails. Jan 27, 2021
@pcanal
Copy link
Contributor Author

pcanal commented Jan 27, 2021

Would you be up for checking the other places in the code where we launch kernels, and make sure they also have the error peeking?

Sure ... But first I want to try to understand why this kernel fails on wc.fnal.gov with both cuda 10 and 11 and works fine on the CI with cuda 11.

See commit message for error detail.

@sethrj
Copy link
Member

sethrj commented Jan 27, 2021

Ahh I get it now.

@pcanal
Copy link
Contributor Author

pcanal commented Jan 27, 2021

Note: it seems to use to work .. so bisecting now.

@pcanal
Copy link
Contributor Author

pcanal commented Jan 27, 2021

It seems that the first commit when it failed is c63c329.
I still need to figure out 'why' this is ...

@sethrj
Copy link
Member

sethrj commented Jan 27, 2021

It seems that the first commit when it failed is c63c329.
I still need to figure out 'why' this is ...

Uh oh. The thing that stands out to me is adding separable compilation to the main library.

@pcanal
Copy link
Contributor Author

pcanal commented Jan 27, 2021

I agree ... (the build fails without it though) ... One odd part is "why" does it work on some machine (the CI, probably yours) and not mine ...

@sethrj
Copy link
Member

sethrj commented Jan 28, 2021

@pcanal Would you be up for changing this PR to add the requisite post-kernel-launch check to any other relevant locations? The need for that extra check was a good catch -- it was a bad assumption on my end that the device synchronize call would error if there had been a problem.

@pcanal
Copy link
Contributor Author

pcanal commented Jan 28, 2021

Would you be up for changing this PR to add the requisite post-kernel-launch check to any other relevant locations?

Yes, I will review and update the existing kernel launch to make sure the error are tested.

This is to see if CI see

arning: Cuda API error detected: cudaLaunchKernel returned (0x62)

warning: Cuda API error detected: cudaPeekAtLastError returned (0x62)

warning: Cuda API error detected: cudaGetLastError returned (0x62)

/wclustre/g4p/pcanal/geant/sources/celeritas/app/demo-rasterizer/demo-rasterizer.cc:139: critical: caught exception: /wclustre/g4p/pcanal/geant/sources/celeritas/app/demo-rasterizer/RDemoKernel.cu:107:
celeritas: cuda error: invalid device function
    cudaPeekAtLastError()
@pcanal pcanal force-pushed the ChechCIvsdemo-rasterizerRDemoKernel branch from 2650d44 to 259f034 Compare January 30, 2021 00:07
@pcanal pcanal requested a review from sethrj January 30, 2021 00:07
@sethrj sethrj changed the title Check if demo-rasterizer works properly on the CI or silently fails. Add post-kernel-launch assertions Jan 30, 2021
Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @pcanal !

@sethrj sethrj merged commit 036f8e2 into celeritas-project:master Jan 30, 2021
@sethrj sethrj added the minor Minor internal changes or fixes (including CI updates) label Feb 18, 2023
@sethrj sethrj added the core Software engineering infrastructure label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Software engineering infrastructure minor Minor internal changes or fixes (including CI updates)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants