Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit_hipDeviceSynchronize_Functional can cause system hangs #112

Closed
pjaaskel opened this issue Aug 16, 2022 · 6 comments
Closed

Unit_hipDeviceSynchronize_Functional can cause system hangs #112

pjaaskel opened this issue Aug 16, 2022 · 6 comments
Milestone

Comments

@pjaaskel
Copy link
Collaborator

The test tests long running kernels and causes hangs (likely kernel mode busy loops) with a shorter duration to some, longer to some. In this laptop I once waited for 10 minutes for the Linux to wake up before hard power off. I added it to the flaky_tests file for now in #111.

@pjaaskel pjaaskel added this to the 0.9 - the first release milestone Aug 16, 2022
@franz
Copy link
Collaborator

franz commented Aug 17, 2022

I've added it to the exclude-tests regexp in PR #113.

The test is problematic for multiple reasons, but we can only partially fix it. First problem that's actually solvable is that it uses a simple for loop + assignment, which Clang can optimize away based on -O flag (that's why i got different results than Henry on the same machine). On that machine, the test causes the GUI to "freeze" but it eventually recovers. If the test causes your laptop to freeze and not recover, i think that problem is out of scope for CHIP-SPV and is a linux kernel / driver issue.

BTW Intel's oneAPI docs recommend to disable GPU hangcheck for long-running tasks, so it seems they're aware of the "freezing".

@pvelesko
Copy link
Collaborator

pvelesko commented Aug 17, 2022 via email

@pvelesko
Copy link
Collaborator

I reduced the runtime further on main please check if this is still causing you issues. @franz

@pjaaskel
Copy link
Collaborator Author

@franz I waited 10 minutes at most, it could recover of course eventually :) Yep, sounds like a driver-side issue. Let's just exclude the test for now as it might hit end users and they think it's a CHIP-SPV issue.

@pvelesko
Copy link
Collaborator

pvelesko commented Aug 18, 2022 via email

@pvelesko
Copy link
Collaborator

@pjaaskel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants