Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing GPU tests. #19848

Merged
merged 1 commit into from May 24, 2022
Merged

Conversation

stonea
Copy link
Contributor

@stonea stonea commented May 20, 2022

There are a few changes:

Add PREDIFF symbolic link to multiGpu dir

The outermost PREDIFF script is written generically enough to handle this. Anyway I've updated the good files to not explicitly do this and am updated the multi/ directory to follow the precedent of having a symbolically linked PREDIFF to the more generic script.

Update multiGPU .good file to recognize multiple printouts of arrays

I'm not sure why this has changed (maybe due to changes in the locale model?).
But it seems like the new output is correct (we're supposed to reprint the arrays for each GPU, right?). @e-kayrakli Could you please double check this?

Update fnWithForall to use new locale model
Instead of doing getChild(1) do gpus[0].

Update memory allocation in diags

See this diff, from:

< 0 (cpu): diags.chpl:12: allocate 16B of array elements at 0xPREDIFFED

to:

> 0 (cpu): diags.chpl:12: allocate 128B of array elements at 0xPREDIFFED

I'm guessing this reflects an internal change in the array module and we should just update the .good file but I want a second opinion. Can you comment @e-kayrakli?

* Add PREDIFF symbolic link to multiGpu dir
* Update multiGPU .good file to recognize multiple printouts of arrays
* Update fnWithForall to use new locale model
* Update memory allocation in diags

---
Signed-off-by: Andy Stone <stonea@users.noreply.github.com>
@stonea stonea requested a review from e-kayrakli May 20, 2022 17:42
@@ -1,5 +1,2 @@
warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11060. Assuming the latest supported version 10.1
ptxas warning : Unresolved extern variable 'chpl_nodeID' in whole program compilation, ignoring extern qualifier
warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11060. Assuming the latest supported version 10.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I apologize. Thanks for fixing my mess.

Array: 1 1 1 1 1 1 1 1 1 1
Array: 2 2 2 2 2 2 2 2 2 2
Array: 3 3 3 3 3 3 3 3 3 3
Array: 32 32 32 32 32 32 32 32 32 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test had a prediff before. I removed it and updated the output, but on my workstation with 1 gpu. You get printouts from each GPU. This should be fine for now, but we'll probably need to revise our tests once we have testing on different systems that don't have 8 gpus per node.

@@ -6,7 +6,7 @@ Start
0 (cpu): diags.chpl:10: allocate 48B of _EndCount(AtomicT(int(64)),int(64)) at 0xPREDIFFED
0 (cpu): diags.chpl:10: free at 0xPREDIFFED
0 (cpu): diags.chpl:12: allocate 168B of [domain(1,int(64),false)] locale at 0xPREDIFFED
0 (cpu): diags.chpl:12: allocate 16B of array elements at 0xPREDIFFED
0 (cpu): diags.chpl:12: allocate 128B of array elements at 0xPREDIFFED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be from your locale model changes? This line is where we do here.getChild vs here.gpus. Apparently we were already doing some allocation there already, but now we are making a bigger allocation?

I wonder if we run this manually with --devel this line would change to point to a more precise location in the modules where this allocation is coming from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be from your locale model changes? This line is where we do here.getChild vs here.gpus. Apparently we were already doing some allocation there already, but now we are making a bigger allocation?

Yeah, I think so.

@stonea stonea merged commit 717e225 into chapel-lang:main May 24, 2022
stonea added a commit to stonea/chapel that referenced this pull request May 25, 2022
Previous PR was: "Fix failing GPU tests. chapel-lang#19848"

---
Signed-off-by: Andy Stone <stonea@users.noreply.github.com>
stonea added a commit that referenced this pull request May 25, 2022
Forgot to add PREDIFF symlink in previous PR.

Previous PR was: "Fix failing GPU tests. #19848"

[Reviewed by nobody; trivial change to fix test failure]
brandonneth pushed a commit to brandonneth/chapel that referenced this pull request Jun 2, 2022
Previous PR was: "Fix failing GPU tests. chapel-lang#19848"

---
Signed-off-by: Andy Stone <stonea@users.noreply.github.com>
Signed-off-by: Brandon Neth <nethbrandone@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants