Fix failing GPU tests. #19848

stonea · 2022-05-20T17:42:39Z

There are a few changes:

Add PREDIFF symbolic link to multiGpu dir

The outermost PREDIFF script is written generically enough to handle this. Anyway I've updated the good files to not explicitly do this and am updated the multi/ directory to follow the precedent of having a symbolically linked PREDIFF to the more generic script.

Update multiGPU .good file to recognize multiple printouts of arrays

I'm not sure why this has changed (maybe due to changes in the locale model?).
But it seems like the new output is correct (we're supposed to reprint the arrays for each GPU, right?). @e-kayrakli Could you please double check this?

Update fnWithForall to use new locale model
Instead of doing getChild(1) do gpus[0].

Update memory allocation in diags

See this diff, from:

< 0 (cpu): diags.chpl:12: allocate 16B of array elements at 0xPREDIFFED

to:

> 0 (cpu): diags.chpl:12: allocate 128B of array elements at 0xPREDIFFED

I'm guessing this reflects an internal change in the array module and we should just update the .good file but I want a second opinion. Can you comment @e-kayrakli?

* Add PREDIFF symbolic link to multiGpu dir * Update multiGPU .good file to recognize multiple printouts of arrays * Update fnWithForall to use new locale model * Update memory allocation in diags --- Signed-off-by: Andy Stone <stonea@users.noreply.github.com>

e-kayrakli · 2022-05-24T15:02:50Z

test/gpu/native/multiGPU/worksharing.good

@@ -1,5 +1,2 @@
 warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
-warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11060. Assuming the latest supported version 10.1
-ptxas warning : Unresolved extern variable 'chpl_nodeID' in whole program compilation, ignoring extern qualifier
-warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11060. Assuming the latest supported version 10.1


Ugh, I apologize. Thanks for fixing my mess.

e-kayrakli · 2022-05-24T15:05:09Z

test/gpu/native/multiGPU/multiGPU.good

+Array: 1 1 1 1 1 1 1 1 1 1 
+Array: 2 2 2 2 2 2 2 2 2 2 
+Array: 3 3 3 3 3 3 3 3 3 3 
+Array: 32 32 32 32 32 32 32 32 32 32 


This test had a prediff before. I removed it and updated the output, but on my workstation with 1 gpu. You get printouts from each GPU. This should be fine for now, but we'll probably need to revise our tests once we have testing on different systems that don't have 8 gpus per node.

e-kayrakli · 2022-05-24T15:06:40Z

test/gpu/native/diags.good

@@ -6,7 +6,7 @@ Start
 0 (cpu): diags.chpl:10: allocate 48B of _EndCount(AtomicT(int(64)),int(64)) at 0xPREDIFFED
 0 (cpu): diags.chpl:10: free at 0xPREDIFFED
 0 (cpu): diags.chpl:12: allocate 168B of [domain(1,int(64),false)] locale at 0xPREDIFFED
-0 (cpu): diags.chpl:12: allocate 16B of array elements at 0xPREDIFFED
+0 (cpu): diags.chpl:12: allocate 128B of array elements at 0xPREDIFFED


Could this be from your locale model changes? This line is where we do here.getChild vs here.gpus. Apparently we were already doing some allocation there already, but now we are making a bigger allocation?

I wonder if we run this manually with --devel this line would change to point to a more precise location in the modules where this allocation is coming from.

Could this be from your locale model changes? This line is where we do here.getChild vs here.gpus. Apparently we were already doing some allocation there already, but now we are making a bigger allocation?

Yeah, I think so.

Previous PR was: "Fix failing GPU tests. chapel-lang#19848" --- Signed-off-by: Andy Stone <stonea@users.noreply.github.com>

Forgot to add PREDIFF symlink in previous PR. Previous PR was: "Fix failing GPU tests. #19848" [Reviewed by nobody; trivial change to fix test failure]

Previous PR was: "Fix failing GPU tests. chapel-lang#19848" --- Signed-off-by: Andy Stone <stonea@users.noreply.github.com> Signed-off-by: Brandon Neth <nethbrandone@gmail.com>

stonea requested a review from e-kayrakli May 20, 2022 17:42

e-kayrakli approved these changes May 24, 2022

View reviewed changes

stonea merged commit 717e225 into chapel-lang:main May 24, 2022

stonea added a commit to stonea/chapel that referenced this pull request May 25, 2022

Forgot to add PREDIFF symlink in previous PR.

1cc3198

Previous PR was: "Fix failing GPU tests. chapel-lang#19848" --- Signed-off-by: Andy Stone <stonea@users.noreply.github.com>

stonea mentioned this pull request May 25, 2022

Forgot to add PREDIFF symlink in previous PR. #19875

Merged

stonea added a commit that referenced this pull request May 25, 2022

Merge pull request #19875 from stonea/forgotten_symlink

be3c52f

Forgot to add PREDIFF symlink in previous PR. Previous PR was: "Fix failing GPU tests. #19848" [Reviewed by nobody; trivial change to fix test failure]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix failing GPU tests. #19848

Fix failing GPU tests. #19848

stonea commented May 20, 2022

e-kayrakli May 24, 2022

e-kayrakli May 24, 2022

e-kayrakli May 24, 2022

stonea May 24, 2022

Fix failing GPU tests. #19848

Fix failing GPU tests. #19848

Conversation

stonea commented May 20, 2022

e-kayrakli May 24, 2022

Choose a reason for hiding this comment

e-kayrakli May 24, 2022

Choose a reason for hiding this comment

e-kayrakli May 24, 2022

Choose a reason for hiding this comment

stonea May 24, 2022

Choose a reason for hiding this comment