Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test cases for AMD GPUs #464

Merged
merged 2 commits into from Jun 13, 2019
Merged

Conversation

@dongahn
Copy link
Contributor

dongahn commented May 7, 2019

  • Add support for OpenCL device naming convention within hwloc reader
  • Add tests for systems with AMD GPUs and mixed GPU types

Please don't merge this before PR #455 and #460 get merged. Because the changes in the previously submitted PRs are substantially, I didn't want to do this PR without those changes.

@dongahn dongahn requested review from SteVwonder, grondo and trws May 7, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented May 7, 2019

I tagged @SteVwonder, @trws and @grondo as potential reviewers.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented May 7, 2019

Hmmm Travis CI failed. I thought this was fixed in the previous PRs.

My proposal:

Review PR #455 and #460 and land them first. And then, rebasing this patch from the fresh upstream/master to see where this is exactly.

make[4]: Entering directory '/usr/src/flux-sched-0.7.0-55-gc86be2c/_build/sub/resource/modules'
  CXX      resource_la-resource_match.lo
../../../../resource/modules/resource_match.cpp: In function 'json_t* get_string_blocking(flux_t*, const char*)':
../../../../resource/modules/resource_match.cpp:300:65: warning: converting to non-pointer type 'int' from NULL [-Wconversion-null]
     if (!(f = flux_kvs_lookup (h, NULL, FLUX_KVS_WAITCREATE, key))) {
                                                                 ^
../../../../resource/modules/resource_match.cpp:300:65: error: cannot convert 'kvs_op' to 'const char*'
In file included from /usr/include/flux/core/kvs.h:17,
                 from /usr/include/flux/core.h:19,
                 from ../../../../resource/modules/resource_match.cpp:35:
/usr/include/flux/core/kvs_lookup.h:18:67: note:   initializing argument 3 of 'flux_future_t* flux_kvs_lookup(flux_t*, int, const char*)'
 flux_future_t *flux_kvs_lookup (flux_t *h, int flags, const char *key);
@dongahn dongahn force-pushed the dongahn:amdgpu-test branch from 665d947 to 8f5c7a5 Jun 3, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 3, 2019

@SteVwonder: I also rebased this one and force a pushed. Hopefully we don't get the error from Travis this time. Once PR #460 is merged, this should be immediately ready -- of course if there is no obvious error from this PR :-)

AMD GPUs appear in hwloc as "opencl" followed by a platform ID
followed by a device ID with the letter 'd' delimits the
platform and device IDs.

Add support for this naming convention for hwloc reader.
@dongahn dongahn force-pushed the dongahn:amdgpu-test branch from 8f5c7a5 to a34b58c Jun 5, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jun 5, 2019

Codecov Report

Merging #464 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #464      +/-   ##
==========================================
+ Coverage   76.19%   76.23%   +0.04%     
==========================================
  Files          45       45              
  Lines        5477     5478       +1     
==========================================
+ Hits         4173     4176       +3     
+ Misses       1304     1302       -2
Impacted Files Coverage Δ
resource/generators/gen.cpp 87.78% <100%> (+0.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0904670...9a0f1d1. Read the comment docs.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 5, 2019

@SteVwonder: Okay I rebased this to upstream/master and forced a push.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 5, 2019

Need to look at CI failures...

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 8, 2019

CI:

FAIL: t4004-match-hwloc.t 3 - resource-query works on gpu query using xml (AMD GPU)
FAIL: t4004-match-hwloc.t 4 - resource-query works on gpu type query using xml (MIXED)

A bit lack of diagnostics. I will push some debug commits to get more data. Please ignore those for a while.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 8, 2019

diff: /usr/src/flux-sched-0.7.0-57-g36a0261/t/data/resource/expected/basics/019.R.out: No such file or directory
not ok 3 - resource-query works on gpu query using xml (AMD GPU)
expecting success: 
    sed "s~@TEST_SRCDIR@~${SHARNESS_TEST_SRCDIR}~g" ${cmd_dir}/cmds11.in > cmds11 &&
    ${query} -X ${hwloc_2mtypes} -S CA -P high -t 020.R.out < cmds11 &&
    test_cmp 020.R.out ${exp_dir}/020.R.out
INFO: Loading a matcher: CA
resource-query> # 2x slot[2]->numanode[1]->gpu[1]
<6a0261/t/data/resource/jobspecs/basics/test012.yaml
<6a0261/t/data/resource/jobspecs/basics/test012.yaml
resource-query> quit
diff: /usr/src/flux-sched-0.7.0-57-g36a0261/t/data/resource/expected/basics/020.R.out: No such file or directory
not ok 4 - resource-query works on gpu type query using xml (MIXED)

Actually clicking on the error expansion gives more info. I think this is just those validation output files not committed.

Add an hwloc xml with a system with AMD GPUs in
data/hwloc-data/001N/amd_gpu/corona11.xml.

Add an hwloc xml with a system with mixed GPUs (one NVIDIA GPU
and one AMD GPU) in data/hwloc-data/001N/multi_gpu_types/chimera.xml.

Modify t4004-match-hwloc.t to make use of these xml files
and test GPU scheduling using resource-query.
@dongahn dongahn force-pushed the dongahn:amdgpu-test branch from a34b58c to 9a0f1d1 Jun 8, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 8, 2019

OK. I forced a push. I think missed those files because .gitignore ignores out prefix.

ahn1@7b680917d49f:/usr/src/t/data/resource/expected/basics$ git add 019.R.out 020.R.out
The following paths are ignored by one of your .gitignore files:
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 9, 2019

@SteVwonder: OK. The Travis turned green. I think this should be ready to go in. Thank you for your review.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jun 11, 2019

I just realized I do not have access to any LC clusters with AMD GPUs. Just made several requests on IDM. As soon as one goes through, I'll give this a spin.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 11, 2019

I just realized I do not have access to any LC clusters with AMD GPUs. Just made several requests on IDM. As soon as one goes through, I'll give this a spin.

Sounds good. You could also do some testing using the xml files I added to the repo if that helps.

Copy link
Member

SteVwonder left a comment

LGTM! Using the hwloc files, I was able to give it a spin on my desktop.

I'll merge tomorrow if no one else has any input.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jun 12, 2019

Thanks @SteVwonder!

@SteVwonder SteVwonder merged commit 9720656 into flux-framework:master Jun 13, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.19%)
Details
codecov/project 76.23% (+0.04%) compared to 0904670
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dongahn dongahn deleted the dongahn:amdgpu-test branch Jul 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.