Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[FEATURE] Add g5 instance to CI #20876

Merged
merged 21 commits into from
Mar 8, 2022
Merged

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Feb 5, 2022

Description

g5 instances are now available to the MXNet CI for testing on Ampere A10G GPUs (courtesy of PR apache/mxnet-ci#43 from @josephevans). Taking advantage of that, this PR turns on g5 instance use by adding a Python3: Ampere-GPU job running on the g5 to the unix-gpu pipeline. Since Ampere architecture GPUs use reduced-mantissa-width TF32 calculations by default on float32 datasets, this required some minor test tolerance adjustments.

A test_report_compute_capabilities test was added that outputs the GPU compute capability to the log . This should be helpful in debugging GPU test failures generally, and here confirms proper enablement of the arch-86 A10G GPU:

[2022-02-16T07:56:58.381Z] tests/python/gpu/test_operator_gpu.py::test_report_compute_capabilities = [86] PASSED [ 18%]

Side notes:

Unrelated MXNet numpy issue mxnet.numpy and numpy differ regarding binary op result type with broadcast 0-dim array input #20898 was discovered in the process of this PR development and a temporary work-around is included.

I encountered and fixed issues in how the test_rnn_layers_fp{16,32} tests are invoked. Before the fix of this PR, ./tests/python/unittest/test_gluon_rnn.py::test_rnn_layers_fp16 was run on a GPU, even though the "unittest" path is supposed to be for cpu-only testing. Also, ./tests/python/gpu/test_operator_gpu.py::test_rnn_layers_fp32, invoked via import, was run on a cpu, even though the "gpu" path was used.

After encountering an error on test_countsketch, I supplied a fix for an out-of-bounds write in the backward kernel. The operator is coded in the legacy operator style, and should be launching the kernels into the context's stream. Instead, the operator currently launches kernels into the default stream and includes cudaDevicedSynchronizes. I stopped short of making those additional changes, feeling those changes should be a separate PR filed based on user interest.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, windows-gpu, unix-cpu, unix-gpu, miscellaneous, sanity, edge, website, centos-cpu, clang, windows-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-work-in-progress PR is still work in progress label Feb 5, 2022
@DickJC123
Copy link
Contributor Author

@josephevans Could you take a look at what I've done so far, and perhaps troubleshoot why I'm seeing the error There are no nodes with the label ‘mxnetlinux-gpu-g5’ on CI page https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20876/3/pipeline/247. Thanks.

@DickJC123 DickJC123 requested a review from szha as a code owner February 15, 2022 22:32
@DickJC123 DickJC123 changed the title [FEATURE] [WIP] Add g5 instance to CI [FEATURE] Add g5 instance to CI Feb 17, 2022
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 17, 2022
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 17, 2022
@DickJC123 DickJC123 mentioned this pull request Feb 22, 2022
@DickJC123
Copy link
Contributor Author

I've encountered a test failure of test_countsketch here: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-20876/16/pipeline

I see where threads might write outside the output tensor bounds, so pushing a fix to this PR.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 22, 2022
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 22, 2022
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 24, 2022
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 24, 2022
Copy link
Contributor

@josephevans josephevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@DickJC123
Copy link
Contributor Author

Gentle ping for additional reviews and an eventual merge. This PR contains a few unrelated CI fixes that could help PR development generally.

@TristonC
Copy link
Contributor

TristonC commented Mar 8, 2022

@szha @ptrendx Could you please help review and merge this PR? It passed all the checks and is ready to be merged.

@ptrendx ptrendx merged commit fc54fab into apache:master Mar 8, 2022
@DickJC123
Copy link
Contributor Author

To help this PR pass CI, it included a fix to test_countsketch, providing a resolution for #10988.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants