-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix dist kvstore for trainer and flaky dist kvstore test #11633
Conversation
Hello Haibin, please take note that integrationtest_ubuntu_gpu_dist_kvstore is currently disabled due to this test throwing exceptions (which strangely don't cause the run to fail). This is tracked at #11441 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I see an error for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three nightly tests now pass. LGTM
@marcoabreu I've fixed and enabled the test. |
Thank you. Did you investigate why there were errors thrown in the log but the test has not been marked as failed? |
|
||
def test_gluon_trainer_step(): | ||
def check_trainer_step(): | ||
ctx = mx.cpu(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this on purpose? We run our KVstore tests on a GPU instance. if we don't require GPU, please downgrade to using a cpu instance.
ci/docker/runtime_functions.sh
Outdated
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=invalid | ||
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_type | ||
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_step | ||
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_sparse_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm having a hard time to determine which tests require a gpu and which ones dont. Would it be possible to add some way to distinguish the tests (like adding cpu and gpu to the names, separating them into different files or adding an argument).
} | ||
} | ||
}, | ||
'dist-kvstore tests CPU': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
Re: "Did you investigate why there were errors thrown in the log but the test has not been marked as failed?" Yes, i found |
Sure, but I mean why the script did not terminate with an error code. The exceptions were logged but the script did not actually fail with an error. We only noticed the exception in the log when we looked at them manually |
I thought @rahul003 fixed that? Could you confirm? |
@rahul003 and I figured that the timeout was due to significant launch overhead of omp auto tuning. This is extremely slow when launching the test in local mode and all 15 processes are doing omp tuning, sharing the same instance. Creating kvstore takes 2 minutes when n=3. |
Awesome! Am I right in assuming that nobody would run this locally? |
nobody except CI :) |
Okay excellent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/nightly/dist_sync_kvstore.py
Outdated
kv = set_optimizer(use_multiprecision=opt.multiprecision) | ||
test_sync_push_pull(opt.nrepeat) | ||
# dont run non compressed tests after this as kvstore compression will be set here | ||
# don't run non compressed tests after this as kvstore compression will be set here | ||
if opt.type == 'all' or opt.type == 'compressed': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this value all
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in the current script, it no longer represents all tests.
* fix dist kvstore trainer * fix test setup * enable tests on CI * update move some test to cpu * dont use nvdia-docker * rename option * trigger test * reduce workload to avvoid time out * disable operator tuning to reduce launch overhead * update test types
Description
Previously there were no unit tests for trainer with dist kvstore. #11429 introduced a bug for dist kvstore in trainer, where it calls
pull(ignore_sparse=False)
which is NOT implemented for dist kvstore. This PR corrects the API call and adds unit test for it. Nowignore_sparse
is set to False for non-distributed kvstore.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments