Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix dist kvstore for trainer and flaky dist kvstore test #11633

Merged
merged 12 commits into from
Jul 13, 2018

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Jul 10, 2018

Description

Previously there were no unit tests for trainer with dist kvstore. #11429 introduced a bug for dist kvstore in trainer, where it calls pull(ignore_sparse=False) which is NOT implemented for dist kvstore. This PR corrects the API call and adds unit test for it. Now ignore_sparse is set to False for non-distributed kvstore.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@marcoabreu
Copy link
Contributor

Hello Haibin,

please take note that integrationtest_ubuntu_gpu_dist_kvstore is currently disabled due to this test throwing exceptions (which strangely don't cause the run to fail). This is tracked at #11441

@eric-haibin-lin eric-haibin-lin changed the title [WIP] Fix dist kvstore for trainer Fix dist kvstore for trainer Jul 10, 2018
Copy link
Member

@rahul003 rahul003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ctcyang
Copy link
Contributor

ctcyang commented Jul 10, 2018

I see an error for gluon_step, but the other two nightly tests pass. I ran ../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_step. The error log is attached dist_kvstore_error.txt.

Copy link
Contributor

@ctcyang ctcyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three nightly tests now pass. LGTM

@eric-haibin-lin
Copy link
Member Author

@marcoabreu I've fixed and enabled the test.

@eric-haibin-lin eric-haibin-lin changed the title Fix dist kvstore for trainer Fix dist kvstore for trainer and flaky dist kvstore test Jul 11, 2018
@marcoabreu
Copy link
Contributor

Thank you. Did you investigate why there were errors thrown in the log but the test has not been marked as failed?


def test_gluon_trainer_step():
def check_trainer_step():
ctx = mx.cpu(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this on purpose? We run our KVstore tests on a GPU instance. if we don't require GPU, please downgrade to using a cpu instance.

../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=invalid
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_type
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_step
../../tools/launch.py -n 7 --launcher local python dist_sync_kvstore.py --type=gluon_sparse_step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm having a hard time to determine which tests require a gpu and which ones dont. Would it be possible to add some way to distinguish the tests (like adding cpu and gpu to the names, separating them into different files or adding an argument).

}
}
},
'dist-kvstore tests CPU': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@eric-haibin-lin
Copy link
Member Author

Re: "Did you investigate why there were errors thrown in the log but the test has not been marked as failed?"

Yes, i found init_kv() was called twice and throwing that exception. Now it is called only once.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 11, 2018

Sure, but I mean why the script did not terminate with an error code. The exceptions were logged but the script did not actually fail with an error. We only noticed the exception in the log when we looked at them manually

@eric-haibin-lin
Copy link
Member Author

I thought @rahul003 fixed that? Could you confirm?

@eric-haibin-lin
Copy link
Member Author

@rahul003 and I figured that the timeout was due to significant launch overhead of omp auto tuning. This is extremely slow when launching the test in local mode and all 15 processes are doing omp tuning, sharing the same instance. Creating kvstore takes 2 minutes when n=3.

@marcoabreu
Copy link
Contributor

Awesome! Am I right in assuming that nobody would run this locally?

@eric-haibin-lin
Copy link
Member Author

nobody except CI :)

@marcoabreu
Copy link
Contributor

Okay excellent

Copy link
Member

@rahul003 rahul003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kv = set_optimizer(use_multiprecision=opt.multiprecision)
test_sync_push_pull(opt.nrepeat)
# dont run non compressed tests after this as kvstore compression will be set here
# don't run non compressed tests after this as kvstore compression will be set here
if opt.type == 'all' or opt.type == 'compressed':
Copy link
Member

@rahul003 rahul003 Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this value all now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in the current script, it no longer represents all tests.

@eric-haibin-lin eric-haibin-lin merged commit 11cd73f into apache:master Jul 13, 2018
@szha szha mentioned this pull request Jul 16, 2018
7 tasks
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* fix dist kvstore trainer

* fix test setup

* enable tests on CI

* update move some test to cpu

* dont use nvdia-docker

* rename option

* trigger test

* reduce workload to avvoid time out

* disable operator tuning to reduce launch overhead

* update test types
@eric-haibin-lin eric-haibin-lin deleted the fix-dist-kv branch November 12, 2018 20:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants