Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit harden] include osx in tests, address time_wait on port assignment #90

Open
wants to merge 19 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@chenqin
Copy link
Contributor

chenqin commented Apr 9, 2019

follow up clean up dmlc-core header file copy pr chenqin@ecd4bf7.

  • adding osx as OS on all ci tasks
  • split mpi test from socket ones
  • address tcp socket time_wait issue
  • enable all tests other than python (plan to clean up rabit python in follow pr)
  • address tracker deadlock in python2 via upgrade tracker to python3 dmlc/dmlc-core#525
    more detail of deadlock rootcause can be found dmlc/dmlc-core#524

@chenqin chenqin force-pushed the chenqin:master branch 3 times, most recently from 1ab1073 to e023a18 Apr 9, 2019

@chenqin chenqin force-pushed the chenqin:master branch from e023a18 to f19d62e Apr 9, 2019

[cleanup] include java regression tests against xgb master
enable xgb-tests use chenqin/xgboost:master with updated path
port packages from xgb
enable test on osx

@chenqin chenqin changed the title [rabit harden] include regression tests on xgboost cmake/java_tests [rabit harden] include regression tests on xgboost java_tests Apr 9, 2019

@CodingCat
Copy link
Member

CodingCat left a comment

have we fixed the test mentioned in #86 (comment)?

Show resolved Hide resolved .travis.yml Outdated
Show resolved Hide resolved .travis.yml Outdated
Show resolved Hide resolved .travis.yml Outdated
Show resolved Hide resolved .travis.yml
@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 9, 2019

have we fixed the test mentioned in #86 (comment)?

That should be another pr after we updated xgboost master. The rationale is previous and this pr haven't change how rabit functions other than reshuffle parameters and delete redundant code.

I already got some idea how to fix flaky test by introducing extra check in allreduce_robust before reset links infinitely. But yeah, I think that should be seperate thing after we have good baseline dmlc/xgboost#4352

Show resolved Hide resolved scripts/travis_script.sh Outdated

@chenqin chenqin force-pushed the chenqin:master branch from 4cb0d84 to 1cda848 Apr 10, 2019

per feedback, clean up packages
remove xgb java tests

@chenqin chenqin force-pushed the chenqin:master branch from 1cda848 to 42553e3 Apr 10, 2019

@chenqin chenqin changed the title [rabit harden] include regression tests on xgboost java_tests [rabit harden] include osx in tests Apr 10, 2019

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 11, 2019

for somereason, trybind actually allows two process bind to same port.

cq@cq-MS-7B84:~/xgboost/rabit/test$ make -f test.mk 
../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0
2019-04-10 15:17:51,502 INFO start listen on 10.0.1.25:9091
[0]ReConnectLinks cmd start port 9010 range (9010 - 10010)
[1]ReConnectLinks cmd start port 9011 range (9010 - 10010)
[0]sock_listen.Accept cmd start port # 9010 0 num_accept 2
[2]ReConnectLinks cmd start port 9012 range (9010 - 10010)
[1]sock_listen.Accept cmd start port # 9011 0 num_accept 2
[3]ReConnectLinks cmd start port 9013 range (9010 - 10010)
[2]sock_listen.Accept cmd start port # 9012 0 num_accept 2
[4]ReConnectLinks cmd start port 9014 range (9010 - 10010)
[2]sock_listen.Accept cmd start port # 9012 1 num_accept 2
[3]sock_listen.Accept cmd start port # 9013 0 num_accept 1
[5]ReConnectLinks cmd start port 9015 range (9010 - 10010)
[4]sock_listen.Accept cmd start port # 9014 0 num_accept 1
[6]ReConnectLinks cmd start port 9016 range (9010 - 10010)
[1]sock_listen.Accept cmd start port # 9011 1 num_accept 2
[5]sock_listen.Accept cmd start port # 9015 0 num_accept 1
[7]ReConnectLinks cmd start port 9017 range (9010 - 10010)
[6]sock_listen.Accept cmd start port # 9016 0 num_accept 1
[8]ReConnectLinks cmd start port 9018 range (9010 - 10010)
[7]sock_listen.Accept cmd start port # 9017 0 num_accept 2
[9]ReConnectLinks cmd start port 9019 range (9010 - 10010)
[0]sock_listen.Accept cmd start port # 9010 1 num_accept 2
[8]sock_listen.Accept cmd start port # 9018 0 num_accept 1
[7]sock_listen.Accept cmd start port # 9017 1 num_accept 2
[0] reload-trail=0, init iter=0
[1] reload-trail=0, init iter=0
[9] reload-trail=0, init iter=0
[7] reload-trail=0, init iter=0
[6] reload-trail=0, init iter=0
[2] reload-trail=0, init iter=0
[4] reload-trail=0, init iter=0
[5] reload-trail=0, init iter=0
2019-04-10 15:17:52,491 INFO @tracker All of 10 nodes getting started
[8] reload-trail=0, init iter=0
[3] reload-trail=0, init iter=0
[0] !!!TestMax pass, iter=0
[3] !!!TestMax pass, iter=0
[6] !!!TestMax pass, iter=0
[0]@@@Hit Mock Error:Broadcast
[4] !!!TestMax pass, iter=0
[5] !!!TestMax pass, iter=0
[2] !!!TestMax pass, iter=0
[8] !!!TestMax pass, iter=0
[9] !!!TestMax pass, iter=0
[7] !!!TestMax pass, iter=0
[1] !!!TestMax pass, iter=0
[9]ReConnectLinks cmd recover port 9014 range (9010 - 10010)
[8]ReConnectLinks cmd recover port 9015 range (9010 - 10010)
[9]sock_listen.Accept cmd recover port # 9014 0 num_accept 3
[7]ReConnectLinks cmd recover port 9016 range (9010 - 10010)
[8]sock_listen.Accept cmd recover port # 9015 0 num_accept 1
[9]sock_listen.Accept cmd recover port # 9014 1 num_accept 3
[1]ReConnectLinks cmd recover port 9017 range (9010 - 10010)
[3]ReConnectLinks cmd recover port 9018 range (9010 - 10010)
[2]ReConnectLinks cmd recover port 9019 range (9010 - 10010)
[1]sock_listen.Accept cmd recover port # 9017 0 num_accept 3
[3]sock_listen.Accept cmd recover port # 9018 0 num_accept 2
[6]ReConnectLinks cmd recover port 9020 range (9010 - 10010)
[1]sock_listen.Accept cmd recover port # 9017 1 num_accept 3
[7]sock_listen.Accept cmd recover port # 9016 0 num_accept 1
[5]ReConnectLinks cmd recover port 9021 range (9010 - 10010)
[6]sock_listen.Accept cmd recover port # 9020 0 num_accept 1
[4]ReConnectLinks cmd recover port 9022 range (9010 - 10010)
[2]sock_listen.Accept cmd recover port # 9019 0 num_accept 1
[3]sock_listen.Accept cmd recover port # 9018 1 num_accept 2
[5]sock_listen.Accept cmd recover port # 9021 0 num_accept 1

@chenqin chenqin force-pushed the chenqin:master branch from 4fa14a7 to 9d0e235 Apr 11, 2019

@chenqin chenqin changed the title [rabit harden] include osx in tests [rabit harden] include osx in tests, fix flaky integration test Apr 11, 2019

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 11, 2019

have we fixed the test mentioned in #86 (comment)?

Done

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 12, 2019

osx build error due to unable to download package. should be fine if rerun. local build here.
https://travis-ci.org/chenqin/rabit/builds/518971901

@chenqin chenqin force-pushed the chenqin:master branch 2 times, most recently from 4415ceb to 9d0e235 Apr 12, 2019

@trivialfis

This comment has been minimized.

Copy link
Member

trivialfis commented Apr 12, 2019

@chenqin thanks, restarted the test.

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 12, 2019

all tests green @CodingCat @trivialfis

@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Apr 12, 2019

check my comments about mpi in osx

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 16, 2019

Chen Qin
@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Apr 18, 2019

isn't the test still hanging?

@chenqin

This comment has been minimized.

Copy link
Contributor Author

chenqin commented Apr 18, 2019

We need to fix dmlc-core first.

@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Apr 18, 2019

We need to fix dmlc-core first.

check dmlc/dmlc-core#525 (comment)

@chenqin chenqin force-pushed the chenqin:master branch 16 times, most recently from 05fc609 to b21bdec Apr 18, 2019

@chenqin chenqin force-pushed the chenqin:master branch from b21bdec to 686e6c2 Apr 18, 2019

@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Apr 19, 2019

looks like we have very consistent behavior in linux with pip3

@chenqin chenqin force-pushed the chenqin:master branch from 118294e to 03a64a9 Apr 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.