Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI hangs when running one more than number of brokers #924

Closed
garlick opened this issue Dec 13, 2016 · 9 comments
Closed

OpenMPI hangs when running one more than number of brokers #924

garlick opened this issue Dec 13, 2016 · 9 comments

Comments

@garlick
Copy link
Member

garlick commented Dec 13, 2016

With recent changes submitted to OpenMPI for flux integration, I can run MPI jobs compiled with OpenMPI, however I noticed if I try to "oversubscribe" by launching the MPI hello program with a size greater than the size of the enclosing instance, it hangs.

When I do the same with the PMI test program src/common/libpmi/test_kvstest, there is no hang.

@garlick
Copy link
Member Author

garlick commented Dec 13, 2016

Trace of MPI hello on two nodes, successful run:

$ flux wreckrun -n2 -otrace-pmi-server ../../t/mpi/hello.ompi
0: C: cmd=init pmi_version=1 pmi_subversion=1
0: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0: C: cmd=get_maxes
0: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=PMI_process_mapping
0: S: cmd=get_result rc=0 value=(vector,(0,2,1))
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=PMI_process_mapping
0: S: cmd=get_result rc=0 value=(vector,(0,2,1))
0: C: cmd=get_universe_size
0: S: cmd=universe_size rc=0 size=2
0: C: cmd=get_appnum
0: S: cmd=appnum rc=0 appnum=5
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=init pmi_version=1 pmi_subversion=1
1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=get_maxes
1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=PMI_process_mapping
1: S: cmd=get_result rc=0 value=(vector,(0,2,1))
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.5.pmi
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=PMI_process_mapping
1: S: cmd=get_result rc=0 value=(vector,(0,2,1))
1: C: cmd=get_universe_size
1: S: cmd=universe_size rc=0 size=2
1: C: cmd=get_appnum
1: S: cmd=appnum rc=0 appnum=5
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=put kvsname=lwj.0.0.5.pmi key=5-0-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
0: S: cmd=put_result rc=0
0: C: cmd=barrier_in
1: C: cmd=put kvsname=lwj.0.0.5.pmi key=5-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
1: S: cmd=put_result rc=0
1: C: cmd=barrier_in
0: S: cmd=barrier_out rc=0
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-0-key0
1: S: cmd=barrier_out rc=0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-1-key0
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-1-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
0: C: cmd=barrier_in
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-0-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
1: C: cmd=barrier_in
0: S: cmd=barrier_out rc=0
1: S: cmd=barrier_out rc=0
1: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-0-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
0: C: cmd=get kvsname=lwj.0.0.5.pmi key=5-1-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADUuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
0: completed MPI_Init in 0.045s.  There are 2 tasks
1: C: cmd=barrier_in
0: C: cmd=barrier_in
0: completed first barrier in 0.002s
1: S: cmd=barrier_out rc=0
0: S: cmd=barrier_out rc=0
1: C: cmd=finalize
1: S: cmd=finalize_ack rc=0
0: C: cmd=finalize
0: S: cmd=finalize_ack rc=0
0: completed MPI_Finalize in 0.004s

@garlick
Copy link
Member Author

garlick commented Dec 13, 2016

The hang doesn't reproduce for sessions with size=1 or size=2 (running mpi at size=2 and 3 respectively). However it does for size=4 (mpi at 5). Here's a trace of one such hang

$ flux wreckrun -n5 -otrace-pmi-server ../../t/mpi/hello.ompi
0: C: cmd=init pmi_version=1 pmi_subversion=1
0: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
4: C: cmd=init pmi_version=1 pmi_subversion=1
4: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0: C: cmd=get_maxes
0: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
4: C: cmd=get_maxes
4: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
1: C: cmd=init pmi_version=1 pmi_subversion=1
1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=get_maxes
1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
0: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
3: C: cmd=init pmi_version=1 pmi_subversion=1
3: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
2: C: cmd=init pmi_version=1 pmi_subversion=1
2: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
3: C: cmd=get_maxes
3: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
2: C: cmd=get_maxes
2: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
1: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
0: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
4: C: cmd=get_universe_size
4: S: cmd=universe_size rc=0 size=5
4: C: cmd=get_appnum
4: S: cmd=appnum rc=0 appnum=1
3: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
0: C: cmd=get_universe_size
0: S: cmd=universe_size rc=0 size=5
4: S: cmd=get_result rc=4
0: C: cmd=get_appnum
0: S: cmd=appnum rc=0 appnum=1
1: C: cmd=get_universe_size
1: S: cmd=universe_size rc=0 size=5
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get_appnum
1: S: cmd=appnum rc=0 appnum=1
4: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
4: S: cmd=get_result rc=4
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.1.pmi
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: C: cmd=get_universe_size
3: S: cmd=universe_size rc=0 size=5
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=PMI_process_mapping
4: S: cmd=get_result rc=4
0: S: cmd=get_result rc=4
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
4: S: cmd=get_result rc=4
2: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
3: C: cmd=get_appnum
3: S: cmd=appnum rc=0 appnum=1
2: C: cmd=get_universe_size
2: S: cmd=universe_size rc=0 size=5
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get_appnum
2: S: cmd=appnum rc=0 appnum=1
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=4
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
1: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
2: S: cmd=get_result rc=4
4: C: cmd=put kvsname=lwj.0.0.1.pmi key=1-4-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
4: S: cmd=put_result rc=0
4: C: cmd=barrier_in
2: C: cmd=put kvsname=lwj.0.0.1.pmi key=1-2-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
2: S: cmd=put_result rc=0
2: C: cmd=barrier_in
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
0: S: cmd=get_result rc=4
0: C: cmd=put kvsname=lwj.0.0.1.pmi key=1-0-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
0: S: cmd=put_result rc=0
0: C: cmd=barrier_in
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=put kvsname=lwj.0.0.1.pmi key=1-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
1: S: cmd=put_result rc=0
1: C: cmd=barrier_in
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4294967294-key0
3: S: cmd=get_result rc=4
3: C: cmd=put kvsname=lwj.0.0.1.pmi key=1-3-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
3: S: cmd=put_result rc=0
3: C: cmd=barrier_in
0: S: cmd=barrier_out rc=0
1: S: cmd=barrier_out rc=0
2: S: cmd=barrier_out rc=0
3: S: cmd=barrier_out rc=0
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-0-key0
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-2-key0
4: S: cmd=barrier_out rc=0
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-0-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-1-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-1-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-2-key0
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-2-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-2-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-3-key0
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-3-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-3-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-0-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-3-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-0-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
0: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4-key0
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-0-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-1-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-1-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAKDnz+d/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMC92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAwCAAAAAAA -
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
2: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4-key0
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-1-key0
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
3: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-4-key0
1: C: cmd=barrier_in
0: C: cmd=barrier_in
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuNDt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBwIAAAAAAA  -
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMTt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA2/UAAAEAAAALAAAAAAAAAAgAQAAAAAAAAOB1o3B/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMS92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA0CAAAAAAA -
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-2-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMjt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3fUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAHC8Gll/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMi92YWRlcl9zZWdtZW50LmppbWJvLjAAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAsCAAAAAAA -
4: C: cmd=get kvsname=lwj.0.0.1.pmi key=1-3-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDE5ADEuMzt0Y3A6Ly8xOTIuMTY4LjEuMTM2OgBwbWl4LmhuYW1lADAzADAwMDYAamltYm8ATVBJX1RIUkVBRF9MRVZFTAAxNAAwMDAxAAFidGwudmFkZXIuMy4wADE0ADAwNTQA3vUAAAEAAAALAAAAAAAAAAgAQAAAAAAAAODCP81/AAAvdG1wL29tcGkuamltYm8uNTU4OC9qZi4wLzEvMy92YWRlcl9zZWdtZW50LmppbWJvLjEAYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABA4CAAAAAAA -
4: C: cmd=barrier_in

@garlick
Copy link
Member Author

garlick commented Dec 13, 2016

Only 3 of 5 processes (ranks 0,1,4) entered the second barrier.

Working backwards to find out what happened to ranks 2,3, both were sent responses to their last kvs get requests, but barrier enter requests were never received.

@garlick
Copy link
Member Author

garlick commented Dec 13, 2016

Same run, different failure:

flux wreckrun -n5 -otrace-pmi-server ../../t/mpi/hello.ompi
4: C: cmd=init pmi_version=1 pmi_subversion=1
4: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=init pmi_version=1 pmi_subversion=1
1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
4: C: cmd=get_maxes
4: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0: C: cmd=init pmi_version=1 pmi_subversion=1
0: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
1: C: cmd=get_maxes
1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
4: C: cmd=get_my_kvsname
0: C: cmd=get_maxes
0: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
1: C: cmd=get_my_kvsname
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
0: C: cmd=get_my_kvsname
0: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
1: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
4: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
4: C: cmd=get_my_kvsname
4: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
2: C: cmd=init pmi_version=1 pmi_subversion=1
2: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
3: C: cmd=init pmi_version=1 pmi_subversion=1
3: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
2: C: cmd=get_maxes
2: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
3: C: cmd=get_maxes
1: C: cmd=get_my_kvsname
1: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
0: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
4: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
1: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
4: C: cmd=get_appnum
4: S: cmd=appnum rc=0 appnum=10
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
1: C: cmd=get_universe_size
1: S: cmd=universe_size rc=0 size=5
1: C: cmd=get_appnum
1: S: cmd=appnum rc=0 appnum=10
2: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
0: S: cmd=universe_size rc=0 size=5
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
0: C: cmd=get_appnum
0: S: cmd=appnum rc=0 appnum=10
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
1: S: cmd=get_result rc=4
2: C: cmd=get_my_kvsname
2: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
3: C: cmd=get_my_kvsname
3: S: cmd=my_kvsname rc=0 kvsname=lwj.0.0.10.pmi
4: S: cmd=get_result rc=4
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=PMI_process_mapping
4: S: cmd=get_result rc=4
1: S: cmd=get_result rc=4
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=0 value=(vector,(0,2,2),(2,1,1))
2: C: cmd=get_universe_size
2: S: cmd=universe_size rc=0 size=5
3: C: cmd=get_universe_size
3: S: cmd=universe_size rc=0 size=5
2: C: cmd=get_appnum
2: S: cmd=appnum rc=0 appnum=10
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: C: cmd=get_appnum
3: S: cmd=appnum rc=0 appnum=10
4: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
4: S: cmd=get_result rc=4
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
4: S: cmd=get_result rc=4
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
3: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
1: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
1: S: cmd=get_result rc=4
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
1: S: cmd=get_result rc=4
1: C: cmd=put kvsname=lwj.0.0.10.pmi key=10-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
1: S: cmd=put_result rc=0
1: C: cmd=barrier_in
4: C: cmd=put kvsname=lwj.0.0.10.pmi key=10-4-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
4: S: cmd=put_result rc=0
4: C: cmd=barrier_in
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
0: S: cmd=get_result rc=4
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
2: S: cmd=get_result rc=4
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4294967294-key0
3: S: cmd=get_result rc=4
0: C: cmd=put kvsname=lwj.0.0.10.pmi key=10-0-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
0: S: cmd=put_result rc=0
0: C: cmd=barrier_in
2: C: cmd=put kvsname=lwj.0.0.10.pmi key=10-2-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjI7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJb5AAABAAAACwAAAAAAAAAIAEAAAAAAAADQdmOGfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8yL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
2: S: cmd=put_result rc=0
2: C: cmd=barrier_in
3: C: cmd=put kvsname=lwj.0.0.10.pmi key=10-3-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
3: S: cmd=put_result rc=0
3: C: cmd=barrier_in
1: S: cmd=barrier_out rc=0
0: S: cmd=barrier_out rc=0
4: S: cmd=barrier_out rc=0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4-key0
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-0-key0
2: S: cmd=barrier_out rc=0
3: S: cmd=barrier_out rc=0
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-2-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-0-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-1-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-1-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-2-key0
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjI7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJb5AAABAAAACwAAAAAAAAAIAEAAAAAAAADQdmOGfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8yL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-2-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-0-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjI7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJb5AAABAAAACwAAAAAAAAAIAEAAAAAAAADQdmOGfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8yL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-3-key0
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-2-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjI7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJb5AAABAAAACwAAAAAAAAAIAEAAAAAAAADQdmOGfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8yL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-3-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-3-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-1-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-0-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-3-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-0-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-2-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjA7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJP5AAABAAAACwAAAAAAAAAIAEAAAAAAAADgkmz7fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8wL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAgIAAAAAAA  -
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-1-key0
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
0: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjI7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJb5AAABAAAACwAAAAAAAAAIAEAAAAAAAADQdmOGfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8yL3ZhZGVyX3NlZ21lbnQuamltYm8uMABidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAwIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-1-key0
4: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-3-key0
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjE7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJT5AAABAAAACwAAAAAAAAAIAEAAAAAAAABQGt56fwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8xL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEAAIAAAAAAA  -
2: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4-key0
0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
1: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4-key0
4: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjM7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnZhZGVyLjMuMAAxNAAwMDU1AJf5AAABAAAACwAAAAAAAAAIAEAAAAAAAABgRDDjfwAAL3RtcC9vbXBpLmppbWJvLjU1ODgvamYuMC8xMC8zL3ZhZGVyX3NlZ21lbnQuamltYm8uMQBidGwudGNwLjMuMAAxNAAwMDE4AMCoAYgAAAAAAAAAAAAAAAAEBAIAAAAAAA  -
2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
3: C: cmd=get kvsname=lwj.0.0.10.pmi key=10-4-key0
4: C: cmd=barrier_in
3: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTExAG9wYWwucHVyaQAwMwAwMDFhADEwLjQ7dGNwOi8vMTkyLjE2OC4xLjEzNjoAcG1peC5obmFtZQAwMwAwMDA2AGppbWJvAE1QSV9USFJFQURfTEVWRUwAMTQAMDAwMQABYnRsLnRjcC4zLjAAMTQAMDAxOADAqAGIAAAAAAAAAAAAAAAABAECAAAAAAA -
3: C: cmd=barrier_in
--------------------------------------------------------------------------
A system call failed during sm BTL initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  System call: open(2)
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  jimbo
  System call: unlink(2) /tmp/ompi.jimbo.5588/jf.0/10/shared_mem_btl_module.jimbo
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
[jimbo:63891] WARNING: common_sm_module_unlink failed.
[jimbo:63891] WARNING: common_sm_module_unlink failed.
[jimbo:63891] WARNING: /tmp/ompi.jimbo.5588/jf.0/10/shared_mem_pool_rndv.jimbo unlink failed.
[jimbo:63891] WARNING: /tmp/ompi.jimbo.5588/jf.0/10/shared_mem_btl_rndv.jimbo unlink failed.


[jimbo:63891] 1 more process has sent help message help-opal-shmem-mmap.txt / sys call fail
[jimbo:63891] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@garlick
Copy link
Member Author

garlick commented Dec 13, 2016

It seems that setting OMPI_MCA_btl=self,tcp in the program's environment (disabling shared memory) makes this problem go away.

@garlick
Copy link
Member Author

garlick commented Dec 14, 2016

Even with the OMPI_MCA setting, I'm still hitting some errors in PMI_Barrier() on larger jobs. It looks like we are getting mangled data back form the PMI server on the PMI_FD on the client side. Possibly pieces of earlier responses are being mixed in with the barrier response.

Also it seems that these barrier failures are not fatal to the app. In the pmix:flux component, the flux_fencenb() function "thread shifts" the PMI_Barrier() call and doesn't manage to propagate the error back to the original thread, which is itself a problem.

I'm wondering if maybe truncated PMI_KVS_Get() responses could be the cause of the shared memory hangs above.

@garlick
Copy link
Member Author

garlick commented Dec 14, 2016

Oh duh. Our PMI library is of course not thread safe, so this thread shifting business may actually be the cause of the problem.

@garlick
Copy link
Member Author

garlick commented Dec 14, 2016

Replacing the nonblocking fence with a blocking one seems to have made the mangled responses go away and I can launch like crazy using OMPI_MCA_btl=self,tcp.

Still running into problems with the default shared memory though.

@garlick
Copy link
Member Author

garlick commented Dec 17, 2016

Problem was shared memory segment naming collisions, fixed by pr #926

@garlick garlick closed this as completed Dec 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant