Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating block_timings leads to checkpoint loading errors #219

Closed
vasdommes opened this issue Apr 1, 2024 · 0 comments · Fixed by #220
Closed

Updating block_timings leads to checkpoint loading errors #219

vasdommes opened this issue Apr 1, 2024 · 0 comments · Fixed by #220
Assignees
Milestone

Comments

@vasdommes
Copy link
Collaborator

This bug was introduced by PR #215

Example

from end-to-end_tests/SingletScalar_cT_test_nmax6/primal_dual_optimal:

mpirun --oversubscribe -n 6  build/pmp2sdp  --input=test/data/end-to-end_tests/SingletScalar_cT_test_nmax6/primal_dual_optimal/input/pmp.nsv --output=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/sdp --precision=768
mpirun --oversubscribe -n 6  build/sdpb --checkpointInterval 3600 --maxRuntime 1340 --dualityGapThreshold 1.0e-30 --primalErrorThreshold 1.0e-30 --dualErrorThreshold 1.0e-30 --initialMatrixScalePrimal 1.0e20 --initialMatrixScaleDual 1.0e20 --feasibleCenteringParameter 0.1 --infeasibleCenteringParameter 0.3 --stepLengthReduction 0.7 --maxComplementarity 1.0e100 --maxIterations 1000 --verbosity 1 --procGranularity 1 --writeSolution x,y,z  --checkpointDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/ck --outDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/out --precision=768 --sdpDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/sdp
mpirun --oversubscribe -n 6  build/sdpb --checkpointInterval 3600 --maxRuntime 1340 --dualityGapThreshold 1.0e-30 --primalErrorThreshold 1.0e-30 --dualErrorThreshold 1.0e-30 --initialMatrixScalePrimal 1.0e20 --initialMatrixScaleDual 1.0e20 --feasibleCenteringParameter 0.1 --infeasibleCenteringParameter 0.3 --stepLengthReduction 0.7 --maxComplementarity 1.0e100 --maxIterations 1000 --verbosity 1 --procGranularity 1 --writeSolution x,y,z  --checkpointDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/ck --outDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/out --precision=768 --sdpDir=test/out/SingletScalar_cT_test_nmax6/primal_dual_optimal/pmp.nsv/sdp

It fails with

Process 4 caught error message:
in read_local_binary_blocks() at ../src/sdp_solve/SDP_Solver/load_checkpoint/load_binary_checkpoint.cxx:31: 
  Assertion 'local_height == block.LocalHeight() && local_width == block.LocalWidth()' failed:
    Incompatible binary checkpoint file.  For x.block with global size (29,1), expected local dimensions (29,1), but found (31,1)
Stacktrace:
 0# void read_local_binary_blocks<Block_Vector>(Block_Vector&, std::basic_ifstream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at ../src/sdp_solve/SDP_Solver/load_checkpoint/load_binary_checkpoint.cxx:31
 1# load_binary_checkpoint(std::filesystem::__cxx11::path const&, Verbosity const&, SDP_Solver&) at ../src/sdp_solve/SDP_Solver/load_checkpoint/load_binary_checkpoint.cxx:129
 2# SDP_Solver::load_checkpoint(std::filesystem::__cxx11::path const&, Block_Info const&, Verbosity const&, bool const&) at ../src/sdp_solve/SDP_Solver/load_checkpoint/load_checkpoint.cxx:20
 3# SDP_Solver::SDP_Solver(Solver_Parameters const&, Verbosity const&, bool const&, Block_Info const&, El::Grid const&, unsigned long const&) at ../src/sdp_solve/SDP_Solver/SDP_Solver.cxx:20
 4# solve(Block_Info const&, SDPB_Parameters const&, Environment const&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, El::Matrix<int>&) at ../src/sdpb/solve.cxx:52
 5# main at ../src/sdpb/main.cxx:144
 6# __libc_start_call_main at ../sysdeps/nptl/libc_start_call_main.h:58
 7# __libc_start_main at ../csu/libc-start.c:379
 8# _start in build/sdpb

What happens?

  • First SDPB call start with a timing run.
  • Timing run writes ck/block_timings.
  • Actual run uses these timings to redistribute SDP block among ranks (you may run sdpb with --verbosity debug to see Block Grid Mapping).
  • At the end, actual run writes a checkpoint and moves the original ck/block_timings to ck/block_timings.0 and writes new timings to ck/block_timings.
  • Second SDPB call uses new ck/block_timings to distribute SDP blocks. Then it loads a checkpoint, which assumes that blocks are distributed according to ck/block_timings.0.

Temporary workaround

Move block_timings.0 to block_timings before loading from a checkpoint.

Solution

Do not write new block_timings after actual run (revert relevant changes from #215)

TODO

Introduce new checkpoint format, invariant to block mapping, number of MPI ranks/nodes etc. Currently, we have one file per rank (containing matrix elements owned by rank). We may write each block to a separate file, like in SDP. NB: this could be more time- and memory-consuming.

@vasdommes vasdommes added this to the 3.0.0 milestone Apr 1, 2024
@vasdommes vasdommes self-assigned this Apr 1, 2024
vasdommes added a commit that referenced this issue Apr 1, 2024
The bug was introduced in #215, see commit 522dc98

Now we revert these changes and do not update ck/block_timings after actual run. It is written only after a timing run.

end-to-end.test.cxx: run SDPB twice for SingletScalar_cT_test_nmax6/primal_dual_optimal to test this bug
vasdommes added a commit that referenced this issue Apr 1, 2024
The bug was introduced in #215, see commit 522dc98

Now we revert these changes and do not update ck/block_timings after actual run. It is written only after a timing run.

end-to-end.test.cxx: run SDPB twice for SingletScalar_cT_test_nmax6/primal_dual_optimal to test this bug
vasdommes added a commit that referenced this issue Apr 1, 2024
The bug was introduced in #215, see commit 522dc98

Now we revert these changes and do not update ck/block_timings after actual run. It is written only after a timing run.

end-to-end.test.cxx: run SDPB twice for SingletScalar_cT_test_nmax6/primal_dual_optimal to test this bug
vasdommes added a commit that referenced this issue Apr 1, 2024
The bug was introduced in #215, see commit 522dc98

Now we revert these changes and do not update ck/block_timings after actual run. It is written only after a timing run.

end-to-end.test.cxx: run SDPB twice for SingletScalar_cT_test_nmax6/primal_dual_optimal to test this bug
vasdommes added a commit that referenced this issue Apr 1, 2024
Fix #219 Updating block_timings leads to checkpoint loading errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant