Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node down, VeloC XOR restart on the new allocated node #26

Closed
gongotar opened this issue Dec 2, 2019 · 5 comments
Closed

Node down, VeloC XOR restart on the new allocated node #26

gongotar opened this issue Dec 2, 2019 · 5 comments

Comments

@gongotar
Copy link

gongotar commented Dec 2, 2019

Hi,

I'm testing VeloC restart capability after a single node's failure (node down). For this reason, I created a test job that checkpoints its data periodically on the node's local storage (/tmp) using the VeloC library. I also configured VeloC to protect the data by erasure coding (ec_interval = 0).

To test the restart capability, after the job does the computation for a number of steps (iterations, checkpoints), I inject a failure on one of the job's nodes and restart the job with the same number of nodes as before. The job will be executed on the previous set of nodes from the first run, except for a newly allocated node replaced by the failed node.

Here I would expect that the checkpoint of the failed node is computed using the EC data and loaded into the newly allocated node. However, as I see this functionality is not available on the VeloC library and the job is restarted from the beginning. This leaves the EC data useless if a node goes down.

However, I managed to successfully restart the job from the local checkpoints if all nodes are alive and the job is restarted on the exact same set of the nodes as the first run.

To further investigate the case, I compared the source code of VeloC with the SCR library (for which I successfully restarted the job using XOR). If I'm correct, the difference is that in SCR in the SCR_Init function, the scr_cache_rebuild call writes the data on the newly allocated node at the start of the second run (after the failure) and before calling the SCR_Have_restart and SCR_Start_restart functions. However, this is not implemented in the VeloC library as I see.

So please let me know if I'm missing something. Otherwise, is this going to be available on the next release?

@bnicolae
Copy link
Contributor

bnicolae commented Dec 3, 2019

Hi @gongotar, can you please provide more information about what is happening? How many nodes did you use? What error messages do the active backends show on restart (if any at all)? Do all active backends show the same error message or are there any differences?

@gongotar
Copy link
Author

gongotar commented Dec 30, 2019

Hi Bogdan, sorry for the late response! So the VeloC library contains a test folder in which there is a file named heatdis_fault.cpp. Now, in the beginning, the code in the file checks if there are any restarts available. If not then it starts to compute from the beginning. However, in the middle of the computation, a failure is injected which removes the local checkpoints of the last rank (lines 134 - 138) and calls MPI_Abort(MPI_COMM_WORLD, 1) and exit(1).
Now, I have written a slurm script named test_sync.sh to run the test code. It runs heatdis_fault twice (in the sync mode). The first run will have no restarts available (the computation starts from the beginning), the second run, however, would be called after the first run encountered the injected failure and lost one of its checkpoints (out of three). This run is expected to be restarted from the most recent checkpoint from the first run. But for the second run, no restarts will be found and the job starts to compute from the beginning again.
I tested the code with 3 and also higher numbers of nodes allocated to the job. In the VeloC config file, the erasure coding is enabled (ec_interval = 0). So I would expect that VeloC restores the lost checkpoint of a single rank out of three or more ranks (one rank per node) and restarts the computation from the saved checkpoint in the second run. Which does not happen and the second run starts from the beginning again. Here is the script I wrote:

#!/bin/bash
#SBATCH -N3
#SBATCH -o out
CFG=heatdis.cfg

srun -N3 heatdis_fault 256 $CFG
srun -N3 heatdis_fault 256 $CFG

EXIT_CODE=$?
exit $EXIT_CODE

I digged into the problem and found out the cause, which I mentioned in my previous post. However, let me know if you could reproduce the problem.

@bnicolae
Copy link
Contributor

bnicolae commented Jan 7, 2020

Hi @gongotar, can you list the content of the VeloC config file? Did you also flush the checkpoints to the parallel file system? If so, the new node should fetch the missing local checkpoint from the PFS on restart. You can check this by deactivating the EC module.

@gongotar
Copy link
Author

gongotar commented Jan 8, 2020

Hi Bogdan,

Here is the config file to test the functionality of erasure coding:

scratch = /tmp/user/veloc_test                                          
persistent = /shared/user/veloc_test_cp
max_versions = 2
mode = sync
ec_interval = 0
persistent_interval = -1
axl_type = native

However, I also changed the config file to flush the checkpoints to the shared PFS as you suggested by setting persistent_interval=0 and ec_interval=-1. In the case of the shared PFS, the second run is restarted successfully from the most recent checkpoint in the shared PFS. But if I deactivate the shared PFS and instead activate the ec_interval (as in the provided conf file: persistent_interval = -1; ec_interval = 0) the restart fails to restore the lost checkpoint using EC.
Here you see the VeloC messages for the job running on 6 nodes with EC activated:

# VeloC messages of the first run:
[INFO 1910392244657] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized       
[INFO 1910401162927] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400580068] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401412467] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400857710] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401879747] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910392244763[DEBUG 1910401412598] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command]
[DEBUG 1910400857816] [164:process_command] obtain latest version for veloc_t[DEBUG 1910401879860] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:obtain latest version for veloc_t
/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for veloc_t
164:process_command] obtain latest version for veloc_t                           
[DEBUG 1910401163027] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:[DEBUG 1910400580180] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:
164:process_command] obtain latest version for veloc_t                           
164:process_command] obtain latest version for veloc_t                           

<Outputs of the job during the computation of the first run>

<Injected failure, job exits>

# VeloC messages of the second run:
[INFO 1910471167561] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479502975] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480085845] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480802648] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479780616] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480335366] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910471167721] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] [DEBUG 1910479503092] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164]
obtain latest version for veloc_t                                                                          
[DEBUG 1910479780770] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command:process_command] obtain latest version for veloc_t
[DEBUG 1910480085986] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for [DEBUG 1910480802784] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] 
obtain latest version for veloc_t
[DEBUG 1910480335509] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] veloc_t
obtain latest version for veloc_t                                                                          
obtain latest version for veloc_t                                                


<Outputs of the job starting the computation from the beginning on the second run>

@bnicolae
Copy link
Contributor

Hi @gongotar, we have a new VELOC release. If it is working for you, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants