-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node down, VeloC XOR restart on the new allocated node #26
Comments
Hi @gongotar, can you please provide more information about what is happening? How many nodes did you use? What error messages do the active backends show on restart (if any at all)? Do all active backends show the same error message or are there any differences? |
Hi Bogdan, sorry for the late response! So the VeloC library contains a
I digged into the problem and found out the cause, which I mentioned in my previous post. However, let me know if you could reproduce the problem. |
Hi @gongotar, can you list the content of the VeloC config file? Did you also flush the checkpoints to the parallel file system? If so, the new node should fetch the missing local checkpoint from the PFS on restart. You can check this by deactivating the EC module. |
Hi Bogdan, Here is the config file to test the functionality of erasure coding:
However, I also changed the config file to flush the checkpoints to the shared PFS as you suggested by setting
|
Hi @gongotar, we have a new VELOC release. If it is working for you, I will close this issue. |
Hi,
I'm testing VeloC restart capability after a single node's failure (node down). For this reason, I created a test job that checkpoints its data periodically on the node's local storage (
/tmp
) using the VeloC library. I also configured VeloC to protect the data by erasure coding (ec_interval = 0
).To test the restart capability, after the job does the computation for a number of steps (iterations, checkpoints), I inject a failure on one of the job's nodes and restart the job with the same number of nodes as before. The job will be executed on the previous set of nodes from the first run, except for a newly allocated node replaced by the failed node.
Here I would expect that the checkpoint of the failed node is computed using the EC data and loaded into the newly allocated node. However, as I see this functionality is not available on the VeloC library and the job is restarted from the beginning. This leaves the EC data useless if a node goes down.
However, I managed to successfully restart the job from the local checkpoints if all nodes are alive and the job is restarted on the exact same set of the nodes as the first run.
To further investigate the case, I compared the source code of VeloC with the SCR library (for which I successfully restarted the job using XOR). If I'm correct, the difference is that in SCR in the
SCR_Init
function, thescr_cache_rebuild
call writes the data on the newly allocated node at the start of the second run (after the failure) and before calling theSCR_Have_restart
andSCR_Start_restart
functions. However, this is not implemented in the VeloC library as I see.So please let me know if I'm missing something. Otherwise, is this going to be available on the next release?
The text was updated successfully, but these errors were encountered: