New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allreduce error when continuing training from the same process (Ray elastic training) #7436
Comments
Hi, I'm not aware of any related change. If you have a reproducible script I can bisect the changes. |
Ah, I see the link. Will take a deeper look. |
@krfricke Could you please provide some detailed instructions on how to run this on a local machine? I found ray-project/ray#15913 but seems to be a huge PR. |
Sure, consider these two files:
Install dependencies
Create data
Run
For me this works with xgboost<1.5 but not with xgboost==1.5 This test works with a single machine (confirmed it works on my laptop). Please note that with latest Ray master this just blocks forever after the first iteration. With Ray 1.7.0 it throws the error above. |
Thanks for looking into this. Is there something we can do about the first issue from the xgboost-ray side? Ideally we'd like to re-use the dmatrices across different starts as that's where the benefit of re-using actors comes from (caching loaded data for re-use) |
I think that's unlikely. I looked into the ray_xgboost's code a little bit. Before training, data goes through quantization, which requires an allreduce operation. Also, DMatrix construction itself requires an allreduce operation to get a consistent number of columns across workers. If you reconstruct only one of the DMatrix it will be stuck at the allreduce operation. |
The ray_xgboost project constructs DMatrix before entering |
The solution is to re-initialize all dmatrices during restart and move the construction of DMatrix into rabit context. |
Ok got it - we'll then try to keep the raw data in cache and reconstruct the dmatrices from scratch always. I'll implement this tomorrow. Thanks for the context here, that was very helpful! |
Let me know if there's anything I can help. Also, it's quite pleasant to read the code as it's very well written. ;-) |
Just something to think about. XGBoost used to support recovery by rabit (the allreduce impl in xgboost) but we removed the feature as it was too difficult to implement at low level. The difficulty is similar to the one in this issue. In old rabit, it caches the booster model for each iteration, but the quantile and some other things weren't properly handled. To achieve single point recovery we need to prevent any possible allreduce that can be run out of order (like all dmatrices should be initialized in the same order or just prevent all reduce in DMatrix altogether), then we need to specify what should be cached during the lifetime of rabit handle and what should be cached only for each iteration. The quantile and number of features belong to the former while the booster belongs to the latter. Some other difficulties are changed parameters (learning rate decay etc.) I remain optimistic that we can restore the native support at some point in the future, but right now the whole data set needs to be cached and the process is best carried out by distributed framework integration (like ray xgboost). |
Thanks for the additional context - and sorry I didn't get to it last week, other stuff came up. I hope to tackle it this week, and yes, I agree we should just move it into the Rabit context to make sure we adhere to all caching principles correctly. |
@krfricke Can we close this issue? |
Yes, sorry for the delay! For reference, this was fixed in xgboost_ray here: ray-project/xgboost_ray#179 |
TLDR: Since XGBoost 1.5, XGBoost-Ray's elastic training fails (it works with XGBoost 1.4). I suspect there may be retained state as it works when all actors are re-created.
XGBoost-Ray uses Ray's actor model to reduce data loading overhead when remote training workers die.
In the elastic training test, we do the following:
xgb.train()
methodAfter a number of iterations (15), we kill one of the actors. This actor is then re-started. The other actors are re-used.
However, when continuing training, existing actors fail with
This is also true when not restoring from a checkpoint.
This does not happen when we re-create all actors.
The bug does not come up in XGBoost < 1.5, only in the latest release.
Are you aware of any changes in XGBoost 1.5 that maybe retain state across multiple calls to
xgboost.train
? As explained above, the actors retain their state and the same PID, but thexgb.train()
call is always in a separate thread, which is ended for all actors when a single actor fails. We also restart the Rabit tracker between runs (and I also tried it with different ports for the Rabit tracker, all with the same result).Any help would be much appreciated, thanks!
The text was updated successfully, but these errors were encountered: