New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed #6551
Comments
Actually, @trivialfis , after ensuring the 2 workers have data, I still saw this hit once out of about 200 models. Just before the failure I have
Yet, I hit on worker:
and I see in logs:
I will try to repro and get the pickled state for when things fail next time. But any insight? It's just regular partitions of data, but for whatever reason xgboost still things one worker has empty dataset. Is it possible that 1 worker happen to get all 16 or 4 partitions of Xy, valid_Xy? |
@trivialfis Actually, I can reproduce the problem if I just run a given script in a loop, and actually it failed only on the 18th trial. I'll try to get a clean repro for you that is free of our code base. |
I'm using (like have before) a 2 node cuda cluster with 2 GPUs in each node. I get this very frequently, but not always. Here's repro. When I ran this first time, it failed on very first iteration, another time did not fail on first iteration, but it always eventually fails within order 20 times at least.
(28MB, so github won't take)
while when works, which it sometimes does, I get:
@trivialfis On this 2 node cluster this error happens about every other model fit!! So this is really bad. FYI the predict is not required for failure, it's just way to see what is going on. |
If it didn't fail only half the time, I would guess that xgboost has some extra conditions for the partitions that are not documented. That's why I asked about the requirements before. You said dmatrix has to be all same partition count, e.g. for X, y, sample_weight since that makes 1 dmatrix. The validation dmatrix can be for valid_X, valid_y, sample_weight_eval_set (for each eval set). That makes sense. However, as a user I'd be worried that 5 partitions does not divide evenly into 4 cuda workers (2 nodes each 2 GPUs). If it failed every time, I would have guessed I need to make the partitions evenly divisible by the number of workers. Of course, that is not always possible to do. |
And to be clear, the actual error vasillates between the allreduce error and the empty dataset -> label set 0 error from here: #6272 It's about 50%/50% in terms of which error pops up. |
I also have reports from colleagues that it happens even without a cluster, just local cluster mode, which I'll check on. |
The report seems to be true, but I so far have only been able to reproduce the issue when using a cluster. In that case, dask is highly unstable with repro above. @trivialfis |
Any progress? I added a hacky retry with (say) 5 retry attempts, but it only works about 30% of time to retry. That it works at all shows part (or whole of) problem is with xgboost. But, I'm just guessing, it could also be that data is already, when doing persist, badly distributed or something. I will keep trying things. |
I also should note that nothing like this was seen in that earlier 1.2.x type build. Problems only occurred with that dask PR/commit we discussed, and so I don't think it is fully fixed in 1.3.0 with the IP fix alone. So this is definitely a regression. |
FYI, even when partitions are even for given workers I see same problem: 2020-12-28 14:21:27,315 C: NA D: NA M: NA NODE:SERVER 22966 PDEBUG | ('Xy npartitions: 32 32',) then fails same way. I can't seem to see why it either hits Allreduce or "label set cannot be empty" even when data should be scattered. Do I have to scatter manually? I'm without a clue, but this definitely didn't happen in older xgboost. |
I don't have any progress yet. Just got back yesterday. Will go through the issues. |
FYI if I remove the eval_set, then I don't seem to encounter either the Allreduce or label set empty issues. It's unclear how xgboost ends up with empty dataset, unless partitions are unevenly distributed. Trying to see on dask side how to determine which workers have which partitions, but maybe you know? |
FYI I also tried explicitly putting the data into dask_cudf by adding FYI the problem with using dask_cudf on local client is that uses GPU memory, and then I see the dask worker uses the same memory. So there is an extra copy for no good reason it seems. |
I think my plan will be improving the implementation of |
You mean all rank based metrics? I guess that will fix one side but not the allreduce perhaps. But it still a concern that data is not evenly distributed. There's no reason, from what I see, that ever any worker should have empty data set. |
Yeah if time allows.
Almost always (I haven't seen other case), the allreduce failure is caused by other failures in 1 or more workers. As for how to sort out a correct, consistent error reporting/handling by coordinating all workers, we don't know yet. Not exactly an issue can be solved by mutex ..
Did you see a Python warning looks like this:
It's the first report for empty DMatrix. A trick for detecting empty DMatrix by yourself is: def _get_client_workers(client: "Client") -> Dict[str, Dict]: # Don't use this on GKE where 'workers' is empty
workers = client.scheduler_info()['workers']
return workers
Xy = xgb.dask.DaskDMatrix(X, y)
workers = _get_client_workers(client)
for addr in workers:
parts = Xy.worker_map.get(addr)
assert parts is not None # Assert there's data partition on worker with address `addr`. |
Unintentional indirect close |
Thanks, will see. FYI, while I mentioned doing I would have expected this command to not materialize any GPU memory on the client, however it does. E.g.
The first process 15019 is on client. The other 2 are workers, with just cuda context. i.e. this is just before the fit, all frames are dask_cudf after passing the dask frame through that above function. Instead of pushing the data to workers, it just eats up GPU memory on the client side, which defeats the purpose of dask. This may be a bug or problem with dask_cudf, or just a problem in rapids 0.14 version of it, but maybe you know more. This is why I settled on passing xgboost the dask frames and letting xgboost convert them to GPU as needed, which seems to perform this per dask worker instead of on client. Do you have an understanding of how to do this correctly? I can see the frames are dask_cudf with partitions etc. but for whatever reason the client uses GPU memory. And, I can see that once the fit actually starts the client holds onto that memory and each worker starts using GPU memory. Perhaps dask_cudf team did not notice this because they usually test with localcudacluster, which would not be as easy to see this problem. |
Hi @trivialfis , any progress on this? We keep seeing it and people who use our products also hit it. |
I don't know the exact cause of allreduce failure, and looking into the aucpr implementation for both cpu and gpu, hopefully if I can get it right we will be able to fix your error. It will take some time as I also want to resolve #4663 |
Hi @trivialfis any progress on this? We still see the issue, it makes dask unusable when using multiple nodes. |
Working on it now. |
I wanted to recommend using sklearn metric as a temporary workaround but it doesn't handle empty dataset. So will continue revising the implementation in xgboost. |
@pseudotensor Could you please try to wrap from sklearn.metrics import roc_auc_score
def _metric_decorator(func: Callable) -> Metric:
# func is a metric like `sklearn.metrics.roc_auc_score`.
def inner(y_score: np.ndarray, dmatrix: DMatrix) -> float:
y_true = dmatrix.get_label()
if y_true.size == 0:
# return 0.5 as default for roc-auc, this various for different metrics.
return func.__name__, 0.5
return func.__name__, func(y_true, y_score)
return inner
metric = _metric_decorator(roc_auc_score)
cls = xgb.dask.DaskXGBClassifier()
cls.fit(X, y, eval_metric=metric, eval_set=[(valid_X, valid_y)]) |
I'm pretty sure in this case I don't have an empty dataset. I ensure the data is partitioned across all workers. When there were cases with empty dataset, it would show such an error. The original repro works fine to repro for me. That is, it doesn't happen every time. So testing that nvidia does would not see except perhaps rarely. Can one try to repro on a real multinode cluster? This is the repro from above: #6551 (comment) If you have a problem with pickle again, I can remove the "model" and just give params etc. |
I setup a ssh cluster. So far the only way I can get the error is I shutdown one of the nodes in the middle of training. |
Judging from the stack trace you provided, I still believe it's the |
Okay, I got a repro but it's due to empty dataset. On worker:
And on client side:
|
@pseudotensor The client side error seems to match the one you have provided. |
Hi, was this resolved? I'm getting this error. |
@Hasna1994 Could you please open a new issue with a reproducible example? |
@trivialfis
Turned on early stopping and even for just single node 2 GPU case I'm getting this error.
It's related is #6272 but I get this allreduce error without the other error. So I wanted to post. However, if one ensures the eval_set has sufficient partitions across the dask workers one does not hit this problem. The error is a bit confusing by itself, but the worker logs show other errors like empty dataset.
Maybe the error provided by xgboost can be improved.
The text was updated successfully, but these errors were encountered: