New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty dataset in ranking metric. #6272
Comments
Also, this problem only happens with multiple GPUs. This is probably also related to #6268 since also eval_set specific related and only happens with multiple GPUs. |
I should clarify that to work around #6268 issue that blocks that script from even running, I'm appending to the bottom of site-packages/xgboost/sklearn.py the following:
This redefines the label encoder to a dummy mode that does nothing, and I ensure that I label encode or have correct y values (else it'll fail with that should be label encoded from 0 .. num classes - 1). |
Work-around for now is to not pass the eval_set, since anyways there is no call back support, so the only purpose would be to print out the score with verbose mode which is not useful except for debugging. |
This should be caused by #6268 (comment) with ranking metric. |
@trivialfis FYI still get this with latest 1.3.1 or 1.4.0.
See these messages:
and
I only see these since enabling early stopping (currently above was 1 node with 2 GPUs) |
Yup. I haven't been able to get to those metrics. But will look into them in 1.4. |
It's not critical, I just happen to be using a chunksize from X for eval_set, and eval_set was sufficiently smaller that it led to not enough data for each worker. Quick question: Is there any constraint for xgboost dask in terms of number of partitions for X,y,sample_weight vs. eval_set/sample_weight_eval_set? I'm assuming each group of those two can have different number of partitions, while within each group they should have same. Is that right? |
Different dask dmatrix does not impose constraint on each other. You assumption is right. |
Hmm, as in #6551 I'm hitting this error even when I ensure X/y and valid_X/valid_y have many partitions for workers. Just 2 GPUs 1 node and sometimes for no good reason hit this error.
Like the other issue, this is despite have plenty of partitions. I print the partitions for the Xy group and the valid_X/valid_y group:
You can see this is for the relevant time window. |
Note that in the to_dask() function I have, I basically:
I do same exact function for valid_X/valid_y/sample_weight_eval_set. So I'm pretty convinced that sometimes, despite having several partitions per worker that should be scattered upon "persist" that either they are not actually scattered and there is a bug in distributed/dask or xgboost is mis-managing the pieces of data and erroneously seeing empty frames. |
FYI I found the exact pickle (we save pickled state when bad things happen) and re-ran and it doesn't fail in same way, even with the same 2-node cluster running (it's still running stuff). So there is some inconsistent behavior I could share what I have, but it's wrapped-up too much in our local code, and if not reproing doesn't help. |
In case info helps:
|
Same setup as here: #6232 (comment)
but now in xgboost 1.2.0 (not 1.1.0) when I run:
I get many warnings:
etc.
If I uncomment the commented-out parts for the eval_set, I get more warnings:
etc.
Crucially, if I add an eval_metric like:
This issue becomes fatal:
I'm guessing this is related to #6232 and recent call back fixes, but I can't be sure.
@teju85
The text was updated successfully, but these errors were encountered: