Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty dataset in ranking metric. #6272

Closed
pseudotensor opened this issue Oct 22, 2020 · 12 comments · Fixed by #7297
Closed

Handle empty dataset in ranking metric. #6272

pseudotensor opened this issue Oct 22, 2020 · 12 comments · Fixed by #7297

Comments

@pseudotensor
Copy link
Contributor

Same setup as here: #6232 (comment)

but now in xgboost 1.2.0 (not 1.1.0) when I run:

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard_1.csv")
            X = dask_cudf.read_csv("creditcard_1.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard_1.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.fit(X, y, verbose=True) #, eval_set=kwargs_cudf_fit.get('eval_set'),
#                           sample_weight_eval_set=kwargs_cudf_fit.get('sample_weight_eval_set'), verbose=True)
            print("here")

if __name__ == '__main__':
    fun()

I get many warnings:

[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:33:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.

etc.

If I uncomment the commented-out parts for the eval_set, I get more warnings:

[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty
[23:36:56] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
[23:36:56] WARNING: /root/repo/xgboost/src/metric/elementwise_metric.cu:336: label set is empty

etc.

Crucially, if I add an eval_metric like:


import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard_1.csv")
            X = dask_cudf.read_csv("creditcard_1.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard_1.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'
            params['eval_metric'] = 'auc'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.fit(X, y, eval_set=kwargs_cudf_fit.get('eval_set'),
                           sample_weight_eval_set=kwargs_cudf_fit.get('sample_weight_eval_set'), verbose=True)
            print("here")

if __name__ == '__main__':
    fun()

This issue becomes fatal:

/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py:3479: RuntimeWarning: coroutine 'Client._update_scheduler_info' was never awaited
  self.sync(self._update_scheduler_info)
task [xgboost.dask]:tcp://127.0.0.1:34473 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:35613 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:34473 got new rank 0
task [xgboost.dask]:tcp://127.0.0.1:35613 got new rank 1
worker tcp://127.0.0.1:35613 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:34473'}
worker tcp://127.0.0.1:35613 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:34473'}
[23:46:04] WARNING: /root/repo/xgboost/src/objective/regression_obj.cu:59: Label set is empty.
task [xgboost.dask]:tcp://127.0.0.1:34473 connected to the tracker
distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://127.0.0.1:34473', {'feature_names': None, 'feature_types': None, 'has_label': True, 'has_weights': False, 'missing': nan, 'worker_map': defaultdict(<class 'list'>, {'tcp://127.0.0.1:34473': [<Future: finished, key: tuple-d5f09e67-afdf-4395-8eaa-5744207937cb>]}), 'is_quantile': False}, [({'feature_names': None, 'feature_types': None, 'has_label': True, 'has_weights': False, 'missing': nan, 'worker_map': defaultdict(<class 'list'>, {'tcp://127.0.0.1:34473': [<Future: finished, key: tuple-1b7129e7-1910-4ace-9220-3c27ed63fbac>]}), 'is_quantile': False}, 'validation_0')])
kwargs:    {}
Exception: XGBoostError('[23:46:05] /root/repo/xgboost/src/metric/rank_metric.cc:242: Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6a) [0x7f3444028b3a]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAuc::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool)+0xde2) [0x7f344415bee2]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptr<xgboost::DMatrix>, std::allocator<std::shared_ptr<xgboost::DMatrix> > > const&, std::vector<std::string, std::allocator<std::string> > const&)+0x46f) [0x7f3444133acf]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x323) [0x7f3444032463]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f35331f0630]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f35331effed]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f3533206f9e]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5) [0x7f35332079d5]\n  [bt] (8) dask-worker [tcp://127.0.0.1:35613](_PyObject_FastCallDict+0x8b) [0x556257d3500b]\n\n',)

task [xgboost.dask]:tcp://127.0.0.1:34473 got new rank 0
/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py:4773: RuntimeWarning: coroutine 'Client._close' was never awaited
  c.close(timeout=2)
/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py:4773: RuntimeWarning: coroutine 'Client._close' was never awaited
  c.close(timeout=2)
Traceback (most recent call last):
  File "dask_cudf_scitkit_example.py", line 38, in <module>
    fun()
  File "dask_cudf_scitkit_example.py", line 34, in fun
    sample_weight_eval_set=kwargs_cudf_fit.get('sample_weight_eval_set'), verbose=True)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 1080, in fit
    eval_set, sample_weight_eval_set, verbose)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
    raise exc.with_traceback(tb)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
    result[0] = yield future
  File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 1067, in _fit_async
    evals=evals, verbose_eval=verbose)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 633, in _train_async
    results = await client.gather(futures)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
    raise exception.with_traceback(traceback)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 618, in dispatched_train
    **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/training.py", line 222, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/training.py", line 85, in _train_internal
    bst_eval_set = bst.eval_set(evals, i, feval)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/core.py", line 1230, in eval_set
    ctypes.byref(msg)))
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/core.py", line 188, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [23:46:05] /root/repo/xgboost/src/metric/rank_metric.cc:242: Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty
Stack trace:
  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6a) [0x7f3444028b3a]
  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAuc::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool)+0xde2) [0x7f344415bee2]
  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptr<xgboost::DMatrix>, std::allocator<std::shared_ptr<xgboost::DMatrix> > > const&, std::vector<std::string, std::allocator<std::string> > const&)+0x46f) [0x7f3444133acf]
  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x323) [0x7f3444032463]
  [bt] (4) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f35331f0630]
  [bt] (5) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f35331effed]
  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f3533206f9e]
  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5) [0x7f35332079d5]
  [bt] (8) dask-worker [tcp://127.0.0.1:35613](_PyObject_FastCallDict+0x8b) [0x556257d3500b]

I'm guessing this is related to #6232 and recent call back fixes, but I can't be sure.

@teju85

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 22, 2020

Also, this problem only happens with multiple GPUs. This is probably also related to #6268 since also eval_set specific related and only happens with multiple GPUs.

@pseudotensor
Copy link
Contributor Author

I should clarify that to work around #6268 issue that blocks that script from even running, I'm appending to the bottom of site-packages/xgboost/sklearn.py the following:

class XGBoostLabelEncoder(object):
    def __init__(self):
        pass
    
    def fit(self, input_array, y=None):
        return self
    
    def transform(self, input_array, y=None):
        return input_array

    def fit_transform(self, input_array, y=None):
        return input_array

    def inverse_transform(self, input_array, y=None):
        return input_array

This redefines the label encoder to a dummy mode that does nothing, and I ensure that I label encode or have correct y values (else it'll fail with that should be label encoded from 0 .. num classes - 1).

@pseudotensor
Copy link
Contributor Author

Work-around for now is to not pass the eval_set, since anyways there is no call back support, so the only purpose would be to print out the score with verbose mode which is not useful except for debugging.

@trivialfis
Copy link
Member

This should be caused by #6268 (comment) with ranking metric.

@trivialfis trivialfis changed the title dask_cudf: Label set is empty and Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty Handle empty dataset in ranking metric. Oct 29, 2020
@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 24, 2020

@trivialfis FYI still get this with latest 1.3.1 or 1.4.0.

Exception: XGBoostError('[03:49:27] /workspace/xgboost/src/metric/rank_metric.cc:611: Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x54) [0x14f237b04f64]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAucPR::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool)+0x9cb) [0x14f237c5790b]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptr<xgboost::DMatrix>, std::allocator<std::shared_ptr<xgboost::DMatrix> > > const&, std::vector<std::string, std::allocator<std::string> > const&)+0x4f4) [0x14f237c2e964]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x22d) [0x14f237b0cb6d]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x14f31c200630]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x14f31c1fffed]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x14f31c216f9e]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5) [0x14f31c2179d5]\n  [bt] (8) dask-worker [tcp://172.16.2.210:38823](_PyObject_FastCallDict+0x8b) [0x555f2de8000b]\n\n',)
Exception: XGBoostError('[04:11:10] /workspace/xgboost/src/metric/rank_metric.cc:242: Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x54) [0x14f237b04f64]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAuc::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool)+0x9cb) [0x14f237c58a8b]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptr<xgboost::DMatrix>, std::allocator<std::shared_ptr<xgboost::DMatrix> > > const&, std::vector<std::string, std::allocator<std::string> > const&)+0x4f4) [0x14f237c2e964]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x22d) [0x14f237b0cb6d]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x14f31c200630]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x14f31c1fffed]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x14f31c216f9e]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5) [0x14f31c2179d5]\n  [bt] (8) dask-worker [tcp://172.16.2.210:38823](_PyObject_FastCallDict+0x8b) [0x555f2de8000b]\n\n',)

See these messages:

[03:48:39] WARNING: /workspace/xgboost/src/learner.cc:1219: Empty dataset at worker: 1

and

Exception: XGBoostError('[03:49:27] /workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed',)

I only see these since enabling early stopping (currently above was 1 node with 2 GPUs)

@trivialfis
Copy link
Member

Yup. I haven't been able to get to those metrics. But will look into them in 1.4.

@pseudotensor
Copy link
Contributor Author

It's not critical, I just happen to be using a chunksize from X for eval_set, and eval_set was sufficiently smaller that it led to not enough data for each worker.

Quick question: Is there any constraint for xgboost dask in terms of number of partitions for X,y,sample_weight vs. eval_set/sample_weight_eval_set?

I'm assuming each group of those two can have different number of partitions, while within each group they should have same. Is that right?

@trivialfis
Copy link
Member

Different dask dmatrix does not impose constraint on each other. You assumption is right.

@pseudotensor
Copy link
Contributor Author

Hmm, as in #6551 I'm hitting this error even when I ensure X/y and valid_X/valid_y have many partitions for workers.

Just 2 GPUs 1 node and sometimes for no good reason hit this error.

[19:17:59] WARNING: /workspace/xgboost/src/learner.cc:1219: Empty dataset at worker: 1
[19:17:59] WARNING: /workspace/xgboost/src/learner.cc:1219: Empty dataset at worker: 1
2020-12-24 19:18:00,283 - distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://172.16.2.210:33765', [b'DMLC_NUM_WORKER=2', b'DMLC_TRACKER_URI=172.16.2.210', b'DMLC_TRACKER_PORT=9091', b'DMLC_TASK_ID=[xgboost.dask]:tcp://172.16.2.210:33765'], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'meta_names': ['labels'], 'missing': nan, 'parts': [(          0_v1   100_v88   101_v89  ...    98_v86     99_v87    9_v108
0     1.335739  3.321300  0.095678  ...  0.866426   9.551836  2.382692
1          NaN       NaN  2.678584  ...       NaN   9.848003  1.825361
2     0.943877  3.367346  0.111388  ...  1.071429   8.447465  1.375753
3     0.797415  1.408046  0.039051  ...  1.242817  10.747144  2.230754
4          NaN       NaN       NaN  ...       NaN        NaN       NaN
...        ...       ...       ...  ...       ...        ...       ...
5711  1.267744  1.145375  0.054302  ...  1.174744   8.252816  1.779999
5712       NaN       NaN       NaN  ...       NaN        NaN       NaN
5713       NaN       NaN       NaN  ...       NaN        NaN     
kwargs:    {}
Exception: XGBoostError('[19:17:59] /workspace/xgboost/src/metric/rank_metric.cc:611: Check failed: info.labels_.Size() != 0U (0 vs. 0) : label set cannot be empty\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x54) [0x14f055b64f64]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::metric::EvalAucPR::Eval(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, bool)+0x9cb) [0x14f055cb790b]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::EvalOneIter(int, std::vector<std::shared_ptr<xgboost::DMatrix>, std::allocator<std::shared_ptr<xgboost::DMatrix> > > const&, std::vector<std::string, std::allocator<std::string> > const&)+0x4f4) [0x14f055c8e964]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterEvalOneIter+0x22d) [0x14f055b6cb6d]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x14f13201c630]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x14f13201bfed]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x14f132032f9e]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5) [0x14f1320339d5]\n  [bt] (8) dask-worker [tcp://172.16.2.210:33765](_PyObject_FastCallDict+0x8b) [0x55bbdf1e500b]\n\n',)

2020-12-24 19:18:00,360 - distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://172.16.2.210:37553', [b'DMLC_NUM_WORKER=2', b'DMLC_TRACKER_URI=172.16.2.210', b'DMLC_TRACKER_PORT=9091', b'DMLC_TASK_ID=[xgboost.dask]:tcp://172.16.2.210:37553'], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'meta_names': ['labels'], 'missing': nan, 'parts': [(           0_v1   100_v88   101_v89  ...    98_v86     99_v87    9_v108
5716   1.256061  0.969933  1.058731  ...  2.056256  13.306870  2.642831
5717   2.107512  3.972944  1.589945  ...  0.640798   9.601439  1.986946
5718   1.232747  2.627320  2.895753  ...  0.713946  10.347195  1.876453
5719        NaN       NaN       NaN  ...       NaN        NaN       NaN
5720   1.672046  0.335700  1.455212  ...  2.537121  11.874986  1.966222
...         ...       ...       ...  ...       ...        ...       ...
11427       NaN       NaN       NaN  ...       NaN        NaN       NaN
11428  0.497645  1.950236  4.318180  ...  1.735036  10.435363  2.076428
11429       NaN       NaN       NaN  ...       NaN      
kwargs:    {}
Exception: XGBoostError('[19:17:59] /workspace/xgboost/rabit/include/rabit/internal/utils.h:90: Allreduce failed',)

Like the other issue, this is despite have plenty of partitions. I print the partitions for the Xy group and the valid_X/valid_y group:

020-12-24 19:17:55,954 C: NA  D:  NA    M:  NA    NODE:SERVER      19354  PDEBUG | ('num_workers: 2',)
2020-12-24 19:17:56,319 C: NA  D:  NA    M:  NA    NODE:SERVER      19354  PDEBUG | to_dask duration for X_shape=(91457, 128): 0.000324011 0.306071 0.0581458
2020-12-24 19:17:56,332 C: NA  D:  NA    M:  NA    NODE:SERVER      19354  PDEBUG | ('Xy npartitions: 16 16',)
2020-12-24 19:17:56,585 C: NA  D:  NA    M:  NA    NODE:SERVER      19354  PDEBUG | to_dask duration for X_shape=(22864, 128): 0.000548124 0.192724 0.0530603
2020-12-24 19:17:56,585 C: NA  D:  NA    M:  NA    NODE:SERVER      19354  PDEBUG | ('valid Xy npartitions: 4 4',)

You can see this is for the relevant time window.

@pseudotensor
Copy link
Contributor Author

Note that in the to_dask() function I have, I basically:

  1. put y, sample_weight (if exists) with X and call it X_pd
  2. X_dask = dd.from_pandas(X_pd, chunksize=chunksize).persist()
  3. X_dask = dask_cudf.from_dask_dataframe(X_dask)
  4. Then I extract back out of X_dask the actual X and y using pandas drop etc.

I do same exact function for valid_X/valid_y/sample_weight_eval_set.

So I'm pretty convinced that sometimes, despite having several partitions per worker that should be scattered upon "persist" that either they are not actually scattered and there is a bug in distributed/dask or xgboost is mis-managing the pieces of data and erroneously seeing empty frames.

@pseudotensor
Copy link
Contributor Author

FYI I found the exact pickle (we save pickled state when bad things happen) and re-ran and it doesn't fail in same way, even with the same 2-node cluster running (it's still running stuff).

So there is some inconsistent behavior

I could share what I have, but it's wrapped-up too much in our local code, and if not reproing doesn't help.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 25, 2020

In case info helps:

Xy npartitions: 16 16
valid Xy npartitions: 4 4
DaskXGBClassifier(booster='gbtree', colsample_bytree=0.55,
                  debug_verbose=2,
                  early_stopping_limit=None, early_stopping_rounds=20, eval_metric='aucpr',
                  gamma=0.0, gpu_id=0,
                  grow_policy='lossguide',
                  learning_rate=0.15, max_bin=256,
                  max_delta_step=0.0, max_depth=0, max_leaves=256,
                  min_child_weight=1,
                  n_estimators=600, n_jobs=9, ...)
{'base_score': None, 'booster': 'gbtree', 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': 0.55, 'gamma': 0.0, 'gpu_id': 0, 'importance_type': 'gain', 'interaction_constraints': None, 'learning_rate': 0.15, 'max_delta_step': 0.0, 'max_depth': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 600, 'n_jobs': 9, 'num_parallel_tree': None, 'objective': 'binary:logistic', 'random_state': 278438169, 'reg_alpha': 0.0, 'reg_lambda': 2.0, 'scale_pos_weight': 1.0, 'subsample': 0.5, 'tree_method': 'gpu_hist', 'validate_parameters': None, 'verbosity': None, 'use_label_encoder': False, 'model_class_name': 'XGBoostGBMDaskModel', 'num_class': 1, 'labels': [0, 1], 'score_f_name': 'LOGLOSS', 'time_column': None, 'encoder': None, 'tgc': None, 'pred_gap': None, 'pred_periods': None, 'target': None, 'tsp': None, 'early_stopping_rounds': 20, 'max_bin': 256, 'grow_policy': 'lossguide', 'max_leaves': 256, 'eval_metric': 'aucpr', 'early_stopping_threshold': 1e-05, 'monotonicity_constraints': False, 'silent': 0, 'debug_verbose': 2, 'seed': 278438169, 'disable_gpus': False, 'lossguide': False, 'accuracy': 7, 'time_tolerance': 10, 'interpretability': 1, 'ensemble_level': 3, 'train_shape': (114321, 133), 'valid_shape': None, 'model_origin': 'DefaultIndiv: do_te:True,interp:11,depth:6,num_as_cat:False', 'resumed_experiment_id': 'bedd7566-45e6-11eb-bb81-0cc47adb058f', 'str_uuid': 'ret_ff6609f1-e952-4a09-af96-9388336d482c', 'experiment_description': '3.cineweru', 'train_dataset_name': 'train.csv.zip', 'valid_data_name': '[Valid]', 'test_data_name': '[Test]', 'ngenes': 127, 'ngenes_max': 133, 'uses_gpu': True, 'early_stopping_limit': None}
(Delayed('int-46fdf95d-3bc2-4c36-b9e1-d5c25b2f5bfe'), 127)
(dd.Scalar<size-ag..., dtype=int64>,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants