GPU errors during multiple Optuna trials #8198

cmunch1 · 2022-08-23T16:18:04Z

I keep getting XGB GPU errors after a few Optuna trials. I tried my code in Windows, Ubuntu, and on Paperspace Gradient, with various GPUs, and same results.

So I tried a public notebook on Kaggle that apparently worked in the past (and unrelated to my data), and it appears to trigger the same errors after a few trials:
https://www.kaggle.com/code/tunguz/tps-mar-2021-xgb-gpu-le-optuna

trivialfis · 2022-08-23T17:32:56Z

Could you please share the version of xgboost and optuna?

trivialfis · 2022-08-23T17:34:30Z

also, could you please share the error trace? I'm running the notebook right now, but haven't seen the error yet.

cmunch1 · 2022-08-23T17:57:55Z

This is what I'm getting from my code on Paperspace Gradient
XGB version: 1.6.1
Optuna version: 2.10.1

[W 2022-08-23 17:53:05,034] Trial 21 failed because of the following error: XGBoostError('[17:53:05] /opt/conda/envs/rapids/conda-bld/work/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [17:53:05] /opt/conda/envs/rapids/conda-bld/work/src/tree/param.h:219: Check failed: n_nodes != 0 (0 vs. 0) : \nStack trace:\n [bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0x42265f) [0x7fa4a504965f]\n [bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(+0x428454) [0x7fa4a504f454]\n [bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerDevice<xgboost::detail::GradientPairInternal >::GPUHistMakerDevice(int, xgboost::EllpackPageImpl const*, xgboost::common::Span<xgboost::FeatureType const, 18446744073709551615ul>, unsigned int, xgboost::tree::TrainParam, unsigned int, unsigned int, xgboost::BatchParam)+0x818) [0x7fa4a537dda8]\n [bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::InitDataOnce(xgboost::DMatrix*)+0x264) [0x7fa4a537e474]\n [bt] (4) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x232) [0x7fa4a5387d02]\n [bt] (5) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x19c) [0x7fa4a4f67f2c]\n [bt] (6) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x516) [0x7fa4a4f6c6f6]\n [bt] (7) /opt/conda/envs/rapids/lib/libxgboost.so(+0x35d5ea) [0x7fa4a4f845ea]\n [bt] (8) /opt/conda/envs/rapids/lib/libxgboost.so(XGBoosterUpdateOneIter+0x7c) [0x7fa4a4e31cbc]\n\n\n\nStack trace:\n [bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0x741467) [0x7fa4a5368467]\n [bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x77d) [0x7fa4a538824d]\n [bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x19c) [0x7fa4a4f67f2c]\n [bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x516) [0x7fa4a4f6c6f6]\n [bt] (4) /opt/conda/envs/rapids/lib/libxgboost.so(+0x35d5ea) [0x7fa4a4f845ea]\n [bt] (5) /opt/conda/envs/rapids/lib/libxgboost.so(XGBoosterUpdateOneIter+0x7c) [0x7fa4a4e31cbc]\n [bt] (6) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/../../libffi.so.8(+0x6a4a) [0x7fa4f5362a4a]\n [bt] (7) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/../../libffi.so.8(+0x5fea) [0x7fa4f5361fea]\n [bt] (8) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x9d2) [0x7fa4f537bdb2]\n\n')
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
value_or_values = func(trial)
File "/tmp/ipykernel_342/180996506.py", line 40, in objective
model = xgb.train(xgb_params, train_dmatrix)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 532, in inner_f
return f(kwargs)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/training.py", line 181, in train
bst.update(dtrain, i, obj)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 1733, in update
_check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/xgboost/core.py", line 203, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [17:53:05] /opt/conda/envs/rapids/conda-bld/work/src/tree/updater_gpu_hist.cu:712: Exception in gpu_hist: [17:53:05] /opt/conda/envs/rapids/conda-bld/work/src/tree/param.h:219: Check failed: n_nodes != 0 (0 vs. 0) :
Stack trace:
[bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0x42265f) [0x7fa4a504965f]
[bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(+0x428454) [0x7fa4a504f454]
[bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerDevice<xgboost::detail::GradientPairInternal >::GPUHistMakerDevice(int, xgboost::EllpackPageImpl const, xgboost::common::Span<xgboost::FeatureType const, 18446744073709551615ul>, unsigned int, xgboost::tree::TrainParam, unsigned int, unsigned int, xgboost::BatchParam)+0x818) [0x7fa4a537dda8]
[bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::InitDataOnce(xgboost::DMatrix)+0x264) [0x7fa4a537e474]
[bt] (4) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x232) [0x7fa4a5387d02]
[bt] (5) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x19c) [0x7fa4a4f67f2c]
[bt] (6) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x516) [0x7fa4a4f6c6f6]
[bt] (7) /opt/conda/envs/rapids/lib/libxgboost.so(+0x35d5ea) [0x7fa4a4f845ea]
[bt] (8) /opt/conda/envs/rapids/lib/libxgboost.so(XGBoosterUpdateOneIter+0x7c) [0x7fa4a4e31cbc]

Stack trace:
[bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0x741467) [0x7fa4a5368467]
[bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x77d) [0x7fa4a538824d]
[bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x19c) [0x7fa4a4f67f2c]
[bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x516) [0x7fa4a4f6c6f6]
[bt] (4) /opt/conda/envs/rapids/lib/libxgboost.so(+0x35d5ea) [0x7fa4a4f845ea]
[bt] (5) /opt/conda/envs/rapids/lib/libxgboost.so(XGBoosterUpdateOneIter+0x7c) [0x7fa4a4e31cbc]
[bt] (6) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/../../libffi.so.8(+0x6a4a) [0x7fa4f5362a4a]
[bt] (7) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/../../libffi.so.8(+0x5fea) [0x7fa4f5361fea]
[bt] (8) /opt/conda/envs/rapids/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x9d2) [0x7fa4f537bdb2]

trivialfis · 2022-08-23T17:59:53Z

Ah, could you please limit the max_depth to 29? With 1.6.2, this can be 30. Fixed in #8098 .

trivialfis closed this as completed Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU errors during multiple Optuna trials #8198

GPU errors during multiple Optuna trials #8198

cmunch1 commented Aug 23, 2022

trivialfis commented Aug 23, 2022

trivialfis commented Aug 23, 2022

cmunch1 commented Aug 23, 2022

trivialfis commented Aug 23, 2022

GPU errors during multiple Optuna trials #8198

GPU errors during multiple Optuna trials #8198

Comments

cmunch1 commented Aug 23, 2022

trivialfis commented Aug 23, 2022

trivialfis commented Aug 23, 2022

cmunch1 commented Aug 23, 2022

trivialfis commented Aug 23, 2022