AttributeError when using GridSearchCV with XGBClassifier #31

mateuszkaleta · 2018-11-06T09:18:34Z

Hello,

I'm working on a small proof of concept. I use dask in my project and would like to use the XGBClassifier. I also need a parameter search and, of course, cross-validation mechanisms.

Unfortunately, when fitting the dask_xgboost.XGBClassifier, I get the following error:

Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_xgboost\core.py", line 97, in _train
AttributeError: 'DataFrame' object has no attribute 'to_delayed'

Although I call .fit() with two dask objects, somehow it becomes a pandas.DataFrame later on.

Here's the code I'm using:

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from dask_xgboost import XGBClassifier
from distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()

    x = pd.DataFrame(data=data['data'], columns=data['feature_names'])
    x = dd.from_pandas(x, npartitions=2)

    y = pd.Series(data['target'])
    y = dd.from_pandas(y, npartitions=2)

    estimator = XGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
        scheduler='threads'
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

I use the packages in the following versions:

pandas==0.23.3
numpy==1.15.1
dask==0.20.0
dask-ml==0.11.0
dask-xgboost==0.1.5

Note that I don't get this exception when using sklearn.ensemble.GradientBoostingClassifier.

Any help would be appreciated.

Mateusz

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-06T13:15:37Z

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Will try to get a release out soon.

mrocklin · 2018-11-06T13:17:19Z

Does our GridSearchCV even handle dask-ml estimators? I thought that it was mostly optimzied for parameter searches on scikit-learn estimators.

…

On Tue, Nov 6, 2018 at 8:15 AM Tom Augspurger ***@***.***> wrote: Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now. Will try to get a release out soon. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMwRfTqT7u-1Ft4AB5OTeLFQXlpRks5usYt5gaJpZM4YQCV_> .

TomAugspurger · 2018-11-06T13:21:51Z

I assume by "dask-ml estimators" you mean dask data objects? dask-ml.model_selection.GridSearchCV should work fine on either, but you have the requirement that the underlying estimator being searched over supports whatever is passed to it (and doesn't blow up memory).

When dask_xgboost encounters a pandas or NumPy object, it just trains the Booster locally. I wonder if that should be done on a worker, in case you have resources like a GPU you want used.

mateuszkaleta · 2018-11-07T06:49:12Z

Thanks for the response.

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Okay, I've tried with master, but now another problem appears:

Traceback (most recent call last):
  File "C:/(...)/aijin-prescoring/aijin/prescoring/sandbox/prediction/xgboost_poc/dask_xgb_sample_fail.py", line 30, in <module>
    grid_search.fit(x, y)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\_search.py", line 1200, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 501, in get_async
    raise_exception(exc, tb)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\compatibility.py", line 112, in reraise
    raise exc
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 272, in execute_task
    result = _execute_task(task, data)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 253, in _execute_task
    return func(*args2)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 322, in fit_and_score
    est_and_time = fit(est, X_train, y_train, error_score, fields, params, fit_params)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 242, in fit
    est.fit(X, y, **fit_params)
  File "C:\(...)\dask-xgboost-master\dask_xgboost\core.py", line 326, in fit
    classes = classes.compute()
AttributeError: 'numpy.ndarray' object has no attribute 'compute'

TomAugspurger · 2018-11-07T13:45:06Z

Whoops, I've accidentally been running your script on my branch for #28, which is fixing this exact issue :) I didn't realize that wasn't merged.

I'm going to kick off the Ci again, and then merge it.

mateuszkaleta · 2018-11-08T06:40:00Z

Hah, glad to read this!

Thank you.

ajdani · 2019-01-23T09:34:57Z

Hi!

Are there any updates on this issue?

I'm heading the same problem - and the PR unfortunately did not get merged, as the CI pipeline failed.

mateuszkaleta · 2019-01-23T11:10:53Z

I don't know if it works for you, but you might be interested in the original xgboost's external memory API.

I've ended up searching hyperparameters with hyperopt, training on large data using the external memory API, reading the data from multiple csv files without dask (currently, I use dask only for the preprocessing part).

quartox · 2019-03-01T21:25:54Z

I was able to install the branch from #28 and it works for my use case. @TomAugspurger I would be interested in helping solve the CI problems but I don't know where to begin (the error is in multiprocessing when using distributed.utils_test.cluster), so if you would welcome help and be willing to point me in the right direction just ping me. No worries if that is more trouble than it is worth.

TomAugspurger · 2019-03-04T12:39:23Z

I spent another couple hours on this with no luck... It's just hard to work around xgboot's behavior of basically doing sys.exit(0) when you try to init their workers twice within a thread. In theory keeping the initialization state as a thread-local should suffice, but I haven't been able to make that work yet, sorry. I don't think I'll have any more time to work on it this week.

FYI, the sparse tests seem to have started failing with the latest xgboost. They're no longer being duck-typed as sparse arrays.

pasayatpravat · 2019-05-11T09:51:53Z

Is there any update on this issue? I am also encounring the same problem.

TomAugspurger · 2019-05-13T15:17:50Z

Still open. You can apply #28. IIRC there are some issues with the CI / testing on master, but no one has had time to resole them (LMK if you're interested in working on it).

TomAugspurger linked a pull request Nov 7, 2018 that will close this issue

CV works #28

Open

quartox mentioned this issue Mar 11, 2019

CI failures #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError when using GridSearchCV with XGBClassifier #31

AttributeError when using GridSearchCV with XGBClassifier #31

mateuszkaleta commented Nov 6, 2018

TomAugspurger commented Nov 6, 2018

mrocklin commented Nov 6, 2018 via email

TomAugspurger commented Nov 6, 2018

mateuszkaleta commented Nov 7, 2018

TomAugspurger commented Nov 7, 2018 •

edited

Loading

mateuszkaleta commented Nov 8, 2018

ajdani commented Jan 23, 2019

mateuszkaleta commented Jan 23, 2019 •

edited

Loading

quartox commented Mar 1, 2019

TomAugspurger commented Mar 4, 2019

pasayatpravat commented May 11, 2019

TomAugspurger commented May 13, 2019

AttributeError when using GridSearchCV with XGBClassifier #31

AttributeError when using GridSearchCV with XGBClassifier #31

Comments

mateuszkaleta commented Nov 6, 2018

TomAugspurger commented Nov 6, 2018

mrocklin commented Nov 6, 2018 via email

TomAugspurger commented Nov 6, 2018

mateuszkaleta commented Nov 7, 2018

TomAugspurger commented Nov 7, 2018 • edited Loading

mateuszkaleta commented Nov 8, 2018

ajdani commented Jan 23, 2019

mateuszkaleta commented Jan 23, 2019 • edited Loading

quartox commented Mar 1, 2019

TomAugspurger commented Mar 4, 2019

pasayatpravat commented May 11, 2019

TomAugspurger commented May 13, 2019

TomAugspurger commented Nov 7, 2018 •

edited

Loading

mateuszkaleta commented Jan 23, 2019 •

edited

Loading