Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

AttributeError when using GridSearchCV with XGBClassifier #31

Open
mateuszkaleta opened this issue Nov 6, 2018 · 12 comments · May be fixed by #28
Open

AttributeError when using GridSearchCV with XGBClassifier #31

mateuszkaleta opened this issue Nov 6, 2018 · 12 comments · May be fixed by #28

Comments

@mateuszkaleta
Copy link

Hello,

I'm working on a small proof of concept. I use dask in my project and would like to use the XGBClassifier. I also need a parameter search and, of course, cross-validation mechanisms.

Unfortunately, when fitting the dask_xgboost.XGBClassifier, I get the following error:

Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_xgboost\core.py", line 97, in _train
AttributeError: 'DataFrame' object has no attribute 'to_delayed'

Although I call .fit() with two dask objects, somehow it becomes a pandas.DataFrame later on.

Here's the code I'm using:

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from dask_xgboost import XGBClassifier
from distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()

    x = pd.DataFrame(data=data['data'], columns=data['feature_names'])
    x = dd.from_pandas(x, npartitions=2)

    y = pd.Series(data['target'])
    y = dd.from_pandas(y, npartitions=2)

    estimator = XGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
        scheduler='threads'
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

I use the packages in the following versions:

pandas==0.23.3
numpy==1.15.1
dask==0.20.0
dask-ml==0.11.0
dask-xgboost==0.1.5

Note that I don't get this exception when using sklearn.ensemble.GradientBoostingClassifier.

Any help would be appreciated.

Mateusz

@TomAugspurger
Copy link
Member

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Will try to get a release out soon.

@mrocklin
Copy link
Member

mrocklin commented Nov 6, 2018 via email

@TomAugspurger
Copy link
Member

I assume by "dask-ml estimators" you mean dask data objects? dask-ml.model_selection.GridSearchCV should work fine on either, but you have the requirement that the underlying estimator being searched over supports whatever is passed to it (and doesn't blow up memory).

When dask_xgboost encounters a pandas or NumPy object, it just trains the Booster locally. I wonder if that should be done on a worker, in case you have resources like a GPU you want used.

@mateuszkaleta
Copy link
Author

Thanks for the response.

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Okay, I've tried with master, but now another problem appears:

Traceback (most recent call last):
  File "C:/(...)/aijin-prescoring/aijin/prescoring/sandbox/prediction/xgboost_poc/dask_xgb_sample_fail.py", line 30, in <module>
    grid_search.fit(x, y)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\_search.py", line 1200, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 501, in get_async
    raise_exception(exc, tb)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\compatibility.py", line 112, in reraise
    raise exc
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 272, in execute_task
    result = _execute_task(task, data)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 253, in _execute_task
    return func(*args2)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 322, in fit_and_score
    est_and_time = fit(est, X_train, y_train, error_score, fields, params, fit_params)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 242, in fit
    est.fit(X, y, **fit_params)
  File "C:\(...)\dask-xgboost-master\dask_xgboost\core.py", line 326, in fit
    classes = classes.compute()
AttributeError: 'numpy.ndarray' object has no attribute 'compute'

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 7, 2018

Whoops, I've accidentally been running your script on my branch for #28, which is fixing this exact issue :) I didn't realize that wasn't merged.

I'm going to kick off the Ci again, and then merge it.

@TomAugspurger TomAugspurger linked a pull request Nov 7, 2018 that will close this issue
@mateuszkaleta
Copy link
Author

Hah, glad to read this!

Thank you.

@ajdani
Copy link

ajdani commented Jan 23, 2019

Hi!

Are there any updates on this issue?

I'm heading the same problem - and the PR unfortunately did not get merged, as the CI pipeline failed.

@mateuszkaleta
Copy link
Author

mateuszkaleta commented Jan 23, 2019

I don't know if it works for you, but you might be interested in the original xgboost's external memory API.

I've ended up searching hyperparameters with hyperopt, training on large data using the external memory API, reading the data from multiple csv files without dask (currently, I use dask only for the preprocessing part).

@quartox
Copy link

quartox commented Mar 1, 2019

I was able to install the branch from #28 and it works for my use case. @TomAugspurger I would be interested in helping solve the CI problems but I don't know where to begin (the error is in multiprocessing when using distributed.utils_test.cluster), so if you would welcome help and be willing to point me in the right direction just ping me. No worries if that is more trouble than it is worth.

@TomAugspurger
Copy link
Member

I spent another couple hours on this with no luck... It's just hard to work around xgboot's behavior of basically doing sys.exit(0) when you try to init their workers twice within a thread. In theory keeping the initialization state as a thread-local should suffice, but I haven't been able to make that work yet, sorry. I don't think I'll have any more time to work on it this week.

FYI, the sparse tests seem to have started failing with the latest xgboost. They're no longer being duck-typed as sparse arrays.

@quartox quartox mentioned this issue Mar 11, 2019
@pasayatpravat
Copy link

Is there any update on this issue? I am also encounring the same problem.

@TomAugspurger
Copy link
Member

Still open. You can apply #28. IIRC there are some issues with the CI / testing on master, but no one has had time to resole them (LMK if you're interested in working on it).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants