Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle scoring for meta-estimators #200

TomAugspurger opened this issue Jun 6, 2018 · 2 comments

How to handle scoring for meta-estimators #200

TomAugspurger opened this issue Jun 6, 2018 · 2 comments


Copy link

@TomAugspurger TomAugspurger commented Jun 6, 2018

In #196, we came across an issue with the scoring API. For training an estimator on small NumPy arrays, a regular

clf = sklearn.linear_model.SGDClassifier()
param_grid = {'alpha': [0.1, 10]}
grid_search = GridSearchCV(clf, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):, y)

will induce calls to grid_search.estimator.score(X_test, y_test). This is fine since X and y are small.

Now, we'd like to do something similar with Incremental, (the new wrapper to pass blocks of a dask array to the underlying estimator.partial_fit) which is itself a meta-estimator.

clf = SGDClassifier()
inc = Incremental(clf)

grid_search = GridSearchCV(inc, param_grid, scoring=None)  # or scoring='accuracy'

with joblib.parallel_backend('dask'):, y)  # X and y are *large* Dask arrays

As written, this will end up calling a scikit-learn scoring function with a CV split, which will potentially cause a memory error. The current workaround is to pass a scoring to Incremental or GridSearchCV

from dask_ml.metrics import make_scorer
dask_scorer = get_scorer('accuracy')
inc = Incremental(clf, scoring=dask_scorer)
grid_search = GridSearchCV(inc, param_grid)

Then things will be fine, but that may be difficult to discover. A couple issues:

  1. (less important) It's a bit unclear how param_grid should behave when estimator is a meta-estimator. I hacked up Incremental.get/set_params to pass things through. Maybe a syntax like param_grid={'estimator.alpha': [0.1, 10]} would make sense for scikit-learn? Has this come up before?
  2. (more important): The string 'accuracy' will get scikit-learn's accuracy scorer. This is unfortunate since X_test and y_test will be large dask Arrays, which will maybe cause a memory error when passed the the scikit-learn version. A few potential solutions
    a.) Do nothing, just write better documentation. Maybe make scorer a required argument to Incremental?
    b.) Rewrite the scikit-learn scorers to be compatible with both ndarrays and dask arrays (is that possible?)
    c.) A way to introspect an estimator to get the scorer used by Estimator.score. Something like
from sklearn.metrics import get_estimator_scorer  # or just put in get_scorer
scorer = get_estimator_scorer(sklearn.linear_model.SGDClassifier)
# <function sklearn.metrics.classification.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)>

Then in Incremental, we can detect that the default scorer should be dask_ml.metrics.classification.accuracy_score. I haven't checked many estimators to see if this is feasible.

cc @ogrisel, @GaelVaroquaux if they have thoughts on this

Copy link
Member Author

@TomAugspurger TomAugspurger commented Jul 9, 2018

We can check wether the default scoring is used:

type(inc.estimator).score is ClassifierMixin.score

and in that case override with our won.


Copy link

@ogrisel ogrisel commented Jul 9, 2018

Similarly for RegressorMixin.score which is the r2 score.

If it's not one of the known score functions from the base class, I think that calling Incremental(estimator) should raise ValueError asking the user to explicitly pass the scoring metric to use from dask_ml.metrics so as to avoid calling an unsafe estimator.score method that might implicitly pull the data on a single node by converting a dask.array to a numpy array.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants