New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parellel Predict / Transform Meta-estimator #132

Merged
merged 4 commits into from Feb 5, 2018

Conversation

Projects
None yet
2 participants
@TomAugspurger
Member

TomAugspurger commented Feb 4, 2018

Start a module for IID meta-estimators. Right now I've done FirstBlockFitter.

I could also imagine an estimator that samples for you, (so maybe FirstBlockFitter is a special case of IIDFitter(sampling='first')).

In [1]: from dask_ml.datasets import make_classification
   ...: from sklearn.ensemble import GradientBoostingClassifier
   ...: from dask_ml.iid import FirstBlockFitter
   ...:
   ...: clf = FirstBlockFitter(GradientBoostingClassifier())
   ...:
   ...: X, y = make_classification(n_samples=100_000, chunks=10_000)
   ...:
   ...: clf.fit(X, y)
   ...:
Out[1]:
FirstBlockFitter(estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_sampl...      subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False))

In [2]: clf.predict_proba(X)
Out[2]: dask.array<predict_proba, shape=(100000, 2), dtype=float64, chunksize=(10000, 2)>

In [3]: clf.predict_proba(X).compute()
Out[3]:
array([[0.03456433, 0.96543567],
       [0.01632352, 0.98367648],
       [0.16356299, 0.83643701],
       ...,
       [0.0150388 , 0.9849612 ],
       [0.98526749, 0.01473251],
       [0.01522759, 0.98477241]])

Closes #126
Supersedes #127

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Feb 5, 2018

It occurs to me that FirstBlockFitter is conflating two independently useful things.

  1. A sampling strategy for data passed to .fit
  2. daskified .transform and .predict

However, I don't think we want to split that functionality into separate estimators. i.e. we don't want FirstBlockFitter(DaskTransformer(Estimator())). Instead, it'd be something like

clf = Daskified(Estimator(), fit_sampling='first', transform=True, predict=True, ...)

I'd like an estimator name that conveys all that information, anyone have suggestions? :)

@mrocklin

This comment has been minimized.

Member

mrocklin commented Feb 5, 2018

I'm curious about the utility of FirstBlockFitter. Both dask dataframe and dask array have cheap and easy easy ways to pull out subsets of data, either the first blocks or random samples. It's not clear to me that we need to provide users with extra support here.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Feb 5, 2018

Perhaps you're right. Initially, I was motivated by having a drop-in replacement for any scikit-learn estimator (hence the class meta-programming in my last PR). In this case, you'd need to have sampling built into the .fit.

Now that we're using a meta-estimator, the need for sampling in .fit has been reduced. Users can do that before hand using some kind of train_test_split (I think we have an issue for dask-aware versions of those, or maybe I have a branch).

So, I'll re-purpose this PR to just do the transform / predict wrapping. This should be in a different module, though I'm not sure which one. And I don't think scikit-learn has a generic term for transforming or predicting, so we'll need a name (PostFit? Daskified? DaskPredictTransform?)

@TomAugspurger TomAugspurger changed the title from ENH: Meta-estimators for IID Data to ENH: Parellel Predict / Transform Meta-estimator Feb 5, 2018

dtype=sample.dtype)
for block in blocks
]
return da.concatenate(arrays)

This comment has been minimized.

@mrocklin

mrocklin Feb 5, 2018

Member

Is this something that we would want to push upstream into Dask?

This comment has been minimized.

@TomAugspurger

TomAugspurger Feb 5, 2018

Member

Yeah, I could see it being broadly useful: dask/dask#3138

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Feb 5, 2018

I'm not thrilled about the module or estimator names, but they're at least descriptive.

Planning to merge this later today.

@TomAugspurger TomAugspurger merged commit ffc4054 into dask:master Feb 5, 2018

3 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details

@TomAugspurger TomAugspurger deleted the TomAugspurger:iid-wrapper branch Feb 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment