New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/REF: Clone before fitting in Incremental.fit #258

Merged
merged 11 commits into from Jul 3, 2018

Conversation

Projects
None yet
3 participants
@TomAugspurger
Member

TomAugspurger commented Jul 2, 2018

This changes Incremental.fit to clone before fitting.

Alternative to #207

This implements the ideas from #207 (comment)

TomAugspurger added some commits Jul 2, 2018

API/REF: Clone before fitting in Incremental.fit
This changes Incremental.fit to clone before fitting.

Alternative to #207
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

@@ -26,11 +27,12 @@ def test_incremental_basic(scheduler, xy_classification):
assert result is clf
assert isinstance(result.estimator.coef_, np.ndarray)
np.testing.assert_array_almost_equal(result.estimator.coef_,
assert isinstance(result.estimator_.coef_, np.ndarray)

This comment has been minimized.

@TomAugspurger

TomAugspurger Jul 2, 2018

Member

This is the API change.

@@ -96,22 +99,20 @@ def test_scoring_string(scheduler, xy_classification, scoring):
assert callable(check_scoring(clf, scoring=scoring))
clf.fit(X, y, classes=np.unique(y))
clf.score(X, y)
clf.estimator.score(X, y)

This comment has been minimized.

@TomAugspurger

TomAugspurger Jul 2, 2018

Member

I don't think this was doing what we wanted earlier.

When Incremental is created with scoring, clf.score(X, y) will in general be different then clf.estimator_.score(X, y). The latter always uses the "default" scoring for that estimator (e.g. accuracy score). I opted to just remove this line.

def test_fit_ndarrays():
X = np.ones((10, 5))
y = np.ones(10)
y = np.concatenate([np.zeros(5), np.ones(5)])

This comment has been minimized.

@TomAugspurger

TomAugspurger Jul 2, 2018

Member

Needed a mix of zeros and ones. I'm not entirely sure what changed to necessitate that...

@@ -122,7 +123,7 @@ def test_score_ndarrays():
inc = Incremental(sgd, scoring='accuracy')
inc.partial_fit(X, y, classes=[0, 1])
inc.fit(X, y)
inc.fit(X, y, classes=[0, 1])

This comment has been minimized.

@TomAugspurger

TomAugspurger Jul 2, 2018

Member

need this, as classes_ no longer leasks through the partial_fit to the fit (a good thing!)

@stsievert

This comment has been minimized.

Contributor

stsievert commented Jul 2, 2018

I think if we implement fit support for Incremental we should loop over the data max_iter times beforehand.

For more complete thoughts on the API, see #207 (comment)

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

Agreed that having a max_iter parameter would be useful for Incremental. However, exposes a difficulty alluded to in #200 and how Incremental.get/set_params works. It would be ambiguos whether max_iter should go to Incremental.max_iter or Incremental.estimator.max_iter.

The solution is to probably to have Incremental to specify a name when creating the Incremental.

See https://stackoverflow.com/a/35534452/1889400 for how VotingClassifier works.

So either

inc = Incremental(SGDClassifier(), name='sgdclassifier')
inc = Incremental(('sgdclassifier', SGDClassifier())

In the latter case, we would have a make_incremental helper that works like make_pipeline.

Then in a context where you need to get / set params, it'd be

param_grid = {
    'sgdclassifier__alpha': [0.1, 10.0],
    'sgdclassifier__max_iter': [1000, 2000],
    'max_iter': [1, 5],
}

clf = GridSearchCV(inc, param_grid)
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

Looks like https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/metaestimators.py has some utilities for get / set params in a meta-estimator, though most are private.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

55fccf2 has the changes.

Turns out BaseEstimator already does the right thing, we don't have to mess with _BaseComposition.

I'm encouraged by the lines of code we could delete from Incremental.

For now, I haven't made the name configurable. If you want to set the parameters for the underlying estimator, then you use estimator__<parameter>. We can make the estimator part configurable in the future if there's a demand for it.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

I'll implement max_iter now.

@stsievert, do you have a feeling for what a good default would be? I'm leaning towards max_iter=1, but am not sure.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Jul 2, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

It varies from model to model.

scikit-learn/scikit-learn#1626 is the general API that we would want to follow here, when that is implemented.

@TomAugspurger TomAugspurger changed the title from API/REF: Clone before fitting in Incremental.fit to API/REF: Clone before fitting in Incremental.fit and implement max_iter Jul 2, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

A basic version of max_iter should be working now.

TomAugspurger added some commits Jul 2, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

I'm finding this diff somewhat difficult to digest. I'm going to split off the max_iter to a second PR.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jul 2, 2018

OK I think this PR should be good. I feel happy about deleting https://github.com/dask/dask-ml/pull/258/files#diff-eaec819577a74de3b3af2ece0ac4dce9L353, and should have noticed that needing those in the first place was a code smell.

Postponing discussion of max_iter, which will build on this, to #264.

@TomAugspurger TomAugspurger changed the title from API/REF: Clone before fitting in Incremental.fit and implement max_iter to API/REF: Clone before fitting in Incremental.fit Jul 2, 2018

@stsievert

This comment has been minimized.

Contributor

stsievert commented Jul 2, 2018

I'm on board with this now. I think Incremental should implement fit and have a max_iter param.

do you have a feeling for what a good default would be? I'm leaning towards max_iter=1, but am not sure.

I think I would default to max_iter=5 – that's what sklearn's SGDClassifier max_iter defaults to, at least when no stopping criterion set in <v0.21. They set some stopping criteria in >= v0.21 and increase max_iter. For simple problems I'd say that 5 epochs is good enough.

This should depend on model complexity but I'm okay leaving that. e.g., fitting a linear model with twice as many features will require about sqrt(2) more samples to be seen by the optimization algorithm.

@TomAugspurger TomAugspurger merged commit d580158 into dask:master Jul 3, 2018

3 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment