New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base class for estimators using _incremental.fit #356

Merged
merged 44 commits into from Sep 18, 2018

Conversation

Projects
None yet
3 participants
@TomAugspurger
Member

TomAugspurger commented Sep 6, 2018

Here's a rough draft of what a base class calling _incremental.fit may look like. I've just been stealing pieces from #221 and making it work for the random sampler with the remove_worst strategy.

Looking at #221 there were a few pieces that will be common to each subclass

  • data checking
  • CV splitting
  • building cv_results (partly)
  • selecting the best model (partly)

That leaves a few things that must be done by each class

  • Building the dict of {model_id: (param_list, additional_calls)}
  • transforming the output of fit on each of those into a common form

In the end, I'm not sure how much code reuse we'll actually get, but I think it's worth exploring a bit further. A next step would be to make a branch of #221 that builds on top of this, and see how well the hyperband structure maps on to this base class. From there we can decide whether the base class can be adjusted to fit hyperband's needs, or whether it should just be scrapped. I'll try to get to that today.

cc @stsievert

* _get_ids_and_args : Build the dict of
``{model_id : (param_list, additional_calls)}``
that are eventually passed to `fit`
* _get_cv_results : Build the dict containing the CV results.

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I would expect this to be shared among implementations. I think that the history output of the fit function should have all of the information here. Are there choices that are likely to change between implementations?

This comment has been minimized.

@TomAugspurger

TomAugspurger Sep 6, 2018

Member

Yes, I think so. Initially, the structure of my IncrementalRandomizedSearchCV differed from Hyperband, as my results were simply results = fit(...), and not results = {model_id: fit(...) for ...}.

Now that both have the consistent results type of {model_id: (info, models, history)}, more of this should be sharable.

same applies to #356 (comment)

This comment has been minimized.

@stsievert

stsievert Sep 6, 2018

Contributor

In my implementation of _get_cv_results, the bracket key (model_id here I think) was only required to make it so the model_ids were unique across brackets. This was important for downstream processing; I wanted unique model IDs for each bracket, not shared (and isn't the cleanest code).

This solution can be used with the "consistent results type of {model_id: (info, models, history)}". The result I came up at _hyperband.py#L315-L318 is something like

for loop_id, (info, models, hist) in results.items():   # results defined in comment above
    for h in hist:
        h["model_id"] = "loop{}-{}".format(loop_id, h["model_id"])
* _get_ids_and_args : Build the dict of
``{model_id : (param_list, additional_calls)}``
that are eventually passed to `fit`

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I would expect the additional_calls function to be a method of this class that developers would override.

I think that we probably don't need users to define model_ids. The fit function will do this for us.

I do think that we need a way to decide initial parameter lists. This might be an overrideable method.

This comment has been minimized.

@TomAugspurger

TomAugspurger Sep 7, 2018

Member

@mrocklin does _get_ids_and_args make sense now? Initially the building of param_lists and additional_calls was two separate methods, but it seems like they may be dependent on eachother. So I figured having a single method to override the things passed into fit made the most sense.

This is the only mandatory method that has to be implemented by subclasses now.

that are eventually passed to `fit`
* _get_cv_results : Build the dict containing the CV results.
* _get_best_estimator : Select the best estimator from the set of
fitted estimators

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I think that this should be easy to make a default for. I would pass over all of the models in the resulting models dict and then pick the one with the best score coming out of info

* _get_best_estimator : Select the best estimator from the set of
fitted estimators
* _update_results : Update self with attributes from the fitted
esimators.

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I would also expect this to be somewhat standardized among different implementations. What does this need to do?

This comment has been minimized.

@TomAugspurger

TomAugspurger Sep 6, 2018

Member

I'm not really sure. For Hyperband, it's something like

        self.history_ = sum([SHA._history for SHA in SHAs.values()], [])
        self.metadata_ = {
            "models": sum(m["models"] for m in meta),
            "partial_fit_calls": sum(m["partial_fit_calls"] for m in meta),
            "brackets": meta,
        }

This could probably have a default implementation that extracts the results that will likely be useful to all base classes (I'm just not sure what those items are at the moment).

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I think that my objective here is for users not to have to write too much to make things work. Some of these hooks should probably be optional.

* _get_ids_and_args : Build the dict of
``{model_id : (param_list, additional_calls)}``
that are eventually passed to `fit`
* _get_cv_results : Build the dict containing the CV results.

This comment has been minimized.

@stsievert

stsievert Sep 6, 2018

Contributor

In my implementation of _get_cv_results, the bracket key (model_id here I think) was only required to make it so the model_ids were unique across brackets. This was important for downstream processing; I wanted unique model IDs for each bracket, not shared (and isn't the cleanest code).

This solution can be used with the "consistent results type of {model_id: (info, models, history)}". The result I came up at _hyperband.py#L315-L318 is something like

for loop_id, (info, models, hist) in results.items():   # results defined in comment above
    for h in hist:
        h["model_id"] = "loop{}-{}".format(loop_id, h["model_id"])
return self
class IncrementalRandomizedSearchCV(BaseIncrementalSearchCV):

This comment has been minimized.

@stsievert

stsievert Sep 6, 2018

Contributor

There are two parts to this class: the method of choosing parameters, and the incremental/adaptive method implemented (currently remove_worst). I'd rather the name reflect the incremental/adaptive component: maybe something like KillWorstModelCV?

This comment has been minimized.

@mrocklin

mrocklin Sep 6, 2018

Member

I think that this is only up here as a strawman to see what things look like when extending the class. I don't think that the remove_worst function is a particularly pragmatic approach. It was just something simple that was small enough for me to put into a docstring.

TomAugspurger added some commits Sep 6, 2018

PICK: Fixed edge case in remove_worst
(cherry picked from commit a37a8e5)
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 7, 2018

I think this is reasonably close to being ready. May require some tuning tomorrow once hyperband is ready. As a preview https://github.com/TomAugspurger/dask-ml/blob/stsievert-hyperband-2/dask_ml/model_selection/_hyperband.py has the trimmed down implementation on top of this.

Once we're good here, I'll make a PR against your branch @stsievert from https://github.com/TomAugspurger/dask-ml/tree/stsievert-hyperband-2, and then we should be close to ready. I still have a few failing tests to fix up.

@TomAugspurger TomAugspurger changed the title from [WIP]: Base class for estimators using _incremental.fit to Base class for estimators using _incremental.fit Sep 7, 2018

TomAugspurger added some commits Sep 7, 2018

Cleanup
* Split visualize to mixin class
* Remove worthless models attribute
* additional_fit_calls : Union[Callable, object]
Objects with a ``fit`` method have the bound
``additional_fit_calls.fit`` method passed through to `fit`.
(Other) callables are simply passed through.

This comment has been minimized.

@mrocklin

mrocklin Sep 7, 2018

Member

I would design things so that the IncrementalRandomizedSearchCV class has remove_worst as a method, rather than as a return value of _get_ids_an_args. I think that the remove_worst method is likely to be the main point of variance between different classes, and so is something that people will probably expect to specify at the class level.

(also, just to be clear, we shouldn't actually publish a class with the remove_worst function as a genuine solution. It's not a good solution)

I also don't understand why this returns a dictionary.

random_state=self.random_state,
)
for model_id, (param_list, additional_calls) in ids_and_args.items()
}

This comment has been minimized.

@mrocklin

mrocklin Sep 7, 2018

Member

The Hyperband implementation is weird in that they'll want to call fit several times rather than just once (which I think will be more common). Additionally, Hyperband will want to call fit many times concurrently, so we'll need to use coroutines for this.

Short term I suggest that we implement classes for fitting one model, RandomSearch, and SuccessiveHalving using the functions found in this notebook. Then we look at Hyperband as a special case that has to call fit a few times and see what extra methods might have to be pulled out.

I'm happy to take a look at this later today as well if that would be useful.

This comment has been minimized.

@mrocklin

mrocklin Sep 7, 2018

Member

In short, I think that this approach is more complicated than necessary specifically because it is targetting Hyperband first and because Hyperband is a bit of a special case. I think that we should target some of the simpler situations first, build conventions around them, and then see how we can add release valves for Hyperband to also work.

This comment has been minimized.

@TomAugspurger

TomAugspurger Sep 7, 2018

Member

My reply seems to have been lost.

tldr: that seems sensible. I agree that Hyperband's multiple calls to fit complicated this PR, and I don't know how representative that will be of other strategies (I assume not).

TomAugspurger added some commits Sep 7, 2018

Simplify
* Removes references to CV
* Makes _additional_calls a method
* removes extraneous methods
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 7, 2018

@mrocklin I think this is mostly done for now.

I may do a bit of restructuring to split out the CV-specific parts of DaskBaseSearchCV into a mixin. We currently inherit some things that don't make so much sense.

Followup items are to try out random search and successive halving from https://gist.github.com/mrocklin/4c95bd26d15281d82e0bf2d27632e294 on this. Do you want to do that or should I?

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 7, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 7, 2018

OK, just merged in master to hopefully fix the CI failure.

If you want to "check your work", here's a solution :)

diff --git a/dask_ml/model_selection/_incremental.py b/dask_ml/model_selection/_incremental.py
index c2983ae..b208ae9 100644
--- a/dask_ml/model_selection/_incremental.py
+++ b/dask_ml/model_selection/_incremental.py
@@ -618,3 +618,68 @@ class RandomizedWorstIncrementalSearch(BaseIncrementalSearch):
 
     def _get_params(self):
         return ParameterSampler(self.parameters, self.n_iter, self.random_state)
+
+
+class RandomIncrementalSearch(BaseIncrementalSearch):
+    def __init__(
+        self,
+        estimator,
+        param_distribution,
+        n_iter=10,
+        max_iter=100,
+        patience=10,
+        tol=0.001,
+        test_size=0.15,
+        random_state=None,
+        scoring=None,
+        iid=True,
+        refit=True,
+        error_score="raise",
+        return_train_score=_RETURN_TRAIN_SCORE_DEFAULT,
+        scheduler=None,
+        n_jobs=-1,
+        cache_cv=True,
+    ):
+        self.max_iter = max_iter
+        self.n_iter = n_iter
+        self.patience = patience
+        self.tol = tol
+        super(RandomIncrementalSearch, self).__init__(
+            estimator,
+            param_distribution,
+            test_size,
+            random_state,
+            scoring,
+            iid,
+            refit,
+            error_score,
+            return_train_score,
+            scheduler,
+            n_jobs,
+            cache_cv
+        )
+
+    def _get_params(self):
+        return ParameterSampler(self.parameters, self.n_iter)
+
+    def _additional_calls(self, info):
+        out = {}
+        max_iter = self.max_iter
+        patience = self.patience
+        tol = self.tol
+
+        for ident, records in info.items():
+            if max_iter is not None and len(records) > max_iter:
+                out[ident] = 0
+
+            elif len(records) > patience:
+                old = records[-patience]["score"]
+                if all(d["score"] < old + tol for d in records[-patience:]):
+                    out[ident] = 0
+                else:
+                    out[ident] = 1
+
+            else:
+                out[ident] = 1
+        print(out)
+        return out
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 7, 2018

I think some of these parameters should be removed:

  • iid
  • refit
  • error_score
  • return_train_score
  • scheduler
  • n_jobs
  • cache_cv

But need to double check first.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 7, 2018

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 7, 2018

@stsievert stsievert referenced this pull request Sep 16, 2018

Open

ENH: Hyperband implementation #221

7 of 7 tasks complete
@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 17, 2018

How do we feel about the following?

Exponential decay with patience on randomly selected parameters

IncrementalSearchCV(model, params, n_initial_params=1000, decay_rate=1.0, patience=20, tol=0.1)

100 partial_fit calls with no decay on a full grid

IncrementalSearchCV(model, params, n_initial_params='grid', decay_rate=0.0, max_iter=100)

cc @ogrisel your feedback on API and functionality here would be welcome

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

That seems reasonable, with the caveat that we shouldn't have the CV suffix if we're just doing a single train-test split. Now that RandomizedIncrementalSearch is a special-case with no decay, we should get rid of it.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 17, 2018

OK, done.

@TomAugspurger I'm curious about your thoughts on the Incremental meta-estimator. Are there still cases where this is useful (for example in pipelines) or should we start to deprecate it in favor of this mechanism?

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

I think wrappers.Incremental is still useful when you aren't doing hyperparameter optimization, and it's a bit easier to explain I think. I'll update the docs to try to clarify things.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 17, 2018

For the default size of the test data how about using the size of a single chunk. For example if the input data has ten chunks then we use 10%. That provides some guarantee that the result will fit into memory. This would be the default value if test_size=None

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

Yes that's a very good idea.

TomAugspurger and others added some commits Sep 17, 2018

@if_delegate_has_method(delegate=("best_estimator_", "estimator"))
def inverse_transform(self, Xt):
self._check_is_fitted("inverse_transform")
return self.best_estimator_.transform(Xt)

This comment has been minimized.

@mrocklin

mrocklin Sep 17, 2018

Member

It looks like the methods above are untested.

This comment has been minimized.

@TomAugspurger

TomAugspurger Sep 17, 2018

Member

I can put some together quick.

mrocklin added some commits Sep 17, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

Edge case when there's a single parameter

    X, y = make_classification(n_samples=100, n_features=5, chunks=(10, 5))

    model = SGDClassifier(tol=1e-3)

    params = {"alpha": np.logspace(-2, 10, 3)}

    search = IncrementalSearch(model, params, n_initial_parameters="grid")
    search.fit(X, y, classes=[0, 1])
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _fit(self, X, y, **fit_params)
    533         history_results = self._get_history_results(results)
--> 534         best_estimator, best_index = self._get_best(results, history_results)
    535         best_estimator = yield best_estimator

~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in _get_best(self, results, history_results)
    486         """Select the best estimator from the set of estimators."""
--> 487         best_model_id = first(results.info)
    488         key = operator.itemgetter("model_id")

~/Envs/dask-dev/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
    367     """
--> 368     return next(iter(seq))
    369

StopIteration:

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-7-048117aa2d7a> in <module>()
----> 1 search.fit(X, y, classes=[0, 1])

~/sandbox/dask-ml/dask_ml/model_selection/_incremental.py in fit(self, X, y, **fit_params)
    559             Additional partial fit keyword arguments for the estimator.
    560         """
--> 561         return default_client().sync(self._fit, X, y, **fit_params)
    562
    563     @if_delegate_has_method(delegate=("best_estimator_", "estimator"))

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    645             return future
    646         else:
--> 647             return sync(self.loop, func, *args, **kwargs)
    648
    649     def __repr__(self):

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

~/Envs/dask-dev/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/Envs/dask-dev/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1131
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

~/Envs/dask-dev/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1145                             exc_info = None
   1146                     else:
-> 1147                         yielded = self.gen.send(value)
   1148
   1149                     if stack_context._state.contexts is not orig_stack_contexts:

RuntimeError: generator raised StopIteration
@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 17, 2018

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

Added small tests for predict_proba, predict_log_proba, transform, and decision_function.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 17, 2018

Sometimes the shape of the result in test_transform differs. Is this expected?

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Sep 17, 2018

Sorry, fixed.

TomAugspurger added some commits Sep 17, 2018

@mrocklin

This comment has been minimized.

Member

mrocklin commented Sep 18, 2018

+1

@TomAugspurger TomAugspurger merged commit 991b894 into dask:master Sep 18, 2018

6 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
codecov/patch 90.62% of diff hit (target 90.14%)
Details
codecov/project 90.23% (+0.09%) compared to ccd5d47
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@TomAugspurger TomAugspurger deleted the TomAugspurger:incremental-base branch Sep 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment