Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental model selection #288

Merged
merged 23 commits into from Sep 5, 2018

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Jul 8, 2018

This implements a simple approach to do hyper-parameter optimization on
incrementally trained models. It is heavily insprired by (and steals content
from) #221 but makes fewer opinions, and doesn't attempt to implement
scikit-learn APIs. I don't intend for this to be user-facing, or necessarily get
merged, but I hope that it to enable conversation around the topic.

Currently it implements a function fit, which takes in a model, training
and testing data and a parameter space. It creates many models and trains them
on a block of data, scores them, and then keeps only the better performing
ones, training them on subsequent blocks of data.
It continues to do this until only one is selected.

This function is not intended for public users, but is hopefully something we
can play with as we eventually build a full solution. It is decently
configurable today and we will need to find sane defaults. Notably, this
function does not take opinions on the following:

  1. How testing data should be separated from training data,
    currently they are both explicitly provided
  2. How many parameters to start with (provided as integer input)
  3. How to decrease this number over time (provided as function input)

Additionally it does not, but probably should provide configurations for the
following:

  1. How to select models from the current group
  2. How to decide on stopping criteria for any particular model
  3. How to decide how many models to select based on the scores received so far

I believe that this implementation is a decent start though and should be ok as a
conversation piece. I think that it makes some decisions that we don't want to
think about yet (online determination of how many models to keep) while giving
us knobs for other parameters that I think we'll want to play with (model decay
functions, how to handle testing, etc..)

It is decently well tuned from a dask perspsective.
It optimistically tries moving forward on models between rounds and then cancels
work if it's not necessary. This depends on dask master.
Performance is also a bit better with dask/distributed#2100
merged in (but not necessary)

Notebook: https://gist.github.com/3a33bed7e7289550bfea8a48af702c82

It would be nice to settle on a sane default for this operation, which I understand is commonly needed.
I suspect that we should play a bit with some real problems using this to see what ends up being important before doing much more development work.


else:
x_key, y_key = seen[j]
return Future(x_key), Future(y_key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend looking too deeply at this function. It's unclean and likely to go in the future.

Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this implementation @mrocklin. I've glanced at the code, but need to take time to thoroughly review. Does this clarify the ideas in #221 (comment)?

It implements starting/stopping work based on the current estimate of good/bad models (which I agree should be done in #221). Is there anything else I should pay close attention to for #221?

Oh, and the gist isn't loading for me.

model = client.submit(_partial_fit, model, X_future, y_future,
fit_params, priority=-i + meta['score'])
score = client.submit(_score, model, X_test, y_test, scorer,
priority=-time_step + meta['score'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice – the priority keyword is useful, and I wasn't aware of the implementation.

I should spend some time thinking about priority. I think we want to use a multiplicative ratio to try to remove scale: priority=score / i.

@mrocklin
Copy link
Member Author

mrocklin commented Jul 8, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Jul 16, 2018

After looking at the Incremental.fit problem I'm inclined to generalize this code a bit so that the caller provides a list of models, training and testing data, and a function that gets called after every batch of training data gets processed to determine how to proceed. That function would accept a dictionary mapping model identity to scoring information, and it would then return a dictionary showing how much further to proceed on each model before checking in again.

>>> scores  # example simplified input
{'model-1': [0.3, 0.4, 0.5, 0.6, 0.7],  # evolving
 'model-2': [0.2, 0.6, 0.6, 0.6, 0.6],  # converged
 'model-2': [0.1, 0.1, 0.1, 0.1, 0.1],  # bad
}
>>> out  # example output
{'model-1': 3  # evolve forward three more steps, then give us back the score
 'model-2': 0  # keep this around, but don't bother evolving further 
# 'model-3': ... # omit model 3 to forget about it
}

def update(scores):
    """ Something like successive halving for hyperband """
    best = topk(n=len(scores) / 2, scores, lambda k: score[k][-1])
    additional_steps = int(...)
    return {ident: additional_steps for ident in best}

def update(scores):
    """ Something like Incremental.fit """
    assert len(score_info) == 1
    [(ident, v)] = scores.items()
    if any(score > v[-10] + tolerance for score in v[-10:]):
        return {ident: 1}  # do one more time step
    else:
        return {ident: 0}  # no more work to do, signals termination

def update(scores):
    """ Something like RandomSearchCV """
    out = {}
    for ident, v in scores:
        if any(score > v[-10] + tolerance for score in v[-10:]):
            out[ident] = 1  # keep evolving this one
        else:
            out[ident] = 0  # don't evolve this one any further, it's done

The full fit function would have the something like the following inputs:

def fit(model, parameter_list, X_train, y_train, X_test, y_test, update, fit_params, scorer):
    ...

This may not cover these cases though. Critique welcome before I invest time into this.

@mrocklin
Copy link
Member Author

I simplified the example above significantly in an edit. Better to look on github rather than depend on e-mail.

@stsievert
Copy link
Member

I'm inclined to generalize this code a bit

This looks like a disciplined way of controlling when to call partial_fit on models. It looks like one float determines model quality. update controls whether to pass a particular model to fit.

I think this looks good to me. I would like to store state inside update (I assume update is called with as_completed as fit completes). For the output of update, I think a negative integer should cancel any fit calls that are currently running for that model.

I think we'd want to control the fit call signature. Could we have a list of metadata aside the scores? Something like

>>> scores
{'model-1': [0.3, 0.4, 0.5],
 'model-2': [0.5, 0.5, 0.5]}
>>> model_meta
{'model-1': {...},
 'model-2': {...}}
>>> all_meta  # for HyperbandCV
{'scorer': ...,
 'X_test': ...,
 'y_test': ...}

where all_meta and model_meta could be configured by the programmer of Hyperband/Incremental.fit/RandomizedSearchCV in some way.

Then the call signature for fit could be

fit(model, X, y, **model_meta[ident], **all_meta) -> model, meta

I'm not seeing why update should be an argument to fit, but scikit-learn/scikit-learn#9599 (comment) seems related. I renamed {X, y}_train to X, y. I also think we should rename update.

@mrocklin
Copy link
Member Author

It looks like one float determines model quality

I don't understand this. Which float? In principle there is no one thing that determines model quality here. We're a bit lower level.

I would like to store state inside update.

What state would you want to propagate. I expect that the scores input will grow a number of other bits of metadata collected during the computation.

(I assume update is called with as_completed as fit completes)

In the context of successive halving update is called at the end of every epoch

I think a negative integer should cancel any fit calls that are currently running for that model.

The convention I proposed above is that the user passes back the number of additional steps go to forward in each model before calling update again. A number of 0 would result in no additional fits. Removing the element from the dict would result in forgetting the model entirely. I believe that your proposed use of -1 fits one of these two. Is that correct?

I think we'd want to control the fit call signature. Could we have a list of metadata aside the scores?

Yes, I agree that some collection of metadata is likely to be important.

where all_meta and model_meta could be configured by the programmer of Hyperband/Incremental.fit/RandomizedSearchCV in some way.

So the user would also pass functions to include in the partial_fit or score calls? Is there a particular place where they should be run? Should the user have the freedom to provide the entire partial_fit or score call or is this too much? If too much, then can you suggest an implementation of _partial_fit that would cover most needs that you can think of? Where do we call meta_func? Or should it be a context manager or something else? What is the right balance of easy to implement and also sufficient? (these are somewhat rhetorical questions, I wouldn't expect anyone to have a full answer to any of them without trying out a bunch of things).

def _partial_fit(model_and_meta, X, y, fit_params, meta_func=None):
    ...

def _score(model_and_meta, X, y, scorer, meta_func=None):
    ...

I'm not seeing why update should be an argument to fit, but scikit-learn/scikit-learn#9599 (comment) seems related.

I'm probably misusing the term fit here. That's just the name of my full function that does all of the training work. I'm using the term partial_fit for training on a single batch. I suspect that this is what you are calling fit above.

@stsievert
Copy link
Member

I got confused by naming/etc but am starting to understand (and disregard most of #288 (comment); it's ill-informed). I think this adaptive training will require

  • specifying how many times _partial_fit is called with a particular model
  • receiving the score for each model after _partial_fit called so many times
  • storage and modification of metadata. This metadata is per-model for Hyperband, but it might be global for all models in other use cases (e.g., bayesian searches).

I think the _partial_fit and _score functions are good in this implementation. I think update should have the call signature

update(ident: str, train_metas: {str: dict}, model_metas: {str: dict}) 
    -> training_iters: {str: int}, model_metas: {str: dict}
  • There are two pieces of metadata: one for training (train_metas), and one to be used and modified in update (model_metas)
  • I would expect this to be called as soon as model ident has finished their _score call.
  • The metadata information in model_metas might change

I think train_metas should have scores and partial_fit_calls keys, and also the same keys as cv_results_ (e.g., mean_fit_time). This would be the object passed to and modified in _partial_fit and _score.

I think model_metas should hold on to historical information, and be initialized in meta_func. In the Hyperband case, this would mean holding onto the previous brackets and their scores in case any models need to be restarted. This would only live on the scheduler/master and not be passed to or modified in _partial_fit or _score.

I do think update should be renamed.

The convention I proposed ... Removing the element from the dict would result in forgetting the model entirely

I think 'not present' should default to 'do nothing'. I'd only want to specify the state for some subset of the models, not all the models. For Hyperband with some subset of the models I'd specify which models should be trained more and which models should be killed.

I think there {'model-0': -1} should cancel the most recent _partial_fit call for 'model-0', but hold onto the results from the previous _partial_fit calls. I think we should use -1 because it's kind of like going back one _partial_fit call.

@mrocklin
Copy link
Member Author

I'm not sure about some of the motivation behind your thoughts above.

For the desire to call update after every score call I believe that you want this in order to support asynchronous hyperband, where you evolve tasks based on the scores that have arrived so far. Instead of this approach I recommend that we just submit a training step on every model until a global update check, but submit them with a priority equal to their score. This will mean that we still keep things saturated during stragglers, but won't let good models die if they happen to be unlucky about when they finish. If async hyperband was your only motivation then I suggest that we keep things a bit simpler for now with the update function and run it only after all of the previous instructions finish.

Similarly I suspect that this was your motivation for having "no change" be the default in update. But I'm not sure. If you can include motivation behind your thoughts that would be useful

@mrocklin
Copy link
Member Author

Or, putting this another way. I believe that I've presented a sufficient approach to successive halving in my update function above. I believe that the speculative execution should handle the underlying desires behind the move to async hyperband. I've removed the prioritization for now for simplicity, but it will be easy to put it back.

If I'm missing something and the approach is unlikely to work for some reason please let me know

@stsievert
Copy link
Member

If you can include motivation behind your thoughts that would be useful

Asynchronous Hyperband is part of my motivation but not all of it. This is an adaptive training function, one that decides to train models based on prior evaluations and stored metadata. Wikipedia has a good example of the model updating after seeing one evaluation, what I'm used to seeing in adaptive algorithms.

run it [update] only after all of the previous instructions finish.

This won't perform well for synchronous Hyperband. There are a couple embarrassingly parallel loops in synchronous Hyperband, and there's no reason for loop to wait on another. It will work well for one successive halving call in synchronous Hyperband.

This is important: the initial call to update would return anywhere between 1 and 81 calls to _partial_fit. This doesn't really go away in later calls to update, which will specify anywhere between 3 and 27 _partial_fit calls.

This could be avoided if there's some way to specify the different Hyperband loops as independent, at least for synchronous Hyperband. This could work for asynchronous Hyperband too, as long as additional arguments are passed.

run it [update] only after all of the previous instructions finish.

I think this should be supported as well. That's the simplest interface I would expect.

@mrocklin
Copy link
Member Author

Notebook implementing prototypes of of Incremental.fit, RandomSearch, and SucessiveHalving:

https://gist.github.com/4c95bd26d15281d82e0bf2d27632e294

@mrocklin
Copy link
Member Author

This is an adaptive training function, one that decides to train models based on prior evaluations and stored metadata. Wikipedia has a good example of the model updating after seeing one evaluation, what I'm used to seeing in adaptive algorithms

Understood. And in principle I agree that we might gain from making decisions whenever we find new information. To start out though with my current constraints (finding solutions to the above problems) I seem to be able to implement most things about as efficiently as desired. If we choose to get fancier we can get fancier easily. This seems to suffice for now though?

This won't perform well for synchronous Hyperband. There are a couple embarrassingly parallel loops in synchronous Hyperband, and there's no reason for loop to wait on another. It will work well for one successive halving call in synchronous Hyperband.

How would you interleave calls? The paper specifies a nested loop. That would be trivial to implement here. We can definitely launch the various successive-halving computations concurrently if desired. Or perhaps you're referring to some way to fuse and share computations between successive halving computations?

The initial call to update would return anywhere between 1 and 81 calls to _partial_fit

Just checking, 81 is just an example number here, right? One might ask for thousands I think? I believe that 81 is just a number they use in a table to show an example of how things might evolve, and what they use in their LeNet benchmark. I don't think it's a magic number, please let me know if I'm missing something.

This could be avoided if there's some way to specify the different Hyperband loops as independent, at least for synchronous Hyperband. This could work for asynchronous Hyperband too, as long as additional arguments are passed.

It's definitely possible to run multiple of these at the same time, though they won't be sharing work. I can imagine that that would be helpful if we have abundant parallelism, which seems possible.

To make this concrete, something like the following would work:

@gen.coroutine
def hyperband(...):
    yield [successive_halving(...) for _ in ...]

The Dask client will then run this process many times from the same thread.

@mrocklin
Copy link
Member Author

Also, I'm running this notebook using my cull-before-send branch in dask/distributed

@stsievert
Copy link
Member

81 is just an example number here

Correct. Sorry for the confusion. maxiter=81 and that can be customized.

It's definitely possible to run multiple of these at the same time,

Then for synchronous Hyperband I am satisfied. My other concerns go out the window.

There might be something more with tpot or Bayesian sampling, but I'll look at the code first.

If we choose to get fancier we can get fancier easily.

Could you push your code? I take it the notebook was run with modifications to _fit. I'd like to look that code over.

@mrocklin
Copy link
Member Author

Ah! Indeed. I was operating on the wrong branch. Pushed!

@mrocklin
Copy link
Member Author

I recommend the graph view of the dashboard when executing.

My current plan is to leave this as-is until some sort of case study arrives to benchmark it against. I'm inclined to experiment a bit with successive_halving to see how different decay rates behave and then use that as motivation for hyperband after we have some practical experience. Presumably UI bits would come after that.

@TomAugspurger
Copy link
Member

Pushed some formatting changes.

Will take a closer look later (tomorrow most likely).

""" Create a model by cloning and then setting params """
with log_errors(pdb=True):
model = clone(model).set_params(**params)
return model, {"ident": ident, "params": params, "time_step": -1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused with time_step here. I thought it referred to the number of times update was called, not the number of times partial_fit was called.

@stsievert
Copy link
Member

stsievert commented Jul 26, 2018

This looks pretty good. A lot of my concerns are addressed by the fact that this

  • works with classes (e.g., update=cls.fit), so arbitrary state can be stored and arbitrary functions can be run.
  • can be called in parallel (though I still need to verify this works).

Making this work with asynchronous Hyperband will be some work but not an incredible amount.

@mrocklin
Copy link
Member Author

Making this work with asynchronous Hyperband will be some work but not an incredible amount.

Yeah, I'm happy to set up concurrent execution of successive halving for hyperband (or walk someone else through it) after we have a bit more understaning of how randomsearch and successive halving perform. I don't think that building hyperband is likely to be a priority until then.

This implements a simple approach to do hyper-parameter optimization on
incrementally trained models.  It is heavily insprired by (and steals content
from) dask#221 but attempts less.

Currently it implements a function ``fit``, which takes in a model, training
and testing data and a parameter space.  It creates many models and trains them
on a block of data, scores them, and then keeps only the better performing
ones.  It continues to do this until only one is selected.

This function is not intended for public users, but is hopefully something we
can play with as we eventually build a full solution.  It is decently
configurable today and we will need to find sane defaults.  Notably, this
function does not take opinions on the following:

1.  How testing data should be separated from training data,
    currently they are both explicitly provided
2.  How many parameters to start with, or how to decrease this over time
    These are given as inputs with an integer and function respectively

Additionally it does not, but probably should provide configurations for the
following:

1.  How to select models from the current group
2.  How to decide on stopping criteria
3.  How any ability actively decide the number of surviving models
    based on the scores received so far

I believe that this implementation is a decent start though and should be a
conversation piece.  I think that it makes some decisions that we don't want to
think about yet (online determination of how many models to keep) while giving
us knobs for other parameters that I think we'll want to play with (model decay
functions, how to handle testing, etc..)
This removes logic from the fit function to make it less about successive
halving, and be more general to support Incremental.fit and RandomSearchCV.
@mrocklin mrocklin force-pushed the incremental-model-selection branch from b213809 to aaaf7e8 Compare July 28, 2018 00:15
@stsievert
Copy link
Member

I thought I discovered some bugs, but turns out I didn't.

@stsievert stsievert mentioned this pull request Aug 1, 2018
7 tasks
@stsievert
Copy link
Member

In this, it'd be nice to support keyboard interrupts in this, and preserve the models/infos/etc on exit.

My use case is a user firing a keyboard interrupt and still recovering the best model's score/related info.

@mrocklin
Copy link
Member Author

mrocklin commented Aug 31, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Aug 31, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

@TomAugspurger do you have thoughts on the right way to make a RandomSearch version of this that is API-familiar to a Scikit-Learn user? Is there even an established API for this kind of thing with partial_fit?

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger do you have thoughts on the right way to make a RandomSearch version of this that is API-familiar to a Scikit-Learn user?

Does something roughly like this make sense?

class IncrementalSearch(BaseEstimator):
    def __init__(self, estimator, params, additional_calls, X_test, y_test):
        self.estimator = estimator
        self.params = params
        self.additional_calls = additional_calls
        self.X_test = X_test
        self.y_test = y_test

    def fit(self, X, y=None, **fit_params):
        info, model, history = fit(self.estimator, self.params, X, y, self.X_test,
                                   self.y_test, self.additional_calls, **fit_params)
        self.info_ = info
        self.model_ = model
        self.history_ = history

The only strange part is passing the validation set to the init. That could maybe be done as a fit param, and plucked from fit_params in fit. I'm not sure there's anything in scikit-learn that takes a validation set as an argument (maybe @ogrisel knows otherwise). Estimators like SGDClassifier will internally create a validation set from the training data when early_stopping=True. Perhaps we could do the same?

raise gen.Return((info, models, history))


def fit(*args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be best to make this the actual signature.

Numpy array or small dask array. Should fit in memory.
y_test : Array
Numpy array or small dask array. Should fit in memory.
start : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed I think.

Add additional_calls.

>>> X_train = X[100000:]
>>> y_train = y[100000:]

>>> info, model, history = yield fit(model, params,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update.

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

class IncrementalSearch(BaseEstimator):

Would it make sense to factor this out from the Hyperband work going on now?

@TomAugspurger
Copy link
Member

Yes, I think so. But I'm not sure what order we should do things in.

  • IncrementalSearch (is that the best name?) that provides a kind of baseline for this class of search models
  • Hyperband is a relatively minimal extension to IncrementalSearch implementing its strategies for model selection, etc.

does that conform with others' expectations? I'm happy to take over the work from those that are busy with other things.

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

I've updated the docstring considerably, including a runnable example. Tests will fail until dask/distributed#2232 is merged in.

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

Yes, I think so. But I'm not sure what order we should do things in.

  • IncrementalSearch (is that the best name?) that provides a kind of baseline for this class of search models
  • Hyperband is a relatively minimal extension to IncrementalSearch implementing its strategies for model selection, etc.

does that conform with others' expectations? I'm happy to take over the work from those that are busy with other things.

Yes, I think that ideally something like Hyperband would look something like the following:

class Hyperband(IncrementalSearch):
    def __init__(**hyperband_specific_kwargs);
        ...

    def additional_calls(self, scores):
        ...

If this existed I think that it would be easier to experiment with this sort of thing.

I should write up documentation for this PR, but I might want to wait until such a super-class exists.

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

I would be happy to leave the sklearn-specific conventions to someone else :)

@TomAugspurger
Copy link
Member

OK then, so how's this for a plan

  1. Merge add byte_keys to unpack_remotedata call distributed#2232
  2. Merge this
  3. I take a pass at an IncrementalSearch base class today
  4. We rebase Hyperband on IncrementalSearch

And followup items for documentation, examples, etc. of IncrementalSeach

The goal being to maximize the effectiveness of @stsievert's time :)

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018 via email

@@ -218,10 +219,23 @@ def get_futures(partial_fit_calls):

new_scores = list(_scores2.values())

models = {k: client.submit(operator.getitem, v, 0) for k, v in models.items()}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break the current Hyperband implementation BTW

@stsievert
Copy link
Member

class IncrementalSearch(BaseEstimator):

Perfect – I had this idea when playing with stop_on_plateau, but didn't bring it up. This should also inherit from DaskBaseSearchCV, right?

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 5, 2018 via email

@mrocklin
Copy link
Member Author

mrocklin commented Sep 5, 2018

Good to merge any time @TomAugspurger . I'll leave this to you

if isinstance(X_test, da.Array):
X_test = client.compute(X_test)
else:
y_test = yield client.scatter(y_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be X_test = yield client.scatter(X_test)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it probably should. My mistake.

@TomAugspurger
Copy link
Member

Updated that, and made the data and a bit smaller in a hope to make it run a bit faster.

@TomAugspurger TomAugspurger merged commit 80f0b14 into dask:master Sep 5, 2018
@mrocklin mrocklin deleted the incremental-model-selection branch September 5, 2018 21:01
TomAugspurger added a commit to TomAugspurger/dask-ml that referenced this pull request Sep 6, 2018
commit 41b0af2
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Sep 5 11:11:20 2018 -0500

    [WIP]: Base class for estimators using incremental fit

commit 80f0b14
Author: Matthew Rocklin <mrocklin@gmail.com>
Date:   Wed Sep 5 14:44:12 2018 -0400

    Incremental model selection (dask#288)

    This implements a simple approach to do hyper-parameter optimization on
    incrementally trained models.  It is heavily insprired by (and steals content
    from) dask#221 but attempts less.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants