Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: better adapt HyperbandSearchCV for exploratory searches #532

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

stsievert
Copy link
Member

@stsievert stsievert commented Jul 15, 2019

What does this PR implement?
This PR allows skipping the bracket that trains all models to completion if bool(patience). This bracket is likely of marginal use because the hyperparameters will (almost certainly) influence the score. Why not eliminate this bracket instead of designing patience=True to accommodate this bracket?

Reference issues/PRs

  • the prioritization scheme (#527). This means higher scoring models are run first.
  • (future) keyboard interrupts (#528 (comment)).

These two features mean that the user can exit out of hyper-parameter selection when the score is high enough.

This PR is not ready need to be merged. I still need to verify the implementation See #532 (comment)

TODO

  • implement
  • verify that API change high scores
  • run simulations to see how much this time-to-solution
  • document
  • allow explore to be negative (slightly mirroring list indexing).

@stsievert
Copy link
Member Author

@stsievert stsievert commented Jul 15, 2019

This PR is motivated by a question from @amueller after my SciPy19 talk. Why run this bracket if it's not the best performing? In simulations, bracket=0 is the best bracket 3.5% of the time:

Screen Shot 2019-07-14 at 9 36 39 PM

This histogram is from the 200 runs of HyperbandSearchCV I did in https://github.com/stsievert/dask-hyperband-comparison.

@amueller
Copy link

@amueller amueller commented Jul 15, 2019

Thanks, that's cool! The question I'm most interested in is: what if you only ran bracket 4 in all cases? Would it be substantially worse than running hyperband?

@stsievert
Copy link
Member Author

@stsievert stsievert commented Jul 16, 2019

what if you only ran bracket 4 [the most aggressive bracket] in all cases?

I'll take your question as "how can the search be made more aggressive?"

The aggressiveness of Hyperband's early stopping scheme can be configured (which I accidentally forgot to mention in the talk!*). This is controlled by HyperbandSearchCV's aggressiveness optional input, which controls how many models are kept when some models are stopped and how long to train between stoppings.

By default, aggressiveness=3. If changed, Li et. al. recommended setting aggressiveness to 4:

The value of η [aka aggressiveness] can be viewed as a knob that can be tuned based on practical user constraints. Larger values of η correspond to a more aggressive elimination schedule and thus fewer rounds elimination ... If our [infinite horizon] theoretical bounds are optimized (see Section 4) they suggest choosing η = e ≈ 2.718 but in practice we suggest taking η to be equal to 3 or 4 (if you don’t know how to choose η, use η = 3).
...
If overhead is not a concern and aggressive exploration is desired, one can (1) increase η to reduce the number of brackets while maintaining R as the maximum number of configurations in the most exploratory bracket or (2) still use η = 3 or 4 but only try brackets that do a baseline level of exploration,
—Section 2.6 of the Hyperband paper.

aggressiveness=4 in #532 (comment). This PR removes the least aggressive bracket(s).

It's also possible to only run bracket 4:

from dask_ml.model_selection import HyperbandSearchCV, SuccessiveHalvingSearchCV
search = HyperbandSearchCV(model, params, max_iter=max_iter)

bracket_params = search.metadata["brackets"][4]["SuccessiveHalvingSearchCV params"]
bracket = SuccessiveHalvingSearchCV(model, params, **bracket_params)

This will require the user deciding how important the training data is, and which bracket to choose.

* I have removed that typo from the slides

@amueller
Copy link

@amueller amueller commented Jul 16, 2019

Aggressiveness is not what I meant, that's just the base of the "halving", right?
Unless I really misunderstood something. The eta/agressiveness is the same for all brackets, right? Only the number of hyper-parameters that are sampled and the minimum resources allocated are different in each bracket, right?

What I meant was what you showed last. The paper basically says that just running the bracket that puts the least emphasis on the amount of training data basically always wins, and in practice Successive Halving is basically as good as Hyperband.

@stsievert
Copy link
Member Author

@stsievert stsievert commented Jul 17, 2019

that's just the base of the "halving", right? The eta/agressiveness is the same for all brackets, right? Only the number of hyper-parameters that are sampled and the minimum resources allocated are different in each bracket, right?

All correct.

What I meant was what you showed last. The paper basically says that just running the bracket that puts the least emphasis on the amount of training data basically always wins, and in practice Successive Halving is basically as good as Hyperband.

Good. Yes, almost always* (there's one case where the second most exploratory bracket is optimal, and it has a narrow search space). The other experiments all run Hyperband (via hyperband) alongside repeats of the most exploratory bracket (via bracket s=4):

Screen Shot 2019-07-17 at 12 11 48 PM

Figure 5: Average test error across 10 trials is shown in all plots. Label “SMAC early” corresponds to SMAC with the early stopping criterion proposed in Domhan et al. (2015) and label “bracket s = 4” corresponds to repeating the most exploratory bracket of Hyperband.

Looks like repeating the most exploratory bracket finds slight lower losses than Hyperband on average, at least in their example. I've also simulated this with my data (the same example as #532 (comment)):

Screen Shot 2019-07-17 at 12 29 21 PM

Here "2 repeats" means "2 repeats of the most exploratory bracket". It looks like 2 or 3 runs of the most exploratory bracket recovers the Hyperband performance, at least for this example.

I'll run some more simulations to see how this performs over time. I'll probably edit the search space a bit to make it more difficult for the most adaptive bracket to be optimal.

* The one exception finds the second most aggressive bracket is optimal in Figure 4 (which runs LeNet)

@amueller
Copy link

@amueller amueller commented Jul 18, 2019

Thanks. Yes this is what I meant ;)
What's the x axis on your plot? Number of configurations run?
And how do you simulate the repeats?
How do repeats compare against running more configs in parallel?

@stsievert
Copy link
Member Author

@stsievert stsievert commented Jul 19, 2019

How do repeats compare against running more configs in parallel?

They're the same thing. "Repeating the most exploratory bracket two times" is "running two instances of that bracket in parallel". i.e., "2 repeats" means "running 2 copies of the most exploratory bracket in parallel".

#532 (comment) is only concerned with the final score, not any of the intermediate scores. I plan to run those simulations somewhat soon.

What's the x axis on your plot? Number of configurations run?

Whoops!

The plot above is an estimate of the final validation score for running a different number of the most exploratory bracket in parallel. It's a vertical histogram, so the x-axis is the frequency from my simulations.

And how do you simulate the repeats?

Earlier, I simulated Hyperband 200 times so I also have the results from 200 runs of the most exploratory bracket (shown in "1 repeat"). The other plots (with more repeats) are generated by pulling from that histogram, and taking the maximum of the values pulled. Something like

def simulate(final_scores, repeats):
    return np.random.choice(final_scores, size=repeats).max()

n = 200  # I ran Hyperband 200 times in an earlier simulation

# For each run, pull the best validation score from bracket=4
one_run = [simulation_results(i, bracket=4) for i in range(n)]  # values in "1 repeat".

two_runs = [simulate(one_run, 2) for _ in range(1000)]
three_runs = [simulate(one_run, 3) for _ in range(1000)]

The plot I showed above is only for the final score. It does not characterize how the score changes over time.

@amueller
Copy link

@amueller amueller commented Jul 19, 2019

They're the same thing. "Repeating the most exploratory bracket two times" is "running two instances of that bracket in parallel". i.e., "2 repeats" means "running 2 copies of the most exploratory bracket in parallel".

Adding more runs is not the same as repeating a bracket I think, because you'd get the best 50% of the overall, not the best 50% from each repeat. So combining them would be less randomized.

Hyperband is as expensive as 4 repeats but it's worse when looking at the histogram, right? It looks even worse than two repeats to me. So doing repeats seems better than hyperband according to your histogram, right?

@stsievert stsievert changed the title WIP: allow not running passive bracket in HyperbandSearchCV WIP: better adapt HyperbandSearchCV for exploratory searches Sep 17, 2019
@stsievert stsievert marked this pull request as draft Apr 16, 2020
@stsievert
Copy link
Member Author

@stsievert stsievert commented May 27, 2020

I'm currently using doing some hyperparameter optimization for a paper I'm writing. I find myself wanting to quickly get a rough idea of the parameters, and don't want to do an excessive amount of computation. Currently, I find myself running the most exploratory bracket of Hyperband: it's only one bracket and there's not much computation.

Maybe this would be a good API:

  • explore=True: very exploratory search. Two copies of most aggressive bracket and (in the future) one copy of InverseDecaySearchCV?
  • isinstance(explore, int): run explore repeats of most exploratory bracket.

This API allows users to specify explore=True to specify a minimal computation that's (very likely) as good as Hyperband. More advanced users can specify explore to be an integer; the documentation gives a recommendation.

I've pushed this API, with a note that explore is still experimental.

@stsievert stsievert marked this pull request as ready for review May 28, 2020
@stsievert stsievert changed the title WIP: better adapt HyperbandSearchCV for exploratory searches API: better adapt HyperbandSearchCV for exploratory searches May 28, 2020
@stsievert
Copy link
Member Author

@stsievert stsievert commented May 28, 2020

I think this PR is ready for review now. The implemented changes summarized in #532 (comment) come from my desires from the search I'm currently performing. I'm not interested in finding the best parameters, I'm interested in finding good parameters quickly.

I'm currently manually running the most exploratory bracket, or explore=1. This PR has the following benefits:

  • explore has shown good empirical performance, especially if explore in [2, 3].* The most exploratory bracket is very likely optimal, and will often out-perform Hyperband if repeated (i.e, if isinstance(explore, int)).**
  • explore allows easy specification of Hyperband parameters. can specify the two parameters I care about as per the Hyperband rule-of-thumb (training time and number of hyperparameters to sample).
  • explore effectively limits the total computation (so my EC2 bill isn't too high).

The only backing simulations are in #532 (comment). I have not done any simulations to very how cluster usage varies throughout time as the search progresses (which won't be relevant until #676 is resolved) – most of my concerns are more functional (EC2 cost & time).

*At least for the experiments in #532 (comment)
**See comparison between Hyperband and bracket s=4 in Figures 5, 6, and 7 the Hyperband paper. In Figure 3, the second most aggressive bracket of Hyperband does better for a search on LeNet, which has a section titled tips and tricks.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Looks nice overall I think.

To confirm, with the default of explore=False, there's no change in the default behavior?

dask_ml/model_selection/_hyperband.py Outdated Show resolved Hide resolved
@stsievert
Copy link
Member Author

@stsievert stsievert commented Jun 6, 2020

This PR is ready for review again.

dask_ml/model_selection/_hyperband.py Outdated Show resolved Hide resolved
If ``explore`` is a bool, run a search aimed at finding the same
validation accuracy as Hyperband with ``explore=False`` but with
less computation.
If ``explore`` is an integer, repeat the most exploratory bracket
Copy link
Member

@TomAugspurger TomAugspurger Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "exploratory" mean in this context?

Copy link
Member Author

@stsievert stsievert Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the bracket that samples the largest number of models.

dask_ml/model_selection/_hyperband.py Outdated Show resolved Hide resolved
dask_ml/model_selection/_hyperband.py Outdated Show resolved Hide resolved
stsievert and others added 3 commits Aug 6, 2020
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@stsievert
Copy link
Member Author

@stsievert stsievert commented Aug 6, 2020

Thanks for the comments; resolved. I'd like to see this merged; I think #532 (comment) is enough motivation. There's an additional benchmark that needs to be performed to see how the scores increase throughout time, but that's a tangential concern that's not relevant until #677 is merged.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Thanks. LGTM, pending the question about timeouts in the tests. I think they can be removed?

tests/model_selection/test_hyperband.py Outdated Show resolved Hide resolved
tests/model_selection/test_hyperband.py Outdated Show resolved Hide resolved
stsievert and others added 2 commits Aug 6, 2020
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Base automatically changed from master to main Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants