Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: better adapt HyperbandSearchCV for exploratory searches #532

Open
wants to merge 1 commit into
base: master
from

Conversation

@stsievert
Copy link
Member

commented Jul 15, 2019

This PR is not ready to be merged. I still need to verify the implementation.

What does this PR implement?
This PR allows skipping the bracket that trains all models to completion if bool(patience). This bracket is likely of marginal use because the hyperparameters will (almost certainly) influence the score. Why not eliminate this bracket instead of designing patience=True to accommodate this bracket?

Reference issues/PRs

  • the prioritization scheme (#527). This means higher scoring models are run first.
  • (future) keyboard interrupts (#528 (comment)).

These two features mean that the user can exit out of hyper-parameter selection when the score is high enough.

TODO

  • implement
  • verify that bracket=0 returns low scores
  • run simulations to see how much this time-to-solution
  • document
@stsievert

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2019

This PR is motivated by a question from @amueller after my SciPy19 talk. Why run this bracket if it's not the best performing? In simulations, bracket=0 is the best bracket 3.5% of the time:

Screen Shot 2019-07-14 at 9 36 39 PM

This histogram is from the 200 runs of HyperbandSearchCV I did in https://github.com/stsievert/dask-hyperband-comparison.

@amueller

This comment has been minimized.

Copy link

commented Jul 15, 2019

Thanks, that's cool! The question I'm most interested in is: what if you only ran bracket 4 in all cases? Would it be substantially worse than running hyperband?

@stsievert

This comment has been minimized.

Copy link
Member Author

commented Jul 16, 2019

what if you only ran bracket 4 [the most aggressive bracket] in all cases?

I'll take your question as "how can the search be made more aggressive?"

The aggressiveness of Hyperband's early stopping scheme can be configured (which I accidentally forgot to mention in the talk!*). This is controlled by HyperbandSearchCV's aggressiveness optional input, which controls how many models are kept when some models are stopped and how long to train between stoppings.

By default, aggressiveness=3. If changed, Li et. al. recommended setting aggressiveness to 4:

The value of η [aka aggressiveness] can be viewed as a knob that can be tuned based on practical user constraints. Larger values of η correspond to a more aggressive elimination schedule and thus fewer rounds elimination ... If our [infinite horizon] theoretical bounds are optimized (see Section 4) they suggest choosing η = e ≈ 2.718 but in practice we suggest taking η to be equal to 3 or 4 (if you don’t know how to choose η, use η = 3).
...
If overhead is not a concern and aggressive exploration is desired, one can (1) increase η to reduce the number of brackets while maintaining R as the maximum number of configurations in the most exploratory bracket or (2) still use η = 3 or 4 but only try brackets that do a baseline level of exploration,
—Section 2.6 of the Hyperband paper.

aggressiveness=4 in #532 (comment). This PR removes the least aggressive bracket(s).

It's also possible to only run bracket 4:

from dask_ml.model_selection import HyperbandSearchCV, SuccessiveHalvingSearchCV
search = HyperbandSearchCV(model, params, max_iter=max_iter)

bracket_params = search.metadata["brackets"][4]["SuccessiveHalvingSearchCV params"]
bracket = SuccessiveHalvingSearchCV(model, params, **bracket_params)

This will require the user deciding how important the training data is, and which bracket to choose.

* I have removed that typo from the slides

@amueller

This comment has been minimized.

Copy link

commented Jul 16, 2019

Aggressiveness is not what I meant, that's just the base of the "halving", right?
Unless I really misunderstood something. The eta/agressiveness is the same for all brackets, right? Only the number of hyper-parameters that are sampled and the minimum resources allocated are different in each bracket, right?

What I meant was what you showed last. The paper basically says that just running the bracket that puts the least emphasis on the amount of training data basically always wins, and in practice Successive Halving is basically as good as Hyperband.

@stsievert

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

that's just the base of the "halving", right? The eta/agressiveness is the same for all brackets, right? Only the number of hyper-parameters that are sampled and the minimum resources allocated are different in each bracket, right?

All correct.

What I meant was what you showed last. The paper basically says that just running the bracket that puts the least emphasis on the amount of training data basically always wins, and in practice Successive Halving is basically as good as Hyperband.

Good. Yes, almost always* (there's one case where the second most exploratory bracket is optimal, and it has a narrow search space). The other experiments all run Hyperband (via hyperband) alongside repeats of the most exploratory bracket (via bracket s=4):

Screen Shot 2019-07-17 at 12 11 48 PM

Looks like repeating the most exploratory bracket finds slight lower losses than Hyperband on average, at least in their example. I've also simulated this with my data (the same example as #532 (comment)):

Screen Shot 2019-07-17 at 12 29 21 PM

Here "2 repeats" means "2 repeats of the most exploratory bracket". It looks like 2 or 3 runs of the most exploratory bracket recovers the Hyperband performance, at least for this example.

I'll run some more simulations to see how this performs over time. I'll probably edit the search space a bit to make it more difficult for the most adaptive bracket to be optimal.

* The one exception finds the second most aggressive bracket is optimal in Figure 4 (which runs LeNet)

@amueller

This comment has been minimized.

Copy link

commented Jul 18, 2019

Thanks. Yes this is what I meant ;)
What's the x axis on your plot? Number of configurations run?
And how do you simulate the repeats?
How do repeats compare against running more configs in parallel?

@stsievert

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

How do repeats compare against running more configs in parallel?

They're the same thing. "Repeating the most exploratory bracket two times" is "running two instances of that bracket in parallel". i.e., "2 repeats" means "running 2 copies of the most exploratory bracket in parallel".

#532 (comment) is only concerned with the final score, not any of the intermediate scores. I plan to run those simulations somewhat soon.

What's the x axis on your plot? Number of configurations run?

Whoops!

The plot above is an estimate of the final validation score for running a different number of the most exploratory bracket in parallel. It's a vertical histogram, so the x-axis is the frequency from my simulations.

And how do you simulate the repeats?

Earlier, I simulated Hyperband 200 times so I also have the results from 200 runs of the most exploratory bracket (shown in "1 repeat"). The other plots (with more repeats) are generated by pulling from that histogram, and taking the maximum of the values pulled. Something like

def simulate(final_scores, repeats):
    return np.random.choice(final_scores, size=repeats).max()

n = 200  # I ran Hyperband 200 times in an earlier simulation

# For each run, pull the best validation score from bracket=4
one_run = [simulation_results(i, bracket=4) for i in range(n)]  # values in "1 repeat".

two_runs = [simulate(one_run, 2) for _ in range(1000)]
three_runs = [simulate(one_run, 3) for _ in range(1000)]

The plot I showed above is only for the final score. It does not characterize how the score changes over time.

@amueller

This comment has been minimized.

Copy link

commented Jul 19, 2019

They're the same thing. "Repeating the most exploratory bracket two times" is "running two instances of that bracket in parallel". i.e., "2 repeats" means "running 2 copies of the most exploratory bracket in parallel".

Adding more runs is not the same as repeating a bracket I think, because you'd get the best 50% of the overall, not the best 50% from each repeat. So combining them would be less randomized.

Hyperband is as expensive as 4 repeats but it's worse when looking at the histogram, right? It looks even worse than two repeats to me. So doing repeats seems better than hyperband according to your histogram, right?

@stsievert stsievert changed the title WIP: allow not running passive bracket in HyperbandSearchCV WIP: better adapt HyperbandSearchCV for exploratory searches Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.