Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for independence between number of trials and number of combinations. #415

Closed
marianokamp opened this issue Nov 4, 2022 · 23 comments

Comments

@marianokamp
Copy link

TL;DR: Please allow to run experiments with more (or less) trials than the number of combinations.

Right now when running a hyperparameter tuning job with AMT I get an error message when the number of trials exceeds the number of enumerable combinations; here being the discrete combinations of integer and categorical parameter values.

But this is limiting as the exploration of the search space is noisy and more than one trial may be needed to understand the variance and to establish a stable mean.

As a workaround I am starting tuning jobs with the Random strategy with an additional hyper parameter "dummy" that is continuous. This way I can specify the number of trials I need. But this makes it harder to use this data basis for future warmstarts to narrow down scenarios. Further it forces me to allow the "dummy" parameter in my training script.

Example:
I want to know if adding a non-linearity and additional capacity (Pooler) on top of a BERT-like model will yield better performance as result, or if the extra capacity will make the model lazier and not use the transformer blocks below.
I also want to see if this assessment changes when adding more transformer layers.

So I have two categorical variables. Layers: [1, 4, 8] and scale-of-classifier: [0, 0.5, 1.0, 2.0].
These are just 3*4 combinations. But given the noisy nature of a NN training a single data point per combination has next to no meaning. To produce the understanding below I used about 100 data points with the workaround from above.

If I could just specify the categorical parameters and the number of trails (for GridSearch/Random) my appreciation will follow you until the end of your hopefully long and fulfilling life.

image
image

@wistuba
Copy link
Collaborator

wistuba commented Nov 4, 2022

Not sure this makes sense for every customer given that some might seed their training scripts. An unorthodox way but you could add "seed": randint(10) to the config space (equivalent to your dummy variable but at least avoid running the exact same job twice).

Besides that, your training job itself might just run multiple repetitions and report mean and standard deviation. This seems to be what you are interested in anyway.

@marianokamp
Copy link
Author

Not sure I can follow here.

[run the same script multiple times]

I can certainly run the scripts multiple times or create an inner loop in the inner loop (the hpo job being the outer loop), but I would break the general connection we maintain in SM Training, that one trial is one training job producing one set of meta data in SM Search, Experiments etc. It would also mean that the training script now needs to know it is part of an experiment and that we need different code or parameterization to run it to produce as model, say as part of a re-training pipeline.

Not sure this makes sense for every customer

I thought this exact opposite. But maybe I don't understand. In which cases would it make sense for customers to just run one trial per combination? When there is zero noise in the estimates?

@iaroslav-ai
Copy link
Collaborator

iaroslav-ai commented Nov 5, 2022

Hey @marianokamp , few questions:

  1. Do you expect you may have use cases to access mean / variance of objective when some parameters are continuous - the search space is not finite?
  2. At least in AMT, we are choosing best HPs configuration that have min / max observed objective; Would you choose here best HPs based on mean of objectives for this HP?
  3. For same set of HPs, you will be getting multiple models. Within set of models for same HP configuration, how would you choose the one to deploy for inference?

@wistuba
Copy link
Collaborator

wistuba commented Nov 6, 2022

A very simple solution to your problem is, as you pointed out, to extend your configuration space with a dummy variable. I would recommend to not use a continuous one since it does not allow to control how often a configuration is repeated.

My suggestion:

from syne_tune.config_space import randint

my_config_space["repetition_id"] = randint(num_repetitions)

Admittedly, this might be a bit hacky but if you use argparse, you could also ignore unknown args if you are not willing to add repetition_id as a parameter to your script. I don't think warmstarting is relevant for you given that you evaluate each configuration anyway.

[run the same script multiple times]

I can certainly run the scripts multiple times or create an inner loop in the inner loop (the hpo job being the outer loop), but I would break the general connection we maintain in SM Training, that one trial is one training job producing one set of meta data in SM Search, Experiments etc. It would also mean that the training script now needs to know it is part of an experiment and that we need different code or parameterization to run it to produce as model, say as part of a re-training pipeline.

As Iaroslav pointed out it would be great if you can elaborate a bit on how HPO is part of your re-training pipeline. Do you select the hyperparameter configuration which is best on average and then retrain on the entire data? Do you use the best checkpoint found during HPO and deploy that? I don't think running multiple training jobs within a training script is uncommon, cross-validation would be one example.

Not sure this makes sense for every customer

I thought this exact opposite. But maybe I don't understand. In which cases would it make sense for customers to just run one trial per combination? When there is zero noise in the estimates?

Yes, in exactly that case. This happens in case you seed your training script to ensure reproducibility. In my opinion, rerunning same trials seems to be an extreme case. Any larger search space (e.g. continuous hyperparameters) will not have this problem.

@geoalgo
Copy link
Contributor

geoalgo commented Nov 7, 2022

I can certainly run the scripts multiple times or create an inner loop in the inner loop (the hpo job being the outer loop), but I would break the general connection we maintain in SM Training, that one trial is one training job producing one set of meta data in SM Search, Experiments etc.

I don't think running multiple training jobs within a training script is uncommon, cross-validation would be one example.

@wistuba: what @marianokamp is referring too is that in SageMaker it is most common to have a training job trains just one configuration. What you say is true in general, but slightly less suited in SageMaker as running multiple "things" in one SM training job can make the tracking of artifacts a bit harder (it does not mean you cant do it, just that its not the general use-case).

@marianokamp: the suggestion of @wistuba to add the seed seems to be a good one. If it is not practical for some reason, we could just add an option to allow running a configuration more than once in Grid/Random search that you are currently using.

@marianokamp
Copy link
Author

Thanks everybody. That's quite a lot, much appreciated. I would like to start with something general and hope this is not too detailed/boring. Also if you get the impression that I am trying to explain HPO to you, please disregard that. You know better than I do. I just want to give you the context of my thinking.

I see two ways to use HPO. One, maybe the most common way, is to find a good set of hyperparameters in a potentially large search space and to either use the best performing model or use the hyperparameters of the best job (or better a summarized version) as a template for future trainings.

But for this feature request I am thinking about the other way: To use HPO for exploration, to learn about the impact of the individual hyperparameters and their values on the objective. This frequently is not a one-shot process, but is more like a dialog, where the ML practitioner learns about the hyperparameters and their impact, and what this means for their modeling decisions, and then refines the implementation or the evaluated hyperparameter ranges across tuning runs; then also using warm starts, so that not only the ML practitioner uses the knowledge from those previous runs in the multi-step optimization dialog, but so does AMT with warm starts.

Let's continue this idea, to use HPO as part of the model authoring, not just the final optimization. Some, like Andrej Karpathy, refer to hyper parameters as design choices. For some, like XGB it could mean you defer them from the model/algo author to the ML practitioner that uses the model. But for NN it's likely the same person using and authoring the model, but still needing to understand the real world impact of design choices for the ML task at hand.

The example I pasted above is one of those. We know that BERT has a Pooler on top of the network, but the results I showed above indicate that for that specific task (sentiment analysis on IMDB) the extra capacity and non-linearity actually produces worse results. Not important to our discussion: Likely because it makes it too easy for the network to learn (weakly) in the classifier, instead of being forced to use the harder to learn, but more powerful, transformer blocks.

The experiment does a Random Search, approximating a Grid Search over two hyperparameters:

  • Use of the extra capacity/non-linearity (clf-scale = 1.) or not (0.) and two additional datapoints to see if there is a trend ([0.5, 2.0])
  • To see if the effect is more pronounced with more transformer blocks we have number of layers as a second hyperparameter.

Why use AMT? Because it already has the infrastructure to run this experiments at scale, with minimum fuss and as it is integrated with the rest of SM it is easy to use the results. Also, because I can define how much guidance I want to give during the exploration as opposed to a completely stochastic random sweep.

Sidebar: I expect there to be more training runs as part of the experimentation during the model authoring phase, than during the first years of model deployment incl. re-training and re-tuning.

It is worth mentioning that the actual best model of the tuning job would not be used for deployment. In fact, the hyperparameter clf-scale can be removed moving forward, because for this task the semantics are unlikely to change. And as you know, we would want to minimize the number of hyperparameters (to evaluate at a time).
In this case I would not want to warm start another tuning job from here (with the clf-scale hyperparameter now fixed to 0.), but in other cases this could make sense.

At no point did I intend to put this specific model into production though. This will come down the line when I tried other hyperparameters and design decisions and I then eventually will re-train, re-optimize with the remaining set of hyperparameters.

@marianokamp
Copy link
Author

Can I use a dummy hyperparameter? Yes, I can and I do. It's a bit ugly as we need to extend the jobs parameters (or turn off validation) and we then have a hyper parameter in our data that is not used.

But I can live with that.

I am wondering though what would be best for our customers? Allowing only one data point per combination feels limiting to me and does not take into account noise, which is inherent in ML.

Btw. fixing a seed helps reproducibility and may increase efficiency and - to my understanding - allow us to surface certain patterns earlier, but it does not reduce the inherent noise. It just fixes the noise to some starting points, that may or may not be representative for how the model is used later. It would lead to less robust models. Happy to be educated and to learn otherwise.

@marianokamp
Copy link
Author

Do you expect you may have use cases to access mean / variance of objective when some parameters are continuous - the search space is not finite?

@iaroslav-ai, with discrete values the search space is finite with respect to the domains of the hyperparameters. Agreed. But it does not follow that we only need one trial per data point to estimate its usefulness, unless it would be deterministic.
Especially when the magnitude of the average noise is above the magnitude of the change we want to detect. If you look at the diagram I posted it is clear that just using a linear classifier (clf-scale == 0.) is clearly the best choice. But you can only see that, because I show the distribution. And in the first panel, layers=1, it is only barely better. A single trial would almost certainly not work at all.
Let me know, if I did not understand your question correctly.

At least in AMT, we are choosing best HPs configuration that have min / max observed objective; Would you choose here best HPs based on mean of objectives for this HP?

In the model authoring phase I don't care much about the best model, but what I learn about the hyperparameter values and their impact.

But for completeness sake, and to answer your question: I would still go with the HPs from the best job by objective metric.

Sidebar: I am less interested in just the HPs from the best job. This feels a bit reductive.
Instead I would prefer a summary of the HPs including means, stddev (and importance/correlation to objective metric). This summary would not only look at the best job, but all jobs, or at least the top k% by performance. So that I can make an informed decision down the road.

For same set of HPs, you will be getting multiple models. Within set of models for same HP configuration, how would you choose the one to deploy for inference?

Conceptually I would start where we are today. As we only run one trial for the data point in the HP space, we effectively randomly chose a model, it just happens we picked form a population of one. But it is still subject to noise. Makes sense? So we could chose one randomly too.

A natural extension to this would be to pick the one with the best objective metric for this data point in the HP space. But as the performance difference inside this group is down to noise, this may give us a nominal improvement that is not representative of the model when it gets deployed with unseen data. This is the same issue as we have with the manual seed.

But(!) I think there is a bug in the questions. With multiple trials per data point we could use the mean objective metric for a data point in the HP space and select the right HP based on that, now adjusted for noise. Which we did not do before, where we would have picked the HP from the job that performed best over all data points in the HP space, which could have been the wrong job as we picked an outlier, far away from the mean. Makes sense?

@marianokamp
Copy link
Author

A very simple solution to your problem is, as you pointed out, to extend your configuration space with a dummy variable. I would recommend to not use a continuous one since it does not allow to control how often a configuration is repeated.
@wistuba, thanks for the suggested solution.

I use a continuous variable, because I don't want to count combinations. Instead I tell AMT I want n runs and as long as n is bigger than the combinations of discrete values, I am done.

Admittedly, this might be a bit hacky but if you use argparse, you could also ignore unknown args if you are not willing to add repetition_id as a parameter to your script.

I use argparse and I added the dummy parameter. But consider the case of a regulated customer, like a bank. They may have a rule in their reviews that training scripts that produce models for production can only have parameters that are used for this training. And, of course, with those customers you cannot turn off validation of parameters.

I don't think warmstarting is relevant for you given that you evaluate each configuration anyway.

Not sure what you mean here.
Warmstarting is nice to have, especially when phrasing HPO as a multi-step dialogue where the ML practitioner incrementaly learns about their model and AMT should also learn along the way. If warmstarting can not be used it is not the end of the world. But I would still suggest a case to keep in the back of our minds:

Training a large model may take, say 100 minutes on a large GPU. Running just 100 trials over, say, four steps (tuning jobs) of 25 trials can already cost quite some money. This will make it less interesting for customers to do this kind of experimentation and as a consequence they may feel less confident about their model (and models and AI). If however, we can introduce warmstarts in this dialog we can cut down on cost massively (or increase the number of trials for the same budget).

As Iaroslav pointed out it would be great if you can elaborate a bit on how HPO is part of your re-training pipeline. Do you select the hyperparameter configuration which is best on average and then retrain on the entire data? Do you use the best checkpoint found during HPO and deploy that? I don't think running multiple training jobs within a training script is uncommon, cross-validation would be one example.

I answered in my other posts a bit more about the background, so that I can be a bit briefer here. The scenarios I have in mind are not directly leading to deployments, but to more changes in code/HPs, then eventually deployment.

I never used CV. To the best of knowledge, if you were to implement CV with a SM training job today, this would be part of the inner loop and you would still produce a single model for SM to upload to S3 and use downstream. What you do with final metrics is left to the training job author.

Again, I am happy with what AMT is providing today and I am grateful for it. But it could be better, especially for the exploration during model authoring case.

@marianokamp
Copy link
Author

[..] in SageMaker it is most common to have a training job trains just one configuration. What you say is true in general, but slightly less suited in SageMaker as running multiple "things" in one SM training job can make the tracking of artifacts a bit harder [..]

^ That's the important bit. But may I also refer to low coupling/high cohesion?

The training job does not know about it being the subject of a hyperparameter tuning job. It neither has parameters to such an effect, nor extra code or libraries. It may early stop its inner loop, but knows nothing about the outer loop.

The outer loop, the tuning job, however, knows nothing about the training job except for hyperparameters and ingesting/reacting to objective metrics emitted by the individual training jobs.
That's an awesome design, I think.

I can live with a dummy variable and I already do so. Event though it means parameter passing and confusing a warm start. But better support for the exploration case could be beneficial to our customers. And that was why Valerio asked me to file the ticket after a discussion we had on Friday.

@marianokamp
Copy link
Author

The suggestion of @wistuba to add the seed seems to be a good one. If it is not practical for some reason, we could just add an option to allow running a configuration more than once in Grid/Random search that you are currently using.

Thanks @geoalgo. I am unsure about the seed. But yes, sure, running the tuning job multiple times would work. It feels less intuitive to me to provide the number of repetitions, instead of the maximum number of trials/samples, though.

Providing the number of repetitions would, however, give more clarity what really happens. While in my case, I was not asking for just raising the number of trials, but also for a disconnect between the number of trials and the number of combinations.

Hence, while in your case you would get a uniform number of samples per data point, in my case it would be more random. The latter could also mean that from some combination we do not sample at all. But I see this as a potential benefit. Think about batch size, int, 64 - 1024. Would this today translate to (1024 - 64) discrete HP values? Then not sampling all values could be right thing to do?

@mseeger
Copy link
Collaborator

mseeger commented Nov 14, 2022

Hello @marianokamp,

the fact that each configuration is evaluated at most once, is pretty strongly baked into AMT and Syne Tune, because it is typically best practice.

For the use case if yours, I'd like to bring up an interesting alternative. The number of repeated evaluations at one config could be a fidelity parameter. The metric you emit at fidelity r is the average of the r values.

This will concentrate repeated evaluations on configurations whose metric values look most promising, and where reducing variance brings the most gains.

If you like to explore this more, get in touch with me. I have some write-up, there is even a nice GP covariance kernel for this, this idea has been lying around for some time.

Lastly, I'd not use HPO for exploring a space. Except for random search, all other methods are explicitly biased to exploring more where the optimum is likely located.

@mseeger
Copy link
Collaborator

mseeger commented Nov 17, 2022

@marianokamp Is there any further actions requested on this issue? Thanks

@marianokamp
Copy link
Author

@mseeger, I need more time in between answers. There is no clock or dependency on your side, or is there?

@mseeger
Copy link
Collaborator

mseeger commented Nov 18, 2022

No problem. Just we try to work down the open issues.

@marianokamp
Copy link
Author

marianokamp commented Nov 21, 2022

Thanks @mseeger. Before going into the alternative approach, I am still trying to get my head around some of the things you mentioned. In particular the first sentence brings up two major points, that I would like to quickly follow up for my understanding.

[..] each configuration is evaluated at most once, [..], because it is typically best practice.

Is there a quick way to describe why this is a good practice?
Would a slightly relaxed approximation be the following? We aim for uniform coverage of each combination. We do not want to miss combinations because of their potential improvement (and in the discrete case the neighboring values may not give us information about not yet sampled values). We don't need to sample them more than once, if the measurements are deterministic. If the latter holds then a minimum of sampling once and a maximum of sampling once are sufficient, which - of course - is the same as evaluating each configuration once.

the fact that each configuration is evaluated at most once, is pretty strongly baked into AMT and Syne Tune

Can you please help me understand this better? In the example I posted above I have two hyperparameters with enumerable values and that brings me to 12 (3x4 values) combinations. When I try to start a job with 20 trials, I get the error message that 20 exceeds 12. If I then add a dummy continuous hyperparameter I can launch the tuning job with 20 trials.
In that case how do you steer the sampling from the space of the 12 combinations? The expectation to not sample more than once from the 12 combinations can no longer be satisfied, or can it? I must oversample some of the combinations of the values from the enumerable hyperparameters to be able to get to 20 trials.
Or am I missing something?

@mseeger
Copy link
Collaborator

mseeger commented Nov 22, 2022

The default in Bayesian optimization is to have a pretty large, often infinite, configuration space, which you want to sample efficiently. A key assumption is that the unknown function (before independent noise) is smooth w.r.t. numerical parameters. Given this assumption, it becomes inefficient to sample at the same place >1 times: you can gain more information by sampling nearby.

Once you start not checking for duplicates, you open the door to all sorts of bugs. A favourite of mine is that SDEs just fix the random seed, so the same config is chosen over and over.

@mseeger
Copy link
Collaborator

mseeger commented Nov 22, 2022

If your criterion is overly noisy, it is better to change the criterion. You can for example go for a cross-validation score. If this is too expensive (each evaluation is K times as expensive as before), the idea I mentioned above becomes attractive.

@mseeger
Copy link
Collaborator

mseeger commented Nov 22, 2022

If you extend the configuration space by a dummy variable, the constraint is that you cannot have duplicates in that new space. The algorithm cannot know that one of your variables is a dummy. Of course, two configs can be different, but the same in a subset of attributes.

@mseeger
Copy link
Collaborator

mseeger commented Nov 22, 2022

If you really need duplicates, the suggestion of adding a dummy (ideally a continuous one, so you never get duplicates again) is a good first step. The idea I mention above, would be more advanced, but maybe also work better.

@mseeger
Copy link
Collaborator

mseeger commented Dec 15, 2022

OK, we put this item on our backlog. We can allow for duplicates, as an option, but the default will remain to not allow for duplicates, first because it is the right thing to do for HPO, and second because it catches mistakes which are otherwise hard to catch.

@marianokamp
Copy link
Author

Thanks @mseeger, all. That sounds great.

I also appreciate the time you and your colleagues spent discussing the issue. Given the better understanding I will now use the proposed approach to do the oversampling in multiples of the combinations and use a discrete dummy variable, so that it works best with the duplicate detection.

@mseeger
Copy link
Collaborator

mseeger commented Jan 4, 2023

Hello, PR #487 introduces a flag allow_duplicates, if set to True, schedulers may return the same config several times.
This settles this issue for Syne Tune.

@mseeger mseeger closed this as completed Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants