New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for independence between number of trials and number of combinations. #415
Comments
Not sure this makes sense for every customer given that some might seed their training scripts. An unorthodox way but you could add Besides that, your training job itself might just run multiple repetitions and report mean and standard deviation. This seems to be what you are interested in anyway. |
Not sure I can follow here.
I can certainly run the scripts multiple times or create an inner loop in the inner loop (the hpo job being the outer loop), but I would break the general connection we maintain in SM Training, that one trial is one training job producing one set of meta data in SM Search, Experiments etc. It would also mean that the training script now needs to know it is part of an experiment and that we need different code or parameterization to run it to produce as model, say as part of a re-training pipeline.
I thought this exact opposite. But maybe I don't understand. In which cases would it make sense for customers to just run one trial per combination? When there is zero noise in the estimates? |
Hey @marianokamp , few questions:
|
A very simple solution to your problem is, as you pointed out, to extend your configuration space with a dummy variable. I would recommend to not use a continuous one since it does not allow to control how often a configuration is repeated. My suggestion:
Admittedly, this might be a bit hacky but if you use argparse, you could also ignore unknown args if you are not willing to add
As Iaroslav pointed out it would be great if you can elaborate a bit on how HPO is part of your re-training pipeline. Do you select the hyperparameter configuration which is best on average and then retrain on the entire data? Do you use the best checkpoint found during HPO and deploy that? I don't think running multiple training jobs within a training script is uncommon, cross-validation would be one example.
Yes, in exactly that case. This happens in case you seed your training script to ensure reproducibility. In my opinion, rerunning same trials seems to be an extreme case. Any larger search space (e.g. continuous hyperparameters) will not have this problem. |
@wistuba: what @marianokamp is referring too is that in SageMaker it is most common to have a training job trains just one configuration. What you say is true in general, but slightly less suited in SageMaker as running multiple "things" in one SM training job can make the tracking of artifacts a bit harder (it does not mean you cant do it, just that its not the general use-case). @marianokamp: the suggestion of @wistuba to add the seed seems to be a good one. If it is not practical for some reason, we could just add an option to allow running a configuration more than once in Grid/Random search that you are currently using. |
Thanks everybody. That's quite a lot, much appreciated. I would like to start with something general and hope this is not too detailed/boring. Also if you get the impression that I am trying to explain HPO to you, please disregard that. You know better than I do. I just want to give you the context of my thinking. I see two ways to use HPO. One, maybe the most common way, is to find a good set of hyperparameters in a potentially large search space and to either use the best performing model or use the hyperparameters of the best job (or better a summarized version) as a template for future trainings. But for this feature request I am thinking about the other way: To use HPO for exploration, to learn about the impact of the individual hyperparameters and their values on the objective. This frequently is not a one-shot process, but is more like a dialog, where the ML practitioner learns about the hyperparameters and their impact, and what this means for their modeling decisions, and then refines the implementation or the evaluated hyperparameter ranges across tuning runs; then also using warm starts, so that not only the ML practitioner uses the knowledge from those previous runs in the multi-step optimization dialog, but so does AMT with warm starts. Let's continue this idea, to use HPO as part of the model authoring, not just the final optimization. Some, like Andrej Karpathy, refer to hyper parameters as design choices. For some, like XGB it could mean you defer them from the model/algo author to the ML practitioner that uses the model. But for NN it's likely the same person using and authoring the model, but still needing to understand the real world impact of design choices for the ML task at hand. The example I pasted above is one of those. We know that BERT has a Pooler on top of the network, but the results I showed above indicate that for that specific task (sentiment analysis on IMDB) the extra capacity and non-linearity actually produces worse results. Not important to our discussion: Likely because it makes it too easy for the network to learn (weakly) in the classifier, instead of being forced to use the harder to learn, but more powerful, transformer blocks. The experiment does a Random Search, approximating a Grid Search over two hyperparameters:
Why use AMT? Because it already has the infrastructure to run this experiments at scale, with minimum fuss and as it is integrated with the rest of SM it is easy to use the results. Also, because I can define how much guidance I want to give during the exploration as opposed to a completely stochastic random sweep.
It is worth mentioning that the actual best model of the tuning job would not be used for deployment. In fact, the hyperparameter At no point did I intend to put this specific model into production though. This will come down the line when I tried other hyperparameters and design decisions and I then eventually will re-train, re-optimize with the remaining set of hyperparameters. |
Can I use a dummy hyperparameter? Yes, I can and I do. It's a bit ugly as we need to extend the jobs parameters (or turn off validation) and we then have a hyper parameter in our data that is not used. But I can live with that. I am wondering though what would be best for our customers? Allowing only one data point per combination feels limiting to me and does not take into account noise, which is inherent in ML. Btw. fixing a seed helps reproducibility and may increase efficiency and - to my understanding - allow us to surface certain patterns earlier, but it does not reduce the inherent noise. It just fixes the noise to some starting points, that may or may not be representative for how the model is used later. It would lead to less robust models. Happy to be educated and to learn otherwise. |
@iaroslav-ai, with discrete values the search space is finite with respect to the domains of the hyperparameters. Agreed. But it does not follow that we only need one trial per data point to estimate its usefulness, unless it would be deterministic.
In the model authoring phase I don't care much about the best model, but what I learn about the hyperparameter values and their impact. But for completeness sake, and to answer your question: I would still go with the HPs from the best job by objective metric.
Conceptually I would start where we are today. As we only run one trial for the data point in the HP space, we effectively randomly chose a model, it just happens we picked form a population of one. But it is still subject to noise. Makes sense? So we could chose one randomly too. A natural extension to this would be to pick the one with the best objective metric for this data point in the HP space. But as the performance difference inside this group is down to noise, this may give us a nominal improvement that is not representative of the model when it gets deployed with unseen data. This is the same issue as we have with the manual seed. But(!) I think there is a bug in the questions. With multiple trials per data point we could use the mean objective metric for a data point in the HP space and select the right HP based on that, now adjusted for noise. Which we did not do before, where we would have picked the HP from the job that performed best over all data points in the HP space, which could have been the wrong job as we picked an outlier, far away from the mean. Makes sense? |
I use a continuous variable, because I don't want to count combinations. Instead I tell AMT I want n runs and as long as n is bigger than the combinations of discrete values, I am done.
I use argparse and I added the dummy parameter. But consider the case of a regulated customer, like a bank. They may have a rule in their reviews that training scripts that produce models for production can only have parameters that are used for this training. And, of course, with those customers you cannot turn off validation of parameters.
Not sure what you mean here. Training a large model may take, say 100 minutes on a large GPU. Running just 100 trials over, say, four steps (tuning jobs) of 25 trials can already cost quite some money. This will make it less interesting for customers to do this kind of experimentation and as a consequence they may feel less confident about their model (and models and AI). If however, we can introduce warmstarts in this dialog we can cut down on cost massively (or increase the number of trials for the same budget).
I answered in my other posts a bit more about the background, so that I can be a bit briefer here. The scenarios I have in mind are not directly leading to deployments, but to more changes in code/HPs, then eventually deployment. I never used CV. To the best of knowledge, if you were to implement CV with a SM training job today, this would be part of the inner loop and you would still produce a single model for SM to upload to S3 and use downstream. What you do with final metrics is left to the training job author. Again, I am happy with what AMT is providing today and I am grateful for it. But it could be better, especially for the exploration during model authoring case. |
^ That's the important bit. But may I also refer to low coupling/high cohesion? The training job does not know about it being the subject of a hyperparameter tuning job. It neither has parameters to such an effect, nor extra code or libraries. It may early stop its inner loop, but knows nothing about the outer loop. The outer loop, the tuning job, however, knows nothing about the training job except for hyperparameters and ingesting/reacting to objective metrics emitted by the individual training jobs. I can live with a dummy variable and I already do so. Event though it means parameter passing and confusing a warm start. But better support for the exploration case could be beneficial to our customers. And that was why Valerio asked me to file the ticket after a discussion we had on Friday. |
Thanks @geoalgo. I am unsure about the seed. But yes, sure, running the tuning job multiple times would work. It feels less intuitive to me to provide the number of repetitions, instead of the maximum number of trials/samples, though. Providing the number of repetitions would, however, give more clarity what really happens. While in my case, I was not asking for just raising the number of trials, but also for a disconnect between the number of trials and the number of combinations. Hence, while in your case you would get a uniform number of samples per data point, in my case it would be more random. The latter could also mean that from some combination we do not sample at all. But I see this as a potential benefit. Think about batch size, int, 64 - 1024. Would this today translate to (1024 - 64) discrete HP values? Then not sampling all values could be right thing to do? |
Hello @marianokamp, the fact that each configuration is evaluated at most once, is pretty strongly baked into AMT and Syne Tune, because it is typically best practice. For the use case if yours, I'd like to bring up an interesting alternative. The number of repeated evaluations at one config could be a fidelity parameter. The metric you emit at fidelity r is the average of the r values. This will concentrate repeated evaluations on configurations whose metric values look most promising, and where reducing variance brings the most gains. If you like to explore this more, get in touch with me. I have some write-up, there is even a nice GP covariance kernel for this, this idea has been lying around for some time. Lastly, I'd not use HPO for exploring a space. Except for random search, all other methods are explicitly biased to exploring more where the optimum is likely located. |
@marianokamp Is there any further actions requested on this issue? Thanks |
@mseeger, I need more time in between answers. There is no clock or dependency on your side, or is there? |
No problem. Just we try to work down the open issues. |
Thanks @mseeger. Before going into the alternative approach, I am still trying to get my head around some of the things you mentioned. In particular the first sentence brings up two major points, that I would like to quickly follow up for my understanding.
Is there a quick way to describe why this is a good practice?
Can you please help me understand this better? In the example I posted above I have two hyperparameters with enumerable values and that brings me to 12 (3x4 values) combinations. When I try to start a job with 20 trials, I get the error message that 20 exceeds 12. If I then add a dummy continuous hyperparameter I can launch the tuning job with 20 trials. |
The default in Bayesian optimization is to have a pretty large, often infinite, configuration space, which you want to sample efficiently. A key assumption is that the unknown function (before independent noise) is smooth w.r.t. numerical parameters. Given this assumption, it becomes inefficient to sample at the same place >1 times: you can gain more information by sampling nearby. Once you start not checking for duplicates, you open the door to all sorts of bugs. A favourite of mine is that SDEs just fix the random seed, so the same config is chosen over and over. |
If your criterion is overly noisy, it is better to change the criterion. You can for example go for a cross-validation score. If this is too expensive (each evaluation is K times as expensive as before), the idea I mentioned above becomes attractive. |
If you extend the configuration space by a dummy variable, the constraint is that you cannot have duplicates in that new space. The algorithm cannot know that one of your variables is a dummy. Of course, two configs can be different, but the same in a subset of attributes. |
If you really need duplicates, the suggestion of adding a dummy (ideally a continuous one, so you never get duplicates again) is a good first step. The idea I mention above, would be more advanced, but maybe also work better. |
OK, we put this item on our backlog. We can allow for duplicates, as an option, but the default will remain to not allow for duplicates, first because it is the right thing to do for HPO, and second because it catches mistakes which are otherwise hard to catch. |
Thanks @mseeger, all. That sounds great. I also appreciate the time you and your colleagues spent discussing the issue. Given the better understanding I will now use the proposed approach to do the oversampling in multiples of the combinations and use a discrete dummy variable, so that it works best with the duplicate detection. |
Hello, PR #487 introduces a flag |
TL;DR: Please allow to run experiments with more (or less) trials than the number of combinations.
Right now when running a hyperparameter tuning job with AMT I get an error message when the number of trials exceeds the number of enumerable combinations; here being the discrete combinations of integer and categorical parameter values.
But this is limiting as the exploration of the search space is noisy and more than one trial may be needed to understand the variance and to establish a stable mean.
As a workaround I am starting tuning jobs with the Random strategy with an additional hyper parameter "dummy" that is continuous. This way I can specify the number of trials I need. But this makes it harder to use this data basis for future warmstarts to narrow down scenarios. Further it forces me to allow the "dummy" parameter in my training script.
Example:
I want to know if adding a non-linearity and additional capacity (Pooler) on top of a BERT-like model will yield better performance as result, or if the extra capacity will make the model lazier and not use the transformer blocks below.
I also want to see if this assessment changes when adding more transformer layers.
So I have two categorical variables. Layers: [1, 4, 8] and scale-of-classifier: [0, 0.5, 1.0, 2.0].
These are just 3*4 combinations. But given the noisy nature of a NN training a single data point per combination has next to no meaning. To produce the understanding below I used about 100 data points with the workaround from above.
If I could just specify the categorical parameters and the number of trails (for GridSearch/Random) my appreciation will follow you until the end of your hopefully long and fulfilling life.
The text was updated successfully, but these errors were encountered: