Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] GPEI when function values are very non gaussian, extreme values #1946

Open
andrejmuhic opened this issue Nov 1, 2023 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@andrejmuhic
Copy link

Hi,
First thank you for doing the hard work in maintaining very helpful library.
I have one question and I would be glad if someone could help me.

We are using standard GPEI for max_x f(x) with bounds on x, something like:

        gs = GenerationStrategy(steps=[
            GenerationStep(Models.SOBOL, num_trials=max(5, int(num_concurrent * 1.5))),
            GenerationStep(Models.GPEI, num_trials=-1)])

The negative function values can be in absolute value 50x bigger than positive ones.
So very fat tailed and skewed distribution of outcome values f(x).
It seems that our optimization gets stuck due to this for a long time as this seems to hurt out of sample prediction quality of the process.

What is the recommended approach to this?
It is not obvious if different choice of kernel would help here, Matern kernel seems to be pretty general already.
We tried truncation of negative values, restarting optimization with again taking some Sobol steps, that seems to help.

But it could be that we may be missing something obvious.
Would adding constraint on outcome as f(x) > bound be more proper thing to do, the docs https://botorch.org/docs/constraints probably indicate so and to my understanding this approach is already used in ax as actually qNoisyExpectedImprovement is used. I am not sure how outcome constraints are handled in detail I just skimmed over ideas and it seems promising approach,
So is this the way to go and more sample efficient than truncating outcome values?

Thank you for your help.

@Balandat
Copy link
Contributor

Balandat commented Nov 1, 2023

Hi @andrejmuhic, my first instinct would be to do some kind of log transform of the observations to squish the distribution together and render the problem easier to model (note that this wouldn't change the location of the optimum due to the monotonicity of the transform).

The issue is of course that you have negative values - do you have some kind of global lower bound L on the function values f that you know a priori? Then you could just transform your values y as y_tf = log(y - L) and try optimizing this.

If you don't know this global value a priori, then one could in principle try to estimate this in each iteration from the model (e.g. as the minimum observed value minus some interquantile range to avoid squishing things too much near zero), but afaik we don't have this hooked up in Ax at this point - so if you can have some reasonable guess of a global L (doesn't have to be perfect) that would be the easiest approach.

@lena-kashtelyan lena-kashtelyan added the question Further information is requested label Nov 1, 2023
@andrejmuhic
Copy link
Author

Hi, thank you for your advice.
We will try it out and I will report how it goes.
We can have global bound probably.
So after shifting I guess any of this stuff goes:
https://petersonr.github.io/bestNormalize/
I was doing similar things in the past for normalizing the data.

@Balandat
Copy link
Contributor

Balandat commented Nov 1, 2023

So after shifting I guess any of this stuff goes: https://petersonr.github.io/bestNormalize

In general, yes, though you have to be careful to make sure that the transform is consistent for all the data. Like, you can't just find the "best normalizing transform" for data of trials 1-10, pass that transformed data to Ax, and then compute another "best normalizing transform" based on data of trials 1-20 but only pass the transformed outcomes of trials 11-20 to Ax.

The transformation has to be applied consistently across all data (and care has to be taken if there are things like outcome constraints or relative effects). FWIW, we do some data transformation automatically in Ax (such as standardizing or winsorizing), but we don't automatically apply log transforms or other kinds of transforms (this is something we could do more of).

If the form of your data transform doesn't change, you'll be safe; One option would be to collect a bunch of observations (or use those that you already have), compute the "best normalizing transform" based on those, and then create a new Ax experiment where you add these results as manual arms, and then use the same transform for all subsequent results obtained on any new trials suggested by Ax.

@andrejmuhic
Copy link
Author

andrejmuhic commented Nov 1, 2023

Hi, @Balandat

In general, yes, though you have to be careful to make sure that the transform is consistent for all the data. Like, you can't just find the "best normalizing transform" for data of trials 1-10, pass that transformed data to Ax, and then compute another "best normalizing transform" based on data of trials 1-20 but only pass the transformed outcomes of trials 11-20 to Ax.

Actually the package that I pasted finds the best transformation using cross validation, which is kind off the right thing to do statistically.

The transformation has to be applied consistently across all data (and care has to be taken if there are things like outcome constraints or relative effects). FWIW, we do some data transformation automatically in Ax (such as standardizing or winsorizing), but we don't automatically apply log transforms or other kinds of transforms (this is something we could do more of).

Yes I am aware that same transformation needs to be applied to all data.
I have actually carefully inspected source code and the documentation. I do not like running what I do not understand to at least some level. For example I carefully studied what is done automatically as I was initially surprised that some recommended normalizations did have no effect.

    "GPEI": ModelSetup(
        bridge_class=TorchModelBridge,
        model_class=BotorchModel,
        transforms=Cont_X_trans + Y_trans,
        standard_bridge_kwargs=STANDARD_TORCH_BRIDGE_KWARGS,
    ),
    "Sobol": ModelSetup(
        bridge_class=RandomModelBridge,
        model_class=SobolGenerator,
        transforms=Cont_X_trans,
    ),

If the form of your data transform doesn't change, you'll be safe; One option would be to collect a bunch of observations (or use those that you already have), compute the "best normalizing transform" based on those, and then create a new Ax experiment where you add these results as manual arms, and then use the same transform for all subsequent results obtained on any new trials suggested by Ax.

Actually I have already coded this before I saw your post, just waiting that I have enough samples. We are using manual stuff extensively already.

Moreover I am actually adding diagnostics, quality of fit, variable importance etc.
Things described here:
https://ax.dev/docs/models.html and here https://ax.dev/versions/0.1.5/tutorials/visualizations.html
I will report back how it goes, including the transformations.

I am actually glad that I asked this question as I started going in too much detail (like starting to read botorch source of qNoisyExpectedImprovement and reading the paper to see if it makes sense changing) before having proper basic analysis. I can get very excited.

Thank you for chatting with me.

@Balandat
Copy link
Contributor

Balandat commented Nov 1, 2023

I can get very excited.

When it comes to Bayesian Optimization, we all do :)

@andrejmuhic
Copy link
Author

andrejmuhic commented Nov 9, 2023

Hi @Balandat,

It took me some time.
I have experimented with things quite a bit. I did not like that R package that much at the end, the quantile transformation that is chosen does not generalize for extreme values of course.
So actually the most benefit according to:

cv = cross_validate(model)
diagnostics = compute_diagnostics(cv)

and also doing a final test on validation data
is obtained by doing winsorization like in Winsorize, some additional benefit also obtained by doing PowerTransformY.
I had to hack up my own transformations so I could do proper out of sample transformations.
I actually have a question here, how does StandardizeY, etc. work. Maybe I am wrong but it seems to me that the transformation is refitted every time when new data comes in, is this correct? I guess this is fine as the transformations are invertible but it means that kernel matrix is recomputed every update.

I am not sure how cross_validate is done, I tried reading the source code but this could cause data peaking in some sense if this is not refitted every time when doing the fold?
Ok found it right now, the default is refit_on_cv=False.

I would just like to make sure that I am doing right thing. The easiest thing for me would be to apply Winsorize (but one cannot invert this one obviously) and PowerTransformY that are already available, for this I need to pass the configs for this transformations, maybe there is an example how to this with ModelSetup. I have the same question again, it seems that the transformations are refitted when the new data comes in refit_on_update=True, is this technically correct or I should use my own as I am doing now.

I also started playing with https://research.facebook.com/blog/2021/7/high-dimensional-bayesian-optimization-with-sparsity-inducing-priors/ and https://twitter.com/SebastianAment/status/1720440669628514763
The log acquisition ones seem to work better than before but it did not make a huge difference. Still need to do proper test with SAASBO.

Thank you for your help, keep up the good work, I enjoyed reading the papers.

@Balandat
Copy link
Contributor

Balandat commented Nov 9, 2023

transformation is refitted every time when new data comes in, is this correct? I guess this is fine as the transformations are invertible but it means that kernel matrix is recomputed every update.

Yes that is correct. The Ax paradigm generally is that it handles both transforming data before passing it to the lower level & optimization layer and then un-transforming the predictions returned by the modeling & optimization layer. And yes, we usually construct a new model upon generating new points rather than updating an existing one (unless there is no new data in which case we use the old one). I guess it would be possible to try and avoid that to cache some kernel matrices etc. but that would get quite messy really fast with all the complexities of transforms and so on. So our design choice here is to re-fit the transform every time there is new data, which is generally pretty low overhead relative to the actual GP fitting and acquisition function optimization.

I would just like to make sure that I am doing right thing. The easiest thing for me would be to apply Winsorize (but one cannot invert this one obviously) and PowerTransformY that are already available, for this I need to pass the configs for this transformations, maybe there is an example how to this with ModelSetup. I have the same question again, it seems that the transformations are refitted when the new data comes in refit_on_update=True, is this technically correct or I should use my own as I am doing now.

Yes this should be fine, this setup gets you around shifting the data to positive values outside of Ax with a global offset (though you could still do that even with the above proposed transforms if you remember applying that shift at the end). Winsorization does work quite well in practice and we use it a lot ourselves, so this seems like a solid approach.

I also started playing with https://research.facebook.com/blog/2021/7/high-dimensional-bayesian-optimization-with-sparsity-inducing-priors/ and https://twitter.com/SebastianAment/status/1720440669628514763
The log acquisition ones seem to work better than before but it did not make a huge difference. Still need to do proper test with SAASBO.

Great, happy to see you test this out, let us know how this goes!

FWIW there is quite a bit of work on robust GP(-like) regression, e.g. with student-t processes or using ideas from generalized Bayesian inference. This is something that we hope to support in the future but don't have any concrete plans for at this point.

@andrejmuhic
Copy link
Author

Hi, Balandat
I did some more inspection, it seems like FixedNoiseGP, so providing our approximate noise measurement of std would also be better option. I have also read a bit on Student T processes, intriguing. I have done robust regression with specifying only noise part in the past, even student t like errors in PyMC3. I guess this would be an easier first option here without implementing the whole Student T process machinery, sadly not enough time and I do not see considerable potential gain in this to justify even trying implementing this alone. It seems that I am approaching diminishing returns quickly here.

One more important thing.
What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?
Kind of hard to dig out of internals if Sobol step is done in that case or there is some kind of restricted sampling going to prevent choosing the same candidate again in acquisition part?
The message that make sense is but it is hard to see what actually happens:
ax.modelbridge.torch: The observations are identical to the last set of observations used to fit the model. Skipping model fitting.

Thank you for your help. I am getting better and better insight.

@Balandat
Copy link
Contributor

What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?

We take those trials into account as "pending points" when computing the acquisition value. Basically, we take the expectation of the acquisition value at some set of points X over the model poster at the pending points X_pending. This typically ensures that we're not generating identical candidates to those already running

The message that make sense is but it is hard to see what actually happens:

ax.modelbridge.torch: The observations are identical to the last set of observations used to fit the model. Skipping model fitting.

Yeah if you're working with pending points then this should be expected. I guess we could consider whether we'd want to reduce the logging of this in the context where this is expected - but it's probably not a bad diagnostic to have (cc @saitcakmak)

@andrejmuhic
Copy link
Author

Hi, Balandat

What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?

We take those trials into account as "pending points" when computing the acquisition value. Basically, we take the expectation of the acquisition value at some set of points X over the model poster at the pending points X_pending. This typically ensures that we're not generating identical candidates to those already running

This makes perfect sense.
Yes I am using parallel trials and providing pending_observations, either automatically or manually.
The message just confused me somewhat. Maybe instead of removing the message it would be better to make it more explicit as I think you already suggested. In a sense that if pending_observations are provided this is ok and expected. On other side if pending_observations are not provided and one does this manually this will generate exactly the same candidate as before which is almost never desired.
Like in https://ax.dev/tutorials/generation_strategy.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants