Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

learn prediction intervals (variance, noise) #1148

Closed
sashaostr opened this issue Jan 16, 2020 · 15 comments
Closed

learn prediction intervals (variance, noise) #1148

sashaostr opened this issue Jan 16, 2020 · 15 comments
Labels

Comments

@sashaostr
Copy link

sashaostr commented Jan 16, 2020

The problem:
I'm looking to emit prediction intervals for each predicted value (the mean) in regression. I need that these intervals cover say 90% of true values and be as narrow as possible. In other words I want to learn and emit variance (or noise) which can't be explained by features of the model in each region of data - each sample would have different intervals determined by input vector.
For example:
Predicting income by number of education years. Given there is no additional data we have I would expect lower variance of income for lower education and higher variance for higher education. Another example - predicting how much years left to leave, by age and health data of a person. Young would have larger variance, while old lower, and old and unhealthy even more lower.

There are two main methods I'm aware of to do it:

  • non-parametric methods - using Quantile Loss, where trying to get 90% of coverage we can train three models: Quantile:alpha=0.5 for the median and alpha=0.1 and 0.9 for the lower and higher bounds (quantiles) respectfully.
  • parametric method - suppose some distribution of the noise, and learn it's parameters. For example to get 90% coverage suppose Normal distribution and train models for mean and stdev. Then calculate the interval using stdev.
  • bayesian methods, essentially parametric as well, but the modeling is done using probabilistic methods.

The question:
As mentioned I want the interval be as narrow as possible but still satisfy the needed coverage, which means learning separate models for quantiles or parameters for parametric methods wouldn't provide optimal solution in terms of coverage and width.
I'm looking for the loss function which can optimize two things simultaneously. Is there something builtin already in some library and if not what would be the simplest way to implement it?
Both parametric and non parametric methods are accepted.
Thanks in advance!
Alexander

Example
The data in example is simulated: 2 independent variables - stage (categorical) and age (axis X), axis Y is the predicted value. The bounds and mean created by 3 separate quantile models. But real data is much more complex, so separate models approach not creating nice results.
image

@StatMixedML
Copy link

StatMixedML commented Jan 19, 2020

Maybe that is of interest to you: I am currently working on an extension of CatBoost to probabilistic forecasting

https://github.com/StatMixedML/CatBoostLSS

It creates probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. It learns all parameters of a distribution and provides inference as well. It is still very developmental though. You may find additional information in the corresponding paper

https://128.84.21.199/abs/2001.02121

In case you are interested, I am also extending XGBoost to to probabilistic forecasting. Repo is here

https://github.com/StatMixedML/XGBoostLSS

and paper here

https://arxiv.org/abs/1907.03178

@sashaostr
Copy link
Author

Maybe that is of interest to you: I am currently working on an extension of CatBoost to probabilistic forecasting

https://github.com/StatMixedML/CatBoostLSS

It creates probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. It learns all parameters of a distribution and provides inference as well. It is still very developmental though. You may find additional information in the corresponding paper

https://128.84.21.199/abs/2001.02121

In case you are interested, I am also extending XGBoost to to probabilistic forecasting. Repo is here

https://github.com/StatMixedML/XGBoostLSS

and paper here

https://arxiv.org/abs/1907.03178

Thanks a lot @StatMixedML, it looks very promising, I"ll definitely examine it very closely!
One question - you call it probabilistic forecasting, but can it be seen as a regression? I mean forecasting for me is predicting future values for some time dependent variable. I'm looking for estimating noise bounds (parametric or non) of dependent variable based on number of independent - I expect it to be different in different regions of generative process. For best of my understanding forecasting problem and the time variable are kind of specific case of what I mentioned, but just wanted to understand your view on it and particular reason you call it forecasting and not regression. In any case it's very useful for me, but just wanted to understand that point. Thanks!!!

@StatMixedML
Copy link

Note sure if I fully get your point, but what CatBoostLSS does it relates all parameters of a distribution to covariates. Say, you specify a Normal distribution, then both mean and sigma are estimated as functions of x, i.e., E(y|x) = f(x) + Sigma(y|x) = f(x). In addition, you get, e.g., Partial Dependence Plots for both E(y|x) and Sigma(y|x) so that you better understand the influence x has on mean and variance. Since you estimate all parameters, you can sample observations for any desired point of the response distribution, i.e., 5%, 50%, 95% Quantile. What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

@sashaostr
Copy link
Author

Note sure if I fully get your point, but what CatBoostLSS does it relates all parameters of a distribution to covariates. Say, you specify a Normal distribution, then both mean and sigma are estimated as functions of x, i.e., E(y|x) = f(x) + Sigma(y|x) = f(x). In addition, you get, e.g., Partial Dependence Plots for both E(y|x) and Sigma(y|x) so that you better understand the influence x has on mean and variance. Since you estimate all parameters, you can sample observations for any desired point of the response distribution, i.e., 5%, 50%, 95% Quantile. What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

Thanks for explanation @StatMixedML , I got the first part (and I think it's what I'm looking for), but not sure got the second:

What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

I"ll try to understand better from the paper. Thanks a lot!

@annaveronika
Copy link
Contributor

Thank you very much for the issue and for your paper! We will implement one of the solutions in the library, because it is one of the most frequently requested features.

@StatMixedML
Copy link

StatMixedML commented Jan 25, 2020

@annaveronika: Great to hear that CatBoostLSS is getting support from the CatBoost team! Let`s get in in contact to collaborate on this!

@annaveronika
Copy link
Contributor

annaveronika commented Jan 27, 2020

Just to be clear - we are not committing on supporting CatBoostLSS, we are planning to implement one of the solutions, not necessarily CatBoostLSS, based on some experiments from out site. We'll write all updates here.

@annaveronika
Copy link
Contributor

I'll remove the email from the previous msg tomorrow, so please copy it somewhere :)

@StatMixedML
Copy link

@annaveronika

Sure, I do understand. I am not sure, though, how we align on that I already have a paper out + an implementation on CatBoostLSS that is submitted for publication in a refereed journal. You can find the repo here:

https://github.com/StatMixedML/CatBoostLSS

I think we need to align on copyrights before you start implementing a solution. Ideally, we have a co-authorship on the paper. Please let me know.

@annaveronika
Copy link
Contributor

For copyrights we need your agreement if we copy the code from your repo. Or if you make a pull request, then you have to agree to our CLA. I don't think there are other problems, it's not required to coauthor a paper to make contributions or to implement an idea from a public research paper.

@StatMixedML
Copy link

I understand. But adhering to sound scientific principles, how do we deal with the fact that I have submitted the paper to a journal?

@alejandroschuler
Copy link

@annaveronika I'm one of the authors of ngboost- please do include our approach in any internal benchmarking you do. We'd love to see the results and are happy to answer any questions.

@StatMixedML thanks for giving me a heads up about this conversation!

@kmedved
Copy link

kmedved commented May 4, 2020

I am curious if there is any update on adding distributional support to Catboost, whether via an NGBoost-type approach or other. This would be a huge feature addition.

Thanks.

@abelriboulot
Copy link

Any update on this? It's an important issue for us and I already know of 3 separate organizations that had to implement NGBoost-type approaches on forks with duct tape.

@ek-ak
Copy link
Collaborator

ek-ak commented Sep 9, 2021

Hello! We implemented RMSEWithUncertainty loss and Uncertainty prediction, I think this is what you are looking for https://catboost.ai/docs/references/uncertainty.html#uncertainty

@kizill kizill closed this as completed Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants