learn prediction intervals (variance, noise) #1148

sashaostr · 2020-01-16T09:23:27Z

The problem:
I'm looking to emit prediction intervals for each predicted value (the mean) in regression. I need that these intervals cover say 90% of true values and be as narrow as possible. In other words I want to learn and emit variance (or noise) which can't be explained by features of the model in each region of data - each sample would have different intervals determined by input vector.
For example:
Predicting income by number of education years. Given there is no additional data we have I would expect lower variance of income for lower education and higher variance for higher education. Another example - predicting how much years left to leave, by age and health data of a person. Young would have larger variance, while old lower, and old and unhealthy even more lower.

There are two main methods I'm aware of to do it:

non-parametric methods - using Quantile Loss, where trying to get 90% of coverage we can train three models: Quantile:alpha=0.5 for the median and alpha=0.1 and 0.9 for the lower and higher bounds (quantiles) respectfully.
parametric method - suppose some distribution of the noise, and learn it's parameters. For example to get 90% coverage suppose Normal distribution and train models for mean and stdev. Then calculate the interval using stdev.
bayesian methods, essentially parametric as well, but the modeling is done using probabilistic methods.

The question:
As mentioned I want the interval be as narrow as possible but still satisfy the needed coverage, which means learning separate models for quantiles or parameters for parametric methods wouldn't provide optimal solution in terms of coverage and width.
I'm looking for the loss function which can optimize two things simultaneously. Is there something builtin already in some library and if not what would be the simplest way to implement it?
Both parametric and non parametric methods are accepted.
Thanks in advance!
Alexander

Example
The data in example is simulated: 2 independent variables - stage (categorical) and age (axis X), axis Y is the predicted value. The bounds and mean created by 3 separate quantile models. But real data is much more complex, so separate models approach not creating nice results.

StatMixedML · 2020-01-19T09:21:05Z

Maybe that is of interest to you: I am currently working on an extension of CatBoost to probabilistic forecasting

https://github.com/StatMixedML/CatBoostLSS

It creates probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. It learns all parameters of a distribution and provides inference as well. It is still very developmental though. You may find additional information in the corresponding paper

https://128.84.21.199/abs/2001.02121

In case you are interested, I am also extending XGBoost to to probabilistic forecasting. Repo is here

https://github.com/StatMixedML/XGBoostLSS

and paper here

https://arxiv.org/abs/1907.03178

sashaostr · 2020-01-19T09:59:48Z

Maybe that is of interest to you: I am currently working on an extension of CatBoost to probabilistic forecasting

https://github.com/StatMixedML/CatBoostLSS

It creates probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. It learns all parameters of a distribution and provides inference as well. It is still very developmental though. You may find additional information in the corresponding paper

https://128.84.21.199/abs/2001.02121

In case you are interested, I am also extending XGBoost to to probabilistic forecasting. Repo is here

https://github.com/StatMixedML/XGBoostLSS

and paper here

https://arxiv.org/abs/1907.03178

Thanks a lot @StatMixedML, it looks very promising, I"ll definitely examine it very closely!
One question - you call it probabilistic forecasting, but can it be seen as a regression? I mean forecasting for me is predicting future values for some time dependent variable. I'm looking for estimating noise bounds (parametric or non) of dependent variable based on number of independent - I expect it to be different in different regions of generative process. For best of my understanding forecasting problem and the time variable are kind of specific case of what I mentioned, but just wanted to understand your view on it and particular reason you call it forecasting and not regression. In any case it's very useful for me, but just wanted to understand that point. Thanks!!!

StatMixedML · 2020-01-19T13:43:05Z

Note sure if I fully get your point, but what CatBoostLSS does it relates all parameters of a distribution to covariates. Say, you specify a Normal distribution, then both mean and sigma are estimated as functions of x, i.e., E(y|x) = f(x) + Sigma(y|x) = f(x). In addition, you get, e.g., Partial Dependence Plots for both E(y|x) and Sigma(y|x) so that you better understand the influence x has on mean and variance. Since you estimate all parameters, you can sample observations for any desired point of the response distribution, i.e., 5%, 50%, 95% Quantile. What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

sashaostr · 2020-01-19T14:25:49Z

Note sure if I fully get your point, but what CatBoostLSS does it relates all parameters of a distribution to covariates. Say, you specify a Normal distribution, then both mean and sigma are estimated as functions of x, i.e., E(y|x) = f(x) + Sigma(y|x) = f(x). In addition, you get, e.g., Partial Dependence Plots for both E(y|x) and Sigma(y|x) so that you better understand the influence x has on mean and variance. Since you estimate all parameters, you can sample observations for any desired point of the response distribution, i.e., 5%, 50%, 95% Quantile. What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

Thanks for explanation @StatMixedML , I got the first part (and I think it's what I'm looking for), but not sure got the second:

What you don`t get is how x affects different parts of the conditional distribution. That is where you should use Quantile Regression / Expectile Regression instead.

I"ll try to understand better from the paper. Thanks a lot!

annaveronika · 2020-01-21T14:12:25Z

Thank you very much for the issue and for your paper! We will implement one of the solutions in the library, because it is one of the most frequently requested features.

StatMixedML · 2020-01-25T10:28:54Z

@annaveronika: Great to hear that CatBoostLSS is getting support from the CatBoost team! Let`s get in in contact to collaborate on this!

annaveronika · 2020-01-27T10:18:55Z

Just to be clear - we are not committing on supporting CatBoostLSS, we are planning to implement one of the solutions, not necessarily CatBoostLSS, based on some experiments from out site. We'll write all updates here.

annaveronika · 2020-01-27T10:21:03Z

I'll remove the email from the previous msg tomorrow, so please copy it somewhere :)

StatMixedML · 2020-01-27T13:28:27Z

@annaveronika

Sure, I do understand. I am not sure, though, how we align on that I already have a paper out + an implementation on CatBoostLSS that is submitted for publication in a refereed journal. You can find the repo here:

https://github.com/StatMixedML/CatBoostLSS

I think we need to align on copyrights before you start implementing a solution. Ideally, we have a co-authorship on the paper. Please let me know.

annaveronika · 2020-01-27T13:45:24Z

For copyrights we need your agreement if we copy the code from your repo. Or if you make a pull request, then you have to agree to our CLA. I don't think there are other problems, it's not required to coauthor a paper to make contributions or to implement an idea from a public research paper.

StatMixedML · 2020-01-27T14:03:24Z

I understand. But adhering to sound scientific principles, how do we deal with the fact that I have submitted the paper to a journal?

alejandroschuler · 2020-01-27T17:08:54Z

@annaveronika I'm one of the authors of ngboost- please do include our approach in any internal benchmarking you do. We'd love to see the results and are happy to answer any questions.

@StatMixedML thanks for giving me a heads up about this conversation!

kmedved · 2020-05-04T23:15:25Z

I am curious if there is any update on adding distributional support to Catboost, whether via an NGBoost-type approach or other. This would be a huge feature addition.

Thanks.

abelriboulot · 2021-09-08T14:02:11Z

Any update on this? It's an important issue for us and I already know of 3 separate organizations that had to implement NGBoost-type approaches on forks with duct tape.

ek-ak · 2021-09-09T13:58:56Z

Hello! We implemented RMSEWithUncertainty loss and Uncertainty prediction, I think this is what you are looking for https://catboost.ai/docs/references/uncertainty.html#uncertainty

annaveronika mentioned this issue Jan 21, 2020

Implement relevant algorithms from NGBoost #1062

Closed

annaveronika added the planned label Jan 21, 2020

StatMixedML mentioned this issue Jan 25, 2020

Probabilistic Forecasting - CatBoostLSS #1158

Closed

StatMixedML mentioned this issue Jan 27, 2020

Alternative Implementations Catboost stanfordmlgroup/ngboost#69

Closed

kizill closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

learn prediction intervals (variance, noise) #1148

learn prediction intervals (variance, noise) #1148

sashaostr commented Jan 16, 2020 •

edited

StatMixedML commented Jan 19, 2020 •

edited

sashaostr commented Jan 19, 2020

StatMixedML commented Jan 19, 2020

sashaostr commented Jan 19, 2020

annaveronika commented Jan 21, 2020

StatMixedML commented Jan 25, 2020 •

edited

annaveronika commented Jan 27, 2020 •

edited

annaveronika commented Jan 27, 2020

StatMixedML commented Jan 27, 2020

annaveronika commented Jan 27, 2020

StatMixedML commented Jan 27, 2020

alejandroschuler commented Jan 27, 2020

kmedved commented May 4, 2020

abelriboulot commented Sep 8, 2021

ek-ak commented Sep 9, 2021

learn prediction intervals (variance, noise) #1148

learn prediction intervals (variance, noise) #1148

Comments

sashaostr commented Jan 16, 2020 • edited

StatMixedML commented Jan 19, 2020 • edited

sashaostr commented Jan 19, 2020

StatMixedML commented Jan 19, 2020

sashaostr commented Jan 19, 2020

annaveronika commented Jan 21, 2020

StatMixedML commented Jan 25, 2020 • edited

annaveronika commented Jan 27, 2020 • edited

annaveronika commented Jan 27, 2020

StatMixedML commented Jan 27, 2020

annaveronika commented Jan 27, 2020

StatMixedML commented Jan 27, 2020

alejandroschuler commented Jan 27, 2020

kmedved commented May 4, 2020

abelriboulot commented Sep 8, 2021

ek-ak commented Sep 9, 2021

sashaostr commented Jan 16, 2020 •

edited

StatMixedML commented Jan 19, 2020 •

edited

StatMixedML commented Jan 25, 2020 •

edited

annaveronika commented Jan 27, 2020 •

edited