# Overview

Variable selection refers to the process of identifying the most relevant variables in a model from a larger set of predictors. Sometimes the goal is just to separate the most relevant variables from the rest, and sometimes the goal is to obtain a ranking of the variables.

In this section we provided a quick background on variable selection and provide a short description of the method implemented in Kulprit, references are provided for those interested into more details.
If you are familiar with the topic or just want to see Kulprit in action you can skip this section and go directly to the [examples](https://kulprit.readthedocs.io/en/latest/examples/examples.html).


## When we should care about variable selection?

We may want to perform variable selection when:

- **We need to reduce measurement costs.** For instance, in medicine, we may have the resources to conduct a pilot study measuring 30 variables for 200 patients but cannot afford to do the same for thousands of people. Similarly, we might be able to install numerous sensors in a field to model crop yields but cannot scale this to cover an entire agricultural region. Cost reduction isn’t always about money or time—when working with humans or other animals, it also involves minimizing pain and discomfort.

- **We aim to reduce computational costs.** While computational costs may not be an issue for small, simple models, they can become prohibitive when dealing with many variables, large datasets, or both.

- **We want to better understand important correlation structures.** In other words, we aim to identify which variables contribute the most to making better predictions. Note that this is not about causality. While statistical models, particularly GLMs, can be used for causal inference, doing so requires additional steps and assumptions.

- **We want a model that is more robust to changes in the data-generating distribution.** Variable selection can serve as a way to make a model more resilient to non-representative data.

## When we should NOT care about variable selection?

If we are not interested in learning which are the most useful variables, then variable selection is not needed. That's the case, for instance, when our goal is to make predictions. In this case, we can use all the variables available to us.

## How does variable selection work?

There are many methods to perform variable, you can read more about them in {cite:t}`Vehtari_2012, Heinze_2018`. One strategy within the PyMC ecosystem is to use [PyMC-BART](https://arviz-devs.github.io/EABM/Chapters/Variable_selection.html#variable-selection-with-bart). 

For the rest of the discussion we will focus on how variable selection works for Kulprit, to do this it may help to contrast it with the direct or brute-force approach. The direct approach is to combine variables in all possible ways (or at least in all reasonable ones), compute the posterior for each model and then compare the models under some metric. Conceptually, this is very straightforward, but in practice it can be very difficult to implement. The number of models grows very fast with the number of predictors, fitting many models can be very time-consuming, and manually building all these models can be error-prone. Kulprit help us to solve these problems by using four strategies: 

1. Automatize the model building and fitting process.
2. Reduce the number of models we need to fit.
3. Reduce the time it takes to fit them.
4. Reduce the time it takes to evaluate them.

Let's explore each of these points in more detail.

1. To use Kulprit we just need to provide a single model, called the `reference` model, using [Bambi](https://bambinos.github.io/bambi/) syntax. Then Kulprit will automatically build all the `submodels`, i.e. model with fewer variables, for us, evaluate them and provide us with useful summaries of the results.

2. To reduce the number of models we must fit, we have to use a `search strategy`. By default, Kulprit uses a forward search. It starts by creating an intercept-only model (i.e. a model with no predictors), then build all models with the intercept and one predictor, select the predictor that improves the model the most, and then build all models with intercept and the previous predictor plus one more, and so on. As you can see, once we have selected a predictor, we only need to consider models that include it. This greatly reduces the number of models we need to evaluate. If the numbers of variables is very large we may want a more aggressive search strategy. As an alternative, we could use a [Lasso search](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lasso_path.html). This fits the model with all the predictors using a Lasso penalty, from which we obtain the order of the predictors from more important to less important. We use that order to create the `submodels`. This is faster than the forward search, but it may not be as accurate.

3. Instead of using MCMC methods, as commonly done in Bayesian statistics, Kulprit computes the `submodel`'s posterior by "projecting" the reference model's posterior into each submodels. This projection is a procedure that finds the posterior distribution for the submodel inducing a posterior predictive distribution that is as close as possible to the posterior predictive distribution of the reference model. That's a fun tongue twister to bust out at parties—sure to impress! In other words, the projection is about finding a submodel that is smaller than the reference model but makes predictions that are as close as possible to it. From a variable selection perspective, this makes sense as we are looking for a model that is simpler but still makes good predictions.

4. Finally, to evaluate the models Kulprit uses the [expected log pointwise predictive density (ELPD)](https://arviz-devs.github.io/EABM/Chapters/Model_comparison.html#elpd). This is a measure of how well a model predicts new data. The ELPD can be computed very efficiently using the PSIS-LOO-CV method {cite:t}`Vehtari_2017, Vehtari_2024`. In principle other metrics could be used.


## What models can I use with Kulprit?

Currently, Kulprit can handle only a subset of models supported by Bambi, for example hierarchies are not yet supported. However, the aim is to extend Kulprit to make it compatible with all the models that can be handled by Bambi.

Another restriction of the current implementation is that the reference model must be a Bambi model and the submodels must be nested models. In principle, the projective inference framework is much more flexible that this and could be used with any model.


## How the projection is done?

It turns out, that projection can be framed as an optimization problem. Let's see. 

Denote $\theta$ as the posterior parameters from the reference model, and $\theta_\perp$ those of the posterior for a particular submodel. Denote $\tilde{y}$ the samples from the posterior predictive distribution of the reference model $p(\tilde{y} \mid \theta)$. 

Then we want to find a posterior that induces a posterior predictive distribution $q(\tilde{y} \mid \theta_\perp)$. We want $p$ and $q$ to be as close as possible. We can use the [Kullback-Leibler](https://arviz-devs.github.io/Exploratory-Analysis-of-Bayesian-Models/Chapters/Model_comparison.html#entropy) divergence to measure how close two distributions are. Then we can write:


$$
\mathbb{KL}\{p(\tilde{y}\mid\theta) q(\tilde{y})\} = \mathbb{E}_{\tilde{y}\sim p(\tilde{y}\mid\theta)} \left[ \log \frac{p(\tilde{y}\mid\theta)}{q(\tilde{y}\mid\theta_\perp)} \right]
$$

We can reorder the term on the right-hand side of the equation:

$$
= \underbrace{\mathbb{E}_{\tilde{y}\sim p(\tilde{y}\mid\theta)} \left[ \log p(\tilde{y}\mid\theta)\right]}_{\text{constant}} - \mathbb{E}_{\tilde{y}\sim p(\tilde{y}\mid\theta)} \left[ \log q(\tilde{y}\mid\theta_\perp)\right]
$$

Because the first term is constant with respect to $\theta_\perp$ (it does not depend on a particular submodel/projections), we can ignore it when minimizing the KL divergence. And we can write:

$$
\propto - \mathbb{E}_{\tilde{y}\sim p(\tilde{y}\mid\theta)} \left[ \log q(\tilde{y}\mid\theta_\perp)\right]
$$

This does not tell us, what $q$ should be, but a general solution is to set $q$ to the likelihood in the reference model, then $\log q(\tilde{y} \mid \theta_\perp)$ is the log-likelihood of our model evaluated with respect to samples from the posterior predictive distribution $\tilde{y}\sim p(\tilde{y}\mid\theta)$. In other words, the projected posterior can be found by maximizing the model's log-likelihood with respect to the posterior predictive samples generated from the reference model.

## References
```{bibliography}
:style: unsrt
```