# Calculating the moments of a loss function

In this post I'll discuss how to calculate the first and second moments of a [loss function](https://en.wikipedia.org/wiki/Loss_function) for a machine learning (ML) regression and classification model. The expected value of a loss function is known as its [risk](https://en.wikipedia.org/wiki/Empirical_risk_minimization#Background), and I'll refer to the variance (i.e. the second moment) of the loss function as the loss variance hereafter. While the risk and the loss variance are not *perfectly* knowable in the real world, in the simulation setting, the researcher has full knowledge of how the data is generated and can therefore calculate these terms in principal. But how does one do this? That is the question this post will answer for a variety of contexts from Gaussian linear regression to the non-linear classification setting.

The rest of the post is structured as follows. [Section 1](#(1)-background) provides a backgrounder on the supervised ML problem and the integration approaches to the moments of the loss function. [Section 2](#(2)-regression) describes the approaches for a regression problem. [Section 3](#(3)-classification) discusses how to calculate these quantities for common classification loss functions. [Section 4]() concludes.

This post was generated from an original notebook that can be [found here](https://github.com/erikdrysdale/erikdrysdale.github.io/tree/master/_rmd/extra_loss_moments/post.ipynb). For readability, various code blocks have been suppressed and the text has been tidied up.

## (1) Background

### (1.1) The ML problem 

The goal of supervised machine learning (ML) is to "learn" a function, $f_\theta(x)$, $f: \mathbb{R}^p \to \mathbb{R}^k$, that maps a feature vector $x \in \mathbb{R}^p$ from $p$-dimensional space and seeks to approximate a label $y \in \mathbb{R}^k$. This function (also referred to as an algorithm), is parameterized by $\theta$.  For example, $y$ could be the price of a house, and $x$ could be the characteristics of the house (the number of bedrooms, square footage, etc.). The label mapping is not always a real-number, and may instead be a non-negative number ($y \in \mathbb{R}^{+}$) or [multiclass](https://en.wikipedia.org/wiki/Multiclass_classification) ($k>1$) label. 

We assume that the actual data generating process (DGP) follows some (unknown) probabilistic joint distribution $P_{\phi}(Y,X)$, parameterized by $\phi$, where $Z$ represents the true covariates in $\mathbb{R}^q$.[[^1]] The observed feature vector $x$ is hopefully a subset or a function of $z$, i.e., $x = g(z)$, where $g: \mathbb{R}^q \to \mathbb{R}^p$. The true relationship between the true covariates $Z$ and the label $Y$ implies some joint distribution between label and measured covariates: $(Y, X) \sim P_{h(\phi)}$.[[^2]]

In a supervised learning scenario, we have a dataset $\{(y_i, x_i)\}_{i=1}^n$ consisting of $n$ observations, where each $x_i \in \mathbb{R}^p$ is the feature vector and $y_i \in \mathbb{R}^k$ is the corresponding label. We learn an optimal finite-sample $\theta$ by minimizing some measure of training "loss" between the predicted and actual label: $\hat\theta = \arg\min_\theta \hat\ell(y, f_\theta(x))$. How one does this is an entirely different subject that will not be discussed here (see some of my other posts [here](http://www.erikdrysdale.com/auc_max/) or [here](http://www.erikdrysdale.com/ordinal_regression/)).[[^3]] Once the training procedure is complete, $f_{\hat\theta}$ (hopefully) approximates the true underlying relationship implied by the true joint distribution and will generalize to new and unseen data drawn from $P_{h(\phi)}$.

After a model is trained (based on some training loss function $\hat\ell$), we are often interested in how well that model will perform on future data for some primary loss function $\ell$.[[^4]] We call the expected value of this loss function the risk of the algorithm: $R(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[\ell(y, f_\theta(x))]$. Notice that for a given function class, $f$, the risk is parameterized w.r.t. the parameter $\theta$. For example, if $f$ is a linear model, then the risk can be higher or lower depending on how well the coefficients approximate the true DGP: i.e. $X^T \theta \approx Y$. 

We may also be interested in variance of the loss: $V(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[ \ell(y, f_\theta(x)) - R(\theta)]^2$. While this quantity, the loss variance, is not discussed as often as the risk, its value is mainly to be found in understanding how fast the empirical risk will converge to the actual risk. Because the empirical analogue of the risk is simply an average, it will asymptotically follow a normal distribution by the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem). In other words, for a large value sample size of new (non-training data): $\sqrt{n}(\hat{R} - R) \to N(0,V(\theta))$.

### (1.2) The integration problem

#### (1.2.1) Joint integration approach 

Suppose that one knows the true DGP: $P_\phi$, how the observed covariates align with the covariates $x=g(z)$, and how this affects the joint distribution of observed covariates and labels $P_{h(\phi)}$, could an oracle calculate $R_\theta$ and $V_\theta$ for some given value of $\theta$? The short answer is yes. By definition, an expectation means solving the following integral:

$$
\begin{align*}
R_f(\theta) &= \int_{\mathcal{Y} \times \mathcal{X}} \ell(y, f_\theta(x)) dP_{h(\phi)} \\
&= \int_{\mathcal{Y}} \int_{\mathcal{X}} \ell(y, f_\theta(x)) p_{h(\phi)}(y, x) dx dy 
\end{align*}
$$

Where $\mathcal{X}$ and $\mathcal{Y}$ are the domains of the random variables $X$ and $Y$, and $p(\cdot)$ denotes the density function. Without loss of generality, I'm notationally assuming that $Y$ and $X$ are continuous.[[^5]] In other words, EQREF says the risk is equivalent to the sum of the loss function weighted by the joint density function. If $x$ or $y$ is multidimensional, this integral can become quite complicated since it involves a order-order integration. Similarly the loss variance involves solving another similar looking integral:

$$
\begin{align*}
V_f(\theta) &= E_{(y,x) \sim P_{h(\phi)}}[\ell^2] - [R_f(\theta)]^2 \\
&= \int_{\mathcal{Y} \times \mathcal{X}} [\ell(y, f_\theta(x))]^2 dP_{h(\phi)} -  [R_f(\theta)]^2\\
&= \int_{\mathcal{Y}} \int_{\mathcal{X}} [\ell(y, f_\theta(x))]^2 p_{h(\phi)}(y, x) dx dy  - [R_f(\theta)]^2
\end{align*}
$$

In other words, once the risk is known, to calculate the loss variance, we simply need to integrate the squared loss function over the joint distribution.

#### (1.2.2) Conditional integration approach

What happens if we are unware of the joint distribution between the observed labels and covariates, but know the distribution of the covariates, and the distribution of the labels given the covariates? This will also be fine! To keep the notation clear, assume the the true and observed covariates are drawn from distributions $Z \sim P_Z$ and $X \sim P_{X}$, respectively. Similarly, the conditional distribution of the label given the true and observed covariate will be $Y | Z \sim P_{Y|Z}$ and $Y | X \sim P_{Y|X}$, respectively.

Recall that the [law of total expectation](https://en.wikipedia.org/wiki/Law_of_total_expectation) says that: $E[X] = E_X[E_Y[X | Y]]$. Thus we can re-write the risk formula from EQREF as:

$$
\begin{align*}
R_f(\theta) &=  \int_{\mathcal{X}} \Bigg[  \int_{\mathcal{Y}} \ell(y, f_\theta(x)) dP_{Y|X} \Bigg] dP_X(x) \\
&=  \int_{\mathcal{X}} \Bigg[  \int_{\mathcal{Y}} \ell(y, f_\theta(x)) p_{Y|X}(y|X=x) dy \Bigg] p_X(x) dx
\end{align*}
$$

Since we tend to think of the covariates as "causing" the label, or at least determining the probability of observing a label, this is often a more intuitive way to think about the DGP. In the context of genetics, we would first imagine mother nature drawing from the existing pool of genetic variation (the features), and these variation would lead to the distribution of phenotypes we observe (the labels). 

Beyond the conceptual advantage, the conditional distribution might also be easier to characterize than the joint distribution. For example if $X \sim \text{MVN}(0, \Sigma)$, and $Y = X^T\beta + (\epsilon - 1)$, where $\epsilon \sim \text{Exp}(1)$, then $E[Y] = 0$, and $Y \sim N(-1, \sigma_\beta^2) + \text{Exp}(1)$, where $\sigma_\beta^2=\beta^T\Sigma\beta$. We can find an exact representation of the conditional density $f_{Y|X} = \exp(\sigma^2_\beta - 1 - w) \cdot \Phi((w+1)/\sigma_\beta  - \sigma_\beta)$ by looking at the [convolution](https://en.wikipedia.org/wiki/Convolution_of_probability_distributions) of these two random variables.[[^6]] While it's trivial to draw from $X$ first, and then draw $Y|X$, or calculate the density of $f_X(x)$ and $f_{Y|X}(y|x)$, there's no easy way to draw $(Y,X)$ simultaneously, or provide an exact analytical form for $f_{Y,X}$ for this stylized example. Thus, for many distributions, it may be easier to characterize the label with respect to the observed covariates than the joint distribution.[[^7]]

### (1.3) Monte Carlo integration

If a researcher can draw the labels and features (either jointly or conditionally), then the simplest way to calculate the risk and loss variance is to use Monte Carlo integration (MCI). This stochastic approach always "works", and its precision can be determined by the number of samples used to estimate $R$.

**Joint sampling**
1. Draw $n$ samples of $\{(y_1, x_1), \dots (y_n, x_n)\}$, with $(y_i, x_i) \sim F_{h(\phi)}$
2. Evaluate $\{\ell_1, \dots, \ell_n\}$, where $\ell_i = \ell(y_i, f_\theta(x_i))$
3. Calculate: $\hat R_f(\theta) = n^{-1} \sum_{i=1}^n \ell_i$

As $n \to \infty$, $\hat{R} \to 1$. From a computational perspective, even if we don't have enough memory to evaluate $n=1e+9$, for example, we could still obtain $\hat R$ with that degree of precision by calculating $\hat{R}$ with $n=1e+6$, and then repeating this 1000 times, storing the result each time, and averaging the 1000 results. 

If we can't draw directly from the joint distribution, but can draw from the features, and then from the label given the features, we can do MCI the following way:

**Conditional sampling**
1. Draw $n$ samples of $\{x_1, \dots x_n\}$, with $x_i \sim F_{X}$
2. Draw $n$ samples of $\{y_1, \dots y_n\}$, with $y_i | x_i \sim F_{Y|X}$
3. Evaluate $\{\ell_1, \dots, \ell_n\}$, where $\ell_i = \ell(y_i, f_\theta(x_i))$
4. Calculate: $\hat R_f(\theta) = n^{-1} \sum_{i=1}^n \ell_i$

Once we have sample draws, it's fairly easy to calculate the loss variance.

**Loss variance**
1. Calculate $\hat{R} \approx R$ using one of the above approaches
2. Evaluate $\{\ell_1^2, \dots, \ell_n^2\}$, where $\ell_i = [\ell(y_i, f_\theta(x_i))]^2$
3. Calculate $\hat V = (n-1)^{-1} \sum_{i=1}^n \ell_i^2 - [\hat{R}]^2$

## (1.4) Numerical methods

Unlike MCI, numerical methods for evaluating the integral of a loss function are usually slower, but will be deterministic and can also be run to a specified levels of tolerance. Numerical methods require knowning the PDF of the joint and/or conditional distribution. While there are different approaches, [trapezoidal integration](https://en.wikipedia.org/wiki/Trapezoidal_rule) is one of the most common and well-studied methods. It works by dividing the integration interval into small segments, approximating the area under the curve as a series of trapezoids, and summing their areas. Specifically: $\int_a^b f(x) dx \approx \sum_{i=2}^{n} \frac{f(x_{i})+f(x_{i-1})}{2} (x_i - x_{i-1})$. If $x$ is a vector, then this amounts to doing $O(n^p)$ operations since we need to have $p$ loops.

Throughout this post, we'll use the trapezoid rule when doing numerical integration. Here are the steps for estimating $R$ with the trapezoidal rule, which I'll denote as $\text{trapz}(I(\cdot), \R^l)$, where $I$ is the integrand, and $\R^l$ denote that feature space to be integrated over.

**Joint density**
1. Discretize the feature space into a grid of points $\hat{\mathcal{X}}=\{x_1, x_2, \ldots, x_{n_x}\} \in \mathcal{X}$ and $\hat{\mathcal{Y}} = \{y_1, y_2, \ldots, y_{n_y}\} \in \mathcal{Y}$
2. Calculate $\{f_{\theta}(x_1), \dots f_{\theta}(x_n)\}$
3. For a given value of $x_i$, compute the inner integral w.r.t the labels: $I(x_i) = \text{trapz}(\ell(y, f_\theta(x_i)) \cdot p_{h(\phi)}(y, x_i), y \in \hat{\mathcal{Y}})$
4. Calculate the outer integral w.r.t. to the covariates: $\hat{R} = \text{trapz}(I(x), x \in \hat{\mathcal{X}})$

Note that steps (1) and (2) can be done "on the fly" rather than being pre-computed. This is especially helpful if $x$ is high-dimensional or there are many points being evaluated. The conditional density approach uses the same 

**Conditional density**
1. See above
2. See above
3. For a given value of $x_i$, compute the inner integral w.r.t the labels: $I(x_i) = \text{trapz}(\ell(y, f_\theta(x_i)) \cdot p_{Y|X}(y, X=x_i), y \in \hat{\mathcal{Y}})$
4.  Calculate the outer integral w.r.t. to the covariates: $\hat{R} = \text{trapz}(I(x) \cdot p_X(x), x \in \hat{\mathcal{X}})$

The key computational difference beteween the joint density and conditional density is that the latter calls $p_{Y|X}$ in the inner integral, and $p_X$ in the outer, whereas the former uses $p_{Y,X}=p_{h(\phi)}$ in the inner integral only.

**Loss variance**
1. Calculate $\hat{R}$ using either the joint or conditional approach
2. Repeat the risk esimate, but replace $\ell(y, f_\theta(x_i) = [\ell(y, f_\theta(x_i)]^2$ in the inner integral, and store the final value as $\hat{R}^2$
3. Estimate the loss variance: $V_f(\theta) \approx \hat{V} = \hat{R}^2 - [\hat{R}]^2$


### Advantages and Considerations

Numerical methods like trapezoidal integration can be more efficient and accurate for low-dimensional integrals compared to Monte Carlo integration, especially when the integrands are smooth and well-behaved. However, they can become computationally expensive and less practical as the dimensionality of the feature space increases due to the curse of dimensionality. In practice, the choice of method depends on the specific problem, the properties of the loss function and distribution, and the computational resources available.


## (1.5) Worked example

Now we're ready to put our theory to the test. 

## (2) Regression models

### (2.1) Linear regression, Gaussian covariates

#### (2.1.1) Gaussian error

#### (2.1.2) Non-Gaussian error


### (2.2) Linear regression, non-Gaussian covariates

### (2.3) Non-linear regression, parametric covariates

### (2.4) Non-linear regression, non-parametric covariates

In [None]:
# Load the 

## Footnotes

[^1]: One should agnostic as to whether $q \ge p$ or $q \le p$. On the one hand, all models will be misspecified in practice since the true DGP is too complex to capture with data (e.g. the psychology of home buyers for housing prices), which would suggest that $g > p$. But we may also collect data that is irrelevant to the DGP which would push $p > q$ (e.g. genetric measurements). There is a whole field of research that looks into the effect of [omitted variable bias](https://en.wikipedia.org/wiki/Omitted-variable_bias) and [model misspecification](https://en.wikipedia.org/wiki/Statistical_model_specification).

[^2]: It's also possible that there is a discrepancy between the true and observed labels. I've written about an instance of [label error](http://www.erikdrysdale.com/hausman/) in a previous post.

[^3]: Many topics are relevant for model optimization including [mathematical optimization](https://en.wikipedia.org/wiki/Mathematical_optimization), [hyper parameter optimziation](https://en.wikipedia.org/wiki/Hyperparameter_optimization), [model selection](https://en.wikipedia.org/wiki/Model_selection), and [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). 

[^4]: The training loss may or may not be the final loss measure we are interested in. For example, a value of $\theta$ could be learned based on the squared loss of the training data, but the final out-of-sample performance be based on the absolute error. Why the discrepenacy? Certain optimization routines work better (or only with) certain loss functions, meaning the squared error might be the best method to use practically, even if it's an apprpoximate surrogate for the absolute error. For our analysis however, this distinction is irrelevant and when we discuss the loss function, we will refer to the final performance measure of interest which may or may not be the training loss.

[^5]: If they are not, we can simply replace the notation with that of a discrete summation: $\sum_{y \in \mathcal{Y}} g(y) f_Y(y)$, for example.

[^6]: The proof for this is a bit involved, but see this proof on [Stack Exchange](https://stats.stackexchange.com/q/467366) for more info.

[^7]: An example of where the joint distribution would be known is if both $X\sim \text{MVN}(0, \Sigma)$ and $\epsilon \sim N(0, \sigma^2)$ are normal, then $(Y, X) \sim \text{MVN}\Bigg(0, \begin{pmatrix} \beta^T \Sigma \beta + \sigma_u^2 & \beta^T \Sigma  \\ \Sigma \beta & \Sigma  \end{pmatrix}  \Bigg)$.