# Calculating the moments of a loss function

In this post I'll discuss how to calculate the first and second moments of a [loss function](https://en.wikipedia.org/wiki/Loss_function) for a machine learning (ML) regression and classification model. The expected value of a loss function is known as its [risk](https://en.wikipedia.org/wiki/Empirical_risk_minimization#Background), and I'll refer to the variance (second moment) as the loss variance hereafter. While the risk and the loss variance are not knowable in practice (since we can't expect to the true data generating process (DGP)), in the simulation setting, the researcher has full knowledge of how the data is generated and can therefore calculate these terms in principal. But how does one do this? That is the question this post will answer for a variety of contexts from Gaussian linear regression to non-linear classification setting.

The rest of the post is structured as follows. [Section 1](#(1)-background) provides a backgrounder on the supervised ML problem and the integration approaches to the moments of the loss function, [section 2](#(2)-regression) describes the approaches for a regression problem, [section 3](#(3)-classification) discusses how to calculate these quantities for common classification loss functions, and [section 4]() concludes.

This post was generated from an original notebook that can be [found here](https://github.com/erikdrysdale/erikdrysdale.github.io/tree/master/_rmd/extra_loss_moments/post.ipynb). For readability, various code blocks have been suppressed and the text has been tidied up.

## (1) Background

### (1.1) The ML problem 

The goal of supervised machine learning (ML) is to "learn" a function, $f_\theta(x)$, $f: \mathbb{R}^p \to \mathbb{R}^k$, that maps a feature vector $x \in \mathbb{R}^p$ from $p$-dimensional space and seeks to approximate a label $y \in \mathbb{R}^k$. This function, or algorithm, is parameterized by $\theta$.  For example, $y$ could be the price of a house, and $x$ could be the characteristics of the house (the number of bedrooms, square footage, etc.). If the label is a scaler, then $k=1$ (like housing prices), and the algorithm maps to a scaler as well. But the mapping may be more complex if $y$ is non-negative ($y \in \mathbb{R}^{+}$) or [multiclass](https://en.wikipedia.org/wiki/Multiclass_classification) ($k>1$). 

We assume that the actual data generating process (DGP) follows some (unknown) probabilistic joint distribution $P_{\phi}(Z, Y)$, parameterized by $\phi$, where $Z$ represents the true covariates in $\mathbb{R}^q$.[[^1]] The observed feature vector $x$ is hopefully a subset or a function of $z$, i.e., $x = g(z)$, where $g: \mathbb{R}^q \to \mathbb{R}^p$. The true relationship between the true covariates $Z$ and the label $Y$ implies some joint distribution between label and measured covariates: $(Y, X) \sim P_{h(\phi)}$.[[^2]]

In a supervised learning scenario, we have a dataset $\{(y_i, x_i)\}_{i=1}^n$ consisting of $n$ observations, where each $x_i \in \mathbb{R}^p$ is the feature vector and $y_i \in \mathbb{R}^k$ is the corresponding label. We learn an optimal finite-sample $\theta$ by minimizing some measure of training "loss" between the predicted and actual label: $\hat\theta = \arg\min_\theta \hat\ell(y, f_\theta(x))$. How one does this is an entirely different subject that will not discussed here (see some of my other posts [here](http://www.erikdrysdale.com/auc_max/) or [here](http://www.erikdrysdale.com/ordinal_regression/)).[[^3]] One the training procedure is complete, hopefully, $f_{\hat\theta}$ approximates the true underlying relationship implied by the true joint distribution and will generalize to new and unseen data drawn from $P_{h(\phi)}$.

After a model is trained (based on some training loss function $\hat\ell$), we are often interested in how well that model will perform on future data for some primary loss function $\ell$.[[^4]] We call the expected value of this loss function the risk of the algorithm: $R(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[\ell(y, f_\theta(x))]$. Notice that for a given function class, $f$, the risk is parameterized w.r.t. the parameter $\theta$. For example, if $f$ is a linear model, then the risk can be higher or lower depending on how well the coefficients approximate the true DGP: i.e. $X^T \theta \approx Y$. 

We may also be interested in variance of the loss: $V(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[ \ell(y, f_\theta(x)) - R(\theta)]^2$. While this quantity, the loss variance, is not discussed as often as the risk, its value is mainly to be found in understanding how fast the empirical risk will converge to the actual risk. Because the empirical analogue of the risk is simply an average, it will asymptotically follow a normal distribution by the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem). In other words, for a large value sample size of new (non-training data): $\sqrt{n}(\hat{R} - R) \to N(0,V(\theta))$.

### (1.2) The integration problem

#### (1.2.1) Joint integration approach 

Suppose that one knows the true DGP: $P_\phi$, how the observed covariates align with the covariates $x=g(z)$, and how this affects the joint distribution of observed covariates and labels $P_{h(\phi)}$, could an oracle calculate $R_\theta$ and $V_\theta$ for some given value of $\theta$? The short answer is yes. By definition of an expectation calculating the risk means solving the following integral:

$$
\begin{align*}
R_f(\theta) &= \int_{\mathcal{Y} \times \mathcal{X}} \ell(y, f_\theta(x)) dP_{h(\phi)} \\
&= \int_{\mathcal{Y}} \int_{\mathcal{X}} \ell(y, f_\theta(x)) p_{h(\phi)}(y, x) dx dy 
\end{align*}
$$

Where $\mathcal{X}$ and $\mathcal{Y}$ are the domains of the random variables $X$ and $Y$, and $p(\cdot)$ denotes the density function. Without loss of generality, I'm notationally assuming that $Y$ and $X$ are continuous.[[^5]] In other words, EQREF says the risk is equivalent to the integral over the joint density function. If $x$ or $y$ is multidimensional, this integral can become quite complicated of course. Similarly the loss variance involves solving another similar looking integral:

$$
\begin{align*}
V_f(\theta) &= E_{(y,x) \sim P_{h(\phi)}}[\ell^2] - [R_f(\theta)]^2 \\
&= \int_{\mathcal{Y} \times \mathcal{X}} [\ell(y, f_\theta(x))]^2 dP_{h(\phi)} -  [R_f(\theta)]^2\\
&= \int_{\mathcal{Y}} \int_{\mathcal{X}} [\ell(y, f_\theta(x))]^2 p_{h(\phi)}(y, x) dx dy  - [R_f(\theta)]^2
\end{align*}
$$

In other words, once the risk is known, the calculate the variance, we simply need to integrate the squared loss function over the joint distribution.

#### (1.2.2) Conditional integration approach

What happens if we are unware of the joint distribution between the observed labels and covariates, but know the distribution of the covariates, and the distribution of the labels given the covariates? This will also be fine! To keep the notation clear, assume the the true and observed covariates are drawn from distributions $Z \sim P_Z$ and $X \sim P_{X}$, respectively. Similarly, the conditional distribution of the label given the true and observed covariate will be $Y | Z \sim P_{Y|Z}$ and $Y | X \sim P_{Y|X}$, respectively.

Recall that the law of total expectation says that: $E[X] = E_X[E_Y[X | Y]]$. Thus we can re-write the risk formula from EQREF as:

$$
\begin{align*}
R_f(\theta) &=  \int_{\mathcal{X}} \Bigg[  \int_{\mathcal{Y}} \ell(y, f_\theta(x)) dP_{Y|X} \Bigg] dP_X(x) \\
&=  \int_{\mathcal{X}} \Bigg[  \int_{\mathcal{Y}} \ell(y, f_\theta(x)) p_{Y|X}(y|X=x) dy \Bigg] p_X(x) dx
\end{align*}
$$

Since we tend to think of the covariates as "causing" the label, or at least determining the probability of observing a label, this is often a more intuitive way to think about the DGP. First, mother nature draws some distribution of features (e.g. genetic variation), and these features then cause the labels we observe (e.g. phenotypes). 

Furthermore, it may be that anaytically it's much easier to characterise the joint distribution, then it is the full joint distribution. For example if $X \sim \text{MVN}(0, \Sigma)$, and $Y = X^T\beta + (\epsilon - 1)$, where $\epsilon \sim \text{Exp}(1)$, then $E[Y] = 0$, and $Y \sim N(-1, \sigma_\beta^2) + \text{Exp}(1)$, where $\sigma_\beta^2=\beta^T\Sigma\beta$. We can find an exact representation of the conditional density $f_{Y|X} = \exp(\sigma^2_\beta - 1 - w) \cdot \Phi((w+1)/\sigma_\beta  - \sigma_\beta)$ by looking at the [convolution](https://en.wikipedia.org/wiki/Convolution_of_probability_distributions) of these two random variables.[[^6]] While it's trial to draw from $X$ first, and then draw $Y|X$, or calculate the density of $f_X(x)$ and $f_{Y|X}(y|x)$, there's no each way to draw $(Y,X)$ simultaneously, or provide an exact analytical form for $f_{Y,X}$. Thus, for many distributions, it may be easier to characterize the label with respect to the observed covariates than the joint distribution.[[^7]]

### (1.3) Monte Carlo integration

The simplest approach is to use Monte Carlo integration (hereasfter MCI) if we can draw from $(Y,X)$. Here's how one does it:

## (2) Regression models

### (2.1) Linear regression, Gaussian covariates

### (2.2) Linear regression, non-Gaussian covariates

### (2.3) Non-linear regression, parametric covariates

### (2.4) Non-linear regression, non-parametric covariates

In [None]:
# Load the 

## Footnotes

[^1]: One should agnostic as to whether $q \ge p$ or $q \le p$. On the one hand, all models will be misspecified in practice since the true DGP is too complex to capture with data (e.g. the psychology of home buyers for housing prices), which would suggest that $g > p$. But we may also collect data that is irrelevant to the DGP which would push $p > q$ (e.g. genetric measurements). There is a whole field of research that looks into the effect of [omitted variable bias](https://en.wikipedia.org/wiki/Omitted-variable_bias) and [model misspecification](https://en.wikipedia.org/wiki/Statistical_model_specification).

[^2]: It's also possible that there is a discrepancy between the true and observed labels. I've written about an instance of [label error](http://www.erikdrysdale.com/hausman/) in a previous post.

[^3]: Many topics are relevant for model optimization including [mathematical optimization](https://en.wikipedia.org/wiki/Mathematical_optimization), [hyper parameter optimziation](https://en.wikipedia.org/wiki/Hyperparameter_optimization), [model selection](https://en.wikipedia.org/wiki/Model_selection), and [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). 

[^4]: The training loss may or may not be the final loss measure we are interested in. For example, a value of $\theta$ could be learned based on the squared loss of the training data, but the final out-of-sample performance be based on the absolute error. Why the discrepenacy? Certain optimization routines work better (or only with) certain loss functions, meaning the squared error might be the best method to use practically, even if it's an apprpoximate surrogate for the absolute error. For our analysis however, this distinction is irrelevant and when we discuss the loss function, we will refer to the final performance measure of interest which may or may not be the training loss.

[^5]: If they are not, we can simply replace the notation with that of a discrete summation: $\sum_{y \in \mathcal{Y}} g(y) f_Y(y)$, for example.

[^6]: The proof for this is a bit involved, but see this proof on [Stack Exchange](https://stats.stackexchange.com/q/467366) for more info.

[^7]: An example of where the joint distribution would be known is if both $X\sim \text{MVN}(0, \Sigma)$ and $\epsilon \sim N(0, \sigma^2)$ are normal, then $(Y, X) \sim \text{MVN}\Bigg(0, \begin{pmatrix} \beta^T \Sigma \beta + \sigma_u^2 & \beta^T \Sigma  \\ \Sigma \beta & \Sigma  \end{pmatrix}  \Bigg)$.