# Calculating the moments of a loss function

In this post I'll discuss how to calculate the first and second moments of a [loss function](https://en.wikipedia.org/wiki/Loss_function) for a machine learning (ML) regression and classification model. The expected value of a loss function is known as its [risk](https://en.wikipedia.org/wiki/Empirical_risk_minimization#Background), and I'll refer to the variance (second moment) as the loss variance hereafter. While the risk and the loss variance are not knowable in practice (since we don't know the true data generating process), in the simulation setting, the researcher has full knowledge of how the data is generated and can therefore calculate these terms in principal. But how does one do this? 

The rest of the post is structured as follows. [Section 1](#(1)-background) provides a backgrounder on the supervised ML problem and the integration approaches to the moments of the loss function, [section 2](#(2)-regression) describes the approaches for a regression problem, [section 3](#(3)-classification) discusses how to calculate these quantities for common classification loss functions, and [section 4]() concludes.

This post was generated from an original notebook that can be [found here](https://github.com/erikdrysdale/erikdrysdale.github.io/tree/master/_rmd/extra_loss_moments/post.ipynb). For readability, various code blocks have been suppressed and the text has been tidied up.

## (1) Background

### (1.1) The ML problem 

The goal of supervised machine learning (ML) is to "learn" a function, $f_\theta(x)$, $f: \mathbb{R}^p \to \mathbb{R}^k$, that maps a feature vector $x \in \mathbb{R}^p$ from $p$-dimensional space and seeks to approximate a label $y$. For example, $y$ could be the price of a house, and $x$ could be the characteristics of the house (the number of bedrooms, square footage, etc.). If the label is a scaler, then $k=1$ (like housing prices), and the algorithm maps to a scaler as well. But the mapping may be more complex if $y$ is non-negative ($y \in \mathbb{R}^{+}$) or [multiclass](https://en.wikipedia.org/wiki/Multiclass_classification). This function, or algorithm, is parameterized by $\theta$. 

We assume that the actual data generating process (DGP) follows some unknown but probabilistic joint distribution $P_{\phi}(Z, Y)$, parameterized by $\phi$, where $Z$ represents the true covariates in $\mathbb{R}^q$.[[^1]] The observed feature vector $x$ is hopefully a subset or a function of $z$, i.e., $x = g(z)$, where $g: \mathbb{R}^q \to \mathbb{R}^p$. The true relationship between the true covariates $Z$ and the label $Y$ implies some joint distribution between label and measured covariates: $(Y, X) \sim P_{h(\phi)}$.[[^2]]

In a supervised learning scenario, we have a dataset $\{(y_i, x_i)\}_{i=1}^n$ consisting of $n$ observations, where each $x_i \in \mathbb{R}^p$ is the feature vector and $y_i \in \mathbb{R}^k$ is the corresponding label. We learn an optimal finite-sample $\theta$ by minimizing some measure of training "loss" between the predicted and actual label: $\hat\theta = \arg\min_\theta \hat\ell(y, f_\theta(x))$. Hopefully, $f_{\hat\theta}$ approximates the true underlying relationship implied by the joint distribution $P_{\phi}(Y, Z)$,, and will generalize to new and unseen data drawn from $P_{h(\phi)}$.

The training loss may or may not be the final loss measure we are interested in. For example, a value of $\theta$ could be learned based on the squared loss of the training data, but the final out-of-sample performance be based on the absolute error. Why the discrepenacy? Certain optimization routines work better (or only with) certain loss functions, meaning the squared error might be the best method to use practically, even if it's an apprpoximate surrogate for the absolute error. 

After a model is trained (based on some training loss function $\hat\ell$), we are often interested in how well that model will do for feature data for some primary loss function $\ell$ (which may or may not be the same as the training loss function). We call the expected value of this loss function the risk of the algorithm: $R(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[\ell(y, f_\theta(x))]$. Notice that for a given function class, $f$, the risk is parameterized w.r.t. the parameter $\theta$. This makes sense, since if $f$ is a linear regression model, then the risk can be higher or lower depending on how well the coefficients approximate the true DGP. 

We may also be interested in variance of the loss: $V(\theta; f) = E_{(y,x) \sim P_{h(\phi)}}[ \ell(y, f_\theta(x)) - R(\theta)]^2$. While this quantity is not discussed as often, its value is mainly to be found in understanding how fast the empirical risk will converge to the actual risk. Because the risk is simply an average, it will asymptotically follow a normal distribution by the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem). In other words, for a large value sample size of new (non-training data): $\sqrt{n}(\hat{R} - R) \to N(0,V(\theta))$.

### (1.2) The integration problem

Suppose that one knows the true DGP: $P_\phi$, how the observed covariates align with the covariates $x=g(z)$, and how this affects the joint distribution of observed covariates and labels $P_{h(\phi)}$, could an oracle calculate $R_\theta$ and $V_\theta$ for some given value of $\theta$?

## (2) Regression models

### (2.1) Linear regression, Gaussian covariates

### (2.2) Linear regression, non-Gaussian covariates

### (2.3) Non-linear regression, parametric covariates

### (2.4) NOn-linear regression, non-parametric covariates

In [None]:
# Load the 

## Footnotes

[^1]: One should agnostic as to whether $q \ge p$ or $q \le p$. On the one hand, all models will be misspecified in practice since the true DGP is too complex to capture with data (e.g. the psychology of home buyers for housing prices), which would suggest that $g > p$. But we may also collect data that is irrelevant to the DGP which would push $p > q$ (e.g. genetric measurements). There is a whole field of research that looks into the effect of [omitted variable bias](https://en.wikipedia.org/wiki/Omitted-variable_bias) and [model misspecification](https://en.wikipedia.org/wiki/Statistical_model_specification).

[^2]: It's also possible that there is a discrepancy between the true and observed labels. I've written about an instance of [label error](http://www.erikdrysdale.com/hausman/) in a previous post.