# Introduction

Machine Learning problems are generally framed into problems of minimizing some kind of **Loss Function**.
Intuitively, Loss Functions are functions that measure how well some model is performing in the available data of the problem at hand, and models with lower Losses are preferred. 

A natural and important question that arises is precisely which Loss Function to choose for some given problem. There is a extensive list of available Loss Functions to choose from (and you can always invent new ones) and although some choices might be obvious or straightforward for some problems (you might usufally not even think about it and just choose the default ones from your library at hand), it might not always be the case. If you find yourself in the latter case, it might be useful to think about the problem from some a "first principles" angle to help understand how to choose it more wisely.

The goal here is to briefly discuss how Loss Functions in ML problems arise naturally from a Bayesian perspective. 
We'll focus on **supervised learning** problems.



# Loss Functions from a Bayesian perspective


Let's supposed we have $N$ data points available $D=\{(x_i, y_i)\}_{i=1} ^ N$, where $x_i \in \mathcal{X}$ are the *inputs* and $y_i \in \mathcal{Y}$ the respetive *outputs*, or *targets*, and we want to learn some relationship, denoted by $f$, between inputs and targets:

$$f : \mathcal{X} \to \mathcal{Y}$$

This is a general **supervised learning setting**. Since we almost never have *complete information* and our data is corrupted by *noises* of several natures, it is natural to include uncertainty into the problem using probabilities. *Estimates without any uncertainty statement are meaningless*.

With that let's just define some notation:

- Capital letters $X$ and $Y$ are **[random variables](https://en.wikipedia.org/wiki/Random_variable)**, while small letters $x$, $y$ are *realizations* of the random variables $X$ and $Y$, respectively.

Finally, we define:

$$p(y|x) := Prob(Y=y | X=x),$$

as the *conditional probability* of observing a $Y=y$ *given* that $X=x$ is observed. **So, our supervised learning problem is precisely to estimate $p(y|x)$, given all available information at hand**, $I$, which we can write as $p(y|x, I)$. In case of continuous $y$, $p(y|x)$ is usually called *probability densisity function* (pdf), while for the case of discrete $y$, $p(y|x)$ is a *probability mass funcion* (pmf).

Now, what is $I$ and how do we leverage it? As said above, $I$ is whatever information about the problem we have at hand and can use to constrain our problem. The most obvious piece of relevant information is of course the available data $D$. Also, in order to be able to proceed, we assume some kind of **parametrization** of $p(y|x)$ to constrain the solution space and to allow us using efficient optimization algorithms to find good solutions. The choice of the parametrization is crucial and is what leads to the form of the main term of the Loss function. Let's write this as follows:

$$p(y|x) = p(y | \lambda (x)),$$ (parametrization)

where $p$ is now parametrized by $\lambda$, which can depend on $x$. 

As example, supposing both $x$ and $y$ are real numbers, one could use the *gaussian distribution*, where $\lambda (x) = (\mu (x), \sigma (x))$ are the expectation and standard deviation, respectively :

$$p(y|x) = \mathcal{N}(y | \mu(x), \sigma(x)),$$ 


Our goal in this case would be to estimate $\mu(x)$ and  $\sigma(x)$. 

Now, in ML what we do is to build some **model** to predict the parameters $\lambda (x)$. The model is in turn a function, $m$, that is parametrized by its own parameters $\theta$:

$$\lambda (x) = m(x | \theta).$$

Fitting or training a model means finding a suitable set of parameters $\theta$. With that we have in principle solved our problem, because from $\theta$ we get to what we initially wanted to estimate, which is the conditional distribution $p(y|x)$:

$$\theta \to \lambda \to p(y|x).$$

With that, we can finally re-write the conditional distribution as:


$$p(y|x) = p(y | \lambda (x)) = p(y | x, \theta),$$ 

where we replaced the $\lambda (x)$ in the conditioning by $x, \theta$ and omitted the dependence on the model function $m$ to simplify notation. 

## Posterior

Our goal is to find $\theta$, but according to which criterium? Well, maybe the simplest one would be: *the most likely, given our data*. We can write precisely that as $p(\theta | D)$ and the solution would be:

$$\theta^* =  \underset{\theta}{\text{argmax }}  p(\theta | D).$$

$p(\theta | D)$ is called the **posterior** distribution of $\theta$ given $D$ and $\theta^*$ is called the **Maximum a Posteriori (MAP)** estimate. We can rewrite $p(\theta | D)$ using [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem):

$$p(\theta | D) = \frac{p(D|\theta)p(\theta)}{p(D)},$$

where: 

- $p(D|\theta)$ is the **likelihood function**,
- $p(\theta)$ is the **prior**,
- $p(D) = \int  p(D|\theta)p(\theta) d\theta$ is the **evidence**.

To increase numerical stability, it is convenient to maximize the **log** of the posterior (which does not change the maximum):

$$\log p(\theta | D) = \log p(D|\theta) + \log p(\theta) - \log p(D).$$

Since maximizing a function is equivalent to minimizing its opposite, we can identify the negative log posterior as the Loss Function $\mathcal{L}$ of our problem:

$$Loss = \mathcal{L}(\theta, D) := - \log  p(\theta | D).$$

Further, if we can assume that all $\{(x_i, y_i$)\} are (i.i.d), we have:

$$\log p(D|\theta) = \log \prod\limits_{i} p(y_i, x_i|\theta) = \log \prod\limits_{i} p(y_i|x_i,\theta)p(x_i) = \sum\limits_{i} \log p(y_i|x_i,\theta) + \sum\limits_{i} p(x_i),$$ 

which leads us to:

$$ \mathcal{L}(\theta, D) = - \sum\limits_{i} \log p(y_i|x_i,\theta) - \log p(\theta) + K,$$

where $K$ is a constant that does not depend on $\theta$ and is therefore irrelevant for our optimization.