# Introduction




# Loss Functions from a Bayesian perspective


Let's supposed we have $N$ data points available $D=\{(x_i, y_i)\}_{i=1} ^ N$, where $x_i \in \mathcal{X}$ are the *inputs* and $y_i \in \mathcal{Y}$ the respetive *outputs*, or *targets*, and we want to learn some relationship, denoted by $f$, between inputs and targets:

$$f : \mathcal{X} \to \mathcal{Y}$$

This is a general **supervised learning setting**. Since we almost never have *complete information* and our data is corrupted by all kinds of *noises*, it is natural to include uncertainty into the modelling process by using probabilities. *Estimates without any uncertainty statement are meaningless*.

With that let's just define some notation:

- Capital letters $X$ and $Y$ are **[random variables](https://en.wikipedia.org/wiki/Random_variable)**, while small letters $x$, $y$ are *realizations* of the random variables $X$ and $Y$, respectively.

Finally, we define:

$$p(y|x) := Prob(Y=y | X=x),$$

as the *conditional probability* of observing a $Y=y$ *given* that $X=x$ is observed. **So, our supervised learning problem is precisely to estimate $p(y|x)$, given all available information at hand**, which we denote by $I$.
Formally we should therefore write as $p(y|x, I)$, but we'll omit it to simplify notation. In case of continuous $y$, $p(y|x)$ is usually called *probability densisity function* (pdf), while for the case of discrete $y$, $p(y|x)$ is a *probability mass function* (pmf).

Now, what is $I$ and how do we leverage it? As said above, $I$ is whatever information about the problem we have at hand and can use to constrain our problem. The most obvious piece of relevant information is of course the available data $D$. Also, in order to be able to proceed, we assume some kind of **parametrization** of $p(y|x)$ to constrain the solution space and to allow us using efficient optimization algorithms to find good solutions. The choice of the parametrization is crucial and is what leads to the form of the main term of the Loss function. Let's write this as follows:

$$p(y|x) = p(y | \lambda (x)),$$(parametrization)

where $p$ is now parametrized by $\lambda$, which can depend on $x$. {eq}`parametrization` is called the **likelihood function**.


As example, supposing both $x$ and $y$ are real numbers, one could use the *gaussian distribution*, where $\lambda (x) = (\mu (x), \sigma (x))$ are the expectation and standard deviation, respectively :

$$p(y|x) = \mathcal{N}(y | \mu(x), \sigma(x)),$$ (gaussian)


Our goal in this case would be to estimate $\mu(x)$ and  $\sigma(x)$. Maybe in most cases people tend to care only about $\mu(x)$ and completely ignore $\sigma(x)$.

Now, in ML what we do is to build some **model** to predict the parameters $\lambda (x)$. The model is in turn a function, $m$, that is parametrized by its own parameters $\theta$:

$$\lambda (x) = m(x | \theta).$$


$m$ can be any suitable ML model, like neural networks, decision trees, boosting, etc.
Fitting or training a model means finding a suitable set of parameters $\theta$. When we find $\theta$, we have in principle solved our problem, because we easily get to what we initially wanted to estimate, which is the conditional distribution $p(y|x)$:

$$\theta \to \lambda \to p(y|x).$$


With that, we can finally re-write the conditional distribution as:


$$p(y|x) = p(y | \lambda (x)) = p(y | x, \theta),$$ 

where we replaced the $\lambda (x)$ in the conditioning by $x, \theta$ and omitted the dependence on the model function $m$ to simplify notation. 

In short, the information $I$ we are leveraging sor far to frame our problem in some solvable way could be written as:

$$I = \{\text{available data}, \text{data parametrization}, \text{model parametrization} \} = \{D, \lambda, \theta \}.$$


## Predictive Distribution

From a full Bayesian perspective we can write:

$$p(y | x, D) = \int p(y, \theta | x, D) d\theta = \int p(y | x, \theta)p(\theta | D) d\theta, $$ (predictive_dist)

where in the first equality we used marginalization over the model parameters $\theta$ and in the second equality we used the [product rule](https://en.wikipedia.org/wiki/Chain_rule_(probability)). Moreover, the term $p(\theta | D)$ is called the **posterior** probablity distribution of $\theta$ given $D$, which can be understood as the probability of some model (parametrized by $\theta$) in light of the available data $D$. What we are doing in {eq}`predictive_dist` is obtaining our final estimate $p(y | x, D)$ as an average of the likelihood function weighted by the likelihood of potential models describing the data relationship. Formally this completely fine, but it turns out that in practice it is usually not feasible (or to costly) to compute the integral over model parameters in {eq}`predictive_dist`.


## Maximum a Posteriori estimate (MAP)

In order to bypass the complexity of computing {eq}`predictive_dist`, we could choose some specific "best" model in some sense, given by some $\theta^*$. But which one? Maybe the simplest one would be: *the most likely, given our data*. We can write precisely that as:

$$\theta^* =  \underset{\theta}{\text{argmax }}  p(\theta | D).$$

$p(\theta | D)$ is the **posterior** distribution of $\theta$ given $D$ and $\theta^*$ is called the **Maximum a Posteriori (MAP)** estimate. We can rewrite $p(\theta | D)$ using [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem):

$$p(\theta | D) = \frac{p(D|\theta)p(\theta)}{p(D)},$$

where: 

- $p(D|\theta)$ is the **likelihood function**,
- $p(\theta)$ is the **prior**,
- $p(D) = \int  p(D|\theta)p(\theta) d\theta$ is the **evidence**.

To increase numerical stability, it is convenient to maximize the **log** of the posterior (which does not change the maximum):

$$\log p(\theta | D) = \log p(D|\theta) + \log p(\theta) - \log p(D).$$

Since maximizing a function is equivalent to minimizing its opposite, we can identify the negative log posterior as the Loss Function $\mathcal{L}$ of our problem:

$$Loss = \mathcal{L}(\theta, D) := - \log  p(\theta | D).$$

Further, if we can assume that all $\{(x_i, y_i$)\} are (i.i.d), we have:

$$\log p(D|\theta) = \log \prod\limits_{i} p(y_i, x_i|\theta) = \log \prod\limits_{i} p(y_i|x_i,\theta)p(x_i) = \sum\limits_{i} \log p(y_i|x_i,\theta) + \sum\limits_{i} p(x_i),$$ 

which leads us to:

$$ \mathcal{L}(\theta, D) = - \sum\limits_{i} \log p(y_i|x_i,\theta) - \log p(\theta) + K,$$

where $K$ is a constant that does not depend on $\theta$ and is therefore irrelevant for our optimization and can be ignored. 

## Entropy

Just as an interesting side note, let us just quickly rewrite the log likelihood term as follows:


$$\log p(D|\theta) = \log \prod\limits_{i}^N p(y_i, x_i|\theta) = \sum \limits_{i}^N \log p(y_i, x_i|\theta).$$

It turns out that if we divide the last sum by $N$, it is precisely an estimate of the expectation of $\log p$:


$$\langle \log p(x, y | \theta) \rangle \approx \frac{1}{N}\sum \limits_{i}^N \log p(y_i, x_i|\theta).$$

But minus the expected value of the $\log p(x,y)$ is precisely the definition of the [joint differential entropy](https://en.wikipedia.org/wiki/Joint_entropy), $h(X,Y)$:

$$h(X, Y | \theta) := -\langle \log p(x, y | \theta) \rangle = -\int\limits_{\mathcal{X}, \mathcal{Y}} p(x,y | \theta) \log p(x,y | \theta) dxdy.$$

So, this is basically saying that **maximizing the log likelihood function is equivalent to minimizing the joint differential entropy**. This makes sense if we think about entropy as measuring our lack of knowledge (ignorance) about random variables: *when we fit a model to describe our data, we want it to capture as much information about it as possible, or, equivalently, we want it to reduce our ignorance about it as much as possible*.