# 1 Introduction

A large set of $N \{x_1,\dots,x_N\}$ called training set is used to tune the parameters of an adaptive model. The categorics in the training set are known in advance, typically by inspecting them individually and hand-labelling them. We can express the category using target vector $t$. There is one such target vector $t$ for each $x$.

The result of running the machine learning algorithm can be expressed as a function $y\left(x\right)$ which takes a new $x$ as input and that generates an output vector $y$, encoded in the same way as the target vectors. The precise form of the function $y\left(x\right)$ is determined during the training phase, also known as the learning phase, on the basis of the training data. Once the model is trained it can then determine the identity of new data, which are said to comprise a test set. The ability to categorize correctly new examples that differ from those used for training is known as generalization. In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition.

For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. This preprocessing stae is sometimes also called feature extraction. Note that new test data must be preprocessed using the same steps as the training data.

This kind of preprocessing represents a form of dimensionality reduction. Preprocessing might also be performed in order to speed up computation.Care must be taken during preprocessing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer.

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. The aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression. 

In other pattern recognition problems, the training data consists of a set of input vectors $x$ without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimenional space down to two or three dimensions for the purpose of visualizaiotn.

## 1.1 Example: Polynomial Curve Fitting

Now suppose that we are give a training set comprising $N$ observations of $x$, written $\mathbf{x}\equiv\left(x_1,\dots,x_N\right)^\top$, together with corresponding observations of the values of $t$, denotd $\mathbf{t}\equiv\left(t_1,\dots,t_N\right)^\top$. The input data set $\mathbf{x}$ was generated by choosing values of $x_n$, for $n=1,\dots,N$, spaced uniformly in range $\left[0,1\right]$, and the target data set $\mathbf{t}$ was obtained by first computing the corresponding values of the function $sin\left(2\pi x\right)$ and then adding a small level of random noise having a Gussian distribution to each such point in order to obtain the coresponding value $t_n$.

Our goal is to exploit this training set in order to make predictions of the value $\hat{t}$ of the target variable for some new value $\hat{x}$ of the input variable.

We shall fit the data using a polynomial function of the form
$$y\left(x, \mathbf{w}\right)=w_0+w_1x+w_2x^2+\dots+w_Mx^M=\sum_{j=0}^Mw_jx^j \quad\quad\quad\left(1.1\right)$$
where $M$ is the order of the polynomial, and $x^j$ denotes $x$ raised to the power of $j$. The polynomial coefficients $w_0,\dots,w_M$ are collectively denoted by the vector $\mathbf{w}$. Note that, although the polynomial function $y\left(x, \mathbf{w}\right)$ is a nonlinear function of $x$, it is a linear function of the coefficients $\mathbf{w}$. Functions which are linear in the unknown parameters have important properties and are called linear models.

The values of the coefficients will be determined by fitting the polynomial the training data. This can be donew by minimizing an error function that measures the misfit between the function $y\left(x, \mathbf{x}\right)$, for any given value of $\mathbf{w}$, and the training set data points. One simple choice of error function, which is widely used, is given by the sum of the squares of the errors between the predictions $y\left(x_n, \mathbf{x}\right)$ for each data point $x_n$ and the corresponding target values $t_n$, so that we minimize
$$E\left(\mathbf{w}\right)=\frac{1}{2}\sum_{n=1}^{N}\{y\left(x_n,\mathbf{w}\right)-t_n\}^2$$
where the factor of 1/2 is included for later convenience. Note that it is a nonegative quantity that would be zero if, and only if, the function $y\left(x, \mathbf{w}\right)$ were to pass exactly through each training data point. 

The resulting polynomial is given by the function $y\left(x,\mathbf{w}^*\right)$, where $\mathbf{w}^*=\arg\min E\left(\mathbf{w}\right)$.

There remains the problem of choosing the order $M$ of the polynomial, and as we shall see this will turn out to be an example of an important concept model comparison or model selection.

When we go to a much higher order polynomial ($M=9$),the polynomial passes exactly through ceach data point and $E\left(\mathbf{w}^*\right)=0$. However, the fitted curve oscillates wildly and gives a very poor representation of the function. This latter behaviour is known as over-fitting.

RMS error defined by 
$$E_{RMS}=\sqrt{2E\left(\mathbf{w}^*\right)/N}$$
in which the division by $N$ allows us to compare different sizes of data sets on an equal footing, and the square root ensures that $E_{RMS}$ is measured on the same scale (and in the same units) as the target variable $t$.

What is happening is that the more flexible polynomials with larger values of $M$ are becoming increasingly tuned to random noise on the target values.

One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model.

We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood, and that the over-fitting problem can be understood as a general property of maximum likelihood.

One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form
$$\tilde{E}=\frac{1}{2}\sum_{n=1}^N\{y\left(x_n,\mathbf{w}\right)-t_n\}^2+\frac{\lambda}{2}\|\mathbf{w}\|^2$$
where $\|\mathbf{w}\|^2\equiv\mathbf{w}^\top\mathbf{w}=w_0^2+w_1^2+\dots+w_M^2$, and the coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. Note that often the coefficient $w_0$ is omitted from the regularizer.

If we were trying to solve a practical application using this approach of minimizing an error function, we would have to find a way to determine a suitable value for the model complexity. The results above sugget a simple way of achieving this, namely by taking the available data and partitioning it into a training set, used to determine the coefficients $\mathbf{w}$ and a separate validation set, also called a hold-out set, used to optimize the model complexity.

## 1.2 Probability Theory

A key concept in the field of pattern recognition is that of uncertainty. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty.

The Rules of Probability
$$ \mathbf{sum\;rule}\quad p\left(X\right)=\sum_Y p\left(X,Y\right) \\
\mathbf{product\;rule}\quad p\left(X,Y\right)=p\left(Y|X\right)p\left(X\right).$$
Here $p\left(X,Y\right)$ is a joint probability and is verbalized as "the probability of X and Y". The quantity $p\left(Y|X\right)$ is a conditional probability and is verbalized as "the probability of Y given X", whereas the quantity $p\left(X\right)$ is a marginal probability and is simply "the probability of X".

From the product rule, together with the symmetry property $p\left(X,Y\right)=p\left(Y,X\right)$, we immediatedly obtain the following relationship between conditional probabilities
$$p\left(Y|X\right)=\frac{p\left(X|Y\right)p\left(Y\right)}{p\left(X\right)}$$
which is called Bayes' theorem. Using the sum rule, the denominator in Bayes' theorem can be expressed in terms of the quantities appearing in the numerator
$$p\left(X\right)=\sum_Yp\left(X|Y\right)p\left(Y\right).$$
We can view the denominator in Bayes' theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side over all values of Y equals one.

We note that if the joint distribution of two variables factorizes into the product of the marginals, so that $p\left(X,Y\right)=p\left(X\right)p\left(Y\right)$, then X and Y are said to be independent. From the product rule, we see that $p\left(Y|X\right)=p\left(Y\right)$, and so the conditional distribution of Y given X is indeed independent of the value of X.

### 1.2.1 Probability densities

If the probability of a real-valued variable $x$ falling in the interval $\left(x,x+\delta x\right)$ is given by $p\left(x\right)\delta x$ for $\delta x\to0$, then $p\left(x\right)$ is called the probability density over $x$. The probability that $x$ will lie in an interval $\left(a,b\right)$ is then given by 
$$p\left(x\in\left(a,b\right)\right)=\int_a^b p\left(x\right)\mathrm{d}x.$$

The probability density $p\left(x\right)$ must satisfy the two conditions
$$p\left(x\right)\geqslant0 \\
\int_{-\infty}^\infty p\left(x\right)\mathrm{d}x=1.$$

The probability that $x$ lies in the interval $\left(-\infty,z\right)$ is gien by the cumulative distribution function defined by 
$$P\left(z\right)=\int_{-\infty}^z p\left(x\right)\mathrm{d}x$$
which satisfies $P'\left(x\right)=p\left(x\right)$.

If we have several continuous variables $x_1,\dots,x_D$, denoted collectively by vecotr $\mathbf{x}$, then we can define a joint probability density $p\left(\mathbf{x}\right)=p\left(x_1,\dots,x_D\right)$ such that the probability of $\mathbf{x}$ falling in an infinitesimal volume $\delta\mathbf{x}$ containing the point $\mathbf{x}$ is given by $p\left(\mathbf{x}\right)\delta\mathbf{x}$. This multivariate probability densityh must satisfy 
$$p\left(\mathbf{x}\right)\geqslant0 \\
\int p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}=1$$
in which the integral is taken over the whole of $\mathbf{x}$ space.

Note that if $x$ is a discrete variable, then $p\left(x\right)$ is sometimes called a probability mass function because it can be regarded as a set of 'probability masses' concentrated at the allowed values of $x$.

If $x$ and $y$ are two real variables, then the sum and product rules take the form
$$p\left(x\right)=\int p\left(x,y\right)\mathrm{d}y \\
p\left(x,y\right)=p\left(y|x\right)p\left(x\right)$$

### 1.2.2 Expectations and covariances

One of the most import operations involving probabilities is that of finding weighted averages of functions. The averge value of some function $f\left(x\right)$ under a probability distribution $p\left(x\right)$ is called the expectation of $f\left(x\right)$ and will be denoted by $\mathbb{E}\left[f\right]$. For a discrete distribution, it is given by
$$\mathbb{E}\left[f\right]=\sum_xp\left(x\right)f\left(x\right)$$
so that the average is weighted by the relative probabilities of the different values of $x$.

In the case of continuous variables, expectations are expressed in terms of an integration with respect to the corresponding probability density
$$\mathbb{E}\left[f\right]=\int p\left(x\right)f\left(x\right).$$

In either case, if we are given a finite number $N$ of points drawn from the probability distribution or probability density, then the expectation can be approximated as a finite sum over these points 
$$\mathbb{E}\left[f\right]\simeq\frac{1}{N}\sum_{n=1}^{N}f\left(x_n\right).$$
The approximation becomes exact in the limit $N\to\infty$.

Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over, so that fo instance
$$\mathbb{E}_x\left[f\left(x,y\right)\right]$$
denotes the average of the function $f\left(x,y\right)$ with respect to the distribution of $x$. Note that $\mathbb{E}_x\left[f\left(x,y\right)\right]$ will be a function fo $y$.

We can also onsider a conditional expectation with respect to a conditional distribution, so that 
$$\mathbb{E}\left[f|y\right]=\sum_xp\left(x|y\right)f\left(x\right)$$
with an analogous definition for continuous variables.

The variance of $f\left(x\right)$ is defined by
$$ var\left[f\right]=\mathbb{E}\left[\left(f\left(x\right)-\mathbb{E}\left[f\left(x\right)\right]\right)^2\right]$$
and provides a measure of how much variability there is in $f\left(x\right)$ around its mean value $\mathbb{E}\left[f\left(x\right)\right].$

Expanding out the square, we see that the variance can also be written in terms of the expectations of $f\left(x\right)$ and $f\left(x\right)^2$
$$var\left[f\right]=\mathbb{E}\left[f\left(x\right)^2\right]-\mathbb{E}\left[f\left(x\right)\right]^2$$

In particular, we can consider the variance of variable $x$ itself, which is given by
$$var\left[x\right]=\mathbb{E}\left[x^2\right]-\mathbb{E}\left[x\right]^2$$

For two random variables $x$ and $y$, the covariance is defined by 
$$cov\left[x,y\right]=\mathbb{E}_{x,y}\left[\{x-\mathbb{E}\left[x\right]\}\{y-\mathbb{E}\left[y\right]\}\right] \\
\quad=\mathbb{E}_{x,y}\left[x,y\right]-\mathbb{E}\left[x\right]\mathbb{E}\left[y\right]$$
which expresses the extent to which $x$ and $y$ vary together. If $x$ and $y$ are independent, the their covariance vanishes.

In the case of two vectors of random variables $\mathbf{x}$ and $\mathbf{y}$, the covariance is a matrix 
$$cov\left[\mathbf{x},\mathbf{y}\right]=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left[\{\mathbf{x}-\mathbb{E}\left[\mathbf{x}\right]\}\{\mathbf{y}^\top-\mathbb{E}\left[\mathbf{y}^\top\right]\}\right] \\
\quad=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left[\mathbf{x},\mathbf{y}^\top\right]-\mathbb{E}\left[\mathbf{x}\right]\mathbb{E}\left[\mathbf{y}^\top\right]$$
If we consider the covariance of the components of a vector $\mathbf{x}$ with each other, then we use a slightly simpler notation $cov\left[\mathbf{x}\right]\equiv cov\left[\mathbf{x},\mathbf{x}\right]$.

### 1.2.3 Bayesian probabilities

We capture our assumptions about $\mathbf{w}$, before observing the data, in the form of prior probability distribution $p\left(\mathbf{w}\right)$. The effect of the observed data $\mathcal{D}=\{t_1,\dots,t_N\}$ is expressed through the conditional probability $p\left(\mathcal{D}|\mathbf{w}\right)$. Bayes' theorem, which takes the form
$$p\left(\mathbf{w}|\mathcal{D}\right)=\frac{p\left(\mathcal{D}|\mathbf{w}\right)p\left(\mathbf{w}\right)}{p\left(\mathcal{D}\right)}$$
then allows us to evaluate the uncertainty in $\mathbf{w}$ after we have observed $\mathcal{D}$ in the form of the posterior probability $p\left(\mathbf{w}|\mathcal{D}\right)$.

The quantity $p\left(\mathcal{D}|\mathbf{w}\right)$ on the right-hand side of Bayes' theorem is evaluated for the observed data set $\mathcal{D}$ and can be viewed as a function of the parameter vector $\mathbf{w}$, in which case it is called the likelihood function. It expresses how probable the observed data seet is for different settings of the parameter vector $\mathbf{w}$. Note that the likelihood function is not a probability distribution over $\mathbf{w}$, and its integral with respect to $\mathbf{w}$ does not (necessarily) equal one.

Given this definition of likelihood, we can state Bayes' theorem in words
$$posterior \propto likelihood \times prior$$
where all of these quantities are view as functions of $\mathbf{w}$.

The denominator in Bayes' theorem is the normalization constant, which ensrues that the posterior distribution on the left-hand side is a valid probability density and integrates to one. Indeed, integrating both sides of Bayes' theorem with respect to $\mathbf{w}$, we can express the denominator in Bayes' theorem in terms of the prior distribution and the likelihood function
$$p\left(\mathcal{D}\right)=\int p\left(\mathcal{D}|\mathbf{w}\right)p\left(\mathbf{w}\right)\mathrm{d}\mathbf{w}$$

In a frequentist setting, $\mathbf{w}$ is considered to be a fixed parameter, whose value is determined by some form of 'estimator', and error bars on this estimate are obtained by considering the distribution of possible data sets $\mathcal{D}$. By contrast, from the Bayesian viewpoint there is only a single data set $\mathcal{D}$ (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over $\mathbf{w}$.

One common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs.

### 1.2.4  The Gaussian distribution (The Normal distribution)

For the case of a single real-valued variable $x$, the Gaussian distribution is defined by 
$$\mathcal{N}\left(x|\mu,\sigma^2\right)=\frac{1}{\left(2\pi\sigma^2\right)^{1/2}}\exp\left\{-\frac{1}{2\sigma^2}\left(x-\mu\right)^2\right\}$$
which is governed by two parameters: $\mu$,called the mean, and $\sigma^2$, called the variance. The square root of the variance, given by $\sigma$, is called the standard deviation, and the reciprocal of the variance, written as $\beta=1/\sigma^2$, is called the precision.

We see that the Gaussian distribution satisfies
$$\mathcal{N}\left(x|\mu,\sigma^2\right)>0.$$

Also it is straightforward to show that the Gaussian is normalized, so that 
$$\int_{-\infty}^{\infty}\mathcal{N}\left(x|\mu,\sigma^2\right)\mathrm{d}x=1$$
Thus the Gaussian distribution satisfies the two requirements for a valid probability density.

We can readily expectations of functions of $x$ under the Gaussian distribution. In particular, the average value of $x$ is given by
$$\mathbb{E}\left[x\right]=\int_{-\infty}^{\infty}\mathcal{N}\left(x|\mu,\sigma^2\right)x\mathrm{d}x=\mu$$
Because the parameter $\mu$ represents the average value of $x$ under the distribution, it is referred to as the mean.

Similarly, for the second order moment
$$\mathbb{E}\left[x^2\right]=\int_{-\infty}^{\infty}\mathcal{N}\left(x|\mu,\sigma^2\right)x^2\mathrm{d}x=\mu^2+\sigma^2$$

It follows that the variance of $x$ is given by
$$var\left[x\right]=\mathbb{E}\left[x^2\right]-\mathbb{E}\left[x\right]^2=\sigma^2$$
and hence $\sigma^2$ is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.

We are also interested in the Gaussian distribution defined over a D-dimensional vector $\mathbf{x}$ of continuous variables, which is given by
$$\mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}\right)=\frac{1}{\left(2\pi\right)^{D/2}}\frac{1}{|\boldsymbol{\Sigma}|^{1/2}}\exp\left\{-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}\right)^\top\boldsymbol{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}\right)\right\}$$
where the D-dimensional vector $\boldsymbol{\mu}$ is called the mean, the $D\times D$ matrix $\boldsymbol{\Sigma}$ is called the covariance, and $|\boldsymbol{\Sigma}|$ denotes the determinant of $\boldsymbol{\Sigma}$.

Now suppose that we have a data set of observations $\mathsf{x}=\left(x_1,\dots,x_N\right)^\top$, representing $N$ observations of the scalar variable $x$. We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean $\mu$ and variance $\sigma^2$ are unknown, and we would like to determine these parameters from the data set. 

Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.

We have seen that the joint probability of two independent events is given by the prduct of the marginal probabilities for each event separately. Because our data set $\mathsf{x}$ is i.i.d., we can therefore write the probability of the data set, given $\mu$ and $\sigma^2$, in the form 
$$p\left(\mathsf{x}|\mu,\sigma^2\right)=\prod_{n=1}^N\mathcal{N}\left(x|\mu,\sigma^2\right).$$
When viewed as a function of $\mu$ and $\sigma^2$, this is the likelihood function for the Gaussian.

One common criterion for determining the parameters in a probability distribution using an observed data set is to find the parameter values that maximize the likelihood function.

Because the logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself. The log likelihood function can be written in the form 
$$\ln p\left(\mathsf{x}|\mu,\sigma^2\right)=-\frac{1}{2\sigma^2}\sum_{n=1}^N\left(x_n-\mu\right)^2-\frac{N}{2}\ln\sigma^2-\frac{N}{2}\ln\left(2\pi\right).$$

Maximizing the log likelihood function with respect to $\mu$, we obtain the maximum likelihood solution given by 
$$\mu_{ML}=\frac{1}{N}\sum_{n=1}^{N}x_n$$
which is the sample mean, i.e., the mean of the observed values $\{x_n\}$.

Maximizing the log likelihood function with respect to $\sigma^2$, we obtain the maximum likelihood solution for the variane in the form
$$\sigma_{ML}^2=\frac{1}{N}\sum_{n=1}^{N}\left(x_n-\mu_{ML}\right)^2$$
which is the sample variance measured with respect to the sample mean $\mu$.

We shall show that the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias. 

We first note that the maximum likelihood solutions $\mu_{ML}$ and $\sigma^2$ are function of the data set values $x_1,\dots,x_N$. Consider the expectations of these quantities with respect to the data set values, which themselves come from a Gaussian distribution with parameters $\mu$ and $\sigma^2$. It is straight forward to show that
$$\mathbb{E}\left[\mu_{ML}\right]=\mu \\
\mathbb{E}\left[\sigma_{ML}^2\right]=\left(\frac{N-1}{N}\right)\sigma^2$$
so that on average the maximum likelihood estimate will obtain the correct mean but will underestimate the true variance by a factor $\left(N-1\right)/N$.

The following estimate for the variance parameter is unbiased 
$$\tilde{\sigma}^2=\frac{N}{N-1}\sigma^2_{ML}=\frac{1}{N-1}\sum_{n=1}^{N}\left(x_n-\mu_{ML}\right)^2$$

Note that the bias of the maximum likelihood solution becomes less significant as the number $N$ of data points increases, and in the limit $N\to\infty$ the maximum likelihood solution for the variance equals the true variance of the distribution that generated the data.

### 1.2.5 Curve fitting re-visited

The goal in the curve fitting problem is to be able to make predictions for the target variable $t$ given some new value of the input variable $x$ on the basis of a set of training data comprising $N$ input value $\mathsf{x}=\left(x_1,\dots,x_N\right)^\top$ and their corresponding target values $\mathsf{t}=\left(t_1,\dots,t_N\right)^\top$. We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y\left(x, \mathbf{w}\right)$ of the polynomial curve given by (1.1). Thus we have
$$p\left(t|x,\mathbf{w},\beta\right)=\mathcal{N}\left(t|y\left(x,\mathbf{w}\right),\beta^{-1}\right) \quad\quad\quad\left(1.60\right)$$
where, for consistency with the notantion in later chapters, we have defined a preision parameter $\beta$ corresponding to the inverse variance of the distribution.

We now use the training data $\{\mathsf{x},\mathsf{t}\}$ to determine the values of the unknown parameters $\mathbf{w}$ and $\beta$ by maximum likelihood. If the data are assumed to be drawn independently from the distribution $\left(1.60\right)$, then the likelihood function is given by 
$$p\left(\mathsf{t}|\mathsf{x},\mathbf{w},\beta\right)=\prod_{n=1}^{N}\mathcal{N}\left(t_n|y\left(x_n,\mathbf{w}\right),\beta^{-1}\right).$$

we obtain the log likelihood function in the form 
$$\ln p\left(\mathsf{t}|\mathsf{x},\mathbf{w},\beta\right)=-\frac{\beta}{2}\sum_{n=1}^N\left(y\left(x_n,\mathbf{w}\right)-t_n\right)^2-\frac{N}{2}\ln\beta-\frac{N}{2}\ln\left(2\pi\right).\quad\quad\quad\left(1.62\right)$$

Consider first the determination fo the maximum likelihood solution for the polynomial coefficients, which will be denoted by $\mathbf{w}_{ML}$
$$\mathbf{w}_{ML}=\arg\max_{\mathbf{w}}\ln p\left(\mathsf{t}|\mathsf{x},\mathbf{w},\beta\right) \\
\quad\quad=\arg\min_{\mathbf{w}}-\ln p\left(\mathsf{t}|\mathsf{x},\mathbf{w},\beta\right)  \\
\quad\quad\quad\quad=\arg\min_{\mathbf{w}}\frac{1}{2}\sum_{n=1}^N\{y\left(x_n,\mathbf{w}\right)-t_n\}^2$$
Thus the sum-of-squares error function has arisen as a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution.

We can also use maximum likelihood to determine the precision parameter $\beta$ of the Gaussian conditional distributio. Maximizing $\left(1.62\right)$ with respect to $\beta$ gives
$$\frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^{N}\{y\left(x_n,\mathbf{w}_{ML}\right)-t_n\}^2.$$

Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over $t$, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into $\left(1.60\right)$ to give
$$p\left(t|x,\mathbf{w}_{ML},\beta_{ML}\right)=\mathcal{N}\left(t|y\left(x,\mathbf{w}_{ML}\right),\beta_{ML}^{-1}\right)$$

Now let us take a step towards a more Bayesian approach and introduce a prior distribution over the polynomial coefficients $\mathbf{w}$. For simplicity, let us consider a Gaussian distribution over the form
$$p\left(\mathbf{w}|\alpha\right)=\mathcal{N}\left(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I}\right)=\left(\frac{\alpha}{2\pi}\right)^{\left(M+1\right)/2}exp\left\{-\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}\right\}\quad\quad\quad\left(1.65\right)$$
where $\alpha$ is the precision of the distribution, and $M+1$ is the total number of elements in the vector $\mathbf{w}$ for an $M^{th}$ order polynomial.

Variables such as $\alpha$, which control the distribution of model parameters, are called hyperparameters.

Using Bayes' theorem, the posterior distribution for $\mathbf{w}$ is proportional to the product of prior distribution and the likelihood function
$$p\left(\mathbf{w}|\mathsf{x},\mathsf{t},\alpha,\beta\right)\propto p\left(\mathsf{t}|\mathsf{x},\mathbf{w},\beta\right)p\left(\mathbf{w}|\alpha\right).\quad\quad\quad\left(1.66\right)$$
We can now determine $\mathbf{w}$ by finding the most probable value of $\mathbf{w}$ given the data, in other words by maximizing the posterior distribution. This technique is called maximum posterior, or simply MAP.

Taking the negative logarithm of $\left(1.66\right)$ and combining with $\left(1.62\right)$ and $\left(1.65\right)$, we find that the maximum of the posterior is given by the minimum of 
$$\frac{\beta}{2}\sum_{n=1}^N\left\{y\left(x_n,\mathbf{w}\right)-t_n\right\}^2+\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w}$$

Thus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function, with a regularization parameter given by $\lambda=\alpha/\beta$.

### 1.2.6 Bayesian curve fitting

In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires that we integrate over all values of $\mathbf{w}$.

In the curve fitting problem, we are given the training data $\mathsf{x}$ and $\mathsf{t}$, along with a new test point $x$, and our goal is to predict the value of $t$. We therefore wish to evaluate the predictive distribution $p\left(t|x,\mathsf{x},\mathsf{t}\right)$. Here we shall assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance.

The predictive distribution to be written in the form
$$p\left(t|x,\mathsf{x},\mathsf{t}\right)=\int p\left(t|x,\mathbf{w}\right)p\left(\mathbf{w}|\mathsf{x},\mathsf{t}\right)\mathrm{d}\mathbf{w}.$$
Here $p\left(t|x,\mathbf{w}\right)$ is given by $\left(1.60\right)$, and we have omitted the dependence on $\alpha$ and $\beta$ to simplify the notation. Here $p\left(\mathbf{w}|\mathsf{x},\mathsf{t}\right)$ is the posterior distribution over parameters, and can be found by normalizing the right-hand side of $\left(1.66\right)$.

The result that the predictive distribution is given by a Gaussian of the form 
$$ p\left(t|x,\mathsf{x},\mathsf{t}\right)=\mathcal{N}\left(t|m\left(x\right),s^2\left(x\right)\right)$$
where the mean and variance are given by 
$$m\left(x\right)=\beta\phi\left(x\right)^\top\mathbf{S}\sum_{n=1}^{N}\phi\left(x_n\right)t_n \\
s^2\left(x\right)=\beta^{-1}+\phi\left(x\right)^\top\mathbf{S}\phi\left(x\right).$$

Here the matrix $S$ is given by 
$$\mathbf{S}^{-1}=\alpha\mathbf{I}+\beta\sum_{n=1}^N\phi\left(x_n\right)\phi\left(x\right)^\top$$
where $\mathbf{I}$ is the unit matrix, and we have defined the vector $\phi\left(x\right)$ with elements $\phi_i\left(x\right)=x^i$ for $i=0,\dots,M$.

## 1.3 Model Selection

We have already seen that, in the maximum likelihood approach, the performance on the training set is not a good indicator of predictive performance on unseen data due to the problem of over-fitting. If data is plentiful, then one approach is simply to use some of the available data to train a range of models, or a given model with a range of values for its complexity parameters, and then to compare them on independent data, sometimes called a validation set, and select the one having the best predictive performance. If the model design is iterated many times using a limited size data set, then some over-fitting to the validation data can occur and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally evaluated.

Cross-validation allows a proportion $\left(S-1\right)/S$ of the available data to be used for training while making use of all of the data to assess performance. When data is particularly scarce, it may be appropriate to consider the case $S=N$, where $N$ is total number of data points, which gives the leave-one-out technique.

One major drawback of cross-validation is that the number of training runs that must be performed is increased by factor of $S$, and this can prove problematic for models in which the training is itself computationally expensive.

Ideally, this should rely only on the training data and should allow multiple hyperparameters and model stpes to be compared in a single training run. We therefore need to find a measure of performance which depends only on the training data and which does not suffer from bias due to over-fitting.

The additioin of a penalty term to compensate for the over-fitting fo more complex models.

The Akaike information criterion, or AIC, chooses the model for which the quantity
$$\ln p\left(\mathcal{D}|\mathbf{w}_{ML}\right)-M$$
is largest. Here $p\left(\mathcal{D}|\mathbf{w}_{ML}\right)$ is the best-fit log likelihood, and $M$ is the number of adjustable parameters in the model.

## 1.4 The Curse of Dimensionality

For practical applications of pattern recognition, however, we will have to deal with spaces of high dimensionality comprising many input variables. As we now discuss, this poses some serious challenges and is an imortant factor influencing the design of pattern recognition techniques.

The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality.

## 1.5 Decision Theory

Here we turn to a discussion of ecision Theory that, when combined with probability theory, allows us to make optimal decisions in situations involving uncertainty such as those encountered in pattern recognition.

Suppose we have an input vector $\mathbf{x}$ together with a corresponding vector $\mathbf{t}$ of target variables, and our goal is to predict $\mathbf{t}$ given a new value for $\mathbf{x}$. In a paractical application, however, we must often make a specific prediction for the value of $mathbf{t}$, or more generally take a specific action based on our understanding of the values $\mathbf{t}$ is likely to take, and this aspect is the subject of decision theory.

The general inference problem then involves determining the joint distribution $p\left(\mathbf{x},\mathcal{C}_k\right)$, or equivalently $p\left(\mathbf{x},t\right)$, which gives us the most complete probabilistic description of the situation. This is decision step, and it is the subject of decision theory to tell us how to make optimal decsions given the appropriate probabilities.

### 1.5.1 Minimizing the misclassification rate 

We need a rule that asigns each value of $\mathbf{x}$ to one of the available classes. Such a rule will divide the input space into regions $\mathcal{R}_k$ called decision regions, one for each class, such that all points in $\mathcal{R}_k$ are assigned to class $\mathcal{C}_k$. The boundaries between decision regions are called decision boundaries or decision surfaces. Note that each decision region need not be contiguous but could comprise some number of disjoint regions.

A mistake occurs when an input vector belonging to class $\mathcal{C}_1$ is assigned to class $\mathcal{C}_2$ or vice versa. The probability of this occurring is given by 
$$p\left(mistake\right)=p\left(\mathbf{x}\in\mathcal{R}_1,\mathcal{C}_2\right)+p\left(\mathbf{x}\in\mathcal{R}_2,\mathcal{C}_1\right) \\
=\int_{\mathcal{R}_1}p\left(\mathbf{x},\mathcal{C}_2\right)\mathrm{d}\mathbf{x}+\int_{\mathcal{R}_2}p\left(\mathbf{x},\mathcal{C}_1\right)\mathrm{d}\mathbf{x}$$

We can restate this result as saying that the minimum probability of making a mistake is obtained if each value of $\mathbf{x}$ is assigned to the calss for which the posterior probability $p\left(\mathcal{C}_k|\mathbf{x}\right)$ is largest.

For the more general case of $K$ classes, it is slightly easier to maximize the probability of being correct, which is given by 
$$p\left(correct\right)=\sum_{k=1}^K p\left(\mathbf{x}\in\mathcal{R}_k, \mathcal{C}_k\right)\\
=\sum_{k=1}^K\int_{\mathcal{R}_k}p\left(\mathbf{x},\mathcal{C}_k\right)\mathrm{d}\mathbf{x}$$
which is maximized when the regions $\mathcal{R}_k$ are chosen such that each $\mathbf{x}$ is assigned to the class for which $p\left(\mathbf{x},\mathcal{C}_k\right)$ is largest.

### 1.5.2 Minimizing the expected loss

A loss function, also called a cost function, which is a single overall measure of loss incurred in taking any of the available decisions or actions. Our goal is then to minimize the total loss incurred.

Suppose that, for a new value of $\mathbf{x}$, the true class is $\mathcal{C}_k$ and that we assign $\mathbf{x}$ to class $\mathcal{C}_j$. In so doing, we incure some level of loss that we denote by $L_{kj}$, which we can view as the $k,j$ element of a loss matrix.

The optimal solution is the one which minimizes the loss function. The average loss is computed with respect to the joint probability distribution $p\left(\mathbf{x},\mathcal{C}_k\right)$, which is giben by
$$\mathbb{E}\left[L\right]=\sum_k\sum_j\int_{\mathcal{R}_j}L_{kj}p\left(\mathbf{x},\mathcal{C}_k\right)\mathrm{d}\mathbf{x}.$$

Each $\mathbf{x}$ can be assigned independently to one of the decision regions $\mathcal{R}_j$. Our goal is to choose the regions $\mathcal{R}_j$ in order to minimize the expected loss, which imlies that for each $\mathbf{x}$ we should minimize $\sum_k L_{kj}p\left(\mathbf{x},\mathcal{C}_k\right)$. Thus the decision rule that minimizes the expected loss is the one that assigns each new $\mathbf{x}$ to the class $j$ for which the quantity
$$\sum_k L_{kj}p\left(\mathcal{C}_k|\mathbf{x}\right)$$
is a minimum. This is clearly trival to do, once we know the posterior class probabilities $p\left(\mathcal{C}_k|\mathbf{x}\right)$.

### 1.5.3 The reject option

These are the regions where we are relatively uncertain about class membership. In some applications, it will be appropriate to avoid making decisions on the defficult cases in anticipation of a lower error rate on those examples for which a classification decision is made. This is known as the reject ooption.

We can achieve this by introducing a threshold $\theta$ and rejecting the those inputs $\mathbf{x}$ for which the largest of the posterior probabilities $p\left(\mathcal{C}_k|\mathbf{x}\right)$ is less than or equal to $\theta$.

### 1.5.4 Inference and decision

We have broken the classification problem down into two separate stages, the inference stage in which we use training data to learn a model for $p\left(\mathcal{C}_k|\mathbf{x}\right)$, and the subsequent decision stage in which we use these posterior probabilities to make optimal class assignments.

An alternative possibility would be to solve both problems toether and simply learn a function that maps inputs $\mathbf{x}$ directly into decisions. Such a function is called a discriminant function.

In fact, we can identify three distinct approaches to solving decision problems, all of which have been used in practical applicaitons. These are given, in decreasing order of complexity, by:
1. First solve the inference problem of determining the class-conditional densities $p\left(\mathbf{x}|\mathcal{C}_k\right)$ for each class $\mathcal{C}_k$ individually. Also separately infer the prior class probabilities $p\left(\mathcal{C}_k\right)$. Then use Bayes' theorem in the form 
$$ p\left(\mathcal{C}_k|\mathbf{x}\right)=\frac{p\left(\mathbf{x}|\mathcal{C}_k\right)p\left(\mathcal{C}_k\right)}{p\left(\mathbf{x}\right)}$$
to find the posterior class probabilities $p\left(\mathcal{C}_k|\mathbf{x}\right)$. As usual, the denominator in Bayes' theorem can be found in terms of the quantities appearing in the numerator, because
$$p\left(\mathbf{x}\right)=sum_k p\left(\mathbf{x}|\mathcal{C}_k\right)p\left(\mathcal{C}_k\right).$$
Equivalently, we can model the joint distribution $p\left(\mathbf{x},\mathcal{C}_k\right)$ directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input $\mathbf{x}$. Approaches that explicitly or implicityly model the distribution of inputs as well as outputs are known as generative models.
2. First solve the inference problem of determining the posterior class probabbilities $p\left(\mathcal{C}_k|\mathbf{x}\right)$, and then subsequently use decision theory to assign each new $\mathbf{x}$ to one of the classes. Approaches that model the posterior probabilities directly are called discriminative models.
3. Find a function $f\left(\mathbf{x}\right)$, called a discriminant function, which maps each input $\mathbf{x}$ directly onto a class label. In this case, probabilities play no role.

### 1.5.5 Loss functions for regression

The decision stage consists of choosing a specific estimate $y\left(\mathbf{x}\right)$ of the value of $t$ for each input $\mathbf{x}$. Suppose that is dong so, we incur a loss $L\left(t,y\left(\mathbf{x}\right)\right)$. The average, or expected, loss is then given by
$$\mathbb{E}\left[L\right]=\int\int L\left(t,y\left(\mathbf{x}\right)\right)p\left(\mathbf{x},t\right)\mathrm{d}\mathbf{x}\mathrm{d}t.$$

A common choice of loss function in regression problems is the squared loss given by $L\left(t,y\left(\mathbf{x}\right)\right)=\left\{y\left(\mathbf{x}\right)-t\right\}^2$. In this case, the expected loss can be written 
$$\mathbb{E}\left[L\right]=\int\int \left\{y\left(\mathbf{x}\right)-t\right\}^2p\left(\mathbf{x},t\right)\mathrm{d}\mathbf{x}\mathrm{d}t.$$

Our goal is to choose $y\left(\mathbf{x}\right)$ so as to minimize $\mathbb{E}\left[L\right]$. If we assume a completely flexible function $y\left(\mathbf{x}\right)$, we can do this formally using the calculus of variations to give 
$$ \frac{\partial\mathbb{E}\left[L\right]}{\partial y\left(\mathbf{x}\right)}=2\int \left\{y\left(\mathbf{x}\right)-t\right\}^2p\left(\mathbf{x},t\right)\mathrm{d}t=0.$$

Solving for $y\left(\mathbf{x}\right)$, and using the sum and product rules of probability, we obtain
$$y\left(\mathbf{x}\right)=\frac{\int tp\left(\mathbf{x},t\right)\mathrm{d}t}{p\left(\mathbf{x}\right)}=\int tp\left(t|\mathbf{x}\right)\mathrm{d}t=\mathbb{E}_t\left[t|\mathbf{x}\right]$$
which is the conditional average of $t$ conditioned on $\mathbf{x}$ and is known as the regression function. It can readily be extended to multiple target variables represented by the vector $\mathbf{t}$, in which case the optimal solution is the conditional average $\mathbf{y}\left(\mathbf{x}\right)=\mathbb{E}_t\left[\mathbf{t}|\mathbf{x}\right]$.

We consider briefly one simple generalization of the squared loss, called the Minkowski loss, whose expectation is given by 
$$ \mathbb{E}\left[L_q\right]=\int\int|y\left(\mathbf{x}\right)-t|^qp\left(\mathbf{x},t\right)\mathrm{d}\mathbf{x}\mathrm{d}t$$
which reduces to the expected squared loss for $q=2$.

## 1.6 Information Theory

Our measure of informaiton content will therefore depend on the probability distribution $p\left(x\right)$, and we therefore look for a quantity $h\left(x\right)$ that is a monotonic function of the probability $\left(x\right)$ and that expresses the information content.

We have 
$$h\left(x\right)=-\ln p\left(x\right) \quad\quad\quad\left(1.92\right)$$
where the negative sign ensures that information is positive or zero.

Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transimit in the process is obtained by taking the expectation of (1.92) with respect to the distribution $p\left(x\right)$ and is given by
$$H\left[x\right]=-\sum_x p\left(x\right)\ln p\left(x\right)$$
This important quantity is called the entropy of the random variable $x$.

The states $x_i$ of a discrete random variable $X$. The entropy of the random variable $X$ is the 
$$H\left[p\right]=-\sum_i p\left(x_i\right)\ln p\left(x_i\right)$$
where $p\left(X=x_i\right)=p_i$.

The entropy is nonnegative, and it will equal its minimum value of 0 when one of the $p_i=1$ and all other $p_{j\neq i}=0$.

The maximum entropy configuration can be found by maximizing $H$ using a Lagrange multiplier to enforce the normalization constraint on the probabilities. Thus we maximize 
$$\tilde{H}=-\sum_i p\left(x_i\right)\ln p\left(x_i\right)+\lambda\left(\sum_i p\left(x_i\right)-1\right)$$
from which we find that all of the $p\left(x_i\right)$ are equal and are given by $p\left(x_i\right)=1/M$, where $M$ is the total number of states $x_i$. 

For a density defined over multiple continuous variables, denoted collectively by the vector $\mathbf{x}$, the differential entropy is given by 
$$H\left[\mathbf{x}\right]=-\int p\left(\mathbf{x}\right)\ln p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}.$$

In the case of discrete distributions, we saw that the maximum entropy configuration corresponded to an equal distribution of probabilities across the possible states of the varibable.

Let us now consider the maximum entropy configuration for a continuous variable. In order for this maximum to be well defined, it will be necessary to constrain the first and second moments of $p\left(x\right)$ as well as preserving the normalizatin constraint. We therefore maximize the differential entropy with the three constraints
$$\int_{-\infty}^{\infty}p\left(x\right)\mathrm{d}x=1 \\
\int_{-\infty}^{\infty}xp\left(x\right)\mathrm{d}x=\mu \\
\int_{-\infty}^{\infty}\left(x-\mu\right)^2p\left(x\right)\mathrm{d}x=\sigma^2.$$

The constrained maximization can be performaed using Lagrange multipliers so that we maximize the followin functional with respect to $p\left(x\right)$
$$ -\int p\left(x\right)\ln p\left(x\right)\mathrm{d}x+\lambda_1\left(\int_{-\infty}^{\infty}p\left(x\right)\mathrm{d}x-1\right)+\lambda_2\left(\int_{-\infty}^{\infty}xp\left(x\right)\mathrm{d}x-\mu\right)+\lambda_3\left(\int_{-\infty}^{\infty}\left(x-\mu\right)^2p\left(x\right)\mathrm{d}x-\sigma^2\right).$$

Using the calculus of variations, we set the derivative of this functional to zero giving 
$$p\left(x\right)=\exp\{-1+\lambda_1+\lambda_2x_2+\lambda_3\left(x-\mu\right)^2\}.$$

The Lagrange multipliers can be found by back substitution of this result into the three constraint equations, leading finally to the result 
$$p\left(x\right)=\frac{1}{\left(2\pi\sigma^2\right)^{1/2}}\exp\left\{-\frac{1}{2\sigma^2}\left(x-\mu\right)^2\right\}$$
and so the distribution that maximizes the differential entropy is the Gaussian.

Suppose we have a joint distribution $p\left(\mathbf{x},\mathbf{y}\right)$ from which we draw pairs of values of $\mathbf{x}$ and $\mathbf{y}$. If a value of $\mathbf{x}$ is already known, then the additional information needed to specify the corresponding value of $\mathbf{y}$ is given by $-\ln p\left(\mathbf{y}|\mathbf{x}\right)$. Thus the average additional information needed to specify $\mathbf{y}$ can be written as 
$$H\left[\mathbf{y}|\mathbf{x}\right]=-\int\int p\left(\mathbf{y},\mathbf{x}\right)\ln p\left(\mathbf{y}|\mathbf{x}\right)\mathrm{d}\mathbf{y}\mathrm{d}\mathbf{x}$$
which is called the conditional entropy of $\mathbf{y}$ given $\mathbf{x}$.

Using the product rule, that the conditional entropy satisfies the relation
$$H\left[\mathbf{x},\mathbf{y}\right]=H\left[\mathbf{y}|\mathbf{x}\right]+H\left[\mathbf{x}\right]$$
where $H\left[\mathbf{x},\mathbf{y}\right]$ is the differential entropy of $p\left(\mathbf{x},\mathbf{y}\right)$ and $H\left[\mathbf{x}\right]$ is the differential entropy of the marginal distribution $p\left(\mathbf{x}\right)$. Thus the information needed to describe $\mathbf{x}$ and $\mathbf{y}$ is given by the sum of the information needed to describe $\mathbf{x}$ alone plus the additional information required to specify $\mathbf{y}$ given $\mathbf{x}$.

### 1.6.1 Relative entropoy and mutual information

$$KL\left(p\|q\right)=-\int p\left(\mathbf{x}\right)\ln q\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}-\left(-\int p\left(\mathbf{x}\right)\ln p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}\right) \\
=-\int p\left(\mathbf{x}\right)\ln \left\{\frac{q\left(\mathbf{x}\right)}{p\left(\mathbf{x}\right)}\right\}\mathrm{d}\mathbf{x}$$
This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence, between the distributions $p\left(\mathbf{x}\right)$ and $q\left(\mathbf{x}\right)$. Note that it is not a symmetrical quantity, that is to say $KL\left(p\|q\right)\not\equiv KL\left(q\|p\right)$.

The KL divergence satisfies $KL\left(p\|q\right)\geqslant0$ with equality if, and only if, $p\left(\mathbf{x}\right)=q\left(\mathbf{x}\right)$.

Thus we can interpret the KL divergence as a measure of the dissimilarity of the two distributions $p\left(\mathbf{x}\right)$ and $q\left(\mathbf{x}\right)$.

Suppose that we have observed a finite set of training points $\mathbf{x}_n$, for $n=1,\dots,N$, drawn from $p\left(\mathbf{x}\right)$. Then the expectation with respect to $p\left(\mathbf{x}\right)$ can be approximated by a finite sum over these points, so that
$$KL\left(p\|q\right)\simeq\sum_{n=1}^N\left\{-\ln q\left(\mathbf{x}_n|\theta\right)+\ln p\left(\mathbf{x}_n\right)\right\}.$$
That second term on the right-hand side is independent of $\theta$, and the first term is the negative log likelihood function for $\theta$ under the distribution $q\left(\mathbf{x}|\theta\right)$ evaluated using the training set. Thus we see that minimizing this KL divergence is equivalent to maximizing the likelihood function.

If the variables are not independent, we can gain some idea of whether they are 'close' to being independent by considering the KL divergence betweent the joint distribution and the product of the marginals, given by
$$I\left[\mathbf{x},\mathbf{y}\right]=KL\left(p\left(\mathbf{x},\mathbf{y}\right)\|p\left(\mathbf{x}\right)p\left(\mathbf{y}\right)\right) \\
=-\int\int p\left(\mathbf{x},\mathbf{y}\right)\ln \left(\frac{p\left(\mathbf{x}\right)p\left(\mathbf{y}\right)}{p\left(\mathbf{x},\mathbf{y}\right)}\right)\mathrm{d}\mathbf{x}\mathrm{d}\mathbf{y}$$
which is called the mutual information etween the variables $\mathbf{x}$ and $\mathbf{y}$. From the properties of the KL divergence, we see that $I\left(\mathbf{x},\mathbf{y}\right)\geqslant0$ with equality if, and only if, $\mathbf{x}$ and $\mathbf{y}$ are independent.

Using the sum and product rules of probability, we see that the mutual information is related to the conditional entropy through 
$$I\left[\mathbf{x},\mathbf{y}\right]=H\left[\mathbf{x}\right]-H\left[\mathbf{x}|\mathbf{y}\right]=H\left[\mathbf{y}\right]-H\left[\mathbf{y}|\mathbf{x}\right].$$
Thus we can view the mutual information as the reduction in the uncertainty about $\mathbf{x}$ by virtue of being told the value of $\mathbf{y}$ (or vice versa).