## Chap 5 Machine Learning Basics

### 5.1 Learning Algorithms

#### 5.1.1 The Task, T

Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:
- Classification
- Classification with missing inputs
- Regression
- Transcription
- Machine translation
- Structured output
- Anomaly detection
- Synthesis and sampling
- Imputation of missing values
- Denoising
- Density estimation or probability mass function estimation

#### 5.1.2 The Performance Measure, P
#### 5.1.3 The Experience, E

Roughly speaking, __unsupervised learning__ involves observing several examples of a random vector x, and attempting to implicitly or explicitly learn the probability distribution __p(x)__, or some interesting properties of that distribution, while __supervised learning__ involves observing several examples of a random vector x and an associated value or vector y, and learning to predict y from x, usually by estimating __p(y|x)__.

A __design matrix__ is a matrix containing a different example in each row. Each column of the matrix corresponds to a different feature.

#### 5.1.4 Example: Linear Regression

Take a vector $x \in R^n$ as input and predict the value of a scalar $y \in R$ as its output. Let $\hat y$ be the value that our model predicts y should take on.

We define the output to be $$\hat y = w^T x$$ where $w \in R^n$ is a vector of parameters.

One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If $\hat y^{(test)}$ gives the predictions of the model on the test set, then the mean squared error is given by $$MSE_{test} = \frac{1}{m} || \hat y^{(test)} - y^{(test)}||^2_2$$

Design an algorithm that will improve the weights w in a way that reduces $MSE_{test}$ when the algorithm is allowed to gain experience by observing a training set $(X^{(train)},y^{(train)})$

To minimize $MSE_{train}$, we can simply solve for where its gradient is 0. 

\begin{align}
    & \nabla_w MSE_{train} = 0 \\
    \Rightarrow & \nabla_w \frac{1}{m} ||\hat y^{(train)} - y^{(train)}||^2_2 = 0 \\
    \Rightarrow & \frac{1}{m} \nabla_w  ||X^{(train)}w - y^{(train)}||^2_2 = 0 \\
    \Rightarrow & \nabla_w  (X^{(train)}w - y^{(train)})^T (X^{(train)}w - y^{(train)}) = 0 \\
    \Rightarrow & 2X^{(train)^T} X^{(train)}w - 2 X^{(train)^T} y^{(train)} = 0 \\
    \Rightarrow & w = (X^{(train)^T} X^{(train)})^{-1} X^{(train)^T} y^{(train)} \\
\end{align}

The system of equations is known as the __normal equations__

The term __linear regression__ is often used to refer to a slightly more sophisticated model with one additional parameter - an intercept term b.

$$\hat y = w^T x + b$$

### 5.2 Capacity, Overfitting and Underfitting

The ability to perform well on previously unobserved inputs is called __generalization__

When training a machine learning model, we have access to a training set, we can compute some error measure on the training set called the __training error__, and we reduce this training error. So far, what we have described is simply an optimization problem.

What separates machine learning from optimization is that we want the __generalization error__, also called the __test error__, to be low as well.

How can we affect performance on the test set when we get to observe only the training set? The field of __statistical learning theory__ provides some answers. If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are allowed to make __some assumptions__ about how the training and test set are collected, then we can make some progress.

The train and test data are generated by a probability distribution over datasets called the __data generating process.__ 

We typically make a set of assumptions known collectively as the __i.i.d. assumptions__. 
- The examples in each dataset are __independent__ from each other
- The train set and test set are __identically  distributed__, drawn from the same probability distribution as each other.

This probabilistic framework and the __i.i.d. assumptions__ allow us to mathematically study the relationship between __training error and test error.__

The expected __training error__ of a randomly selected model is equal to the expected __test error__ of that model.

We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, _the expected test error is greater than or equal to the expected value of training error._ The factors determining how well a machine learning algorithm will perform are its ability to:

- Make the training error small.
- Make the gap between training and test error small.

__Underfitting__ occurs when the model is not able to obtain a sufficiently low error value on the training set. __Overfitting__ occurs when the gap between the training error and test error is too large.

We can control whether a model is more likely to overfit or underfit by altering its __capacity.__ A model’s capacity is its ability to fit a wide variety of functions. 

One way to control the capacity of a learning algorithm is by choosing its __hypothesis space__, the set of functions that the learning algorithm is allowed to select as being the solution.

_Machine learning algorithms will generally perform best when their __capacity is appropriate__ for the true complexity of the task they need to perform and the amount of training data they are provided with._

If we have more parameters than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions exist.

Capacity is __not__ determined only by the choice of model. The model specifies which __family of functions__ the learning algorithm can choose from when varying the parameters in order to reduce a training objective. This is called the __representational capacity__ of the model.

The most important results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

The problem of determining the capacity of a deep learning model is especially difficult because the effective capacity is limited by the capabilities of the optimization algorithm, and we have little theoretical understanding of the very general non-convex optimization problems involved in deep learning.

#### 5.2.1 The No Free Lunch Theorem

Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples.

In part, machine learning avoids this problem by offering only probabilistic rules, rather than the entirely certain rules used in purely logical reasoning. Machine learning promises to find rules that are probably correct about most members of the set they concern.

The __no free lunch theorem__ for machine learning (Wolpert, 1996) states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. 

In other words, in some sense, __no machine learning algorithm is universally any better than any other.__ The most sophisticated algorithm we can conceive of has the same average performance __(over all possible tasks)__ as merely predicting that every point belongs to the same class.

Fortunately, these results hold only when we average over all possible data generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

This means that the goal of machine learning research is __not__ to seek a __universal learning algorithm__ or the absolute best learning algorithm. Instead, our goal is to understand __what kinds of distributions are relevant to the “real world”__ that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.

#### 5.2.2 Regularization
The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. We do so by building __a set of preferences__ into the learning algorithm. When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better.

We can thus control the performance of our algorithms by choosing what __kind of functions__ we allow them to draw solutions from, as well as by controlling the __amount of these functions.__ We can also give a learning algorithm a preference for one solution in its hypothesis space to another. This means that both functions are eligible, but one is preferred.

Modify the training criterion for linear regression to include __weight decay.__ $$J(w) = MSE_{train} + \lambda w^Tw$$

Minimizing J(w) results in a choice of weights that make a tradeoff between __fitting the training data__ and __being small__. This gives us solutions that have a smaller slope, or put weight on fewer of the features.

In our weight decay example, we expressed our preference for linear functions defined with smaller weights explicitly, via an extra term in the criterion we minimize. There are many other ways of expressing preferences for different solutions, both implicitly and explicitly. Together, these different approaches are known as __regularization.__ 

__Regularization__ is any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its training error.

The no free lunch theorem has made it clear that there is __no best machine learning algorithm__, and, in particular, __no best form of regularization.__ Instead we must choose a form of regularization that is well-suited to the particular task we want to solve.

### 5.3 Hyperparameters and Validation Sets
Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm. These settings are called __hyperparameters.__

The setting must be a hyperparameter because it is __not appropriate to learn__ that hyperparameter on the training set. This applies to all hyperparameters that __control model capacity__. If learned on the training set, such hyperparameters would always choose the __maximum possible model capacity__, resulting in __overfitting__.

__Test set__, composed of examples coming from the same distribution as the training set, can be used to estimate the __generalization error__ of a learner, after the learning process has completed.

It is important that the test examples are __not__ used in any way to make choices about the model, including its hyperparameters. For this reason, __no__ example from the __test set__ can be used in the __validation set.__

Se split the training data into two __disjoint subsets__. One of these subsets is used to __learn the parameters__. The other subset is our validation set, used to __estimate the generalization error during or after training__, allowing for the hyperparameters to be updated accordingly.

The subset of data used to __guide the selection of hyperparameters__ is called the __validation set__. Typically, one uses about __80%__ of the training data for training and __20%__ for validation.

In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms over many years ... we end up having optimistic evaluations with the test set as well. Benchmarks can thus become stale and then __do not reflect the true field performance__ of a trained system.

#### 5.3.1 Cross-Validation
Dividing the dataset into a fixed __training set__ and a fixed __test set__ can be problematic if it results in the test set being small. A small test set implies __statistical uncertainty__ around the estimated average test error, making it difficult to claim that algorithm A works better than algorithm B on the given task.

When the dataset is too small, are alternative procedures enable one to use all of the examples in the estimation of the mean test error, at the price of increased computational cost. 

These procedures are based on the idea of repeating the training and testing computation on different randomly chosen subsets or splits of the original dataset.

__k-fold cross-validation__ procedure, in which a partition of the dataset is formed by splitting it into k non-overlapping subsets. The __test error__ may then be estimated by taking __the average test error across k trials.__

### 5.4 Estimators, Bias and Variance

#### 5.4.1 Point Estimation
__Point estimation__ is the attempt to provide the single “best” prediction of some quantity of interest.

Let {x(1) , . . . , x(m) } be a set of m independent and identically distributed (i.i.d.) data points. A __point estimator__ or __statistic__ is __any function__ of the data: $$\hat \theta_m = g(x^{(1)}, ... , x^{(m)})$$ 

The definition does not require g to return a value that is close to the true $\theta$ or even that the range of g is the same as the set of allowable values of $\theta$. 

This definition of a point estimator is very __general__ and allows the designer of an estimator __great flexibility__. A __good estimator__ is a function whose output is __close to the true underlying $\theta$__ that generated the training data.

#### 5.4.2 Bias 
The bias of an estimator is defined as:
$$bias( \hat \theta_m) = \mathbb{E}(\hat \theta_m) - \theta$$ 
where the expectation is over the data (seen as __samples from a random variable__) and $\theta$ is the __true underlying value of $\theta$__ used to define the data generating distribution.

An estimator $\hat \theta_m$ is said to be __unbiased__ if $bias(\hat \theta_m) = 0$, which implies that $\mathbb{E}(\hat \theta_m) = \theta$.

#### Example: Bias of mean estimator of Bernoulli Distribution
Consider a set of m samples that are independently and identically distributed (i.i.d) according to a Bernoulli distribution with mean $\theta$: 
$$P(x^{(i)};\theta) = \theta^{x^{(i)}} (1-\theta)^{(1-x^{(i)})}$$

A common estimator for the $\theta$ parameter of this distribution is the __mean of the training samples__: 
$$\hat \theta_m = \frac{1}{m} \sum_{i=1}^{m} x^{(i)} $$

\begin{align}
    & bias(\hat \theta_m) = \mathbb{E}[\hat \theta_m] - \theta \\
    \Rightarrow & \frac{1}{m} \sum_{i=1}^{m} \mathbb{E}[x^{(i)}] - \theta \\
    \Rightarrow & \frac{1}{m} \sum_{i=1}^{m} (0*\theta^0*(1-\theta)^{1-0} + 1*\theta^1*(1-\theta)^{1-1}) - \theta \\
    \Rightarrow & \frac{1}{m} \sum_{i=1}^{m} \theta - \theta \\
    \Rightarrow & \theta - \theta = 0 \\
\end{align}

> 期望值等於值乘機率和$x^{(i)}$等於零或一

Since $bias(\hat \theta) = 0$, we say that our estimator, __mean of the training samples__, $\hat \theta$ is unbiased.

#### Example: Bias of mean estimator of Gaussian Distribution
Consider a set of m samples that are independently and identically distributed (i.i.d) drawn from a Gaussian distribution (population unknow) with mean equals to $\mu$ and variance equals to $\sigma^2$ $$P(x^{(i)};\mu,\sigma^2) = \sqrt{\frac{1}{2 \pi \sigma^{2}}} exp(- \frac{1}{2 \sigma^2}(x^{(i)} - \mu)^2)$$

A common estimator of the Gaussian mean parameter is known as the __sample mean__ 
$$\hat \mu_m = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$$

\begin{align}
    bias(\hat \mu_m) = & \mathbb{E}[\hat \mu_m] - \mu \\
    = & \frac{1}{m} \sum_{i=1}^{m} \mathbb{E}[x^{(i)}] - \mu \\
    = & \frac{1}{m} \sum_{i=1}^{m} \mu -\mu = 0 \\
\end{align}

The __sample mean__ is an __unbiased estimator__ of Gaussian mean parameter. Sample mean equals to the real Gaussian distribution mean. 

$$\mathbb{E}[\hat \mu_m] = \mu$$

#### Example: Bias of variance estimator of Gaussian Distribution
The first estimator of variance we consider is known as the __sample variance__
$$\hat \sigma_m^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)}-\hat \mu_m)^2 $$ 

\begin{align}
    bias(\hat \sigma_m^2) = & \mathbb{E}[\hat \sigma_m^2] - \sigma^2 \\
    = & \mathbb{E}(\frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \hat \mu_m)^2) - \sigma^2  \\
    = & \frac{1}{m} \sum_{i=1}^{m}[ \mathbb{E}(x^{(i)^2}) - 2\mathbb{E}(x^{(i)}\hat \mu_m) + \mathbb{E}(\hat \mu_m^2)]- \sigma^2 \\
\end{align}

> $$\mathbb{E}[x_p x_q] = \mathbb{E}[x_p] \mathbb{E}[x_q] = \mu^2, \; p \neq q$$ 
> $$\mathbb{E}[x_p x_q] = \mathbb{E}[x_p^2] = \sigma^2 + \mu^2, \; p = q$$ 

> Since we have m samples, the possibility of getting the same sample is 1/m

> $$\mathbb{E}[x_j \hat \mu_m] = \frac{(m-1) \mu^2 + (\sigma^2 + \mu^2) }{m}$$ 

> Since we have $m^2$ terms, the possibility of getting the same sample is $m^2-m$

> $$\mathbb{E}[\hat \mu_m^2] = \frac{(m^2-m)\mu^2 + m(\sigma^2+\mu^2)}{m^2}$$ 

\begin{align}
    = & \frac{1}{m} \sum_{i=1}^{m}[ \sigma^2 + \mu^2 - 2(\frac{(m-1) \mu^2 + (\sigma^2 + \mu^2) }{m}) + \frac{(m^2-m)\mu^2 + m(\sigma^2+\mu^2)}{m^2}]- \sigma^2 \\
    = & [\sigma^2 + \mu^2 - 2(\frac{(m-1) \mu^2 + (\sigma^2 + \mu^2) }{m}) + \frac{(m^2-m)\mu^2 + m(\sigma^2+\mu^2)}{m^2}]- \sigma^2 \\
    = & \frac{m-1}{m} \sigma^2 - \sigma^2 = -\frac{\sigma^2}{m} \\
\end{align}

We conclude that the bias of $\hat \sigma_m^2$ is $-\frac{\sigma^2}{m}$. Therefore,
the __sample variance__ is a __biased estimator__.

We have two estimators: one is biased and the other is not. While unbiased estimators are clearly desirable, they are not always the “best” estimators.

The second estimator of variance we consider is known as the __unbiased sample variance estimator__

$$\tilde \sigma_m^2 = \frac{1}{m-1} \sum_{i=1}^m (x^{(i)} i \hat \mu_m )^2 $$

\begin{align}
    bias(\tilde \sigma_m^2) = & \mathbb{E}[\tilde \sigma_m^2] - \sigma^2 \\
    = & \mathbb{E}(\frac{1}{m-1} \sum_{i=1}^{m} (x^{(i)} - \hat \mu_m)^2) - \sigma^2 \\
    = & \frac{m}{m-1} \mathbb{E}[\hat \sigma_m^2] - \sigma^2 \\
    = & \frac{m}{m-1} (\frac{m-1}{m} \sigma^2) - \sigma^2 = 0 \\
\end{align}

The __unbiased sample variance__ estimator is an unbiased estimator of Gaussian variance parameter. Unbiased sample variance equals to the real Gaussian distribution variance.

$$\mathbb{E}[\tilde \sigma_m^2] = \sigma^2$$

#### 5.4.3 Variance and Standard Error 
Another property of the estimator that we might want to consider is how much we expect it to vary as a function of the data sample. The __variance of an estimator__ is simply the variance $$Var(\hat \theta)$$

The square root of the variance is called the __standard error__, denoted $SE(\hat \theta)$.

The __variance or the standard error of an estimator__ provides a measure of _how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data generating process._

The standard error of the mean is given by 

\begin{align}
    SE(\hat \mu_m) = & \sqrt{ Var[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}]} \\
    = & \sqrt{ \frac{1}{m^2} Var[\sum_{i=1}^{m} x^{(i)}]} \\
    = & \sqrt{ \frac{1}{m^2} m \sigma^2} \\
    = & \frac{\sigma}{\sqrt{m}} \\
\end{align}

where $\sigma^2$ is the true variance of the samples $x^{(i)}$ 

```
The standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean
```

The standard error is often estimated by using an estimate of $\sigma$. Unfortunately, neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard deviation. Both approaches tend to underestimate the true standard deviation, but are still used in practice.

The __standard error of the mean__ is very useful in machine learning experiments. We often __estimate the generalization error__ by computing the __sample mean of the error__ on the __test set__. The number of examples in the test set determines the accuracy of this estimate. Taking advantage of the central limit theorem, which tells us that __the mean will be approximately distributed with a normal distribution__, we can use the standard error to compute the probability that the true expectation falls in any chosen interval.

In machine learning experiments, it is common to say that algorithm A is _better than_ algorithm B if the upper bound of the 95% confidence interval for the error of algorithm A is less than the lower bound of the 95% confidence interval for the error of algorithm B

#### Example: Variance of mean estimator of Bernoulli Distribution
Consider a set of m samples that are independently and identically distributed (i.i.d) according to a Bernoulli distribution with mean $\theta$: 
$$P(x^{(i)};\theta) = \theta^{x^{(i)}} (1-\theta)^{(1-x^{(i)})} $$
A common estimator for the $\theta$ parameter of this distribution is the __mean of the training samples__: 
$$\hat \theta_m = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$$

This time we are interested in computing the variance of the estimator $\hat \theta_m$

\begin{align}
    Var(\hat \theta_m) = & Var(\frac{1}{m} \sum_{i=1}^{m} x^{(i)}) \\
    = & \frac{1}{m^2} \sum_{i=1}^{m} Var(x^{(i)}) \\
    = & \frac{1}{m^2} \sum_{i=1}^{m} \theta(1-\theta) \\
    = & \frac{1}{m^2} m\theta(1-\theta) \\
    = & \frac{1}{m} \theta(1-\theta) \\
\end{align}

The variance of the estimator decreases as a function of m, the number of examples in the dataset. This is a common property of popular estimators. 

#### 5.4.4 Trading off Bias and Variance to Minimize Mean Squared Error
```
Bias and variance measure two different sources of error in an estimator. Bias measures the expected deviation from the true value of the function or parameter. Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause.
```

Imagine that we are interested in approximating the function shown and we are only offered the choice between a model with large bias and one that suffers from large variance. How do we choose between them?

The most common way to negotiate this trade-off is to use __cross-validation__. Alternatively, we can also compare the __mean squared error (MSE) of the estimates__: 

\begin{align}
    MSE = & \mathbb{E}[(\hat \theta_m - \theta)^2] \\
    = & \mathbb{E}(\hat \theta_m^2 -2\hat \theta_m \theta + \theta^2) \\
    = & \mathbb{E}(\hat \theta_m)^2 -2 \mathbb{E}(\hat \theta_m) \theta + \theta^2 + \mathbb{E}(\hat \theta_m^2) - \mathbb{E}(\hat \theta_m)^2 \\
    = & (\mathbb{E}(\hat \theta_m) - \theta)^2 + (\mathbb{E}(\hat \theta_m^2) - \mathbb{E}(\hat \theta_m)^2) \\
    = & bias(\hat \theta_m)^2 + Var(\hat \theta_m) \\
\end{align}

The MSE measures the overall expected deviation—in a squared error sense between the estimator and the true value of the parameter $\theta$. Evaluating the MSE incorporates __both the bias and the variance.__

```
The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. 
```

```
In the case where generalization error is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias.
```

#### 5.4.5 Consistency
We usually wish that, as the number of data points m in our dataset increases, our point estimates converge to the true value of the corresponding parameters known as __consistency.__

Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows. However, the reverse is not true—asymptotic __unbiasedness does not imply consistency.__

For example, consider estimating the mean parameter $\mu$ of a normal distribution $N (x;\mu,\sigma^2)$, with a dataset consisting of m samples: $\{x^{(1)}, . . . , x^{(m)} \}$. We could use the first sample $x^{(1)}$ of the dataset as an unbiased estimator: $ \hat \theta = x^{(1)} $. In that case, $\mathbb{E}(\theta _m) = \theta$ so the estimator
is unbiased no matter how many data points are seen. This, of course, implies that the estimate is asymptotically unbiased. However, this is not a consistent estimator as it is not the case that $\theta_m \rightarrow \theta$ as $m \rightarrow \infty$.

### 5.5 Maximum Likelihood Estimation
Rather than guessing that some function might make a good estimator and then analyzing its bias and variance, we would like to have some principle from which we can _derive specific functions that are good estimators for different models._
The most common such principle is the __maximum likelihood principle__.

Consider a set of m examples $X = \{x^{(1)}, . . . , x^ {(m)}\}$ drawn independently from the true but unknown data generating distribution $p_{data}(x)$.

Let $p_{model}(x;\theta)$ be a parametric family of probability distributions over the same space indexed by $\theta$. In other words, $p_{model}(x;\theta)$ maps any configuration x to a real number estimating the true probability $p_{data}(x)$.

The maximum likelihood estimator for $\theta$ is then defined as 

\begin{align}
    \theta_{ML} = & arg \max_{\theta} p_{model}(X;\theta) \\
    = & arg \max_{\theta} \prod_{i=1}^m p_{model}(x^{(i)};\theta) \\
\end{align}

We observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum.

$$
\theta_{ML} = arg \max_{\theta} \sum_{i=1}^m \log p_{model}(x^{(i)};\theta)
$$

Because the arg max does not change when we rescale the cost function, we can divide by m to obtain a version of the criterion that is expressed as __an expectation with respect to the empirical distribution__ $\hat p_{data}$ defined by the training data:

$$
\theta_{ML} = arg \max_{\theta} \mathbb{E}_{x \sim \hat p_{data}} \log p_{model}(x;\theta)
$$

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution $\hat p_{data}$ defined by the training
set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. The KL divergence is given by

$$
D_{KL} (\hat p_{data} || p_{model}) = \mathbb{E}_{x \sim \hat p_{data}} [ \log \hat p_{data}(x) - \log p_{model}(x)]
$$

The term on the left is a function only of the data generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize 

$$
\mathbb{E}_{x \sim \hat p_{data}} - \log p_{model}(x)
$$

We can thus _see maximum likelihood as an attempt to make the model distribution match the empirical distribution $\hat p_{data}$_. Ideally, we would like to match the true data generating distribution $p_{data}$, but we have no direct access to this distribution.

In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes __minimization of the negative log-likelihood (NLL)__, or equivalently, __minimization of the cross entropy__. The perspective of maximum likelihood as minimum KL divergence becomes helpful in this case because the KL divergence has a known minimum value of zero. The negative log-likelihood can actually become negative when x is real-valued.

#### 5.5.1 Conditional Log-Likelihood and Mean Squared Error
The maximum likelihood estimator can readily be generalized to the case where
our goal is to estimate a conditional probability P(y | x ; θ)

This is actually the most common situation because it forms the basis for
most supervised learning. 

If X represents all our inputs and Y all our observed targets, then the conditional maximum likelihood estimator is 
$$\theta_{ML} = arg \max_{\theta} P(Y|X;\theta)$$

If the examples are assumed to be i.i.d., then this can be decomposed into
$$\theta_{ML} = arg \max_{\theta} \sum_{i=1}^{m} \log P(y^{(i)}|x^{(i)};\theta)$$

#### Example: Linear Regression as Maximum Likelihood
Instead of producing a single prediction $\hat y$, we now think of the model as producing a conditional distribution p(y | x). 
We can imagine that with an infinitely large training set, we might see several training examples with the same input value x but different values of y. 

The goal of the learning algorithm is now to fit the distribution p (y | x) to all of those different y values that are all compatible with x.

We define $p(y | x) = N(y ; \hat y(x;w) , \sigma^2)$. The function $\hat y(x; w)$ gives the prediction of the mean of the Gaussian. We assume that the variance is fixed to some constant $\sigma^2$ chosen by the user.

The __conditional log-likelihood__ is given by 

\begin{align}
    & \sum_{i=1}^{m} \log p(y^{(i)}|x^{(i)};\theta) \\
    = & -m \log \sigma - \frac{m}{2} \log(2 \pi) - \sum_{i=1}^{m} \frac{|| \hat y^{(i)} - y^{(i)}||^2}{2 \sigma^2} \\
\end{align}

> Assume P is a normal distribution. $$ P(y^{(i)};\hat y^{(i)},\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp(-\frac{{( y^{(i)} - \hat y^{(i)} })^2 }{2 \sigma^2})
$$

Comparing the log-likelihood with the mean squared error, 
$$
MSE_{train} = \frac{1}{m} \sum_{i=1}^{m} || \hat y^{(i)} - y^{(i)} ||^2
$$
We immediately see that maximizing the log-likelihood with respect to w yields
the same estimate of the parameters w as does minimizing the mean squared error.

#### 5.5.2 Properties of Maximum Likelihood
The main appeal of the __maximum likelihood estimator__ is that it can be shown to
be the __best estimator asymptotically__, as the number of examples m → ∞, in terms
of its rate of convergence as m increases.


### 5.8 Unsupervised Learning Algorithms
__Unsupervised learning__ refers to most attempts to extract information from a distribution that do not require human labor to annotate examples.

A classic unsupervised learning task is to find the __“best” representation__ of the
data. By ‘best’ we can mean different things, but generally speaking we are looking
for a representation that preserves as much information about x as possible while
obeying some penalty or constraint aimed at keeping the representation simpler or
more accessible than x itself.

There are multiple ways of defining a __simpler__ representation. Three of the
most common include 
- __Lower dimensional representations__
attempt to compress as much information about x as possible in a smaller representation.
- __Sparse representations__
embed the dataset into a representation whose entries are
mostly zeroes for most inputs. The use of sparse representations typically requires
increasing the dimensionality of the representation, so that the representation
becoming mostly zeroes does not discard too much information. 
- __Independent representations__.
attempt to disentangle the sources of variation underlying the data distribution such that the dimensions of the representation are statistically independent.

#### 5.8.1 Principal Components Analysis
PCA learns a representation that has lower dimensionality than the original input. It also learns a representation whose elements have no linear correlation with each other.

PCA learns an orthogonal, linear transformation of the data that projects an
input x to a representation z. We can use PCA as a simple and effective __dimensionality reduction method__ that preserves as much of the information in the data as possible (again, as measured by least-squares reconstruction error).

Let us consider the m × n dimensional design matrix X. We will assume that
the data has a mean of zero, $\mathbb{E}[x] = 0$. If this is not the case, the data can easily be centered by subtracting the mean from all examples in a preprocessing step.

The __unbiased sample covariance matrix__ associated with X is given by:
$$ Var[x] = \frac{1}{m-1} X^TX
$$

PCA finds a representation (through linear transformation) $z = x^TW$ where $Var[z]$ is diagonal. (Prove later)

From previous section, we saw that the principal components of a design matrix X are given by the eigenvectors of $X^TX$. 

$$
X^TX = W \Lambda W^T
$$

The principal components may also be obtained via the singular value decomposition. Let W be the right singular vectors in the decomposition $X=U \Lambda W^T$

We then recoverthe original eigenvector equation with W as the eigenvector basis:

$$
X^TX = (U \Sigma W^T)^T (U \Sigma W^T) = W \Sigma^2 W^T \\
where \; \Sigma^2 = \Lambda
$$

Using the SVD of X, we can express the variance of X as:

\begin{align}
    Var[x] = & \frac{1}{m-1} X^TX \\
    = & \frac{1}{m-1} (U \Sigma W^T)^T (U \Sigma W^T) \\
    = & \frac{1}{m-1} W \Sigma^T U^T U \Sigma W^T \\
    = & \frac{1}{m-1} W \Sigma^2 W^T \\
\end{align}

If we take $z = x^T W$ , we can ensure that the covariance of z is diagonal as required:

\begin{align}
    Var[z] = & \frac{1}{m-1} Z^TZ \\
    = & \frac{1}{m-1} W^T X^T X W \\
    = & \frac{1}{m-1} W^T W \Sigma^2 W^T W \\
    = & \frac{1}{m-1} \Sigma^2 \\
\end{align}

The above analysis shows that when we project the data x to z, via the linear transformation W, the resulting representation has a diagonal covariance matrix (as given by $\Sigma^2$) which immediately implies that the __individual elements of z are
mutually uncorrelated__. It is a simple example of a representation that attempts to _disentangle the unknown factors of
variation_ underlying the data.

In the case of PCA, this disentangling takes the form of finding __a rotation__ of the input space (described by W) that aligns the principal axes of variance with the basis of the new representation space associated with z.

While correlation is an important category of dependency between elements of the data, we are also interested in __learning representations that disentangle more complicated forms of feature dependencies__. For this, we will need more than what can be done with a simple linear transformation.

#### 5.8.2 k-means Clustering

### 5.9 Stochastic Gradient Descent
Stochastic gradient descent is an extension of the gradient descent algorithm. A recurring problem in machine learning is that large training sets are necessary for good generalization, but large training sets are also more computationally expensive.

The cost function used by a machine learning algorithm often decomposes as __a sum over training examples of some per-example loss function (i.e. negative log-likelihood).__

The insight of stochastic gradient descent is that the __gradient is an expectation.__

The gradient of a cost function:
$$
g = \nabla_{\theta} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} L(x^{(i)},y^{(i)},\theta)
$$

The __estimate of the gradient__ is formed as:
$$
g = \frac{1}{m'} \nabla_{\theta} \sum_{i=1}^{m'}  L(x^{(i)},y^{(i)},\theta)
$$
where m' is the size of minibatch

The expectation may be approximately estimated using a small set of samples.

Specifically, on each step of the algorithm, we can __sample a minibatch of examples__ $B = \{x^{(1)},...,x^{(m')} \}$ drawn uniformly from the training set. 

Stochastic gradient descent has many important uses outside the context of deep learning. It is the main way to train large linear models on very large datasets. For a fixed model size, the cost per SGD update __does not depend on the training set size m.__

### 5.10 Building a Machine Learning Algorithm
Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: 
- Combine a specification of a dataset
- A cost function
- An optimization procedure and a model

### 5.11 Challenges Motivating Deep Learning
How the challenge of generalizing to new examples becomes exponentially more difficult when working with high-dimensional data, and how the mechanisms used to achieve generalization in traditional machine learning are insufficient to learn complicated functions in high-dimensional spaces.

Such spaces also often impose high computational costs. Deep learning was designed to overcome these and other obstacles.

#### 5.11.1 The Curse of Dimensionality
Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. This phenomenon is known as the __curse of dimensionality__

Of particular concern is that __the number of possible distinct configurations__ of a set of variables __increases exponentially__ as the number of variables increases.

One challenge posed by the curse of dimensionality is a __statistical challenge__. A statistical challenge arises because the number of possible configurations of x is much larger than the number of training examples.

Because in high-dimensional spaces the number of configurations is huge, much larger than our number of examples, a typical grid cell has no training example associated with it. How could we possibly say something meaningful about these new configurations?

#### 5.11.2 Local Constancy and Smoothness Regularization
In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn.

Throughout this book, we will describe how deep learning introduces additional (explicit and implicit) priors in order to reduce the generalization error on sophisticated tasks.

Among the most widely used of these implicit “priors” is the __smoothness prior__ or __local constancy prior__. This prior states that the function we learn should not change very much within a small region.

In other words, if we know a good answer for an input x (for example, if x is a labeled training example) then that answer is probably good in the neighborhood of x.

If we have several good answers in some neighborhood we would combine them (by some form of averaging or interpolation) to produce an answer that agrees with as many of them as much as possible.

__Decision trees__ also suffer from the limitations of exclusively smoothness-based learning because they break the input space into as many regions as there are leaves and use a separate parameter (or sometimes many parameters for extensions of decision trees) in each region. If the target function requires a tree with at least n leaves to be represented accurately, then at least n training examples are required to fit the tree.

The smoothness assumption and the associated non-parametric learning algorithms work extremely well so long as there are enough examples for the learning algorithm to observe high points on most peaks and low points on most valleys of the true underlying function to be learned.

In high dimensions, even a very smooth function can change smoothly but in a different way along each dimension. If the function additionally behaves differently in different regions, it can become extremely complicated to describe with a set of training examples. 

If the function is complicated (we want to distinguish a huge number of regions compared to the number of examples), is there any hope to generalize well?

- Whether it is possible to represent a complicated function efficiently ? Yes
- Whether it is possible for the estimated function to generalize well to new inputs ? Yes. 

We introduce some __dependencies between the regions via additional assumptions__ about the underlying data generating distribution. In this way, we can actually generalize non-locally.

AI tasks have structure that is much too complex to be limited to simple, manually specified properties such as periodicity, so we want learning algorithms that embody more general-purpose assumptions.

The core idea in deep learning is that we assume that the __data was generated by the composition of factors or features__, potentially at __multiple levels in a hierarchy__.

#### 5.11.3 Manifold Learning
A __manifold__ is a connected region. Mathematically, it is a set of points, associated with a neighborhood around each point.

Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn functions with interesting variations across all of $R^n$

Manifold learning algorithms surmount this obstacle by assuming that most of $R^n$ consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold. 

The assumption that the __data lies along a low-dimensional manifold may NOT always be correct or useful__. The evidence in favor of this assumption consists of two categories of observations.

- The first observation: the probability distribution over images, text strings, and sounds that occur in real life is __highly concentrated__. If you generate a document by picking letters uniformly at random, what is the probability that you will get a meaningful English-language text? The distribution of natural language sequences occupies a very small volume in the total space of sequences of letters.

- The second argument in favor of the manifold hypothesis is that we can also __imagine such neighborhoods and transformations__, at least informally. In the case of images, we can certainly think of many possible transformations that allow us to trace out a manifold in image space: we can gradually dim or brighten the lights, gradually move or rotate objects in the image, gradually alter the colors on the surfaces of objects, etc.