# Linear Regression and Regularization

This notebook will review the slides for linear regression, with examples.
It will also show examples of different regularizers.

1. Review model (derivation too?)
2. Simple example
3. Review regularization
4. Simple examples (change regularizer, look at coefficients and results?

What's a good data set to use for these examples?  
Things that sparsity works good for?  
What if we already have enough material, and we really only need the logistic regression notebook?  


# Machine Learning Models

Many problems in machine learing seek to build a model

$$g(a; x) \approx y$$

given a data set

$$\{(a_1, y_1), \dots, (a_m, y_m)\},$$

with components
* $a_i \in \mathbf{R}^n$ - data features,
* $y_i \in \mathbf{R}$ or $\{0, 1\}$ - data value or class,
* $g: \mathbf{R}^n \to \mathbf{R}$ or $\{0, 1\}$ - prediction function,
* $x \in \mathbf{R}^n$ - model parameters,
* $m$ - number of data points, and
* $n$ - number of data features.

We can fit a model to the given data by solving an optimization problem of the form

$$\min_x \sum_{i=1}^m f_i(g(a_i; x), y_i) + r(x)$$

with components
* $x \in \mathbf{R}^n$ - model parameters,
* $f_i: \mathbf{R}^n \to \mathbf{R}$ - functions that measure how well the model fits the data for a given set of parameters, and
* $r(x): \mathbf{R}^n \to \mathbf{R}$ - regularization function.

# Linear Regression

In the linear regression problem, we would like to find a linear predictor

$$g(a_i; x) = x_1 a_{i1} + \dots + x_n a_{in} = a_i^T x \approx y_i,$$

where both $a_i$ and $y_i$ are continuous.
One approach for deriving the functions $f_i$ is to assume a statistical model for the error in the data set, and then develop a [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) formulation.
Here we will assume that errors are independently drawn from a normal distribution with mean zero and variance $\sigma^2$, that is, 

$$y_i = a_i^Tx + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2).$$

This means that the probability density function for an observation $(a_i, y_i)$ given model parameters $x$ is

$$p((a_i, y_i); x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - a_i^Tx)^2}{2\sigma^2}\right),$$

and the PDF of all $m$ i.i.d. observations is

\begin{align}
p\big(\{(a_1, y_1), \dots, (a_m, y_m)\}; x\big) &= \prod_{i=1}^m p((a_i, y_i); x) \\
&= \prod_{i=1}^m \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - a_i^Tx)^2}{2\sigma^2}\right) \\
&= \left(\frac{1}{2\pi\sigma^2}\right)^{m/2} \exp\left(-\frac{\sum_{i=1}^m (y_i - a_i^Tx)^2}{2\sigma^2}\right).
\end{align}

Alternatively, we can consider the likelihood of a set of model parameters given our $m$ observations, 

$$\mathcal{L}\big(x; \{(a_1, y_1), \dots, (a_m, y_m)\}\big) = p\big(\{(a_1, y_1), \dots, (a_m, y_m)\}; x\big),$$

so that we can solve an optimization problem to find the parameters with the maximum likelihood. In practice, this is often done by minimizing the negative log-likelihood, which, ignoring the constant coefficient, results in our least-squares problem

$$\min_x \sum_{i=1}^m(y_i - a_i^Tx)^2 = \|y - Ax\|^2.$$


