# 1. Linear Regression

solve problem:

$y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}$

# 1.1 Assumptions
- errors ~ $N(0, \delta^2)$, that is, $p(\epsilon^{(i)}) = \frac{1}{\sqrt {2\pi} \delta} exp(-\frac{(\epsilon ^{(i)})^2}{2\delta^2})$
    - with rearangement, we have $p(y^{(i)}|x^{(i); \theta}) = \frac{1}{\sqrt {2\pi} \delta} exp(-\frac{(y^{(i)} - \theta^Tx^{(i)})^2}{2\delta^2})$
- samples: independent

## 1.2 Likelihood Function

**Likelihodd Function**:

\begin{equation}

    L(\theta) = p(y^{(1,...,m)} | x^{(1,...,m)}; \theta) = \Pi_{i=1}^m p(y^i|x^i, \theta)
    
\end{equation}

**Notes**
- what parameter can predict the given target when combined with given features -> purpose of maximum likelihood
- the maximum probability that the parameter $\theta$ can make ALL the prediction matches ALL the measurement
- due to the independency of samples, the above equation holds.


**Log-likelihood Function**

\begin{align}
 log L(\theta) &= \sum_{i=1}^{m} log (p(y^i|x^i; \theta)) \\
    &= \sum_{i=1}^{m} log (\frac{1}{\sqrt {2\pi} \delta} exp(-\frac{(y^i - \theta^T(x^i)^2}{2\delta^2})) \\
    &= m log (\frac{1}{\sqrt{2\pi}\delta}) - \frac{1}{\delta^2} \cdot \frac{1}{2} \sum_{i=0}^{m} (y^i - \theta^Tx^i)^2 \\
    &= constant - \frac{1}{\delta^2} \cdot J(\theta)
\end{align}

where $constant$ is a constant value, and $J(\theta)$ is typicall known as the least-sqaures cost function, or mean square errors.

**Notes**
- easy calculation by changing production to additions.
- maximizing the log-likelihood function with respect to $\theta$ is equivalent to minimizing the means squared error. These two have different values but the same location of the optimum.


## 1.3 Propeties of Maximum Likelihood 
The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of examples $m \rightarrow \infty$, in terms of its rate of convergence as `m` increases.

Under appropriate conditions, the maximum likelihood estimator has the property of consistency, meaning that as the number of training examples approaches inﬁnity, the maximum likelihood estimate of a parameter
converges to the true value of the parameter. These conditions are as follows:
- The true distribution $p_{data}$ must lie within the model family $p_{model}(·;θ)$.
Otherwise, no estimator can recover $p_{data}$.
- The true distribution $p_{data}$ must be identiﬁable, meaning that there exists a unique value of θ that maximizes the likelihood function $p_{data}(·;θ)$. Otherwise, no estimator can recover $p_{data}$.

## 1.4 Implementation
Linear regression is simply a dense neural network with a single neuron and no activation function. The neuron computes the weighted sum of its inputs and adds a bias term. The weights and bias are the parameters that the model will learn.

In [2]:
import jax
import jax.numpy as jnp
from jax import random
from flax.core import freeze, unfreeze
import flax.linen as nn

class Linear(nn.Module):
    features: int
    kernel_init: nn.initializers.Initializer = nn.initializers.lecun_normal()
    bias_init: nn.initializers.Initializer = nn.initializers.zeros
    
    @nn.compact
    def __call__(self, x):
        w = self.param('w', # name of the parameter 
                       self.kernel_init, # initialization
                       (x.shape[-1], self.features) # shape of the parameter
                       ) 
        b = self.param('b', self.bias_init, ())
        return jnp.dot(x, w) + b

key1, key2 = random.split(random.PRNGKey(0), 2)
x = random.uniform(key1, (10,5))

# create a dense layer (linear model) with 5 features
model = Linear(features=1)
params = model.init(key2, x)

y = model.apply(params, x)

print('initialized parameter shapes:\n', jax.tree_util.tree_map(jnp.shape, unfreeze(params)))
print('output:\n', y)

initialized parameter shapes:
 {'params': {'b': (), 'w': (5, 1)}}
output:
 [[-0.487528  ]
 [-0.46083403]
 [ 0.13756503]
 [-0.13930888]
 [-0.29863402]
 [-0.47391087]
 [-0.16525263]
 [-0.40750027]
 [-0.32280582]
 [-0.23942292]]
