### Lecture Notes

#### Estimators

Estimator - Rule/Function that estimates \\(\theta\\) from a given realization of data

$$\hat{\theta}=g(x[0],..,x[N-1])$$

To design an estimator, we need both a signal model, which links our observed data to our parameters of interest, and an objective function. Then finding an estimator, becomes an optimization problem of minimizing the objective function.

![estimator_design.png](./estimator_design.png "estimator_design.png")

Unbiased estimator - On average, we expect the estimator estimate to be equal to the true value \\(\mathbb{E}[\hat{\theta}]=\theta\\)

Minimum variance - We aim to use an estimator with minimum variance, since a lower variance will give us a estimates closer to the true value

Information in data - The joint PDF \\(p(\mathbf{x};\theta)\\) (Reads the probavility of x given parameters \\(\theta\\)) will show how much information is present in observed data. The more the PDF depends on the uknown parameter \\(\theta\\). the more accurate an estimator we will have.

If \\(x[0]=A+w[0]\\) where \\(w\\) is normally distributed with mean 0 and \\(\sigma_w^2\\) and we want to estimate A, the joint PDF are as seen below

![pdf.png](./pdf.png "pdf.png")

In the first example, the PDF is "sharper" and will give a better estimate.

Likelihood function - The PDF of data(Where data is fixed) and the unknown parameter \\(\theta\\) varies is called the likelihood function. The sharpness of this functions determines hoc much information about \\(\theta\\) i present in data and how acccurately we can estimate it.

The sharpness is quantified by the curvature of the log-likelihood function

$$I(\theta) = -E_{\mathbf{x}}\left[\frac{\partial^2 \ln p[\mathbf{x}; \theta]}{\partial \theta^2}\right]$$



####Maximum Likelihood estimator

When given a set of observations \\(\mathbf{x}\\) the probability of observing this data , given a parameter \\(\theta\\) is given by \\(p(\mathbf{x};\theta)\\)

The goal for MLE is the maximize the likelihood given the data 

$$\hat{\theta} = \arg \max_{\theta} p(\mathbf{x}; \theta)$$

For convenience in optimization, we often use the log-likelihood function $\ell(\theta)$, defined as

$$\ell(\theta) = \log p(\mathbf{x}; \theta) = \sum_{i=1}^{n} \log p(x_i; \theta)$$

MLE estimates are asymptotically unbiased, i.e.,$$\lim_{N \to \infty} E[\hat{\theta}] = \theta.$$MLE estimates are asymptotically efficient$$\lim_{N \to \infty} \text{VAR}[\hat{\theta}] \to I^{-1}(\theta).$$

Where \\(I^{-1}(\theta)\\) is the inverse of the Fischer information matrix shown above, called the Cramer-Rao lower bound which defines the absolute minimum variance that any unbiased estimator can acheive. 

#### Least Squares Estimator

If we have a signal model where the signal \\(\theta\\) is corrupted by the noise, measurements or model inaccuracies and we measure \\(s\\)

$$\mathbf{s}=\mathbf{H}\mathbf{\theta}$$

Here \\(\mathbf{H}\\) is the observation matrix. Then

The cost function \\(J(\theta)\\) is defined as:
$$J(\theta) = \sum_{n=0}^{N-1} (x[n] - s[n])^2$$

$$= (\mathbf{x} - \mathbf{H}\boldsymbol{\theta})^T (\mathbf{x} - \mathbf{H}\boldsymbol{\theta})$$

The value of \\(\mathbf{\theta}\\) that minimizes the above cost function is the LSE (Least Squares Estimator):

$$\hat{\boldsymbol{\theta}} = (\mathbf{H}^T \mathbf{H})^{-1} \mathbf{H}^T \mathbf{y}$$

With the least squares estimator we don't use any probabalistic models, but we define a mathematical model of our signal, based on some parameters \\(\theta\\)

### Exercises

![ex_11_a.png](./ex_11_a.png "ex_11_a.png")
![ex_11_b.png](./ex_11_b.png "ex_11_b.png")

#### 1)

The LS estimator in closed form is

$$\hat{\boldsymbol{\theta}} = (\mathbf{H}^T \mathbf{H})^{-1} \mathbf{H}^T \mathbf{y}$$

Where \\(\mathbf{H}\\) is the design matrix with rows \\(\mathbf{h_1}=[1, x_1, x_1^2]\\) to \\(\mathbf{h_N}=[1, x_N, x_N^2]\\) for \\(M=3\\)

#### 2)

In [0]:
import numpy as np
import matplotlib.pyplot as plt

N = 100

i = np.arange(1,N+1)

x = np.sin(2*np.pi*i/N)

w = np.random.normal(0, 0.2, N)

y = x + w

train_samples = 10
M = 3

train_idx = np.random.randint(low=0, high=N, size=train_samples)

y_train = y[train_idx]
x_train = x[train_idx]

H = np.zeros((N, M+1))

for t in range(N):
    for m in range(M+1):
        H[t, m] = i[t]**m

H_train = H[train_idx, :]

theta_hat = np.linalg.pinv(H_train) @ x_train

x_hat = H @ theta_hat

print(theta_hat)
plt.figure()
plt.plot(x, label="x")
plt.plot(y, label="y")
plt.plot(x_hat, label="x_hat")
plt.plot(x-x_hat, label="Error")
plt.legend()


![ex_11_c.png](./ex_11_c.png "ex_11_c.png")


#### 1 and 2)

The likelihood function 

$$
p(\mathbf{x};A) = \prod_{n=0}^{N-1} \frac{1}{\sqrt{2\pi\sigma^{2}}}\,\exp\left(-\frac{(x_n-A)^{2}}{2\sigma^{2}}\right)
$$

Then the log likelihood function is

$$
\ell(\mathbf{x};A)
= \sum_{n=0}^{N-1} \log\ \left(
\frac{1}{\sqrt{2\pi\sigma^{2}}}
\exp \left[-\frac{(x_n-A)^2}{2\sigma^{2}}\right]
\right)
$$

$$
\ell(\mathbf{x};A)= -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x_n - A)^2
$$

It can be seen that to maximize this, we will need to minimize \\(\sum_{n=0}^{N-1}(x_n - A)^2\\) which is the least squares problem.