# Setup

In [1]:
# general modules 
import numpy as np
import pandas as pd 
from matplotlib import pyplot as plt
from scipy.optimize import minimize

%load_ext autoreload
%autoreload 2
 
# Code for this week 
import estimation_ante as est
import LinearModel as lm # based on code from previous weeks

# Set random seed
seed = 42
rng = np.random.default_rng(seed=seed)

In [2]:
import seaborn as sns
sns.set_theme()
plt.rcParams.update({
    #"text.usetex": True, # LaTeX can sometimes be tricky to get working but makes graphs prettier :) 
    "font.family": "serif", 
    "font.size":18 
})

# Maximum Likelihood Estimation of the Linear Model


In this exercise we will consider the linear regression model. Of
course, when estimated with OLS, the estimator, which minimizes the sum
of squared residuals, has a closed form solution. This also goes for the
maximum likelihood estimator when the residuals are assumed Gaussian. We
will however do the maximization numerically using the scipy library and the `optimize` class, more specifically we will use its `minimize` function. The purpose of this exercise is to learn to do
numerical maximization and to be familiar with $M$-estimators by viewing
the maximum likelihood estimator of the linear model as an
$M$-estimator.

## The Model

We consider a linear model with the following characteristics

$$y_{i}=\beta _{0}+\beta _{1}x_{1i}+....+\beta _{k-1}x_{k-1i}+u _{i} \quad i=1,..,N$$

with $u{i} \sim N(0,\sigma_{\varepsilon}^2)$.

The (conditional) likelihood contribution for observation $i$ is,
$$f\left(y_{i}\left|\mathbf{x}_{i};\beta,\sigma^{2}\right.\right) = \phi\left(\frac{\hat{u}_i}{\sigma}\right) =\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ -\frac{1}{2}\frac{\hat{u}_{i}^{2}}{\sigma^{2}}\right\},$$
where
$\hat{u}_{i}=y_{i}- \sum_{k=1}^K \beta_k x_{ik}$, and $\phi(\cdot)$ is the standard normal density. 

Thus, the loglikelihood contribution is 
$$
\ell_i(\theta) = - \frac{1}{2}\log (2 \pi) - \frac{1}{2}\log(\sigma^2) - \frac{1}{2}\frac{\hat{u}_i^2}{\sigma^2}
$$

Often the term $-\frac{N}{2}\log\left(2\pi\right)$ is dropped as it does not change with 
$\beta$ or $\sigma$ and thus does not affect the optimization. 

Finally, our optimizer will be solving the problem 
$$ \min_\theta N^{-1} \sum_{i=1}^N q(\theta, y_i, x_i),$$ 
where the criterion function is the negative loglikelihood, $q(\theta,y_i,x_i) = -\ell_i(\theta)$. 
Again, we can drop the factor $N^{-1}$ as it does not affect the optimization. 

In [3]:
# Simulate dataset
n = 100
K = 2 # two regressors, a constant and one (real) regressor
beta = np.ones((K,1))  # First is constant
sigma = 3
true_theta = np.vstack([beta, sigma])
y, x = lm.sim_data(n, true_theta, rng)

# Find some starting values 
theta0 = lm.starting_values(y, x, )
theta0 = 0.8*theta0 # scale them by 0.8 to make the problem a little harder 

# Question 1: Write a function for the likelihood contribution.
Open the file `LinearModel.py`, and fill in the function `loglikelihood` with the **likelihood contribution**. It should return an $N \times 1$ vector of likelihood contributions.

*Hint:* The sum of the likelihood contributions should be close to -162.8, given `theta0, x, y`, which I have written a function to check for you, so you know if you have written it correctly.

In [4]:
# Fill in the missing parts of the lm.loglikelihood() function.
# First, calculate the residual.
# Then calculate the likelihood value, using the likelihood contribution equation from above.
# Test if you got it right with the cell below. You might have to "reseed" by running the first cell in this notebook again.

In [None]:
np.isclose(np.sum(lm.loglikelihood(theta0, y, x)), -162.800)

# Question 2: Estimate Parameters 

Now finish up the `estimate` function, which takes an function to minimize `func`, starting values `theta0`, the data `y` and `x`, and what type of variance to use `cov_type`.

You need to use the `minimize` function, which takes the following inputs: the objective function `obj_func`, and the starting values `theta0`.

You also need to finish up the `variance` function, which takes the function `func` (not the objective function), the data `y` and `x`, the results from the minimizer `result`, and finaly what type of variance to calculate `cov_type`.

## 2.a Estimate Parameters with `optimize.minimize`

1. Create a `lambda` function, `Q`, taking only one input, `theta`, and returning the negative mean loglikelihood. 
    * ***Hint:*** Watch this video to learn about functions in Python: https://youtu.be/watch?v=loF8zsPaIjs. 
2. Evaluate `Q(theta0)` to test that it works. 
3. Call `minimize`, starting from `theta0`, with options having `disp` set to `True`, and the optimization algorithm (`method`) set to `BFGS`. 

In [None]:
# practice example with optimize.minimize

# 1. function handle to the objective 
Q = lambda theta : np.mean(lm.q(theta, y, x)) # just a function of one variable

# 2. starting values 
theta0 = lm.starting_values()

# 3. call scipy minimize 
#res = minimize(FILL IN)

# Question 3: Standard Errors

**Tasks**: 
* Fill in `estimation.estimate()`: For the estimate function, you need to read the documentation for the optimize.minimize function, to pass the obj_func and theta0 to that function.
    * ***Bonus:*** Make sure that `estimate()` passes the inputs `options` and the optional `kwargs` correctly to `optimize.minimize`. The options struct can e.g. ask the optimizer to print or not print final convergence output by setting `options = {'disp': True}` (or `False`). The `kwargs` can include things like controlling which algorithm is used for optimization, e.g. `method='BFGS'`. 
* Fill in `estimation.variance()`: implement all three options for $\text{Avar}(\hat{\theta}$) described below. 

## Theory: The Three Asymptotic Variance Estimators

The log-likelihood function is a nonlinear function, which must in
general be maximized numerically in order to obtain the ML estimates.
In general, for $M$-estimators, we know that 
$$
\sqrt{N}\left( \boldsymbol{\hat{\theta}}-\boldsymbol{\theta }_{0}\right) 
\overset{d}{\rightarrow} \mathcal{N} \left(\mathbf{0}, \mathbf{A}_{0}^{-1} \mathbf{B}_0 \mathbf{A}_{0}^{-1} \right). 
$$ 

For Maximum Likelihood (ML) estimators specifically, the *Information Matrix 
Equality* holds, which implies that 
$$ \mathbf{A}_{0} = \mathbf{B}_{0}. $$ 
This means that the asymptotic variance matrix simplifies so that 
$$ \mathbf{A}_{0}^{-1} \mathbf{B}_0 \mathbf{A}_{0}^{-1} = \mathbf{A}_{0}^{-1} = \mathbf{B}_{0}^{-1}.$$ 

This means that we have three valid ways of estimating the asymptotic variance 
matrix of our parameter esimates: 

$\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}})$ can be taken to be any
of the three options
1. $\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}}) = N^{-1} \hat{\mathbf{A}}^{-1}$: the `Hessian`, 
2. $\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}}) = N^{-1} \hat{\mathbf{B}}^{-1}$: the `Outer Product`,  
3. $\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}}) = N^{-1} \hat{\mathbf{A}}^{-1} \hat{\mathbf{B}} \hat{\mathbf{A}}^{-1} $: the `Sandwich`: viewed as a more "robust" option.  


where the components matrices are
$$
\begin{aligned}
\quad \mathbf{\hat{A}} 
    &= -\frac{1}{N} \left[ \sum_{i=1}^{N}\mathbf{H}_{i}( \boldsymbol{\hat{\theta}}) \right] 
\\
\quad  \mathbf{\hat{B}}
    &= \frac{1}{N}
        \left[ \sum_{i=1}^{N} \mathbf{s}_{i}( \boldsymbol{\hat{\theta}}) 
                              \mathbf{s}_{i}( \boldsymbol{\hat{\theta}})^{\prime}
        \right].
\end{aligned}
$$ 


*Programming hint:* To calculate the variance you have to do the following:

* `Hessian`: There is a function in the estimation file to compute the hessian numerically. Inverting this is an estimate of $\mathbf{\hat{A}}^{-1}$. So for the `Hessian` variance, you would calculate:
$$
\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}}) = \frac{1}{N}\mathbf{\hat{A}}^{-1}
$$
* `Outer Product`: the ex ante code computes the *numerical gradient* $N \times K$ matrix, `s`. You need to compute the $K \times K$ outer product of the scores, 
$ \mathbf{s}' \mathbf{s} = \sum_{i=1}^N \mathbf{s}_i \mathbf{s}_i', $
($\mathbf{s}_i$ is $K \times 1$ in this notation) and then use this to form $\hat{\mathbf{B}} = N^{-1} \mathbf{s}' \mathbf{s}$. Finally, calculate the variance using:
$$
\widehat{\text{Avar}}( \boldsymbol{\hat{\theta}}) = \frac{1}{N}\mathbf{\hat{B}}^{-1}
$$

## Question 3.a: The Outer Product

Estimate parameters and compute standard errors using the `Outer Product` estimator. (This is the *default* variance estimator in `est.estimate`.)

In [None]:
results1 = est.estimate(lm.q, theta0.flatten(), y, x)

In [None]:
label = ['beta 1', 'beta 2', 'sigma2']
est.print_table(label, results1, title='Maximum Likelihood results', num_decimals=2)

Your table should look a little like this: <br>

Maximum Likelihood results <br>

|        |   theta |   se |   t |
|--------|--------|------|------------|
| beta 1 |   0.99 | 0.31 |       3.23 |
| beta 2 |   1.36 | 0.40 |       3.42 |
| sigma2 |   2.91 | 0.23 |      12.66 |

### Question 3.b: The 'Sandwich' Estimator

Compute standard errors using the `Sandwich` estimator. 

In [None]:
results_san = est.estimate(lm.q, theta0, y, x, cov_type='Sandwich')
est.print_table(label, results_san, title='Maximum Likelihood results', num_decimals=3)

Results should look like this: 

|        |   theta |    se |      t |
|:-------|--------:|------:|-------:|
| beta 1 |   0.986 | 0.294 |  3.357 |
| beta 2 |   1.358 | 0.362 |  3.753 |
| sigma2 |   2.911 | 0.195 | 14.961 |

### Question 3.c: The 'Hessian' Estimator

Compute standard errors using the `Hessian` estimator. 

In [None]:
results_he = est.estimate(lm.q, theta0, y, x, cov_type='Hessian')
est.print_table(label, results_he, title='Maximum Likelihood results', num_decimals=3)

Results should look like this: 

|        |   theta |    se |      t |
|:-------|--------:|------:|-------:|
| beta 1 |   0.986 | 0.292 |  3.381 |
| beta 2 |   1.358 | 0.377 |  3.604 |
| sigma2 |   2.911 | 0.206 | 14.142 |

# Question 4: Monte Carlo Study. 

**Task:** Conduct a Monte Carlo study for different sample sizes. 

Conduct a Monte Carlo study of the Maximum Likelihood estimator. Try
various values of $N$ to illustrate the consistency of the
estimator. Is the estimator biased? Compare the three types of
standard error estimates to the Monte Carlo sampling standard
deviation. Is the estimator of $\sigma$ consistent?

In [None]:
NN = [5, 15, 50, 200]  # Sample size
S = 1_000  # Number of replications
P = len(theta0)

# Initialize containers for all MX experiments
theta_n    = np.zeros((len(NN), P, S))
se_theta_n = np.zeros((len(NN), P, S))
MC_se      = np.zeros((len(NN), P))

In [None]:
for i, N in enumerate(NN): # loop over sample sizes 
    print(f'N = {N:5d}: {i+1}/{len(NN)}')
    for s in range(S): # for each Monte Carlo replication 
        y, x = # simulate N observations 
        theta0 = # find starting values, and scale them slightly by 0.8
        results = # estimate parameters 
        
        theta_n[i, :, s] = results['theta_hat']
        se_theta_n[i, :, s] = results['se']
    
    MC_se[i, :] = np.std(theta_n[i,:,:], axis=1, ddof=1)

### Plotting the results

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel() 
i_theta = 1 # second beta, i.e. not the constant 
theta_diff = theta_n[:, i_theta, :] - true_theta[i_theta, 0]

for i, ax in enumerate(axes):
    ax.hist(theta_diff[i,:], bins=20)
    ax.set_xlim(-10, 10)
    ax.set_xlabel('$\\hat{\\theta}_1 - \\theta^o_1$')

plt.tight_layout()


In [None]:
fig, ax = plt.subplots()
aa = np.linspace(0.3,0.5,len(NN))
for i, N in enumerate(NN): 
    xx = np.linspace(-6,6,30)
    ax.hist(theta_diff[i, :], bins=xx, alpha=aa[i], label=f'N = {N}', density=True)
ax.legend(); 

### Same graph, scaled by $\sqrt{N}$

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel() 
i_theta = 1 # second beta, i.e. not the constant 
theta_diff = theta_n[:, i_theta, :] - true_theta[i_theta, 0]

for i, ax in enumerate(axes):
    ax.hist(theta_diff[i,:] * np.sqrt(NN[i]), bins=30)
    ax.set_xlim(-15, 15)
    
    ax.set_xlabel(f'$\sqrt{{N}} (\\hat{{\\theta}}_{i_theta} - \\theta^o_{i_theta})$')
    ax.set_ylabel('Monte carlo samples')
    ax.set_title(f'N = {NN[i]}')

plt.tight_layout()


# Question 4: Alternative Minimization Algorithms

**Task:** Estimate the model using alternative minimization algorithms (the input `method` to `minimize`). Compare how many function evaluations they take and whether they converge to the global minimum. 

* `BFGS`: The default algorithm (Newton with approximated Hessian and numerical gradients), 
* `CG`: Newton with numerical Hessian and gradients,  
* `Nelder-Mead`: Gradient-free optimizer, 
* `Powell`: Another gradient-free optimizer. 

***Hint:*** `est.estimate()` accepts various extra args, which are by default just passed to `scipy.optimize.minimize`. Try alternatives for `method` (the algorithm). 

***Note:*** The gradient-free optimizers do not return an inverse Hessian, so we can only compute the Outer Product variance matrix for those (unless we compute the Hessian numerically). 

In [None]:
N = 100
y, x = lm.sim_data(N, true_theta, rng)
theta0 = lm.starting_values(y, x)*0.8

results_BFGS = est.estimate(lm.q, theta0, y, x) # the default option is method='BFGS'
results_CG   = # estimate using method='CG'
results_NM   = # and with method='Nelder-Mead'
results_PO   = # and with method='Powell'

In [None]:
print(est.print_table(label, results_BFGS))
print(est.print_table(label, results_CG))
print(est.print_table(label, results_NM))
print(est.print_table(label, results_PO))

Expected results: 

In [None]:
# --- BFGS ---
# Optimizer succeeded after 14 iter. (60 func. evals.). Final criterion:    1.533.
#Results
#         theta      se        t
# beta 1  0.6397  0.2862   2.2351
# beta 2  0.4773  0.2920   1.6345
# sigma2  2.8084  0.1992  14.0965
# --- CG ---
# Optimizer succeeded after 8 iter. (76 func. evals.). Final criterion:    1.533.
# Results
#          theta      se        t
# beta 1  0.6397  0.2862   2.2354
# beta 2  0.4774  0.2920   1.6345
# sigma2  2.8084  0.1992  14.0964
# --- Nelder-Mead ---
# Optimizer succeeded after 64 iter. (116 func. evals.). Final criterion:    1.533.
# Results
#          theta      se        t
# beta 1  0.6397  0.2862   2.2351
# beta 2  0.4773  0.2921   1.6343
# sigma2  2.8084  0.1992  14.0963
# --- Powell ---
# Optimizer succeeded after 2 iter. (69 func. evals.). Final criterion:    1.533.
# Results
#          theta      se        t
# beta 1  0.6396  0.2862   2.2348
# beta 2  0.4773  0.2921   1.6342
# sigma2  2.8084  0.1992  14.0963