## 1. Two-Stage Least Squares

In [2]:
from typing import Tuple
from dataclasses import dataclass

import numpy as np
from scipy.stats import bootstrap

np.random.seed(42)

def least_squares(y: np.ndarray, x: np.ndarray) -> Tuple[float, float]:
    assert y.ndim == 1
    assert x.ndim == 1

    x_ = np.stack((np.ones_like(x), x), axis=1)  # Shape: (N, 2)
    xtx = np.dot(x_.T, x_)  # Shape: (2, 2)
    xty = np.dot(x_.T, y)  # Shape: (2,)

    beta = np.linalg.solve(xtx, xty)  # Shape: (2,)

    return beta[0], beta[1]  # Intercept, slope


@dataclass
class DGP:
    N: int = 1_000
    number_of_simulations: int = 1_000

    # Correlations
    rho_xu: float = 0.5
    rho_zx: float = 0.4
    rho_zq: float = 0.4

    true_beta_0: float = 2.0
    true_beta_1: float = 5.0

    def __call__(self) -> np.ndarray:
        """Runs the simulation, returns the estimates of beta_1."""
        estimates = np.full(shape=(self.number_of_simulations,), fill_value=np.nan)

        # Generate the data
        # NOTE: Keeping the full shape can be memory intensive for large N or number_of_simulations
        u = np.random.standard_normal(size=(self.number_of_simulations, self.N))
        x = self.rho_xu * u + np.random.standard_normal(size=(self.number_of_simulations, self.N))
        q = np.random.standard_normal(size=(self.number_of_simulations, self.N))
        z = self.rho_zx * x + self.rho_zq * q + np.random.standard_normal(size=(self.number_of_simulations, self.N))

        y = self.true_beta_0 + self.true_beta_1 * x + u + 2 * q

        for i in range(self.number_of_simulations):
            # Run the two-stage least squares regression
            # First stage: regress x on z
            b0, b1 = least_squares(x[i], z[i])
            x_hat = b0 + b1 * z[i]

            # Second stage: regress y on x_hat
            b0, b1 = least_squares(y[i], x_hat)
            estimates[i] = b1

        return estimates


dgp = DGP()
estimates = dgp()

avg_beta_1 = np.mean(estimates)
bias = avg_beta_1 - dgp.true_beta_1

print("Mean of beta_1 estimates: {:.4f}".format(avg_beta_1))
print("Bias of beta_1 estimates: {:.4f}".format(bias))

bs = bootstrap(data=[estimates - dgp.true_beta_1], statistic=np.mean)
low, high = bs.confidence_interval
print("95% CI of Estimator Bias: [{:.4f}, {:.4f}]".format(low, high))

print("True beta_1: {:.4f}".format(dgp.true_beta_1))


Mean of beta_1 estimates: 7.0019
Bias of beta_1 estimates: 2.0019
95% CI of Estimator Bias: [1.9895, 2.0149]
True beta_1: 5.0000


To clarify, the Bias of an estimator is its _expected deviation from the true value_.
Since this is an expectation, and we have samples of $\beta_1$, we can easily estimate a confidence interval for the bias
using the bootstrap method.

For this configuration, we can say that with $\alpha = 95 \%$ confidence, the estimator is biased since $0$ is not in the confidence interval.

## 2. Maximum Likelihood of Exponentially Distributed $X$

$$X \sim \text{Exp}(\lambda)$$
so that the probability density function (PDF) is given by:
$$f(x; \lambda) = \lambda e^{-\lambda x}$$
for $\lambda, x > 0$.

### (a) Likelihood and Log-Likelihood Functions

For a sample of $N$ observed $x_i$'s drawn from the distribution, the likelihood function is given by:
$$L(\lambda) = \prod_{i=1}^{N} f(x_i; \lambda) = \prod_{i=1}^{N} \lambda e^{-\lambda x_i} = \lambda^N \exp(-\lambda \sum_{i=1}^{N} x_i)$$
taking logarithms, we get the log-likelihood function:
$$\ell(\lambda) = \log L(\lambda) = N \log \lambda - \lambda \sum_{i=1}^{N} x_i$$
where $\sum_{i=1}^{N} x_i$ is the sum of the observed values.

### (b) Maximum Likelihood Estimation of $\lambda$

The first order condition is given by:
$$\frac{\partial \ell(\lambda)}{\partial \lambda} = \frac{N}{\lambda} - \sum_{i=1}^{N} x_i = 0 \implies \frac{N}{\lambda} = \sum_{i=1}^{N} x_i.$$
From which the maximum likelihood estimator (MLE) turns out to be:
$$\hat{\lambda} = \frac{N}{\sum_{i=1}^{N} x_i}$$
Note that the right hand side (RHS) is defined because none of the $x_i$'s are exactly $0$ (in fact, with probability $0$ or when $\lambda \to \infty$).
In practice, however, rounding errors or numerical accuracies can lead to $x_i = 0$,
so we must carefully inspect the sample before applying the estimator.

### (c) Asymptotic Variance of MLE

We know that as $N \to \infty$, the MLE is distributed as:
$$\hat{\theta} \xrightarrow{d} N(\theta_0, [I(\theta_0)]^{-1})$$
where $I(\theta_0)$ is the Fisher information matrix evaluated at $\theta_0$, which is
for now assumed to be known a priori.
Hence, the _asymptotic variance_ of the MLE, which in the multivariate case is a variance-covariance matrix, is given by
$I(\theta_0)^{-1}$.
The calculation of the variance is thus a simple (matrix) inversion of the Fisher information matrix.

Further, the cramer-rao lower bound (CRLB) states that the variance of any unbiased estimator $\hat{\theta}$
is bounded by the inverse of the Fisher information matrix:
$$\text{Var}(\hat{\theta}) \geq I(\theta_0)^{-1}$$
the right-hand side of which is the _asymptotic variance_ of the MLE.
This can be used to place a best-case, or lower bound, on the variance of our MLE estimator.
