# Assignment 1 - Eleanor Adachi

## 1. Admin

We created a fork of the main GitHub repository. Our code can be found here: https://github.com/eleanor-adachi/ARE212_Materials

## 2. Exercises




### 5.

**Moore-Penrose Inverse**

---

A matrix $A^+$ is a "Moore-Penrose" generalized inverse if:

- $AA^+A = A$;
- $A^+AA^+ = A^+$;
- $A^+A$ is symmetric; and
- $AA^+$ is symmetric.

**Full Rank Factorization**

---

Let $A$ be an $n\times m$ matrix of rank $r$. If $A = LR$, where $L$ is an $n\times r$ full column rank matrix, and $R$ is a $r\times m$ full row rank matrix, then $LR$ is a full rank factorization of $A$.

**Fact**

---

Provided only that $r>0$, the Moore-Penrose inverse $A^+ = R^{\top}(L^{\top}AR^{\top})^{-1}L^{\top}$ exists and is unique.

---

#### (1.) If $A$ is a matrix of zeros, what is $A^+$?

For a matrix $A$ consisting entirely of zeros, its Moore-Penrose inverse, $A^+$, is also a matrix consisting entirely of zeros. This conclusion follows directly from the properties of the Moore-Penrose inverse:

- $AA^+A = A$; multiplying $A^+$, which is a zero matrix, by $A$ from both sides will result in a zero matrix, satisfying this property.
- $A^+AA^+ = A^+$; similarly, since $A$ is a zero matrix, $A^+$ remains unchanged and thus must also be a zero matrix to satisfy this property.
- $A^+A$ is symmetric; a zero matrix multiplied by another zero matrix is still a zero matrix, which is inherently symmetric.
- $AA^+$ is symmetric; likewise, this multiplication results in a zero matrix, which is symmetric.

Hence, when $A$ is a matrix of zeros, $A^+$ is also a matrix of zeros.


#### (2.) Show  that if $X$ has full column rank, then $X^+ = (X^TX)^{-1}X^T$ (this is sometimes called the "left inverse"), and $X^+ X = I$.

Given a matrix $X$ with full column rank, it means that all columns of $X$ are linearly independent. This implies that the matrix $X^TX$ is invertible. The Moore-Penrose inverse of $X$, $X^+$, satisfies the property that $XX^+X = X$.

For matrices with full column rank, the Moore-Penrose inverse can be specifically expressed as $X^+ = (X^TX)^{-1}X^T$. This expression is sometimes referred to as the "left inverse" because when it is multiplied by $X$ from the left, it results in the identity matrix, $I$.

*Proof:*

1. **Start with the expression for $X^+$**: 

   We have $X^+ = (X^TX)^{-1}X^T$.

2. **Show that multiplying by $X$ yields $I$**:

   Calculate $X^+X = [(X^TX)^{-1}X^T]X = (X^TX)^{-1}(X^TX) = I$.
   
   Here, the product $(X^TX)$ is invertible because $X$ has full column rank, ensuring that $X^TX$ is a full rank square matrix and thus invertible. Multiplying this invertible matrix by its inverse yields the identity matrix, $I$.

This demonstrates that when $X$ has full column rank, its Moore-Penrose inverse $X^+$, when multiplied by $X$, yields the identity matrix, confirming that $X^+X = I$.



#### (3.) Use the result of (2) to solve for $b$ in the (matrix) form of theregression $y = Xb + u$ if $X^Tu = 0$.

Given the regression equation $y = Xb + u$ where $X$ has full column rank and it's given that $X^Tu = 0$, we aim to solve for the coefficient vector $b$. We leverage the property of the Moore-Penrose inverse that if $X$ has full column rank, then $X^+ = (X^TX)^{-1}X^T$ and $X^+X = I$.

**Starting from the regression equation**: 

   $$y = Xb + u$$

**Apply the Moore-Penrose inverse of $X$ to both sides**:

   Since we know $X^+X = I$, multiplying both sides by $X^+$ yields:

   $$X^+y = X^+Xb + X^+u$$

**Given that $X^Tu = 0$**:

   This simplifies to:

   $$X^+y = X^+Xb + 0$$

   Which further simplifies to:

   $$X^+y = b$$

   Because $X^+X = I$.

**Thus, the solution for $b$ is**:

   $$b = X^+y$$

   Where $X^+ = (X^TX)^{-1}X^T$ is the Moore-Penrose inverse of $X$.

This method shows how to isolate the coefficient vector $b$ in the presence of a noise vector $u$ that is orthogonal to the column space of $X$ ($X^Tu = 0$). 


## 5. Simultaneous Equations

When we defined the general weighted regression, we didn’t assume anything about the dimension of the different objects except that they were ’conformable.’ So: consider

(2) $y = X\beta + u$, with $ET'u = 0$

#### 1. What does our assumption of conformability then imply about the dimensions of $X$, $\beta$, $T$, and $u$?

In the context of the general weighted regression equation $y = X\beta + u$, with the condition $E[T'u] = 0$ and $y$ being a $N \times k$ matrix, the assumption of conformability dictates the following about the dimensions of $X$, $\beta$, $T$, and $u$:

- **$X$**: This is the matrix of independent variables or predictors. For the matrix multiplication $X\beta$ to conform to $y$, $X$ must have dimensions of $N \times m$, where $m$ is the number of independent variables. This dimensionality ensures that $X$ has one row per observation and one column for each independent variable.

- **$\beta$**: This matrix contains the regression coefficients associated with each independent variable across each of the $k$ dependent variables. For the product $X\beta$ to result in a $N \times k$ matrix (like $y$), $\beta$ must be of dimension $m \times k$, allowing each independent variable to potentially influence each dependent variable uniquely.

- **$u$**: The error term must have the same dimension as $y$ to maintain conformability. Therefore, $u$ is a $N \times k$ matrix.

- **$T$**: Given the condition $E[T'u] = 0$, $T$ acts as a transformation or weighting matrix applied to the error terms. For $T'$ to multiply $u$, and considering the expectation, $T$ must be of dimension $p \times N$, where $p$ could represent the number of constraints or transformations applied to the error terms. The condition implies a certain level of independence or orthogonality between the transformed errors and some other variables or components in the model.

#### 2. Could you use the estimator we developed in weighted_regression.ipynb to estimate this system of simultaneous equations?

In [7]:
%matplotlib inline
import numpy as np
from scipy.stats import multivariate_normal

# Number of observables in T and dependent variables in y
k = 3 

# Mean and covariance matrix for the multivariate normal distribution
mu = [0] * k
Sigma = [[1, 0.5, 0], [0.5, 2, 0], [0, 0, 3]]

# Generate T 
T = multivariate_normal(mean=mu, cov=Sigma)

# Error term u with its covariance
u = multivariate_normal(mean=np.zeros(k), cov=0.2*np.eye(k))

# Random matrix D for generating X
D = np.random.random(size=(3, 2)) 

# Sample size
N = 1000 

# Generating samples for T and u
T_sample = T.rvs(N)
u_sample = multivariate_normal(mean=0, cov=0.2).rvs(N)

# Generate X using a non-linear transformation of T
X = (T_sample ** 3) @ D  

# Define beta
beta = np.array([1/2, 1])

# Generate y with multiple dependent variables
y = X @ beta + u_sample  

# Estimate beta using least squares for each dependent variable
b_estimates = np.linalg.lstsq(X, y, rcond=None)[0]

# Print estimated beta coefficients
print("Estimated beta coefficients:\n", b_estimates)

# Calculate residuals
e = y - X @ b_estimates

# Calculate residuals variance
var_e = np.var(e, ddof=X.shape[1])  # Adjusted for degrees of freedom

# Calculating X'X inverse (X transpose X)
XX_inv = np.linalg.inv(X.T @ X)

# Calculating covariance matrix of b estimates
vb = var_e * XX_inv

print("\nCovariance matrix of b estimates:\n", vb)

Estimated beta coefficients:
 [0.50041179 0.99883475]

Covariance matrix of b estimates:
 [[ 7.50317205e-05 -6.91033001e-05]
 [-6.91033001e-05  6.40855954e-05]]


## 6. SUR 

#### (1.) If $\Omega$ isn’t diagonal then there’s a sense in which the different equations in the system are dependent, since observing a realization of, say, $y_1$ may change our prediction of $y_2$. (This is why the system is called “seemingly” unrelated.) Describe this dependence formally.


In the context of Seemingly Unrelated Regressions (SUR), when the covariance matrix $\Omega$ is not diagonal, it indicates that the error terms of the different equations are correlated. This correlation leads to a dependence among the equations in the system, despite their appearance as unrelated. 

The covariance matrix $\Omega$ represents the covariance of the error terms across different equations. A non-diagonal $\Omega$ means that for some $i \neq j$, the covariance $\mathrm{cov}(u_i, u_j) \neq 0$, where $u_i$ and $u_j$ are error terms from different equations in the system. This non-zero covariance implies a statistical dependence between the error terms, and consequently, between the equations themselves.

Formally, the dependence can be described as follows:

- Observing a realization of $y_1$ (which is influenced by $u_1$) can inform us about $u_2$, and hence, about the potential realization of $y_2$, if $\mathrm{cov}(u_1, u_2) \neq 0$. This is because the realization of $u_1$ provides information that can be used to update the expected value of $u_2$, reflecting a departure from independence.

- This interdependence signifies that the errors (and therefore the outcomes) of one equation are informative about the errors (and outcomes) of another equation within the system. Thus, shocks or variations in one part of the system can have implications for other parts, which would not be the case if the equations were truly unrelated and the covariance matrix $\Omega$ were diagonal.

The presence of this dependence suggests that estimating the equations jointly, taking into account the correlation among the error terms, can yield more efficient and unbiased parameter estimates than estimating each equation separately without considering such correlations.

#### (2) Adapt the code in weighted_regression.ipynb so that the datagenerating process for $u$ can accommodate a general covariance matrix such as $\Omega$, and let $X = T$. Estimate $\beta$.

In [20]:
%matplotlib inline
import numpy as np
from scipy.stats import multivariate_normal
from numpy.linalg import lstsq, pinv

# Define parameters
k = 3  # Number of observables in T
N = 1000  # Sample size
D = np.random.random(size=(3,3)) # Generate random 3x3 matrix

# Parameters for generating T
mu = [0] * k
Sigma = [[1, 0.5, 0],
         [0.5, 2, 0],
         [0, 0, 3]]  # Covariance matrix for T

# Generate sample T
T = multivariate_normal(mu, Sigma).rvs(N)

# Define a general covariance matrix Omega for u
A = np.random.rand(3,3)
# Construct a positive semidefinite matrix B by multiplying A by its transpose
Omega = np.dot(A, A.T)

# Generate u using Omega, taking the first component to maintain scalar outcome
u = multivariate_normal(mean=np.zeros(k), cov=Omega).rvs(N)[:, 0]
# Set X = T
X = T

# Define beta 
beta = np.array([0.5, 1, -0.5])  


# Generate y
y = X @ beta + u

# Estimate beta using least squares
b_est = lstsq(X, y, rcond=None)[0]

# Print the estimated beta
print(f"Estimated beta: {b_est}")

# Calculating residuals for variance estimation might not directly apply if dimensions of beta and T don't match directly
# If needed, calculate residuals and estimate variance of beta
e = y - X @ b_est
print(f"Residual variance: {np.var(e)}")

Dimensions of X: (1000, 3)
Dimensions of beta: (3,)
Dimensions of u: (1000,)
Estimated beta: [ 0.48202934  1.00579006 -0.48408004]
Residual variance: 0.6010909253684503


#### (3) How are the estimates obtained from this SUR system different from what one would obtain if one estimated equation by equation using OLS?

The SUR approach can yield different estimates compared to estimating each equation separately using OLS when the error terms across equations are correlated. This correlation among error terms is what SUR explicitly accounts for, which can lead to efficiency gains in the parameter estimates.

SUR leverages the covariance structure among the error terms across different equations. When these error terms are correlated, SUR, by considering this correlation, can provide more efficient estimates. Efficiency here refers to the variance of the estimator; more efficient estimators have smaller variances and are, therefore, closer to the true parameter value on average.

Estimating each equation separately with OLS assumes that the error terms across different equations are uncorrelated. Under this assumption, OLS does not account for any potential information that could be gained from the error term correlations across equations. Consequently, if there indeed exists correlation across error terms (violating OLS assumptions when equations are related), OLS estimates might not be as efficient as those obtained from SUR.

The key difference arises in situations where the error terms across equations are correlated. In such cases:

- SUR can produce parameter estimates with smaller standard errors compared to OLS because it utilizes the information contained in the covariance structure of the errors.
- OLS treats each equation as if it is standalone, ignoring the potential gains from understanding how the error terms across equations relate.

If the error terms across the equations are actually uncorrelated, the SUR estimates and OLS estimates will be similar in terms of efficiency. 

## 8. “Plug-in” Kernel Bias Estimator

In our discussion of bias of the kernel density estimator in lecture we
constructed an “Oracle” estimator, which can be implemented when we
know the true density $f$ that we’re trying to estimate. Of course, the Oracle estimator is only feasible when we don’t need it. What about the idea of using the same expression for bias as in
the Oracle case, but replacing $f$ with our estimate $\hat{f}$? Would this tell us anything useful? If so, under what conditions? What pitfalls might one encounter?

For this approach to be meaningful, certain conditions would have to be met:

- **Consistency**: The kernel density estimator $\hat{f}$ must be consistent, converging to $f$ as the sample size increases.
- **Smoothness**: The true density $f$ should be sufficiently smooth.

Several challenges might arise with this approach:
- **Circular Reasoning**: Estimating the bias of $\hat{f}$ using $\hat{f}$ itself might lead to circular reasoning and uncertainty.
- **Bandwidth Sensitivity**: The bias estimation is highly sensitive to bandwidth choice, which can significantly impact the outcome.
- **Variance Ignorance**: This method might overlook the trade-off between bias and variance.
- **Sample Size Sensitivity**: The effectiveness of the approach may be highly dependent on the sample size, with small samples potentially leading to misleading bias estimates.

