# Instrument Selection Using Shift-Share Weights

This Jupyter Notebook provides a numerical example to illustrate how shift-share instruments can help in selecting the correct instruments among potentially endogenous ones.

Our goal is to recover $\theta$ in the following equation where $x$ and $e$ are correlated:

$$y = \theta x + e$$

We perform IV regression using two approaches:

1. GMM Weighting Using the Standard 2SLS Weighting Matrix
2. Using a Custom Weighting Matrix

The custom weighting approach is designed to mirror shift weights and provide intuition for how using shifts could reduce endogeneity.

## 1. Data Generation Process

We generate datasets with a specified correlation structure.

### Equations:

The model is specified as:

$$y = \theta x + e$$

Where $x$ and $e$ are correlated. The parameter `rho_x_e` controls the level of correlation between $x$ and $e$.

Instruments $z_1$ and $z_2$ are also correlated with $x$ but have a small correlation with $e$. The level of endogeneity of the instruments is controlled by the parameters `rho_z1_e` and `rho_z2_e`. The higher these parameters, the more endogenous the instruments are, resulting in greater bias.

You can think of $z_1$ and $z_2$ as the initial shares in the context of a shift-share analysis.

In [1]:
import numpy as np
import statsmodels.api as sm
from scipy.linalg import inv

np.random.seed(0)

def generate_data(n, rho_x_e, rho_z1_e, rho_z2_e, theta):
    # Generate noise terms
    e = np.random.normal(0, 1, n)
    u1 = np.random.normal(0, 1, n)
    v1 = np.random.normal(0, 1, n)
    v2 = np.random.normal(0, 1, n)

    # Generate instruments
    z1 = rho_z1_e * e + v1
    z2 = rho_z2_e * e + v2

    # Generate endogenous regressor
    x = z1 + z2 + u1 + rho_x_e * e

    # Generate dependent variable
    y = theta * x + e

    return y, x, e, z1, z2


## 2. IV Regression with GMM

### Standard vs. Custom Weighting Matrix

Proposition 1 from [Goldsmith-Pinkham et al. (2020)](https://www.aeaweb.org/articles?id=10.1257/aer.20181047) shows that a shift-share instrument, using $z_1$ and $z_2$ and a vector of shifts $s$, is equivalent to 2SLS just using the initial shares $z_1$ and $z_2$. However, the shifts $s$ change the GMM weighting matrix. As a result, these results only apply when the instruments are exogenous since misspecified GMM estimates depend on the weighting matrix.

In our example, where the instruments are endogenous, we expect different results between shift-share and 2SLS. We show that the shift-share instrument can reduce bias by comparing two approaches: 
- 2SLS with $z_1$ and $z_2$
- A custom weighting matrix that ranks the instruments based on their level of endogeneity. 


### Standard GMM Weighting Matrix

The standard GMM weighting matrix for 2SLS is $(Z'Z)^{-1}$.

#### Equations: 
The GMM estimator $\hat{\theta}$ is given by:

$$\hat{\theta} = (X'ZW(Z'X))^{-1}(X'ZW(Z'y))$$

### Custom GMM Weighting Matrix
The custom weighting matrix is diagonal and inversely proportional to the absolute correlations of $z_1$ and $z_2$ with $e$.

#### Equations:
The custom GMM estimator $\hat{\theta}$ using the custom weighting matrix $W$:

$$\hat{\theta} = (X'ZW(Z'X))^{-1}(X'ZW(Z'y))$$

The custom weighting matrix mirrors the shift-share approach. The variable $s$ doesn't explicitly appear in this code due to time constraints. For simplicity, we directly weight by their known level of endogeneity `rho_z1_e` and `rho_z2_e`. 

To demonstrate the efficicy of shift-shares in a more general setting, we would need to derive the weighting matrix in terms of the shifts $s$ and provide conditions relating the correlation between $z_1$ and $e$, and the correlation between $z_2$ and $e$.

In [2]:

def iv_gmm_standard(y, x, z1, z2):
    T = len(y)
    Z = np.column_stack([z1, z2])
    Z_prime_Z_inv = inv(Z.T @ Z / T)
    
    W_standard = Z_prime_Z_inv
    
    X = np.column_stack([x])
    theta_hat = inv(X.T @ Z @ W_standard @ Z.T @ X) @ X.T @ Z @ W_standard @ Z.T @ y
    
    return theta_hat

def iv_gmm_custom(y, x, z1, z2, rho_z1_e, rho_z2_e):
    T = len(y)
    Z = np.column_stack([z1, z2])
    
    W_custom = np.diag([(1/abs(rho_z1_e)), (1/abs(rho_z2_e))])
    
    X = np.column_stack([x])
    theta_hat = inv(X.T @ Z @ W_custom @ Z.T @ X) @ X.T @ Z @ W_custom @ Z.T @ y
    
    return theta_hat

## 3. Monte Carlo Simulation

We run a Monte Carlo simulation to evaluate the performance of each estimator. The provided code implements GMM-based IV estimation under two weighting matrix schemes described above and evaluates their performance.

In the simulation, $z_1$ is nearly exogenous (`rho_z1_e= .01`), while $z_2$ is relatively endogenous ( `rho_z2_e =.5`). As a result, the 2SLS approach results in a significant bias for the parameter $\theta$, ~0.3 over the true value of the parameter ($\theta = 2$). The bias from the custom weighting matrix is much lower, only ~0.05, which is substantially lower.

In [3]:
def monte_carlo_sim(n_simulations, n, rho_x_e, rho_z1_e, rho_z2_e, theta):
    results_standard = []
    results_custom = []
    
    for _ in range(n_simulations):
        y, x, e, z1, z2 = generate_data(n, rho_x_e, rho_z1_e, rho_z2_e, theta)
        
        theta_standard = iv_gmm_standard(y, x, z1, z2)
        theta_custom = iv_gmm_custom(y, x, z1, z2, rho_z1_e, rho_z2_e)
        
        results_standard.append(theta_standard)
        results_custom.append(theta_custom)
    
    results_standard = np.array(results_standard)
    results_custom = np.array(results_custom)
    
    print("Standard GMM Weighting Matrix: Mean estimate and Std Dev",
          results_standard.mean(), results_standard.std())
          
    print("Custom GMM Weighting Matrix: Mean estimate and Std Dev",
          results_custom.mean(), results_custom.std())

# Parameters
n_simulations = 1000
n = 100
rho_x_e = 0.5
rho_z1_e = 0.01
rho_z2_e = 2
theta = 2

monte_carlo_sim(n_simulations, n, rho_x_e, rho_z1_e, rho_z2_e, theta)


Standard GMM Weighting Matrix: Mean estimate and Std Dev 2.292344813872509 0.020306030374386077
Custom GMM Weighting Matrix: Mean estimate and Std Dev 2.0543310913499235 0.06752782768508905
