# Instrument Selection Using Shift-Share Weights

This Jupyter Notebook compares two IV regression approaches:
1. Overidentified 2SLS using the initial shares as instruments, without any special weighting scheme.
2. Just-identified 2SLS using the shift-share (Bartik) instrument, which uses the initial shares and the growth rates.

Our goal is to recover $\theta$ in the following equation where $x_i$ and $e_i$ are correlated ($i$ is individual):

$$y_i = \theta x_i + e_i$$

This notebook shows that the shift-share method can help in selecting the correct instruments and reduce endogeneity bias more effectively than standard 2SLS using the initial shares alone. In this example, the instruments are not completely exogenous. Shift-share lowers endogeneity using the growth rates. The growth rates are proportional to the degree of instrument endogeneity. As a result, we weight better instruments more and worse instruments less when estimating results. 

## 1. Data Generation Process

We generate datasets with a specified correlation structure. The model is specified as:

$$y_i = \theta x_i + e_i$$

where $x_i$ and $e_i$ have a small correlated with $e_i$. The parameter $\rho_{xe}$ controls the level of correlation between $x_i$ and $e_i$.

The instruments $z_{ik}^0$ represent shares in a set of three shares, $k = 3$.

Instruments $z_{ik}^t$ determine $x_i$ but have a small correlation with $e_i$. The level of endogeneity of the instruments is controlled by the parameters $\rho_{ze}$. The higher these parameters, the more endogenous the instruments are, resulting in greater bias.

$$
z_{ik}^0 = 0.3 + N(0, .3) + \rho_{ze,k} \cdot e_i
$$

for $k=1,2$. We ensure the shares are non-negative $z_{i}^0 > 0$ and normalize them if they sum to more than 1.

For $k=3$:

$$z_{i3}^0 = 1 - z_{i1}^0 - z_{i2}^0$$

There are really 3 variables, but just 2 are independent since the third one is a linear combination of the other two.

For generating the shares in the next period ($k=1,2$):

$$
z_{ik}^1 = 0.3 + z_{ik}^0 - \rho_{ze,k} + N(0, 0.3)
$$

The growth rates are inversely proportional to the degree of instrument endogeneity $\rho_{ze}$. As a result, they weight better instruments more and worse instruments less when measuring results. More specifically, proportional to $\rho_{ze}$.

The growth rate of each share $k$ for each of the shares if there are $I$ observations will be $g_k$ where $g_k$ is the sum of $g_{ik}$:

$$
g_{ik} = \frac{z_{ik}^1}{z_{ik}^0}
$$

$$
g_k = \sum_{i} g_{ik}
$$

In [1]:
import numpy as np
from numpy.linalg import inv



def generate_data(n, rho_z_e, rho_x_e, theta):
    k = 3

    e = np.random.normal(0, .4, n)
    u = np.random.normal(0, .4, n)

    # Generate random draws for period 0 with rho_z_e * e tiled
    draw_0 = .3 + np.random.normal(0, .3, (n, k-1)) + rho_z_e * np.tile(e[:, np.newaxis], (1, k-1))
    
    # Ensure Z0 shares are non-negative and sum to 1
    Z0 = np.empty((n, k))
    Z0[:, 0:2] = draw_0
    Z0[:, 2] = 1 - draw_0.sum(axis=1)
    Z0 = np.clip(Z0, 0, None)
    
    # Normalize rows to sum to 1
    Z0 = Z0 / Z0.sum(axis=1, keepdims=True)

    # Generate random draws for period 1
    draw_1 = .3 + draw_0 - np.array(rho_z_e) + np.random.normal(0, .3, (n, k-1))

    # Ensure Z1 shares are non-negative and sum to 1
    Z1 = np.empty((n, k))
    Z1[:, 0:2] = draw_1
    Z1[:, 2] = 1 - draw_1.sum(axis=1)
    Z1 = np.clip(Z1, 0, None)
    
    # Normalize rows to sum to 1
    Z1 = Z1 / Z1.sum(axis=1, keepdims=True)

    # Calculate growth rates G as the percentage change from Z0 to Z1
    G = Z1 / np.maximum(Z0, 1e-5)

    # Generate X
    x = Z1[:, :k-1] @ np.array([.1, .3]) + rho_x_e * e + u

    # Generate Y
    y = theta * x + e

    return y, x, e, Z0, G



## 2. IV Regression with GMM

### Standard vs. Shift-Share 2SLS

This example compares two approaches:
- 2SLS with $z_{ik}^0$ as the instrument. This approach uses over-identified 2SLS with $z_{ik}^0$ as an instrument.
- Shift-share instrument that uses $z_i^0$ in conjunction with the growth weights by modifying the weights of $z_{ik}^0$ using $g_k$.

$$
B_i = \sum_k g_k z_{ik}^0
$$

Both approaches can be estimated using GMM:

$$
\hat{\theta} = \left(X'ZW(Z'X)\right)^{-1}\left(X'ZW(Z'y)\right)
$$

They just use different weighting matrices $W$. The shifts $g_k$ determine the GMM weighting matrix with the shift-share instrument.

In our example, where the instruments are endogenous, we expect different results between just-identified 2SLS with the shift-share instrument and over-identified 2SLS using the initial shares. We show that the shift-share instrument can reduce bias by comparing two approaches to estimating $\theta$. Since $g_i$ is inversely proportional to $\rho_{ze}$, the shift-share instrument weights the instruments by their endogeneity and prioritizes more exogenous variation, thus having less bias compared to 2SLS.

### 2SLS Over-Identified Weighting Matrix

The GMM estimator $\hat{\theta}$ is given by:

$$
\hat{\theta} = \left(X'ZW(Z'X)\right)^{-1}\left(X'ZW(Z'y)\right)
$$

The standard GMM weighting matrix for 2SLS is $(Z'Z)^{-1}$. This is computed using `iv_gmm` with the flag `tsls=True`. This just uses the identity matrix as the weights.

### Shift-Share GMM Weighting Matrix

Proposition 1 from [Goldsmith-Pinkham et al. (2020)](https://www.aeaweb.org/articles?id=10.1257/aer.20181047) shows that two-stage least squares with the shift-share instrument, using the initial shares $z_{ik}^0$ and a vector of average growth rates $g_k$,

$$
B_i = \sum_k g_k z_{ik}^0
$$

is equivalent to the GMM estimator $\hat{\theta}$ using the custom weighting matrix $GG'$, where $G$ is a 3x1 vector of the weights $g_k$. In our example, the growth rates are proportional to the degree of instrument endogeneity. As a result, the shift-share approach effectively reduces bias by weighting the instruments appropriately.

The shift-share estimator is computed in two ways that are numerically identical: using `iv_gmm` with `tsls=False` (uses shift-share matrix in GMM) or the `bartik_iv` function for just-identified 2SLS.

In [2]:
def bartik_iv(y, x, Z, G):
    avg_G = G.mean(axis=0)  # (1 x K)
    Z_bartik = Z @ avg_G   # (N x 1): industry shares weighted by average growth rates
    numerator = Z_bartik.T @ y
    denominator = Z_bartik.T @ x
    
    b_bartik = numerator / denominator
    return b_bartik

def iv_gmm(y, x, Z, G, tsls=False):
    avg_G = G.mean(axis=0)  # (1 x K)
    avg_G = avg_G.reshape(1, -1)  # Ensure avg_G is row vector (1, K)
    W = avg_G.T @ avg_G  # (K x K)

    if tsls:
        W = np.identity(W.shape[0])
    
    X = np.column_stack([x])
    ZX = Z.T @ X
    ZY = Z.T @ y
    
    theta_hat = inv(ZX.T @ W @ ZX) @ (ZX.T @ W @ ZY)
    
    return theta_hat


## 3. Monte Carlo Simulation

We run a Monte Carlo simulation to evaluate the performance of each estimator. The provided code implements GMM-based IV estimation under the two weighting matrix schemes described above and evaluates their performance. 
- First approach is shift-share IV (using both **Bartik IV Estimate** and **GMM Estimates**). These two approaches are numerically equivalent and provide the exact same results. I compute each respective estimator with GMM and with just-identified 2SLS.
- Second approach is 2SLS without a special weighting matrix (**TSLS Estimates**).

In the simulation, I set $\theta=1$. The first share $z_1$ is nearly exogenous while the second share $z_2$ is relatively endogenous (with $\rho_{ze} = [0.01, 0.4]$). As a result, the bias in the standard 2SLS estimate ($\hat{\theta} \approx 1.45$) is significantly higher compared to the mere -0.1 bias in the Bartik IV and GMM estimates ($\hat{\theta} \approx 0.92$). This highlights how the shift-share instrument can better mitigate endogeneity bias compared to traditional instruments. Adjusting for the growth rates $g_k$ effectively improves the identification of $\theta$.

In [3]:
# Set parameters
n = 1000  # number of observations
theta = 1.0
rho_z_e = [0.01, .4]  # Reduced linkage with error term
rho_x_e = 1  # Increased
num_simulations = 100  # number of simulations

# Containers for results
bartik_estimates = []
gmm_estimates = []
tsls_estimates = []

# Run simulations
for _ in range(num_simulations):
    y, x, e, Z, G = generate_data(n, rho_z_e, rho_x_e, theta)
    bartik_estimates.append(bartik_iv(y, x, Z, G))
    gmm_estimates.append(iv_gmm(y, x, Z, G)[0])
    tsls_estimates.append(iv_gmm(y, x, Z, G, tsls=True)[0])


# Convert to numpy arrays for easy computation of mean and standard deviation
bartik_estimates = np.array(bartik_estimates)
gmm_estimates = np.array(gmm_estimates)
tsls_estimates = np.array(tsls_estimates)

# Calculate means and standard deviations
mean_bartik = bartik_estimates.mean()
std_bartik = bartik_estimates.std()
mean_gmm = gmm_estimates.mean()
std_gmm = gmm_estimates.std()
mean_tsls = tsls_estimates.mean()
std_tsls = tsls_estimates.std()

# Report outcomes and standard deviations
print(f"Bartik IV Estimates: Mean = {mean_bartik}, Std Dev = {std_bartik}")
print(f"GMM Estimates: Mean = {mean_gmm}, Std Dev = {std_gmm}")
print(f"TSLS Estimates: Mean = {mean_tsls}, Std Dev = {std_tsls}")

Bartik IV Estimates: Mean = 0.920989274325311, Std Dev = 0.16054999842876316
GMM Estimates: Mean = 0.920989274325311, Std Dev = 0.16054999842876314
TSLS Estimates: Mean = 1.4487093879198176, Std Dev = 0.058198602261687916
