# Detecting Identification Failure in the Moment Condition Models.

This tutorial shows you how to do an identification check for MSM in estimagic. In order to get the estimates by MSM, you must have at least as many moments as parameters to estimate. If you have fewer moments than parameters to be estimated, the model is said to be underidentified.  Besides that, when not all moments are orthogonal it may also lead to identification failure.

In the tutorial here, we will use a simple linear regression model where two of the regressors are correlated. Thus, the identification problem is encountered.

Throughout the tutorial, we perform the testing procedure described in Forneron, J. J. (2019). 

## Outline of the testing procedure
1. Calculate quasi-Jacobian matrix
2. Identification category  selection
3. Subvector inference


## Example: Estimate the parameters of a regression model

The model we consider here is a simple regression model with two explanatory variables (plus a constant). The goal is to estimate the slope coefficients and the error variance from a simulated data set.


### Model:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon, \text{ where } \epsilon \sim N(0, \sigma^2)$$

We aim to estimate $\beta_0, \beta_1, \beta_2,\sigma^2$.

In [1]:
import numpy as np
import pandas as pd

import estimagic as em

rng = np.random.default_rng(seed=0)



## 1. Simulate data

In [2]:
def simulate_data(params, n_draws, rng,correlation=0.7):

    mu = np.array([0.0, 0.0])
    var_cov = np.array([
            [  1, correlation],
            [ correlation,  1],
        ])
    x = rng.multivariate_normal(mu, var_cov, size=n_draws)
    x1 = x[:,0]
    x2 = x[:,1]
    e = rng.normal(0, params.loc["sd", "value"], size=n_draws)
    y = params.loc["intercept", "value"] + params.loc["slope1", "value"] * x1 + params.loc["slope2", "value"] + e
    return pd.DataFrame({"y": y, "x1": x1, "x2": x2})

In [3]:
true_params = pd.DataFrame(
    data=[[2, -np.inf], [-1, -np.inf], [-1, -np.inf], [1, 1e-10]],
    columns=["value", "lower_bound"],
    index=["intercept", "slope1", "slope2", "sd"],
)

data = simulate_data(true_params, n_draws=100, rng=rng)

In [8]:
data

Unnamed: 0,y,x1,x2
0,0.401218,-0.064754,-0.167082
1,1.017650,-0.631068,-0.549813
2,-0.958967,0.353818,0.633908
3,3.298381,-1.569032,-0.835426
4,0.667232,1.138907,0.158716
...,...,...,...
95,-0.736617,2.299014,1.286033
96,1.690777,-0.982410,-1.021607
97,1.676692,-0.375400,0.897457
98,-1.670329,1.409368,0.955720


## 2. Calculate Moments

In [4]:
def calculate_moments(sample):
    moments = {
        "y_mean": sample["y"].mean(),
        "x1_mean": sample["x1"].mean(),
        "x2_mean": sample["x2"].mean(),
        "yx1_mean": (sample["y"] * sample["x1"]).mean(),
        "yx2_mean": (sample["y"] * sample["x2"]).mean(),
        "y_sqrd_mean": (sample["y"] ** 2).mean(),
        "x1_sqrd_mean": (sample["x1"] ** 2).mean(),
        "x2_sqrd_mean": (sample["x1"] ** 2).mean(),
    }
    return pd.Series(moments)

In [5]:
empirical_moments = calculate_moments(data)
empirical_moments

y_mean          0.835995
x1_mean         0.026028
x2_mean         0.104510
yx1_mean       -0.983584
yx2_mean       -0.449597
y_sqrd_mean     2.930317
x1_sqrd_mean    1.024421
x2_sqrd_mean    1.024421
dtype: float64

``get_moments_cov`` mainly just calls estimagic's bootstrap function. See our [bootstrap_tutorial](../../how_to_guides/inference/how_to_do_bootstrap_inference.ipynb) for background information. 



## 3. Define a function to calculate simulated moments

In a real world application, this is the step that would take most of the time. However, in our very simple example, all the work is already done by numpy.

In [6]:
def simulate_moments(params, n_draws=10_000, seed=0):
    rng = np.random.default_rng(seed)
    sim_data = simulate_data(params, n_draws, rng)
    sim_moments = calculate_moments(sim_data)
    return sim_moments

In [7]:
simulate_moments(true_params)

y_mean          1.009276
x1_mean        -0.006568
x2_mean        -0.003578
yx1_mean       -0.977183
yx2_mean       -0.683988
y_sqrd_mean     2.976694
x1_sqrd_mean    0.981403
x2_sqrd_mean    0.981403
dtype: float64

In [10]:
moments_cov = em.get_moments_cov(
    data, calculate_moments, bootstrap_kwargs={"n_draws": 5_000, "seed": 0}
)

moments_cov

Unnamed: 0,y_mean,x1_mean,x2_mean,yx1_mean,yx2_mean,y_sqrd_mean,x1_sqrd_mean,x2_sqrd_mean
y_mean,0.021509,-0.009426,-0.004625,-0.011972,0.000164,0.056065,-0.00034,-0.00034
x1_mean,-0.009426,0.009893,0.006379,0.007864,0.002203,-0.019811,0.000483,0.000483
x2_mean,-0.004625,0.006379,0.00853,0.0033,0.003402,-0.005371,0.000924,0.000924
yx1_mean,-0.011972,0.007864,0.0033,0.035066,0.017265,-0.071828,-0.015913,-0.015913
yx2_mean,0.000164,0.002203,0.003402,0.017265,0.018112,-0.021093,-0.010521,-0.010521
y_sqrd_mean,0.056065,-0.019811,-0.005371,-0.071828,-0.021093,0.259691,0.016572,0.016572
x1_sqrd_mean,-0.00034,0.000483,0.000924,-0.015913,-0.010521,0.016572,0.018723,0.018723
x2_sqrd_mean,-0.00034,0.000483,0.000924,-0.015913,-0.010521,0.016572,0.018723,0.018723


In [13]:
start_params = true_params.assign(value=[100, 100, 100, 100])

res = em.estimate_msm(
    simulate_moments,
    empirical_moments,
    moments_cov,
    start_params,
    optimize_options={"algorithm":"scipy_lbfgsb"},
)

  free["standard_error"] = np.sqrt(np.diag(free_cov))


In [21]:
res['summary']

Unnamed: 0,value,standard_error,p_value,ci_lower,ci_upper,stars
intercept,0.413411,,,,,
slope1,-0.841402,0.1790438,2.556831e-06,-1.192322,-0.4904827,***
slope2,0.413411,8092283.0,0.98,-15860580.0,15860580.0,
sd,1.251192,0.08438251,9.516526999999999e-50,1.085806,1.416579,***


## 4. Identification Check

For more background on the sensitivity measures and their interpretation, check out Forneron, J. J. (2019). 



In [None]:
check_msm_identification(
    simulate_moments = simulate_moments,
    simulate_moments_kwargs = {"data": data},
    params = res['summary']['value']
    draws = 10000,
    sampling = 'sobol'
    )