# Detecting Identification Failure in the Moment Condition Models.

This tutorial shows you how to do an identification check for MSM in estimagic. In order to get the estimates by MSM, you must have at least as many moments as parameters to estimate. If you have fewer moments than parameters to be estimated, the model is said to be underidentified.  Besides that, when not all moments are orthogonal it may also lead to identification failure.

In the tutorial here, we will use a simple linear regression model where two of the regressors are correlated. Thus, the identification problem is encountered.

Throughout the tutorial, we perform the testing procedure described in Forneron, J. J. (2019). 

## Outline of the testing procedure
1. Calculate quasi-Jacobian matrix
2. Identification category  selection
3. Subvector inference


## Example: Estimate the parameters of a regression model

The model we consider here is a simple regression model with two explanatory variables (plus a constant). The goal is to estimate the slope coefficients and the error variance from a simulated data set.


### Model:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon, \text{ where } \epsilon \sim N(0, \sigma^2)$$

We aim to estimate $\beta_0, \beta_1, \beta_2,\sigma^2$.

In [2]:
import numpy as np
import pandas as pd

import estimagic as em

rng = np.random.default_rng(seed=0)



## 1. Simulate data

In [24]:
def simulate_data(params, n_draws, rng,correlation=0.7):

    mu = np.array([0.0, 0.0])
    var_cov = np.array([
            [  1, correlation],
            [ correlation,  1],
        ])
    x = rng.multivariate_normal(mu, var_cov, size=n_draws)
    x1 = x[:,0]
    x2 = x[:,1]
    e = rng.normal(0, params.loc["sd", "value"], size=n_draws)
    y = params.loc["intercept", "value"] + params.loc["slope1", "value"] * x1 + params.loc["slope2", "value"] + e
    return pd.DataFrame({"y": y, "x1": x1, "x2": x2})

In [25]:
true_params = pd.DataFrame(
    data=[[2, -np.inf], [-1, -np.inf], [-1, -np.inf], [1, 1e-10]],
    columns=["value", "lower_bound"],
    index=["intercept", "slope1", "slope2", "sd"],
)

data = simulate_data(true_params, n_draws=100, rng=rng)

## 2. Calculate Moments

In [27]:
def calculate_moments(sample):
    moments = {
        "y_mean": sample["y"].mean(),
        "x1_mean": sample["x1"].mean(),
        "x2_mean": sample["x2"].mean(),
        "yx1_mean": (sample["y"] * sample["x1"]).mean(),
        "yx2_mean": (sample["y"] * sample["x2"]).mean(),
        "y_sqrd_mean": (sample["y"] ** 2).mean(),
        "x1_sqrd_mean": (sample["x1"] ** 2).mean(),
        "x2_sqrd_mean": (sample["x1"] ** 2).mean(),
    }
    return pd.Series(moments)

In [28]:
empirical_moments = calculate_moments(data)
empirical_moments

y_mean          0.637690
x1_mean         0.194696
x2_mean         0.103593
yx1_mean       -0.606911
yx2_mean       -0.401584
y_sqrd_mean     2.158836
x1_sqrd_mean    0.839349
x2_sqrd_mean    0.839349
dtype: float64

``get_moments_cov`` mainly just calls estimagic's bootstrap function. See our [bootstrap_tutorial](../../how_to_guides/inference/how_to_do_bootstrap_inference.ipynb) for background information. 



## 3. Define a function to calculate simulated moments

In a real world application, this is the step that would take most of the time. However, in our very simple example, all the work is already done by numpy.

In [30]:
def simulate_moments(params, n_draws=10_000, seed=0):
    rng = np.random.default_rng(seed)
    sim_data = simulate_data(params, n_draws, rng)
    sim_moments = calculate_moments(sim_data)
    return sim_moments

In [31]:
simulate_moments(true_params)

y_mean          1.009276
x1_mean        -0.006568
x2_mean        -0.003578
yx1_mean       -0.977183
yx2_mean       -0.683988
y_sqrd_mean     2.976694
x1_sqrd_mean    0.981403
x2_sqrd_mean    0.981403
dtype: float64

## 4. Identification Check

For more background on the sensitivity measures and their interpretation, check out Forneron, J. J. (2019). 



In [None]:
check_ml_identification()