# Identification Task 

## What we need

We need a function called check_msm_identification that it easy to use and performs the identification check described in [this paper](https://arxiv.org/pdf/1907.13093.pdf). The different variants (e.g. different methods of sampling uniformly from likelihood level sets) can be selected via the optional arguments of check_ml_identification.  The output will either be a dictionary (if it is a small set of outputs that every user will want) or a results object similar to the result of estimate_ml (if there are many different test statistics).

## Task 1: Planning

- Write down which model specific inputs a user has to supply in order to do an identification check. The names should be aligned with estimate_ml where possible. It will definitely be a likelihood function and a result of estimate_msm but there might be more. 
- Write down which kinds of outputs a user will get, what they mean and how they should be visualized in a paper (plots, tables, ...). 
- Write docstrings for check_ml_identification before you actually implement it
- Adjust our [simple example](https://estimagic.readthedocs.io/en/stable/getting_started/estimation/first_msm_estimation_with_estimagic.html) such that it has a second variable that can be arbitrarily correlated with x (i.e. add an identification problem)
- Start to write a tutorial in a notebook that shows how the new function will be used and what the outputs mean

## Remarks

- You can for now assume that the model parameters (params) are a 1d numpy array. We talk about making this more flexible later. 
- The idea behind writing the documentation first is that it lets you focus completely on a user friendly interface and a high level understanding. Also, we will probably ask for changes after you show us your proposed interface. If you had already implemented it, you would have to change it.

estimate_msm
https://estimagic.readthedocs.io/en/stable/reference_guides/index.html

In [2]:
# CALCULATE NECESSARY INPUTS (as in identification_check_with_estimagic.ipynb)
import pandas as pd
import numpy as np
import estimagic as em

rng = np.random.default_rng(seed=0)

def simulate_data(params, n_draws, rng,correlation=0.7):

    mu = np.array([0.0, 0.0])
    var_cov = np.array([
            [  1, correlation],
            [ correlation,  1],
        ])
    x = rng.multivariate_normal(mu, var_cov, size=n_draws)
    x1 = x[:,0]
    x2 = x[:,1]
    e = rng.normal(0, params.loc["sd", "value"], size=n_draws)
    y = params.loc["intercept", "value"] + params.loc["slope1", "value"] * x1 + params.loc["slope2", "value"] + e
    return pd.DataFrame({"y": y, "x1": x1, "x2": x2})

true_params = pd.DataFrame(
    data=[[2, -np.inf], [-1, -np.inf], [-1, -np.inf], [1, 1e-10]],
    columns=["value", "lower_bound"],
    index=["intercept", "slope1", "slope2", "sd"],
)

true_params = pd.DataFrame(
    data=[[2, -np.inf], [-1, -np.inf], [-1, -np.inf], [1, 1e-10]],
    columns=["value", "lower_bound"],
    index=["intercept", "slope1", "slope2", "sd"],
)

data = simulate_data(true_params, n_draws=1000, rng=rng)

def calculate_moments(sample):
    moments = {
        "y_mean": sample["y"].mean(),
        "x1_mean": sample["x1"].mean(),
        "x2_mean": sample["x2"].mean(),
        "yx1_mean": (sample["y"] * sample["x1"]).mean(),
        "yx2_mean": (sample["y"] * sample["x2"]).mean(),
        "y_sqrd_mean": (sample["y"] ** 2).mean(),
        "x1_sqrd_mean": (sample["x1"] ** 2).mean(),
        "x2_sqrd_mean": (sample["x1"] ** 2).mean(),
    }
    return pd.Series(moments)

empirical_moments = calculate_moments(data)


def simulate_moments(params, n_draws=10_000, seed=0):
    rng = np.random.default_rng(seed)
    sim_data = simulate_data(params, n_draws, rng)
    sim_moments = calculate_moments(sim_data)

    return sim_moments


moments_cov = em.get_moments_cov(
    data, calculate_moments, bootstrap_kwargs={"n_draws": 5_000, "seed": 0}
)

start_params = true_params.assign(value=[100, 100, 100, 100])

res = em.estimate_msm(
    simulate_moments,
    empirical_moments,
    moments_cov,
    start_params,
    optimize_options={"algorithm":"scipy_lbfgsb"},
)

res.summary() # !check that standard_error is without NA

Unnamed: 0,value,standard_error,ci_lower,ci_upper,p_value,free,stars
intercept,0.45368,23651820000.0,-46356710000.0,46356710000.0,1.0,True,
slope1,-0.97946,127.3175,-250.5172,248.5583,0.993862,True,
slope2,0.453677,25836210000.0,-50638040000.0,50638040000.0,1.0,True,
sd,0.988449,562.828,-1102.134,1104.111,0.998599,True,


In [24]:
#data.to_excel("test_data.xlsx")
#moments_cov.to_excel("test_moments_cov.xlsx")

# 2.i Draws on the level set
## Sampling for step 2.i)
#### 1. Direct approach
 - draw values from the space - either randomly or pseudo-randomly (Sobol or Halton)
 - assign weights proportionally to the bandwidth criterion (indicator function)
 - drawback - effective sample size can be samll relative to the parameter space; especially when the dimention of \theta is moderately large

#### 2. Adaptive Sampling by Population Monte Carlo
- constructing a sequence of proposal distributions with higher acceptance rate
- to do later

In [3]:
import math
import numpy as np
import pandas as pd
import scipy.stats.qmc as qmc
import cvxpy as cp

from estimagic.estimation.msm_weighting import get_weighting_matrix
from estimagic.estimation.estimate_msm import get_msm_optimization_functions

In [4]:
from identification_check import check_msm_identification
from identification_check import sampling_level_sets
from identification_check import calculate_quasi_jacobian
from identification_check import category_selection

In [None]:
n = data.shape[0]
bandwidth = math.sqrt(2 * math.log(math.log(n)) / n)
sampling= "sobol"
grid_sub,moms_sub = sampling_level_sets(simulate_moments,res,moments_cov,10000,bandwidth,'diagonal',sampling)

In [None]:
grid_sub

array([[-7.59574547e+05, -9.02193229e-01,  7.59689576e+05,
         9.92711672e-01]])

In [None]:
res

In [30]:
pd.DataFrame(grid_sub).to_excel('test_grid_sub.xlsx')
pd.DataFrame(moms_sub).to_excel('test_moms_sub.xlsx')

In [None]:
calculate_quasi_jacobian(grid_sub, moms_sub, 4)

In [None]:
check_msm_identification(
        simulate_moments,
        res,
        moments_cov,
        1000,
        n_obs = data.shape[0],
        weights = 'diagonal',
        kernel = 'uniform',
        sampling = "sobol",
        bandwidth = None,
        cutoff = None,
        population_mc_kwgs = None,
        simulate_moments_kwargs= None,
        logging = False,
        log_options = None,

)

# 2.ii) Linear approximation
## Kernels for step 2.ii)

TO DO - solve with scipy.optimize.linprog

 - uniform
 - Epanchnikov- to do later
 - cosine - to do later