# Wrapping Stata Code in Python

As a part of the Medicaid imputation project, I have a use case that involves in estimator that does not appear to be implemented in Python: the multivariate probit estimator.  (That being said, this could probably be tackled by a suitably designed implementation of [PyStan](https://github.com/stan-dev/pystan).) Essentially, we think that there is a relationship between the receipt of SNAP and Medicaid benefits (and other transfer benefits directed at low-income folks for that matter).  The most straightforward implementation seems to be contained by the `mvprobit` estimator in Stata.  This Notebook demonstrates how one might 1) use Python to deal with the data management, 2) pass the data to Stata for estimation, and 3) return the relevant results back into the Python namespace.

In [26]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import subprocess

%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


## Data Simulation

So that we may check our results against the model laid out in the article on [`mvprobit`](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiBl7OhwpjKAhUC7R4KHaLRAuQQFggdMAA&url=http%3A%2F%2Fwww.stata-journal.com%2Fsjpdf.html%3Farticlenum%3Dst0045&usg=AFQjCNHi1QVTZiK-YRagsQOuF37LmWht9w&sig2=ah1usVaro6ebRyEIGoiVog) housed in the Stata Journal, we will follow the same simulation procedure.  We are generating a set of four correlated equations with known parameter values so that we may identify the quality of the estimator.  The equations are taken directly from the article (Section 4.2).

In [27]:
#Set seed value (not that the same sequence will hold)
rand_seed=12309

#Set number of observations
nobs=5000

#Define correlation structure
vcov_R=np.array([[1.,25,.5,.75],[25,1.,.75,.5],[.5,.75,1.,.75],[.75,.5,.75,1.]])

#Set error means
u_means=np.zeros(4)

#Generate errors given the covariance structure
u_draws=np.random.multivariate_normal(u_means,vcov_R,nobs)

#Generate regressor data
x1 = np.random.uniform(size=nobs) - .5
x2 = np.random.uniform(size=nobs) + (1/3.)
x3 = 2*np.random.uniform(size=nobs) + .5
x4 = .5*np.random.uniform(size=nobs) - (1/3.)

#Generate latent response
y1_latent = .5 + 4*x1 + u_draws[:,0]
y2_latent = 3 + .5*x1 - 3*x2 + u_draws[:,1]
y3_latent = 1 - 2*x1 + .4*x2 - .75*x3 + u_draws[:,2]
y4_latent = -6 + 1*x1 -.3*x2 + 3*x3 - .4*x4 + u_draws[:,3]

#Generate observed binary response
y1 = np.where(y1_latent>0,1,0)
y2 = np.where(y2_latent>0,1,0)
y3 = np.where(y3_latent>0,1,0)
y4 = np.where(y4_latent>0,1,0)

#Define the number of draws
sim_draws=75



With all of the relevant inputs defined, we will write them to disk, so that they may be used in the Stata `.do` file.

In [28]:
#Capture data in DF
model_data=DataFrame({'y1':y1,
                      'y2':y2,
                      'y3':y3,
                      'y4':y4,
                      'x1':x1,
                      'x2':x2,
                      'x3':x3,
                      'x4':x4})

#Capture parameters
model_params={'seed_in':rand_seed,
              'obs_in':nobs,
               'r_in':sim_draws}
for var in model_params.keys():
    model_data[var]=model_params[var]

#Write both DFs to disk
model_data.to_csv('stata_test_data.csv')
# model_params.to_csv('stata_test_params.csv')

In [29]:
model_data

Unnamed: 0,x1,x2,x3,x4,y1,y2,y3,y4,r_in,obs_in,seed_in
0,0.360617,1.093283,1.253734,0.140427,1,1,1,0,75,5000,12309
1,-0.466143,0.966869,1.356876,-0.236156,1,1,0,0,75,5000,12309
2,-0.000362,0.504442,1.320947,-0.290000,1,1,1,0,75,5000,12309
3,0.232841,1.013847,1.627763,-0.292968,0,0,1,0,75,5000,12309
4,-0.113683,0.437025,2.068783,0.085220,1,1,1,1,75,5000,12309
5,-0.496768,0.918800,1.704710,-0.190340,0,1,1,0,75,5000,12309
6,-0.182321,0.943502,1.214528,-0.310866,0,0,1,0,75,5000,12309
7,0.100383,1.042758,0.834008,-0.052716,0,0,1,0,75,5000,12309
8,-0.487695,0.844357,2.017624,0.100763,0,1,0,0,75,5000,12309
9,0.291633,1.075973,1.763080,0.116773,0,0,0,0,75,5000,12309


Now we can use Python to call Stata from the command line.  Specifically, we need to get the Stata engine to run the `.do` file we have created.

In [30]:
!pwd
!cat py_stata_test.do

/home/MarvinW/work_scratch
*Change directory to work_scratch
cd C:\cygwin64\home\MarvinW\work_scratch

*Read in data
insheet using stata_test_data.csv

*Capture parameters from input data in mean estimation table
*(mean works because they are represented as repeated values across obs)
mean seed_in obs_in r_in

*Capture the results in an extractable matrix
matrix b=e(b)

*Assign each parameter to a local scalar for use in the routine
local seed_p=b[1,1]
local obs_p=b[1,2]
local r_p=b[1,3]

*Set random seed and number of obs
set seed `seed_p'
set obs `obs_p'

*Run multivariate probit model
mvprobit (y1=x1) (y2=x1 x2) (y3 = x1 x2 x3) (y4=x1 x2 x3 x4), dr(`r_p')

*Store output
estimates store model

*Write estimates to disk
esttab model using py_stata_test_results.csv

exit, STATA clear


In [31]:
# # Define Python function to launch a do-file 
# def dostata(dofile, *params):
#     '''Launch a do-file, given the fullpath to the do-file and a list of parameters.'''
#     import subprocess
#     cmd = ["stata-64", "do", dofile]
#     for param in params:
#         cmd.append(param)
#     return subprocess.call(cmd) 

# dostata('py_stata_test.do')

Stata is taking forever.  What if we tried a SAS routine instead?

In [32]:
!cat py_sas_test.sas

*Read in simulated data;
proc import datafile="C:\cygwin64\home\MarvinW\work_scratch\stata_test_data.csv"
	out=test
	dbms=csv
	replace;
run;

*Estimate multivariate probit model;
proc qlim data=test method=qn outest=mod_est covout;
	model y1=x1;
	model y2=x1 x2;
	model y3 = x1 x2 x3;
	model y4=x1 x2 x3 x4;
	endogenous y1 y2 y3 y4 ~ discrete;
	output out=mod_out marginal predicted prob;
run;

*Inspect results output;
data mod_est;
	set mod_est;
run;
data mod_out;
	set mod_out;
run;

*Write model estimates and output to disk;
proc export data=mod_est
	outfile="C:\cygwin64\home\MarvinW\work_scratch\mod_est.csv"
	dbms=CSV
	replace;
run;
proc export data=mod_out
	outfile="C:\cygwin64\home\MarvinW\work_scratch\mod_out.csv"
	dbms=CSV
	replace;
run;


In [33]:
# Define Python function to launch a do-file 
def dosas(dofile, *params):
    '''Launch a sas file, given the fullpath to the sas file and a list of parameters.'''
    import subprocess
    cmd = ["sas", dofile]
    for param in params:
        cmd.append(param)
    return subprocess.call(cmd) 

dosas('py_sas_test.sas')

0

That was much quicker.  Now we can explore the output.

In [34]:
#Read in model output
mod_out=pd.read_csv('mod_out.csv')

mod_out.head().T

Unnamed: 0,0,1,2,3,4
_,0.0,1.0,2.0,3.0,4.0
x1,0.360617,-0.466143,-0.000362,0.232841,-0.113683
x2,1.093283,0.966869,0.504442,1.013847,0.437025
x3,1.253734,1.356876,1.320947,1.627763,2.068783
x4,0.140427,-0.236156,-0.29,-0.292968,0.08522
y1,1.0,1.0,1.0,0.0,1.0
y2,1.0,1.0,1.0,0.0,1.0
y3,1.0,0.0,1.0,1.0,1.0
y4,0.0,0.0,0.0,0.0,1.0
r_in,75.0,75.0,75.0,75.0,75.0


So, let's unpack this a bit. `y1-y4` and `x1-x4` clearly capture the dependent and independent data, respectively.  `r_in, obs_in` and `seed_in` are all input parameters for the estimator.  Everything else was generated during estimation:

+ `P_y1-P_y4` capture the (discrete) predicted values for the dependent (based upon the coefficient estimates in the model spec);
+ `Meff*` capture the marginal effects for each regressor on the probability of a given observed value.  These differ by equation.  For example, `Meff_P1_y1.x1` features the marginal effect of `x1` on the probability of `y=1` in the first equation.  By contrast, `Meff_P2_y4.x3` features the marginal effect of `x3` on the probability of `y=2` in the fourth equation.
+ `Prob_y1-Prob_y4` capture the primary measure of interest, the continuous estimated probability of `y=1` for observation `i`.

What about the estimation output?

In [35]:
#Read in model output
mod_est=pd.read_csv('mod_est.csv')

mod_est

Unnamed: 0,_NAME_,_TYPE_,_STATUS_,y1_Intercept,y1_x1,y2_Intercept,y2_x1,y2_x2,y3_Intercept,y3_x1,...,y4_x1,y4_x2,y4_x3,y4_x4,_Rho_y1_y2,_Rho_y1_y3,_Rho_y1_y4,_Rho_y2_y3,_Rho_y2_y4,_Rho_y3_y4
0,,PARM,0 Converged,0.1612751,1.099069,0.8911064,0.08416783,-0.8829441,1.011857,-1.945929,...,0.9384756,-0.1179343,2.952334,-0.5267015,0.9989401,0.1648397,0.1536273,0.1650284,0.1631339,0.7497073
1,,STD,0 Converged,0.01748363,0.06243637,0.03222645,0.06096532,0.03260868,0.07828474,0.07133761,...,0.09300648,0.09153145,0.08442194,0.1732324,0.0005509605,0.02299967,0.03140529,0.02296811,0.03126397,0.02401526
2,y1.Intercept,COV,0 Converged,0.0003056772,1.53317e-05,0.0002706711,-1.2401e-05,-1.2486e-05,3.19813e-05,-2.767484e-06,...,5.877881e-07,2.17293e-06,4.197547e-06,-4.678537e-06,-1.921799e-08,-3.369236e-06,-4.09099e-06,-2.642523e-06,-3.554377e-06,-1.321864e-06
3,y1.x1,COV,0 Converged,1.53317e-05,0.0038983,0.0004031413,0.00290746,-0.000484772,2.936171e-06,0.0003770262,...,0.0003723314,-1.968218e-06,1.074546e-07,-1.3264e-05,2.497207e-07,-2.8433e-05,-2.3211e-05,-1.6694e-05,-8.428686e-06,5.249129e-07
4,y2.Intercept,COV,0 Converged,0.0002706711,0.0004031413,0.001038544,-0.000226317,-0.000894822,7.778e-05,-9.50776e-07,...,-8.696668e-06,-9.0449e-05,-1.1587e-05,1.757344e-06,9.918846e-07,-3.9149e-05,-3.838e-05,-1.0061e-05,-6.241329e-06,-4.126979e-08
5,y2.x1,COV,0 Converged,-1.2401e-05,0.00290746,-0.000226317,0.00371677,0.0002687754,-4.1323e-05,0.0003839195,...,0.0004337008,0.0001005361,6.2953e-05,1.29125e-05,-1.1015e-05,4.10198e-05,3.99471e-05,-2.28185e-06,2.201e-05,2.092146e-06
6,y2.x2,COV,0 Converged,-1.2486e-05,-0.000484772,-0.000894822,0.0002687754,0.001063326,-5.7093e-05,-1.220411e-06,...,1.67502e-05,0.0001158716,3.288e-05,-7.335529e-07,-2.06775e-06,4.88422e-05,4.80868e-05,5.915613e-06,2.14965e-06,-4.609797e-06
7,y3.Intercept,COV,0 Converged,3.19813e-05,2.936171e-06,7.778e-05,-4.1323e-05,-5.7093e-05,0.006128501,-0.000658081,...,-0.000106123,-0.001534964,-0.000556846,-7.4735e-05,1.503038e-07,-4.0606e-05,-2.6062e-05,-3.75e-05,-2.2794e-05,-3.0892e-05
8,y3.x1,COV,0 Converged,-2.767484e-06,0.0003770262,-9.50776e-07,0.0003839195,-1.220411e-06,-0.000658081,0.005089054,...,0.001758383,4.60096e-05,-0.000294812,3.86718e-05,-6.696803e-07,1.36506e-05,2.82106e-05,1.25915e-05,2.74028e-05,3.70161e-05
9,y3.x2,COV,0 Converged,6.090521e-07,-2.621808e-06,-5.3212e-05,4.64656e-05,6.54071e-05,-0.003671213,-0.000216528,...,4.21429e-05,0.001857734,2.33561e-05,7.50018e-05,-2.393522e-07,3.6856e-05,1.04416e-05,3.24209e-05,7.091702e-06,1.14291e-05


The model estimate DF contains parameter estimates (in our case, both $\beta$ and $\rho$) and their respective standard deviations. The output also provides a covariance matrix for the parameter estimates.  Precisely which value corresponds to what is driven by the descriptor contained in the `_TYPE_` variable.

In [36]:
0.0040297101**.5

0.06347999763705099