# Wrapping Stata Code in Python

As a part of the Medicaid imputation project, I have a use case that involves in estimator that does not appear to be implemented in Python: the multivariate probit estimator.  (That being said, this could probably be tackled by a suitably designed implementation of [PyStan](https://github.com/stan-dev/pystan).) Essentially, we think that there is a relationship between the receipt of SNAP and Medicaid benefits (and other transfer benefits directed at low-income folks for that matter).  The most straightforward implementation seems to be contained by the `mvprobit` estimator in Stata.  This Notebook demonstrates how one might 1) use Python to deal with the data management, 2) pass the data to Stata for estimation, and 3) return the relevant results back into the Python namespace.

In [16]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import subprocess

%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


## Data Simulation

So that we may check our results against the model laid out in the article on [`mvprobit`](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiBl7OhwpjKAhUC7R4KHaLRAuQQFggdMAA&url=http%3A%2F%2Fwww.stata-journal.com%2Fsjpdf.html%3Farticlenum%3Dst0045&usg=AFQjCNHi1QVTZiK-YRagsQOuF37LmWht9w&sig2=ah1usVaro6ebRyEIGoiVog) housed in the Stata Journal, we will follow the same simulation procedure.  We are generating a set of four correlated equations with known parameter values so that we may identify the quality of the estimator.  The equations are taken directly from the article (Section 4.2).

In [17]:
#Set seed value (not that the same sequence will hold)
rand_seed=12309

#Set number of observations
nobs=5000

#Define correlation structure
vcov_R=np.array([[1.,25,.5,.75],[25,1.,.75,.5],[.5,.75,1.,.75],[.75,.5,.75,1.]])

#Set error means
u_means=np.zeros(4)

#Generate errors given the covariance structure
u_draws=np.random.multivariate_normal(u_means,vcov_R,nobs)

#Generate regressor data
x1 = np.random.uniform(size=nobs) - .5
x2 = np.random.uniform(size=nobs) + (1/3.)
x3 = 2*np.random.uniform(size=nobs) + .5
x4 = .5*np.random.uniform(size=nobs) - (1/3.)

#Generate latent response
y1_latent = .5 + 4*x1 + u_draws[:,0]
y2_latent = 3 + .5*x1 - 3*x2 + u_draws[:,1]
y3_latent = 1 - 2*x1 + .4*x2 - .75*x3 + u_draws[:,2]
y4_latent = -6 + 1*x1 -.3*x2 + 3*x3 - .4*x4 + u_draws[:,3]

#Generate observed binary response
y1 = np.where(y1_latent>0,1,0)
y2 = np.where(y2_latent>0,1,0)
y3 = np.where(y3_latent>0,1,0)
y4 = np.where(y4_latent>0,1,0)

#Define the number of draws
sim_draws=75



With all of the relevant inputs defined, we will write them to disk, so that they may be used in the Stata `.do` file.

In [18]:
#Capture data in DF
model_data=DataFrame({'y1':y1,
                      'y2':y2,
                      'y3':y3,
                      'y4':y4,
                      'x1':x1,
                      'x2':x2,
                      'x3':x3,
                      'x4':x4})

#Capture parameters
model_params={'seed_in':rand_seed,
              'obs_in':nobs,
               'r_in':sim_draws}
for var in model_params.keys():
    model_data[var]=model_params[var]

#Write both DFs to disk
model_data.to_csv('stata_test_data.csv')
# model_params.to_csv('stata_test_params.csv')

In [19]:
model_data

Unnamed: 0,x1,x2,x3,x4,y1,y2,y3,y4,r_in,obs_in,seed_in
0,-0.412036,0.893058,1.448316,-0.216108,1,1,1,0,75,5000,12309
1,-0.098177,0.684493,1.564045,-0.041778,0,1,1,0,75,5000,12309
2,0.223193,0.412064,0.538992,0.004446,0,0,1,0,75,5000,12309
3,-0.183885,0.562191,2.170886,-0.324298,1,1,1,1,75,5000,12309
4,-0.462857,0.401412,2.298344,-0.107225,0,0,0,0,75,5000,12309
5,-0.274244,0.663529,0.724221,-0.009814,1,1,1,0,75,5000,12309
6,0.333416,1.192337,0.775371,-0.194807,0,0,1,0,75,5000,12309
7,-0.196710,0.856402,2.195139,0.130084,1,1,1,1,75,5000,12309
8,0.234116,0.766830,2.031420,-0.126423,0,0,0,0,75,5000,12309
9,0.448369,0.590168,1.143761,-0.143042,0,0,1,0,75,5000,12309


Now we can use Python to call Stata from the command line.  Specifically, we need to get the Stata engine to run the `.do` file we have created.

In [20]:
!pwd
!cat py_stata_test.do

/home/MarvinW/work_scratch
*Change directory to work_scratch
cd C:\cygwin64\home\MarvinW\work_scratch

*Read in data
insheet using stata_test_data.csv

*Capture parameters from input data in mean estimation table
*(mean works because they are represented as repeated values across obs)
mean seed_in obs_in r_in

*Capture the results in an extractable matrix
matrix b=e(b)

*Assign each parameter to a local scalar for use in the routine
local seed_p=b[1,1]
local obs_p=b[1,2]
local r_p=b[1,3]

*Set random seed and number of obs
set seed `seed_p'
set obs `obs_p'

*Run multivariate probit model
mvprobit (y1=x1) (y2=x1 x2) (y3 = x1 x2 x3) (y4=x1 x2 x3 x4), dr(`r_p')

*Store output
estimates store model

*Write estimates to disk
esttab model using py_stata_test_results.csv

exit, STATA clear


In [21]:
# Define Python function to launch a do-file 
def dostata(dofile, *params):
    '''Launch a do-file, given the fullpath to the do-file and a list of parameters.'''
    import subprocess
    cmd = ["stata-64", "do", dofile]
    for param in params:
        cmd.append(param)
    return subprocess.call(cmd) 

dostata('py_stata_test.do')

0