# How Can One Save an Instance of an Estimator?

I have a situation in which I want to use an estimator that has been fit in an original Notebook to predict new values given data in a new Notebook.  The question is, can I save a specific model instance to be read in elsewhere?  This Notebook will run a basic regression on simulated data.  We will then attempt to write the model to disk, read back in, and use it to predict outcomes on new data.

In [1]:
#Basic data manipulation
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

#Estimating methods
import statsmodels.formula.api as smf
from statsmodels.api import add_constant

#Serialization
import pickle

## Simulation of Input Data

We are going to follow a very basic model:

$Y = X_1 + 2X_2 + \epsilon$

In [2]:
#Generate random vectors for regressors
x1,x2=(np.random.uniform(20,size=50),np.random.uniform(20,size=50))

#Generate dependent variable
y=x1 + 2*x2 + [np.random.uniform(3) for val in range(len(x1))]

#Capture data in DF
data=DataFrame({'y':y,
                'x1':x1,
                'x2':x2})

data

Unnamed: 0,x1,x2,y
0,19.117813,9.081719,38.487556
1,19.090036,6.498166,33.855858
2,12.200609,18.250084,50.024782
3,8.471135,9.904962,29.284813
4,12.998626,12.719388,39.454115
5,15.946168,2.285481,23.406013
6,7.071413,1.768348,12.569192
7,16.119018,12.580359,42.352553
8,8.577352,18.972704,48.913061
9,12.471167,7.176528,29.701035


## Estimation of the Model

We are just going to fit a standard OLS estimator...

In [3]:
#Fit model
model=smf.ols(formula='y ~ x1 + x2', data=data).fit()

model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.998
Model:,OLS,Adj. R-squared:,0.998
Method:,Least Squares,F-statistic:,10390.0
Date:,"Mon, 28 Mar 2016",Prob (F-statistic):,6.47e-63
Time:,20:10:07,Log-Likelihood:,-44.127
No. Observations:,50,AIC:,94.25
Df Residuals:,47,BIC:,99.99
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,2.0162,0.246,8.181,0.000,1.520 2.512
x1,1.0027,0.016,62.585,0.000,0.971 1.035
x2,1.9831,0.016,124.666,0.000,1.951 2.015

0,1,2,3
Omnibus:,23.2,Durbin-Watson:,1.694
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4.088
Skew:,0.158,Prob(JB):,0.129
Kurtosis:,1.635,Cond. No.,46.5


To provide a point of comparison, we will use the regressor subset to predict `y`.

In [4]:
pred_y=model.predict(data[['x1','x2']])

pred_y

array([ 39.19621217,  34.04494895,  50.44172549,  30.15293372,
        40.27407805,  22.53834664,  12.61376528,  43.12730978,
        48.24156285,  28.75320123,  57.72287732,  27.07806658,
        12.87529603,  58.08475365,  45.16579098,  41.35184676,
        21.84099785,  37.96278894,  21.44149071,  19.90859993,
        53.6952919 ,  39.20403839,  37.56368662,  37.32120798,
        28.32633854,  46.46907991,  44.24531183,  16.29523509,
        25.65277488,  32.30176333,  43.11840901,  41.91169594,
        15.17264292,  22.60851603,  41.25604118,  38.46013907,
        24.31116041,  21.00822494,  35.41398894,  38.11885085,
        23.70902291,  22.94958993,  16.6165082 ,  48.99109041,
        35.82361708,  10.26546609,  15.38526414,  47.69530843,
        27.05553445,  32.79476691])

## Writing the Model to Disk

Now that we have instantiated this model object, let's see if we can write it to disk.  To do this, we are going to use a method called "pickling", which serializes an object into a byte stream so that it may be written to disk.  In theory, we can read this same byte stream back elsewhere, and use it to reconstitute the object that had been written to disk.

So, what exactly are we writing to disk?  Here is the ASCII version of the byte stream:

In [5]:
pickle.dumps(model)

'ccopy_reg\n_reconstructor\np0\n(cstatsmodels.regression.linear_model\nRegressionResultsWrapper\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS\'__doc__\'\np6\nS\'\\n    Results class for for an OLS model.\\n\\n    Most of the methods and attributes are inherited from RegressionResults.\\n    The special methods that are only available for OLS are:\\n\\n    - get_influence\\n    - outlier_test\\n    - el_test\\n    - conf_int_el\\n\\n    See Also\\n    --------\\n    RegressionResults\\n\\n    \'\np7\nsS\'_results\'\np8\ng0\n(cstatsmodels.regression.linear_model\nOLSResults\np9\ng2\nNtp10\nRp11\n(dp12\nS\'normalized_cov_params\'\np13\ncnumpy.core.multiarray\n_reconstruct\np14\n(cnumpy\nndarray\np15\n(I0\ntp16\nS\'b\'\np17\ntp18\nRp19\n(I1\n(L3L\nL3L\ntp20\ncnumpy\ndtype\np21\n(S\'f8\'\np22\nI0\nI1\ntp23\nRp24\n(I3\nS\'<\'\np25\nNNNI-1\nI-1\nI0\ntp26\nbI00\nS"/\\xb3\\x1b\\xb3\\xab]\\xc5?\\xef{?\\to[~\\xbf\'\\x19V/\\xbc\\xffy\\xbf\\xef{?\\to[~\\xbf\\xd1i\\x93T\\xc1\\x1dG?(6B=0\\x0c\\x0

I guess I am just trusting that this is correct.  Let's write that nonsense to disk.

In [6]:
with open('model_dump','w') as f:
    pickle.dump(model,f)

Now we can read the model object back in, save it as a new instance, and use it to predict based upon the original regressor data.  This result can be compared to the original prediction above.

In [7]:
#Capture model object from disk
with open('model_dump','r') as f:
    model2=pickle.load(f)

#Show model summary
print model2.summary()

#Predict new values of y
# pred_y2=model2.predict(data[['x1','x2']])

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                 1.039e+04
Date:                Mon, 28 Mar 2016   Prob (F-statistic):           6.47e-63
Time:                        20:10:07   Log-Likelihood:                -44.127
No. Observations:                  50   AIC:                             94.25
Df Residuals:                      47   BIC:                             99.99
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.0162      0.246      8.181      0.0

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-16-f7a79b10e2b5> in <module>()
          7 
          8 #Predict new values of y
    ----> 9 pred_y2=model2.predict(data[['x1','x2']])

    C:\Users\marvinw\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\base\model.pyc in predict(self, exog, transform, *args, **kwargs)
        876         if transform and hasattr(self.model, 'formula') and exog is not None:
        877             from patsy import dmatrix
    --> 878             exog = dmatrix(self.model.data.orig_exog.design_info.builder,
        879                     exog)
        880         return self.model.predict(self.params, exog, *args, **kwargs)

    C:\Users\marvinw\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)
       1976                 return self[name]
       1977             raise AttributeError("'%s' object has no attribute '%s'" %
    -> 1978                                  (type(self).__name__, name))
       1979 
       1980     def __setattr__(self, name, value):

    AttributeError: 'DataFrame' object has no attribute 'design_info'

Hmmmm.  While it does appear that we have reconstituted the object, clearly something was lost in translation.  This helpful [SO post](http://stackoverflow.com/questions/20724919/pandas-dataframe-attributeerror-dataframe-object-has-no-attribute-design-inf) reveals that user additions to the model can get lost.  In this case, `design_info` is referring to the formula used to fit the model: `'y ~ x1 + x2'`.  The post suggests either building a design matrix explicitly with **`patsy`**, or just making sure we add a constant to our input data for the unpickled version of the model.

In [8]:
#Predict new values of y
pred_y2=model2.predict(add_constant(data[['x1','x2']]),transform=False)

pred_y-pred_y2

array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -1.77635684e-15,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -3.55271368e-15,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -1.77635684e-15,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   7.10542736e-15,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -3.55271368e-15,

I think that I can handle those small deviations.  Looks like this works as expected.

In [9]:
np.random.choice(range(0,2),size=10)

array([0, 1, 1, 1, 1, 1, 0, 1, 1, 0])

In [10]:
import statsmodels.api as sm
X=data[['x1','x2']].values
y=np.random.choice(range(0,2),size=len(data))
p=sm.Probit(y,X).fit(maxiter=1000)
p

Optimization terminated successfully.
         Current function value: 0.657559
         Iterations 4


<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x15f66898>

In [12]:
# summary_string=p.summary().as_csv().split(',')
# ll_text_idx=[i for i,content in enumerate(summary_string) if 'Log-Likelihood:' in content][0]
# ll_rep_value=summary_string[conv_text_idx+1].strip().split('\n')[0]
# ll_max_found=ll_rep_value != 'nan'
# ll_max_found

In [13]:
p.summary().as_csv().split(',')

['                  Probit Regression Results                   \nDep. Variable:',
 'y               ',
 '  No. Observations:  ',
 '    50  \nModel:        ',
 'Probit          ',
 '  Df Residuals:      ',
 '    48  \nMethod:       ',
 'MLE             ',
 '  Df Model:          ',
 '     1  \nDate:         ',
 'Mon',
 ' 28 Mar 2016',
 '  Pseudo R-squ.:     ',
 '0.03342 \nTime:         ',
 '20:10:33        ',
 '  Log-Likelihood:    ',
 ' -32.878\nconverged:    ',
 'True            ',
 '  LL-Null:           ',
 ' -34.015\n              ',
 '                ',
 '  LLR p-value:       ',
 '0.1316  \n  ',
 '   coef   ',
 ' std err ',
 '    z    ',
 'P>|z| ',
 ' [95.0% Conf. Int.]\nx1',
 '   -0.0183',
 '    0.025',
 '   -0.743',
 ' 0.457',
 '   -0.066     0.030\nx2',
 '    0.0440',
 '    0.027',
 '    1.608',
 ' 0.108',
 '   -0.010     0.098']

In [14]:
print p.summary().tables[0]

                          Probit Regression Results                           
Dep. Variable:                      y   No. Observations:                   50
Model:                         Probit   Df Residuals:                       48
Method:                           MLE   Df Model:                            1
Date:                Mon, 28 Mar 2016   Pseudo R-squ.:                 0.03342
Time:                        20:10:34   Log-Likelihood:                -32.878
converged:                       True   LL-Null:                       -34.015
                                        LLR p-value:                    0.1316


In [18]:
dir(p)

['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache',
 '_data_attr',
 '_get_endog_name',
 '_get_robustcov_results',
 'aic',
 'bic',
 'bse',
 'conf_int',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'f_test',
 'fittedvalues',
 'get_margeff',
 'initialize',
 'k_constant',
 'llf',
 'llnull',
 'llr',
 'llr_pvalue',
 'load',
 'mle_retvals',
 'mle_settings',
 'model',
 'nobs',
 'normalized_cov_params',
 'params',
 'pred_table',
 'predict',
 'prsquared',
 'pvalues',
 'remove_data',
 'resid_dev',
 'resid_generalized',
 'resid_pearson',
 'resid_response',
 'save',
 'scale',
 'summary',
 'summary2',
 't_test',
 'tvalues',
 'use_t',
 'wald_test']

In [19]:
p.llf

-32.877934198062555

In [21]:
p.mle_retvals['converged']

True

In [25]:
p.pvalues

array([ 0.45727312,  0.10792699])

In [35]:
not np.isnan(p.pvalues).all()

True

In [27]:
p.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,50.0
Model:,Probit,Df Residuals:,48.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 28 Mar 2016",Pseudo R-squ.:,0.03342
Time:,20:16:05,Log-Likelihood:,-32.878
converged:,True,LL-Null:,-34.015
,,LLR p-value:,0.1316

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
x1,-0.0183,0.025,-0.743,0.457,-0.066 0.030
x2,0.0440,0.027,1.608,0.108,-0.010 0.098
