# How Can One Save an Instance of an Estimator?

I have a situation in which I want to use an estimator that has been fit in an original Notebook to predict new values given data in a new Notebook.  The question is, can I save a specific model instance to be read in elsewhere?  This Notebook will run a basic regression on simulated data.  We will then attempt to write the model to disk, read back in, and use it to predict outcomes on new data.

In [1]:
#Basic data manipulation
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

#Estimating methods
import statsmodels.formula.api as smf
from statsmodels.api import add_constant

#Serialization
import pickle

## Simulation of Input Data

We are going to follow a very basic model:

$Y = X_1 + 2X_2 + \epsilon$

In [2]:
#Generate random vectors for regressors
x1,x2=(np.random.uniform(20,size=50),np.random.uniform(20,size=50))

#Generate dependent variable
y=x1 + 2*x2 + np.random.uniform(5)

#Capture data in DF
data=DataFrame({'y':y,
                'x1':x1,
                'x2':x2})

data

Unnamed: 0,x1,x2,y
0,11.162996,7.858211,31.053511
1,9.121698,18.031641,49.359073
2,7.122839,8.880319,29.05757
3,18.998183,9.209179,41.590634
4,13.442536,3.774621,25.165871
5,7.841275,16.859139,45.733645
6,1.375523,6.66646,18.882535
7,11.596553,9.659084,35.088813
8,11.342886,14.571422,44.659823
9,4.419872,9.915029,28.424023


## Estimation of the Model

We are just going to fit a standard OLS estimator...

In [3]:
#Fit model
model=smf.ols(formula='y ~ x1 + x2', data=data).fit()

model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,4.346e+31
Date:,"Fri, 17 Jul 2015",Prob (F-statistic):,0.0
Time:,16:46:55,Log-Likelihood:,1539.5
No. Observations:,50,AIC:,-3073.0
Df Residuals:,47,BIC:,-3067.0
Df Model:,2,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,4.1741,3.74e-15,1.12e+15,0.000,4.174 4.174
x1,1.0000,2.63e-16,3.8e+15,0.000,1.000 1.000
x2,2.0000,2.59e-16,7.73e+15,0.000,2.000 2.000

0,1,2,3
Omnibus:,0.199,Durbin-Watson:,1.802
Prob(Omnibus):,0.905,Jarque-Bera (JB):,0.365
Skew:,0.119,Prob(JB):,0.833
Kurtosis:,2.656,Cond. No.,39.4


To provide a point of comparison, we will use the regressor subset to predict `y`.

In [4]:
pred_y=model.predict(data[['x1','x2']])

pred_y

array([ 31.05351114,  49.35907263,  29.05757006,  41.59063421,
        25.16587066,  45.73364484,  18.88253471,  35.08881256,
        44.65982315,  28.4240229 ,  59.27480489,  45.18569398,
        55.2471291 ,  28.57092686,  48.35522603,   9.81240764,
        13.14922989,  55.60652878,  58.54317206,  12.63476969,
        37.47506987,  29.87868074,  44.83185737,  26.21100409,
        41.66098556,  42.67319132,  41.54132717,  15.76827766,
         9.99388603,  36.59258824,  57.22672297,  41.62986308,
        15.50910592,  48.31486452,  28.81450385,  44.95146298,
        36.71805582,  51.22397146,  27.99785729,  14.72345638,
        18.46505934,  15.01163319,  32.96052968,  32.87556324,
        32.70815089,  14.4849211 ,  58.18740798,  34.06714916,
        34.86150229,  35.46979198])

## Writing the Model to Disk

Now that we have instantiated this model object, let's see if we can write it to disk.  To do this, we are going to use a method called "pickling", which serializes an object into a byte stream so that it may be written to disk.  In theory, we can read this same byte stream back elsewhere, and use it to reconstitute the object that had been written to disk.

So, what exactly are we writing to disk?  Here is the ASCII version of the byte stream:

In [5]:
pickle.dumps(model)

'ccopy_reg\n_reconstructor\np0\n(cstatsmodels.regression.linear_model\nRegressionResultsWrapper\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS\'__doc__\'\np6\nS\'\\n    Results class for for an OLS model.\\n\\n    Most of the methods and attributes are inherited from RegressionResults.\\n    The special methods that are only available for OLS are:\\n\\n    - get_influence\\n    - outlier_test\\n    - el_test\\n    - conf_int_el\\n\\n    See Also\\n    --------\\n    RegressionResults\\n\\n    \'\np7\nsS\'_results\'\np8\ng0\n(cstatsmodels.regression.linear_model\nOLSResults\np9\ng2\nNtp10\nRp11\n(dp12\nS\'normalized_cov_params\'\np13\ncnumpy.core.multiarray\n_reconstruct\np14\n(cnumpy\nndarray\np15\n(I0\ntp16\nS\'b\'\np17\ntp18\nRp19\n(I1\n(L3L\nL3L\ntp20\ncnumpy\ndtype\np21\n(S\'f8\'\np22\nI0\nI1\ntp23\nRp24\n(I3\nS\'<\'\np25\nNNNI-1\nI-1\nI0\ntp26\nbI00\nS\'{\\x15N\\x08\\xed\\xcc\\xbf?Q\\xdf\\xd8\\x94b`u\\xbf\\xa1\\xda\\xa8;2}t\\xbfQ\\xdf\\xd8\\x94b`u\\xbfP\\xb1=/94D?\\x9a\\x9b\\x0

I guess I am just trusting that this is correct.  Let's write that nonsense to disk.

In [6]:
with open('model_dump','w') as f:
    pickle.dump(model,f)

Now we can read the model object back in, save it as a new instance, and use it to predict based upon the original regressor data.  This result can be compared to the original prediction above.

In [7]:
#Capture model object from disk
with open('model_dump','r') as f:
    model2=pickle.load(f)

#Show model summary
print model2.summary()

#Predict new values of y
# pred_y2=model2.predict(data[['x1','x2']])

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.346e+31
Date:                Fri, 17 Jul 2015   Prob (F-statistic):               0.00
Time:                        16:46:55   Log-Likelihood:                 1539.5
No. Observations:                  50   AIC:                            -3073.
Df Residuals:                      47   BIC:                            -3067.
Df Model:                           2                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      4.1741   3.74e-15   1.12e+15      0.000         4.174     4.174
x1             1.0000   2.63e-16    3.8e+15      0.0

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-16-f7a79b10e2b5> in <module>()
          7 
          8 #Predict new values of y
    ----> 9 pred_y2=model2.predict(data[['x1','x2']])

    C:\Users\marvinw\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\base\model.pyc in predict(self, exog, transform, *args, **kwargs)
        876         if transform and hasattr(self.model, 'formula') and exog is not None:
        877             from patsy import dmatrix
    --> 878             exog = dmatrix(self.model.data.orig_exog.design_info.builder,
        879                     exog)
        880         return self.model.predict(self.params, exog, *args, **kwargs)

    C:\Users\marvinw\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.pyc in __getattr__(self, name)
       1976                 return self[name]
       1977             raise AttributeError("'%s' object has no attribute '%s'" %
    -> 1978                                  (type(self).__name__, name))
       1979 
       1980     def __setattr__(self, name, value):

    AttributeError: 'DataFrame' object has no attribute 'design_info'

Hmmmm.  While it does appear that we have reconstituted the object, clearly something was lost in translation.  This helpful [SO post](http://stackoverflow.com/questions/20724919/pandas-dataframe-attributeerror-dataframe-object-has-no-attribute-design-inf) reveals that user additions to the model can get lost.  In this case, `design_info` is referring to the formula used to fit the model: `'y ~ x1 + x2'`.  The post suggests either building a design matrix explicitly with **`patsy`**, or just making sure we add a constant to our input data for the unpickled version of the model.

In [8]:
#Predict new values of y
pred_y2=model2.predict(add_constant(data[['x1','x2']]),transform=False)

pred_y-pred_y2

array([  0.00000000e+00,  -7.10542736e-15,   0.00000000e+00,
         0.00000000e+00,   3.55271368e-15,   0.00000000e+00,
         0.00000000e+00,   7.10542736e-15,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -1.77635684e-15,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -1.77635684e-15,   0.00000000e+00,
         3.55271368e-15,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -7.10542736e-15,   0.00000000e+00,
         1.77635684e-15,   0.00000000e+00,   0.00000000e+00,
        -7.10542736e-15,   0.00000000e+00,   0.00000000e+00,
        -7.10542736e-15,   3.55271368e-15,  -7.10542736e-15,
        -7.10542736e-15,   0.00000000e+00,   0.00000000e+00,
         1.77635684e-15,   3.55271368e-15,   1.77635684e-15,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -1.77635684e-15,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,

I think that I can handle those small deviations.  Looks like this works as expected.