# Getting Started

Load in them modules and functions:

In [4]:
import statsmodels.api as sm
import pandas
from patsy import dmatrices

`patsy` is a library for describing statistical models and building design matrices using `R`-like formulas.

## Data

We download the [Guerry dataset](https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Guerry.html), a collection of historical data used in support of Antre-Michael Guerry's 1833 Essay on the Moral Statistics. We could download the file locally and then load it using `read_csv`, but let's be real... we wanna use `pandas`.

In [7]:
df = sm.datasets.get_rdataset('Guerry', 'HistData').data
df.head()

Unnamed: 0,dept,Region,Department,Crime_pers,Crime_prop,Literacy,Donations,Infants,Suicides,MainCity,...,Crime_parents,Infanticide,Donation_clergy,Lottery,Desertion,Instruction,Prostitutes,Distance,Area,Pop1831
0,1,E,Ain,28870,15890,37,5098,33120,35039,2:Med,...,71,60,69,41,55,46,13,218.372,5762,346.03
1,2,N,Aisne,26226,5521,51,8901,14572,12831,2:Med,...,4,82,36,38,82,24,327,65.945,7369,513.0
2,3,C,Allier,26747,7925,13,10973,17044,114121,2:Med,...,46,42,76,66,16,85,34,161.927,7340,298.26
3,4,E,Basses-Alpes,12935,7289,46,2733,23018,14238,1:Sm,...,70,12,37,80,32,29,2,351.399,6925,155.9
4,5,E,Hautes-Alpes,17488,8174,69,6962,23076,16171,1:Sm,...,22,23,64,79,35,7,1,320.28,5549,129.1


We select the variables of interest and look at the bottom 5 rows:

In [8]:
vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
df = df[vars]
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C
85,Corse,83,49,37,


Notice that there is one missing observation in the *Region* column. We eliminate it using a `DataFrame` method provided by `pandas`:

In [12]:
df = df.dropna()
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
80,Vendee,68,28,56,W
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C


## Substantive motivation and model

We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using OLS.

## Design matrices (endog and exog)

To fit most of the models covered by `statsmodels`, you will need to create two design matrices. The first is a matrix of endogenous variable(s) (i.e. dependent, response, regressand, etc.). The second matrix is a matrix of exogenous variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient estimates are calculated as usual:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mrow data-mjx-texclass="ORD">
    <mover>
      <mi>&#x3B2;</mi>
      <mo stretchy="false">^</mo>
    </mover>
  </mrow>
  <mo>=</mo>
  <mo stretchy="false">(</mo>
  <msup>
    <mi>X</mi>
    <mo data-mjx-alternate="1">&#x2032;</mo>
  </msup>
  <mi>X</mi>
  <msup>
    <mo stretchy="false">)</mo>
    <mrow data-mjx-texclass="ORD">
      <mo>&#x2212;</mo>
      <mn>1</mn>
    </mrow>
  </msup>
  <msup>
    <mi>X</mi>
    <mo data-mjx-alternate="1">&#x2032;</mo>
  </msup>
  <mi>y</mi>
</math>

where *y* is an *N* x 1 column of data on lottery wagers per capita (*Lottery*). *X* is *N* x 7 with an intercept, the *Literacy* and *Wealth* variables, and 4 region binary variables.

The `patsy` module provides a convenient function to prepare design matrices using `R`-like formulas. You can find more information [here](https://patsy.readthedocs.io/en/latest/).

We use `patsy`'s `dmatrices` function to create design matrices:

In [13]:
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

The resulting matrices/data frames look like this:

In [14]:
y[:3]

Unnamed: 0,Lottery
0,41.0
1,38.0
2,66.0


In [15]:
X[:3]

Unnamed: 0,Intercept,Region[T.E],Region[T.N],Region[T.S],Region[T.W],Literacy,Wealth
0,1.0,1.0,0.0,0.0,0.0,37.0,73.0
1,1.0,0.0,1.0,0.0,0.0,51.0,22.0
2,1.0,0.0,0.0,0.0,0.0,13.0,61.0


Notice that `dmatrices` has
* split the categorical *Region* variable into a set of indicator variables.
* added a constant to the exogenous regressors matrix.
* returned `pandas` DataFrames instead of simple numpy arrays. This is useful because DataFrames allow `statsmodels` to carry-over meta-data (e.g. variable names) when reporting results.

The above behavior can of course be altered.

## Model fit and summary