# Linear models

*Monday 20, September*

### 1. `statsmodels` package

_`statsmodels` is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration._

The online documentation is hosted at [statsmodels.org](https://www.statsmodels.org/stable/index.html)

It covered:

- Linear Regression 
- Generalized Linear Models 
- Generalized Estimating Equations 
- Generalized Additive Models (GAM) 
- Robust Linear Models 
- Linear Mixed Effects Models 
- Regression with Discrete Dependent Variable 
- Generalized Linear Mixed Effects Models 
- ANOVA 
- Time Series analysis `tsa` 
- Time Series Analysis by State Space Methods statespace 
- Vector Autoregressions `tsa.vector_ar` 
- Methods for Survival and Duration Analysis 
- Statistics `stats` 
- Nonparametric Methods nonparametric 
- Generalized Method of Moments `gmm` 
- Contingency tables 
- Multiple Imputation with Chained Equations 
- Multivariate Statistics multivariate 
- Empirical Likelihood emplike 
- Other Models miscmodels 
- Distributions 
- Graphics 
- Input-Output iolib 
- Tools 
- The Datasets Package 
- Sandbox 
- Working with Large Data Sets 
- Optimization

`statsmodels` works smoothly with the `pandas` in a way that DataFrame is the dataset form it supports by default.

Anaconda has installed `statsmodels` module by default.
Before using the functions and classes inside, we need to import the `statsmodels.api` and `statsmodels.formula.api`.

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

import numpy as np

The output of `statsmodels` is similiar to the output of functions in `R`.
We start with the most widely used and elementary statistical methods : ordinary least square.

### 2. OLS

**2.1. How to fit a dataset and see the result**

We use the dataset `Guerry` provided by `statsmodel` which studied the determinants of the number of lottery sold.

In [2]:
# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# list of the variables
list(dat.columns.values)

['dept',
 'Region',
 'Department',
 'Crime_pers',
 'Crime_prop',
 'Literacy',
 'Donations',
 'Infants',
 'Suicides',
 'MainCity',
 'Wealth',
 'Commerce',
 'Clergy',
 'Crime_parents',
 'Infanticide',
 'Donation_clergy',
 'Lottery',
 'Desertion',
 'Instruction',
 'Prostitutes',
 'Distance',
 'Area',
 'Pop1831']

More specifically, we studied the relationship between lottery and the literacy and population (in the log scale).

In [7]:
# Fit regression model (using the natural log of one of the regressors)
model = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat)
results = model.fit()

To see the results, we need an additional step:

In [8]:
# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Tue, 27 Aug 2019   Prob (F-statistic):           1.90e-08
Time:                        18:39:14   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233     

**2.2. When the dataset is not in `DataFrame`**

The dataset above is provided by `statsmodels` package hence in the form it supports. 
However, in many situations, the dataset is not constructed yet. 
In this case, we can use `numpy` arrays.

In [9]:
import numpy as np

import statsmodels.api as sm

# Generate artificial data (2 regressors + constant)
nobs = 100

X = np.random.random((nobs, 2))

X = sm.add_constant(X)

beta = [1, .1, .5]

e = np.random.random(nobs)

y = np.dot(X, beta) + e

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.168
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     9.796
Date:                Tue, 27 Aug 2019   Prob (F-statistic):           0.000133
Time:                        18:46:55   Log-Likelihood:                -16.388
No. Observations:                 100   AIC:                             38.78
Df Residuals:                      97   BIC:                             46.59
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5404      0.075     20.632      0.0

Of course, we can create a dataset and make it supported by `statsmodels`.
Details can be found here: [adding a dataset](https://www.statsmodels.org/stable/dev/dataset_notes.html?highlight=statsmodels%20datasets#adding-a-dataset-an-example)

**2.3. Wald's test**

Besides the fitting, `statsmodels` also supports many statsitical testing methods.
Here, we show how to use _Wald's test_ in `statsmodels`. 

Again, we consider the  dataset `Guerry`.

We want to analyse the effect of _Wealth_ and _Literacy_ on the _Crime_pers_ and test:

> whether the coeffcients of _Wealth_ and _Literacy_ are the same.

In [6]:
formula = 'Crime_pers ~ Wealth + Literacy'
results = smf.ols(formula, dat).fit()
hypotheses = '(Wealth = Literacy)'
f_test = results.f_test(hypotheses)
print(f_test)

<F test: F=array([[0.03467668]]), p=0.8527291641569565, df_denom=83, df_num=1>
