# Regression

Here we perform basic regresison analysis


## Using statsmodels

This follows the example in Kevin Sheppard's Introduction to Python (https://www.kevinsheppard.com/files/teaching/python/notes/python_introduction_2021.pdf) Chapter 21.1 Regression. The statsmodels package has good documentation here:
https://www.statsmodels.org/stable/index.html



In [1]:
import statsmodels.api as sm 
d = sm.datasets.statecrime.load_pandas()


The data are now loaded into `d`. That is a dataset and you can see the actual spreadsheet using the `.data` attribute.

In [2]:
d.data




Unnamed: 0_level_0,violent,murder,hs_grad,poverty,single,white,urban
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,459.9,7.1,82.1,17.5,29.0,70.0,48.65
Alaska,632.6,3.2,91.4,9.0,25.5,68.3,44.46
Arizona,423.2,5.5,84.2,16.5,25.7,80.0,80.07
Arkansas,530.3,6.3,82.4,18.8,26.3,78.4,39.54
California,473.4,5.4,80.6,14.2,27.8,62.7,89.73
Colorado,340.9,3.2,89.3,12.9,21.4,84.6,76.86
Connecticut,300.5,3.0,88.6,9.4,25.0,79.1,84.83
Delaware,645.1,4.6,87.4,10.8,27.6,71.9,68.71
District of Columbia,1348.9,24.2,87.1,18.4,48.0,38.7,100.0
Florida,612.6,5.5,85.3,14.9,26.6,76.9,87.44


This `d` dataset object has more attributes (see details here: https://www.statsmodels.org/stable/datasets/index.html#available-datasets) amonst others they have been pre-partitioned into exogenous and endogenous variables.

In [3]:
print(d.endog_name)
d.endog.head(n=10) # only showing first 10 rows

murder


state
Alabama                  7.1
Alaska                   3.2
Arizona                  5.5
Arkansas                 6.3
California               5.4
Colorado                 3.2
Connecticut              3.0
Delaware                 4.6
District of Columbia    24.2
Florida                  5.5
Name: murder, dtype: float64

In [4]:
print(d.exog_name)
d.exog.head(n=10)

['urban', 'poverty', 'hs_grad', 'single']


Unnamed: 0_level_0,urban,poverty,hs_grad,single
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,48.65,17.5,82.1,29.0
Alaska,44.46,9.0,91.4,25.5
Arizona,80.07,16.5,84.2,25.7
Arkansas,39.54,18.8,82.4,26.3
California,89.73,14.2,80.6,27.8
Colorado,76.86,12.9,89.3,21.4
Connecticut,84.83,9.4,88.6,25.0
Delaware,68.71,10.8,87.4,27.6
District of Columbia,100.0,18.4,87.1,48.0
Florida,87.44,14.9,85.3,26.6


Before we can estimate a regression model we specify the regression model and save that specification in an object `mod`. The first argument specifies the explained and the second the explanatory variables. The result is an object of the model class OLS (see the above help on statsmodels for other type of model classes). 

In [5]:
mod = sm.OLS(d.endog,d.exog)

In order to confirm what type of object `mod` is you could run `type(mod)` which will confirm `mod` is of type "statsmodels.regression.linear_model.OLS". Every object of that type has some attributes and methods associated with it. You can figure out which by running `dir(mod)`. 

You can think of attributes as characteristics of the object and of methods as of tools that can be applied to the object. The method that is of immediate importance is to actually estimate (or `fit`) the model. We apply that method to the object `mod` using the command `mod.fit()`. The result we save in `res`.

In [6]:
res = mod.fit()

In [7]:
print(res.summary())

                                 OLS Regression Results                                
Dep. Variable:                 murder   R-squared (uncentered):                   0.915
Model:                            OLS   Adj. R-squared (uncentered):              0.908
Method:                 Least Squares   F-statistic:                              126.9
Date:                Thu, 17 Aug 2023   Prob (F-statistic):                    1.45e-24
Time:                        17:17:41   Log-Likelihood:                         -101.53
No. Observations:                  51   AIC:                                      211.1
Df Residuals:                      47   BIC:                                      218.8
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

This is very much like a standard regression output you would see from most statistical computing packages. One thing you may note is that there are two degrees of freedom (Df) information. The model and residual degrees of freedom. The model Df tells you how many explanatory variables were used (here 4) and the residual Df is the number of observations minus the number of estimated coefficients, here 51 - 4 = 47. The latter is the usual definition of degrees of freedom in the context of regression models. 