Title: Reading A Linear Regression Output Table  
Slug: reading_a_linear_regression_output_table  
Summary: Reading A Linear Regression Output Table  
Date: 2016-05-01 12:00  
Category: Frequentist Statistics 
Tags: Basics  
Authors: Chris Albon  

Source: [An Introduction to Statistical Learning](https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370), [ISL-python repo](http://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%203.ipynb).

Data source: [ISL's webpage](http://www-bcf.usc.edu/~gareth/ISL/data.html)

This tutorial describes many of the statistics you will see in a typical regression output table. While this particular regression is run using the statsmodels package for Python, a similar chart will be seen in SPSS, STATA, R, or other statistical software.

The data for this tutorial comes from [An Introduction to Statistical Learning](https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370) and the regression run is the exact regression described in the book. For this reason, if you want a more indepth description of anything you see here, check out the book.

## Preliminaries

In [19]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Load Data

In [6]:
advertising = pd.read_csv('../data/isl/Advertising.csv')

## Train Model

This is just a simple ordinary least squares regression that trains a model to predict a company's sales (in dollars) compared to the amount of dollars the company spends on TV ads, radio ads, or newspaper ads. Specifically, the model is:

$$\hat{y}_{sales} = \beta_{0}+\beta_{1}x_{Television}+\beta_{2}x_{Radio}+\beta_{3}x_{Newspaper}$$

In [21]:
# Train the model
est = smf.ols('Sales ~ TV + Radio + Newspaper', advertising).fit()

## View Output

The two table below should be the output tables you are used to seeing. Often they are displayed together, however for the sake of organization I explain the elements of each one seperately.

In [16]:
# View the first output table
est.summary().tables[0]

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Fri, 19 Aug 2016",Prob (F-statistic):,1.58e-96
Time:,12:46:38,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,


### No. Observations

The number of observations used in the analysis. In this case, we trained our model using 200 observations.

### Df Residuals

The number of degrees of freedom remaining. That is, the total degrees of freedom minus the model degrees of freedom).

### Df Model

The number of degrees of freedom used to train the model. In this case we trained our model to find four coefficients ($\beta_{0}$+$\beta_{1}$,$\beta_{2}$, and $\beta_{3}$) and thus used the number of coefficients minus one degrees of freedom.

### R-squared

The proportion of the variance in the dependent variable (in this case `sales`) explained by the independent variables (`TV`,`Radio`, and `Newspaper`). R-squared is commonly used to judge how well the model fits the data. R-squared rages between 0 and 1 and is independent of the unit scale (in our model, dollars spent) of the data itself.

R-squared is calculated:

$$ R^{2} = 1 - \frac{RSS}{TSS} $$

where $TSS$ is the total sum of squares, a measure of the sum of the squared difference between each value of the dependent variable and the dependent variable's mean value:

$$\sum _{{i=1}}^{{n}}\left(y_{{i}}-{\bar  {y}}\right)^{2}$$ 

and where $RSS$ is the residual sum of squares, a measure of the model's accuracy. Specifically it is the sum of the squared difference between each of the trained model's predicted y values and the actual y values:

$$\sum_{i=1}^{n}(y_{i}-f(x_{i}))^{2}$$

### Adj. R-squared

One problem with R-squared is that as you increase the number of predictors in a model, the variance explained by that model will increase, even if those predictors are not related to the response variable. This can become a problem if you are comparing the R-squared of the model with two predictors with the R-squared of a model with 200 predictors. In this case, the second model might have a greater R-squared simply because it contains more predictors. Adjusted R-squared compensates for this:

$$R^{2}_{Adjusted} = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}$$

where $n$ is the number of observations, and $d$ is the number of predictors (independent variables).

Because the number of predictors, $d$ is included in the calculation, adjusted R-squared statistic in penalized for each additional predictor. The idea is that each new predictor added has to be useful enough to the model that it is worth the penalty paid for its inclusion.

### F-statistic

The F-statistic tests whether all the coefficients are equal to zero:

$$ \beta_{1} =\beta_{2}=\cdots \beta_{n}=0 $$

The F-statistic is calculated:

$$F=\frac{(TSS-RSS)/p}{RSS/(n-d-1)}$$

where $p$ is the number of predictors. The statistic works because based on the linear model assumptions we assume the expectation of the demoninator is also $\sigma^{2}$ and if all the coefficients collectively equal zero, then the expectation of the numerator is $\sigma^{2}$. In that case the F-statistic will be close to 1.

The reason why might look at the F-statistic first, instead of the individual coefficient p-values, is because if you have a large number of predictors unrelated to the response variable, just by chance some of them will be statistically significant. The F-statistic does not face this problem because $p$ in the denominator penalizes the statistic for each additional predictor.

### Prob (F-statistic)

The p-value of the F-statistic. If it is not significant, then we cannot reject the null hypothesis:

$$ H_{0}:\beta_{1} =\beta_{2}=\cdots \beta_{n}=0 $$

### AIC

Akaike information criterion (AIC) is another measure of the quality of the model, like R-squared. In general models with lower AIC are preferable to models with higher AIC.

$$AIC=\frac{1}{n\sigma^{2}}(RSS+2d\sigma^{2})$$

where $2d\sigma^{2}$ is essentially a penalty placed upon the score to account for the fact that the errors in training models well tend to be naturally lower than errors in test data.

where $n$ is the number of observations and $d$ is the number of predictors. 

### BIC

BIC is like AIC, taking a small value for a model with a low test error, the $2d\sigma^{2}$ penalty term replaces the $2$ used in AIC with $log(n)$.

$$ BIC = \frac{1}{n}(RSS+log(n)d\sigma^{2}) $$

Since $log(n)$ will be greater than 2 for models with more than seven observations, BIC has a greater penalty for models with many variables than AIC.

In [25]:
# View the second output table
est.summary().tables[1]

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,2.9389,0.312,9.422,0.000,2.324 3.554
TV,0.0458,0.001,32.809,0.000,0.043 0.049
Radio,0.1885,0.009,21.893,0.000,0.172 0.206
Newspaper,-0.0010,0.006,-0.177,0.860,-0.013 0.011


### coef

The effect size. Specifically, in the case of linear regression -- the change in $y$ per one unit change of $X$.

### std err

The standard deviation of the estimate of a coefficient. It is a measure of how well the model estimates the unknown true coefficient value.

### t

Calculated as the coefficient estimate divided by its standard error, the t-statistic tests whether the coefficient is different from zero.

### P>|t|

The p-value of the t-statistic.

### [95.0% Conf. Int.]

The 95% confidence interval. There is a 95% probability the true coefficient value is between the two values. If the interval includes zero, then the we cannot reject the null hypthosis that a coefficient has no effect.