# Goodness-of-fit, Fitted values, Residuals

In [None]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm

In [None]:
# import sleep75 dataset
df_sleep = pd.read_csv('sleep75.csv')
df_sleep.shape # dataset size

## Regression model
Consider a regression **sleep on totwrk & age**
Its specification is 
$$sleep=\beta_0+\beta_1totwrk+\beta_2age+u $$
Here
* sleep is an endogenous/dependent variable
* totwrk & age are exogenous/explanatory varables, predictors

Fitting result

In [None]:
mod = smf.ols(formula='sleep~totwrk+age', data=df_sleep).fit()
# estimated coefficients: .params property
mod.params

## TSS, ESS, RSS
Related sums of squares
\begin{align*}
TSS&=\sum_{i=1}^n(y_i-\bar{y})^2 & ESS&=\sum_{i=1}^n(\hat{y}_i-\bar{y})^2 & 
RSS&=\sum_{i=1}^n(y_i-\hat{y}_i)^2=\sum_{i=1}^n e_i^2
\end{align*}

In [None]:
print('TSS=', mod.centered_tss)
print('ESS=', mod.ess)
print('RSS=', mod.ssr)

For these sums a Pythagorean theorem is true, i.e. $TSS=ESS+RSS$ or $TSS-ESS-RSS=0$

In [None]:
mod.centered_tss-mod.ess-mod.ssr

## Goodness-of-fit
Goodness-of-fit measure: R-squared
$$R^2=\frac{ESS}{TSS}=1-\frac{RSS}{TSS} $$
Who do we interpret it?

In [None]:
mod.rsquared

In [None]:
1-mod.ssr/mod.centered_tss

Adjusted R-squared $R^2_{adj}=1-(1-R^2)*\frac{n-1}{n-k-1}$

In [None]:
mod.rsquared_adj

## Fitted values & Residuals
Fitted values, dependent variable, residuals for observations with indices [0, 3, 78, 197, 401, 561]

In [None]:
ind = [0, 3, 78, 197, 401, 561]
# Dependent variable
df_sleep['sleep'].iloc[ind]

In [None]:
# Fitted values
mod.fittedvalues[ind]

In [None]:
# Residuals
mod.resid[ind]

In [None]:
# In a table/DataFrame
df = pd.DataFrame({'Dependent':df_sleep['sleep'].iloc[ind], 'Fitted':mod.fittedvalues[ind], 'Residual':mod.resid[ind]})
df