# Chapter 3. Linear Regression - Lab

- [3.6 Lab: Linear regression](#Lab:-Linear-regression)

## Lab: Linear Regression

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices

In [2]:
boston_df = pd.read_csv('../data/boston.csv', index_col=0)

In [3]:
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  black    506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB


In [4]:
boston_df.describe().applymap('{:3.2f}'.format)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.61,11.36,11.14,0.07,0.55,6.28,68.57,3.8,9.55,408.24,18.46,356.67,12.65,22.53
std,8.6,23.32,6.86,0.25,0.12,0.7,28.15,2.11,8.71,168.54,2.16,91.29,7.14,9.2
min,0.01,0.0,0.46,0.0,0.39,3.56,2.9,1.13,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.08,0.0,5.19,0.0,0.45,5.89,45.02,2.1,4.0,279.0,17.4,375.38,6.95,17.02
50%,0.26,0.0,9.69,0.0,0.54,6.21,77.5,3.21,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.68,12.5,18.1,0.0,0.62,6.62,94.07,5.19,24.0,666.0,20.2,396.23,16.96,25.0
max,88.98,100.0,27.74,1.0,0.87,8.78,100.0,12.13,24.0,711.0,22.0,396.9,37.97,50.0


In [5]:
boston_df = pd.get_dummies(boston_df, columns=['chas'])

In [6]:
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   nox      506 non-null    float64
 4   rm       506 non-null    float64
 5   age      506 non-null    float64
 6   dis      506 non-null    float64
 7   rad      506 non-null    int64  
 8   tax      506 non-null    int64  
 9   ptratio  506 non-null    float64
 10  black    506 non-null    float64
 11  lstat    506 non-null    float64
 12  medv     506 non-null    float64
 13  chas_0   506 non-null    uint8  
 14  chas_1   506 non-null    uint8  
dtypes: float64(11), int64(2), uint8(2)
memory usage: 56.3 KB


In [7]:
predictors = ' + '.join(boston_df.drop('medv', axis=1).columns)
y, X = dmatrices('medv ~ '+str(predictors), return_type='dataframe', data=boston_df)
print(f'Number of predictors: {boston_df.shape[1]}')

Number of predictors: 15


## Data science questions and answers

Let's try to answer some questions concerning the _Boston_ data set.

1. Is there a relationship between the median house price and the predictors?

In [8]:
results = sm.OLS(y, X).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Wed, 29 Dec 2021   Prob (F-statistic):          6.72e-135
Time:                        17:23:28   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     25.2019      3.407      7.398      0.0

Due to the low $p$-value of the $F$-statistic, we can conclude that there is a clear relationship between the median house prices and the predictors.

2. How strong is the relationship?

The $R^2$ is 0.741, indicating that the model can explain over 74% of the variance in the data. The residual standard error using statsmodels is:

In [17]:
print(f'Residual standard error: {round(np.sqrt(results.mse_resid), 3)}')

Residual standard error: 4.745


The residual standard error can be alternatively derived from the calculation below.

In [18]:
print(f'Residual standard error: {round(np.sqrt(((y.values.ravel()-results.fittedvalues)**2).sum()/(y.shape[0]-14)), 3)}')

Residual standard error: 4.745


In [25]:
print(f'Average median house price: {round(np.mean(y.values), 3)}')

Average median house price: 22.533


In [33]:
print(f'Percentage error of the mean: {round(100*(np.sqrt(results.mse_resid)/np.mean(y.values)), 2)}%')

Percentage error of the mean: 21.06%
