# Introduction

In this tutorial, I will show you how you can code your own linear regression in Python. Unlike most tutorials, I also show you how to get standard errors, t and p-values, and confidence intervals for all estimated coefficients.

This tutorial not only shows you the implementation in Python, but also the mathematical background (i.e., math). I hope that, by seeing the combination of the two, you will gain a deeper understanding of how linear regression works. 

Note: My implementation is based on the OLS estimator, which uses matrix algebra to estimate regression coefficients, rather than maximum likelihood estimatiom (MLE) or Gradient Descent, which are iterative procedures.

## Importing required packages

Let's start by importing the packages we need. We need pandas to read the data and to make pretty data frames. 

Numpy will be used for nifty calculations like transposing matrices. Lastly, we need the 'f' and 't' subfunctions from scipy.stats. These contain the F and t distribution, which we will be using later to obtain the p-values of the regression as a whole (based on the F distribution) and of the individual regression coefficients (based on the t distribution). 

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import f, t

In [253]:
boston=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
boston.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


## Loading the data

We will be using the Boston Housing dataset. I am lazy today, so I will not run any descriptive statistics. We load the data using pandas and examine the head. 

We see that we have 14 variables in the data set. We will be using 'medv' as the dependent variable and the rest as the independent variables for the purpose of this tutorial. 

Next, we separate the dependent variable (Y) from the independent variables (X).

In [None]:
X= boston.drop('medv', axis=1).values
Y= boston["medv"].values

We want to keep a list of the column names, which will form the index of the data frame that will hold our regression coefficients. First, we extract the column names from the boston dataset, then we insert the intercept of the regression at the first position, and delete the last column (which, remember, is the independent variable.

In [229]:
#Create list of column names
colnames = list(boston)
colnames.insert(0, 'Intercept')
del colnames[-1]
colnames

['crim',
 'zn',
 'indus',
 'chas',
 'nox',
 'rm',
 'age',
 'dis',
 'rad',
 'tax',
 'ptratio',
 'b',
 'lstat',
 'medv']

Now we can create a dataframe that will hold the results of the regression and call it 'model'. As I just said, the colnames list will be used as the index. 

In [231]:
model = pd.DataFrame(index=colnames)
model.head()

Intercept
crim
zn
indus
chas


# Estimating the parameters!

## Estimating the betas

Our preparations are done, let's get started! 
First, we need to calculate two parameters that we will use often in the following: N and p. 
N is simply the number of observations in our dataset. So, in our case, 506. p is the number of variables that make up the regression (14).

In [232]:
N = len(X)
p = len(boston.columns)

In [233]:
X_with_intercept = np.empty(shape=(N, p), dtype=float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = pd.DataFrame(X).values

In [234]:
beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ pd.DataFrame(Y).values
print(beta_hat)

[[ 3.64594884e+01]
 [-1.08011358e-01]
 [ 4.64204584e-02]
 [ 2.05586264e-02]
 [ 2.68673382e+00]
 [-1.77666112e+01]
 [ 3.80986521e+00]
 [ 6.92224640e-04]
 [-1.47556685e+00]
 [ 3.06049479e-01]
 [-1.23345939e-02]
 [-9.52747232e-01]
 [ 9.31168327e-03]
 [-5.24758378e-01]]


Next, we assign the betas to a new column, 'B', of the model dataframe. Let's take a look!

In [235]:
model['B']=beta_hat

In [236]:
model

Unnamed: 0,B
Intercept,36.459488
crim,-0.108011
zn,0.04642
indus,0.020559
chas,2.686734
nox,-17.766611
rm,3.809865
age,0.000692
dis,-1.475567
rad,0.306049


In [237]:
    intercept = beta_hat[0]
    #extract all other coefficients (the betas)
    other_betas = beta_hat[1:]
    other_betas

array([[-1.08011358e-01],
       [ 4.64204584e-02],
       [ 2.05586264e-02],
       [ 2.68673382e+00],
       [-1.77666112e+01],
       [ 3.80986521e+00],
       [ 6.92224640e-04],
       [-1.47556685e+00],
       [ 3.06049479e-01],
       [-1.23345939e-02],
       [-9.52747232e-01],
       [ 9.31168327e-03],
       [-5.24758378e-01]])

Next, we need to calculate the predicted values. They are not terribly interesting by themselves, but they are needed to calculate other parameters, such as the standard errors.

The predicted values are commonly called "y hat", based on their mathematical symbol ŷ (get it, it looks like a y wearing a hat). ŷ is simply the value of the dependent variable (the y) that our model predicts based on the intercept and slope coefficients.

ŷ is calculated by, for every observation, multiplying the regression coefficients with the values on the respective variables for every observation, and then adding the intercept to that. In matrix notation, ŷ is the product of the regression coefficients vector and the matrix X, plus the intercept.

We implement this by using np.dot, which multiplies matrices. 


In [238]:
y_hat = intercept + np.dot(X, other_betas)

In [239]:
y_hat

array([[30.00384338],
       [25.02556238],
       [30.56759672],
       [28.60703649],
       [27.94352423],
       [25.25628446],
       [23.00180827],
       [19.53598843],
       [11.52363685],
       [18.92026211],
       [18.99949651],
       [21.58679568],
       [20.90652153],
       [19.55290281],
       [19.28348205],
       [19.29748321],
       [20.52750979],
       [16.91140135],
       [16.17801106],
       [18.40613603],
       [12.52385753],
       [17.67103669],
       [15.83288129],
       [13.80628535],
       [15.67833832],
       [13.38668561],
       [15.46397655],
       [14.70847428],
       [19.54737285],
       [20.8764282 ],
       [11.45511759],
       [18.05923295],
       [ 8.81105736],
       [14.28275814],
       [13.70675891],
       [23.81463526],
       [22.34193708],
       [23.10891142],
       [22.91502612],
       [31.35762569],
       [34.21510225],
       [28.02056414],
       [25.20386628],
       [24.60979273],
       [22.94149176],
       [22

## Calculating standard errors

Calculating the standard errors is a bit more tedious.

$$
    se=\frac{}{}
$$


In [240]:
#Calculate standard errors
#Standard error: square root of diagonal of the XX matrix times MSE
Ys=pd.DataFrame()
Ys["Yactual"]=Y
Ys["Ypred"]=y_hat
Ys

Unnamed: 0,Yactual,Ypred
0,24.0,30.003843
1,21.6,25.025562
2,34.7,30.567597
3,33.4,28.607036
4,36.2,27.943524
...,...,...
501,22.4,23.533341
502,20.6,22.375719
503,23.9,27.627426
504,22.0,26.127967


In [241]:
Y.shape

(506,)

In [242]:
mse=(np.sum(np.square(Ys["Yactual"]-Ys["Ypred"])))/(N-p)
mse

22.517854833241827

In [243]:
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * mse 

In [244]:
for p_ in range(len(model)):
    standarderrors = np.diag(var_beta_hat)**0.5

In [245]:
standarderrors

array([5.10345881e+00, 3.28649942e-02, 1.37274615e-02, 6.14956890e-02,
       8.61579756e-01, 3.81974371e+00, 4.17925254e-01, 1.32097820e-02,
       1.99454735e-01, 6.63464403e-02, 3.76053645e-03, 1.30826756e-01,
       2.68596494e-03, 5.07152782e-02])

In [246]:
model["SE"]=standarderrors
model

Unnamed: 0,B,SE
Intercept,36.459488,5.103459
crim,-0.108011,0.032865
zn,0.04642,0.013727
indus,0.020559,0.061496
chas,2.686734,0.86158
nox,-17.766611,3.819744
rm,3.809865,0.417925
age,0.000692,0.01321
dis,-1.475567,0.199455
rad,0.306049,0.066346


## Calculating the t values

Calculating the t-values is super easy now: t is simply the regression coefficient divided by its standard error. We can simply divide the two columns of the data frame and assign the result to a new column, t.


$$
    t=\frac{b}{se}
$$

In [247]:
#Calculate t
model['t']=model["B"]/model["SE"]
model

Unnamed: 0,B,SE,t
Intercept,36.459488,5.103459,7.144074
crim,-0.108011,0.032865,-3.286517
zn,0.04642,0.013727,3.381576
indus,0.020559,0.061496,0.33431
chas,2.686734,0.86158,3.118381
nox,-17.766611,3.819744,-4.651257
rm,3.809865,0.417925,9.11614
age,0.000692,0.01321,0.052402
dis,-1.475567,0.199455,-7.398004
rad,0.306049,0.066346,4.6129


## Calculating p-values

Based on the t value, we can get the p-value of the regression coefficients. 

This next step requires a bit more explanation. 



We will use the t.sf function, which contains all values of the t-distribution. 

In [249]:
p_values=np.around((t.sf(np.abs(model['t']), N-1)*2), 3)
model['p']=p_values
model

Unnamed: 0,B,SE,t,p
Intercept,36.459488,5.103459,7.144074,0.0
crim,-0.108011,0.032865,-3.286517,0.001
zn,0.04642,0.013727,3.381576,0.001
indus,0.020559,0.061496,0.33431,0.738
chas,2.686734,0.86158,3.118381,0.002
nox,-17.766611,3.819744,-4.651257,0.0
rm,3.809865,0.417925,9.11614,0.0
age,0.000692,0.01321,0.052402,0.958
dis,-1.475567,0.199455,-7.398004,0.0
rad,0.306049,0.066346,4.6129,0.0


## Calculating 95% confidence intervals

We can easily calculate the 95% confidence intervals from the regression coefficients and their corresponding standard errors.

Recall that the formula for the confidence intervals is as follows:

$$
    CI = B ± 1.96 * SE
$$

So, we can simply take the values of the SE column, multiply them by 1.96 and add (upper 95% CI) or subtract (lower 95% CI) them to/from the values in the B column. Easy peasy!

In [250]:
model['ci_lower']=model["B"]-1.96*model["SE"]
model['ci_upper']=model["B"]+1.96*model["SE"]

model

Unnamed: 0,B,SE,t,p,ci_lower,ci_upper
Intercept,36.459488,5.103459,7.144074,0.0,26.456709,46.462268
crim,-0.108011,0.032865,-3.286517,0.001,-0.172427,-0.043596
zn,0.04642,0.013727,3.381576,0.001,0.019515,0.073326
indus,0.020559,0.061496,0.33431,0.738,-0.099973,0.14109
chas,2.686734,0.86158,3.118381,0.002,0.998037,4.37543
nox,-17.766611,3.819744,-4.651257,0.0,-25.253309,-10.279914
rm,3.809865,0.417925,9.11614,0.0,2.990732,4.628999
age,0.000692,0.01321,0.052402,0.958,-0.025199,0.026583
dis,-1.475567,0.199455,-7.398004,0.0,-1.866498,-1.084636
rad,0.306049,0.066346,4.6129,0.0,0.17601,0.436089


In [251]:
#Model fit statistics
modelfit=pd.DataFrame()
#R squared
rss=np.sum(np.square((Ys["Yactual"]-Ys["Ypred"])))
#Calculate the mean of Y
mean=np.mean(Ys["Yactual"])
#Calculate the sum of squares total: sum of the squared differences between Y and the mean of Y
sst = np.sum(np.square(Ys["Yactual"]-mean))
#Calculate the r_squared: 1 minus rss/sst
r_squared = 1 - (rss/sst)
#Adjusted r squared
r_sq_adj = 1- ((1-r_squared)*((N-1)/(N-p-1)))
#Root MSE
rmse=mse**0.5
#F and sig of F
msm=(np.sum(np.square(Ys["Ypred"]-mean)))/(p-1)
fval=msm/mse
#Compute p-value of F using f.sf from scipy
p_off = np.around(f.sf(fval, (p-1), (N-p)), 3)

modelfitvals=[N, r_squared, r_sq_adj, rmse, fval, p_off]
colnames_modelfit=["Number of observations", "R sq", "Adjusted R sq", "Root MSE", "F", "Prob>F"]
modelinfo = pd.DataFrame(modelfitvals, index=colnames_modelfit)
modelinfo

Unnamed: 0,0
Number of observations,506.0
R sq,0.740643
Adjusted R sq,0.733248
Root MSE,4.745298
F,108.076666
Prob>F,0.0


In [252]:
model

Unnamed: 0,B,SE,t,p,ci_lower,ci_upper
Intercept,36.459488,5.103459,7.144074,0.0,26.456709,46.462268
crim,-0.108011,0.032865,-3.286517,0.001,-0.172427,-0.043596
zn,0.04642,0.013727,3.381576,0.001,0.019515,0.073326
indus,0.020559,0.061496,0.33431,0.738,-0.099973,0.14109
chas,2.686734,0.86158,3.118381,0.002,0.998037,4.37543
nox,-17.766611,3.819744,-4.651257,0.0,-25.253309,-10.279914
rm,3.809865,0.417925,9.11614,0.0,2.990732,4.628999
age,0.000692,0.01321,0.052402,0.958,-0.025199,0.026583
dis,-1.475567,0.199455,-7.398004,0.0,-1.866498,-1.084636
rad,0.306049,0.066346,4.6129,0.0,0.17601,0.436089


Let us now compare this result to the ones obtained from Stata. In Stata, we would type

import delimited https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

to import the file and 

reg medv crim zn indus chas nox rm age dis rad tax ptratio b lstat

to run the regression. These are the results from Stata:




Note that Stata indicates the intercept with "_cons".

If we compare our coefficients with the ones from Stata we can see that they are the same. Neat! 