This notebook will be used with the aim of showing how a linear regression works:

# 1. Set up

# 2. Import necessary libraries

In [1]:
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

In [2]:
import warnings

warnings.simplefilter("ignore")

# 3. Define global variables

# 4. Functions

# 5. Code

We are going to make use of the Boston datset. 

*The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:*

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

## 5.1. Load and transform data

First of all we are going to load the dataset making use of scikit-learn library

In [3]:
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df['MEDV'] = boston.target  # Median value of owner-occupied homes

In [4]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


We are going to take two variables of the dataset in order to train our linear regression model

In [5]:
X = boston_df[['RM', 'LSTAT']]  
y = boston_df['MEDV']

Let's divide the data in train and test samples:

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Would it be correct to run a linear regression with this data? NO

The data must be standarized in order to compare coeffiecients when the linear regression is calculated

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 5.2. Training

Let's initialize the model and train it then:

In [8]:
model = LinearRegression()

# Train the model
model.fit(X_train_scaled, y_train)

LinearRegression()

Once the model is trained, let's calculate predictions of the X_test_scaled data:

In [9]:
y_pred = model.predict(X_test_scaled)

## 5.3. Metrics calculation

In [10]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [11]:
print(f"The mean squared error value is: {mse}")
print(f"The r2 value is: {r2}")

The mean squared error value is: 31.243290601783624
The r2 value is: 0.5739577415025859


## 5.4. Interpretation

In order to understand the model, we need to atend to the coefficients **which can be compared given that we have scaled variables**

In [12]:
coef_df = pd.DataFrame({'Feature': list(X.columns) + ["intercept"], 
                        'Coefficient': list(model.coef_) + [model.intercept_]})
coef_df

Unnamed: 0,Feature,Coefficient
0,RM,3.872422
1,LSTAT,-4.491736
2,intercept,22.796535


We observe that RM coefficient has a coefficient of 3.87, thus, an increase of one unit in the average number of rooms is associated with an estimated increase of 3.87 units in the target variable (e.g., median house price).