# Time to progresss on to regress(ion)

In dataexploration_01.ipynb, I explored the training data and cleaned it up

In this notebook, I'll run a few different types of linear regression models -- starting with simple OLS and moving onto some more techy ML models

The goal is to run different models using the **exact same data**, so I can compare their performance. This means that, although some models require different manipulations in the data to maximise performance (e.g. normality), I will forgo this for consistency.



Lets start off by importing the relevant libraries:

In [23]:
import pandas as pd                                 # Python library for handling structured data
from pathlib import Path                            # Clean way to work with file and folder paths
import matplotlib.pyplot as plt                     # Loads matplotlib's pypolot module for plotting graphs
import seaborn as sns                               # Statistically plotting library built on top of Matplotlib
import numpy as np                                  # Fundamental library for arrays and numerical operations
from scipy.stats import norm                        # Normal distribution object from SciPy's stats module
from scipy.stats import skew
from scipy.stats import pearsonr
from sklearn.preprocessing import StandardScaler    # Tool to standardise data from scikitlearn
from scipy import stats                             # Loads stats function (e.g. t-tests, correlations, distributions)
%matplotlib inline
# ^^ Makes Matplotlib plots appear inside the notebook (can zoom) ^^
%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
# ^^ Makes plots look better on high-res screens ^^

Now, lets import our data and check that our clean training set is as we left it:

In [24]:
test = pd.read_csv(Path('../data/raw/test.csv'))
train_clean = pd.read_csv(Path('../data/processed/train_clean.csv'))
train_clean.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageCond_TA,PoolQC_Ex,PoolQC_Gd,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed
0,1,60,8450,7,5,2003,2003,706.0,0.0,150.0,...,False,False,False,False,False,False,False,False,False,False
1,2,20,9600,6,8,1976,1976,978.0,0.0,284.0,...,False,False,False,False,False,False,False,False,False,False
2,3,60,11250,7,5,2001,2002,486.0,0.0,434.0,...,False,False,False,False,False,False,False,False,False,False
3,4,70,9550,7,5,1915,1970,216.0,0.0,540.0,...,False,False,False,False,False,False,False,False,False,False
4,5,60,14260,8,5,2000,2000,655.0,0.0,490.0,...,False,False,False,False,False,False,False,False,False,False


In [25]:
train_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1457 entries, 0 to 1456
Columns: 282 entries, Id to MiscFeature_Shed
dtypes: bool(247), float64(9), int64(26)
memory usage: 750.0 KB


Now lets create matricies with our data for sklearn:

In [26]:
X_train_full = train_clean.drop("SalePrice", axis=1) # Features matrix (axis=1 drops columns, axis=0 drops rows)
y_train_full = train_clean["SalePrice"]               # Target variable (already log-transformed)

X_test = test.copy()                             # Features matrix for test set (no target variable in test set)
y_test = None                                     # No target variable in test set

Great, now lets split the training data into train and validation:

In [27]:
from sklearn.model_selection import train_test_split
from IPython.display import display

X_train, X_valid, y_train, y_valid = train_test_split(

    X_train_full,     # This is the features matrix
    y_train_full,     # This is the target variable
    test_size=0.2, # 20% of the data will be used for validation
    random_state=42     # Setting a random state ensures reproducibility of results

)

# This code splits the original training data into two sets:
# - X_train and y_train: These will be used to train your models.
# - X_valid and y_valid: These will be used to validate your models' performance.
# The split is 80% for training and 20% for validation, and random_state=42 ensures reproducibility.

display(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)
print("First 5 rows of X_train:")
display(X_train.head())
print("First 5 rows of X_valid:")
display(X_valid.head())

(1165, 281)

(292, 281)

(1165,)

(292,)

First 5 rows of X_train:


Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageCond_TA,PoolQC_Ex,PoolQC_Gd,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed
254,255,20,8400,5,6,1957,1957,922.0,0.0,392.0,...,False,False,False,False,False,False,False,False,False,False
1362,1365,160,3180,7,5,2005,2005,0.0,0.0,600.0,...,False,False,False,False,False,False,False,False,False,False
636,638,190,6000,5,4,1954,1954,0.0,0.0,811.0,...,False,False,False,False,False,False,False,False,False,False
973,975,70,11414,7,8,1910,1993,0.0,0.0,728.0,...,False,False,False,False,False,False,False,False,False,False
514,515,45,10594,5,5,1926,1950,0.0,0.0,768.0,...,False,False,False,False,False,False,False,False,False,False


First 5 rows of X_valid:


Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageCond_TA,PoolQC_Ex,PoolQC_Gd,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed
497,498,50,9120,7,6,1925,1950,329.0,0.0,697.0,...,False,False,False,False,False,False,False,False,False,False
1262,1264,70,13515,6,6,1919,1950,0.0,0.0,764.0,...,False,False,False,False,False,False,False,False,False,False
411,412,190,34650,5,5,1955,1955,1056.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
1047,1049,20,21750,5,4,1960,2006,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
1034,1036,20,11500,4,3,1957,1957,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False


Perfect! Now we can start running some models!

# Linear Regression (Ordinary Least Squares)

Definition:
- A linear approach to modelling the relationship between a dependent variable and one or more independent variables
- The model finds the best-fitting line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences between the observed and predicted values
- Commonly used for regression tasks where the goal is to predict a continuous outcome
- Often used as a baseline model due to its simplicity and interpretability

Assumptions:
- Linearity: The relationship between the independent and dependent variables is linear
- Independence: The residuals (errors) are independent of each other
- Homoscedasticity: The residuals have constant variance at every level of the independent variable
- Normality: The residuals of the model are normally distributed
- **Note: If these assumptions are violated, the results of the linear regression may be invalid. This means that the model's predictions may be biased or unreliable.**

Strengths:
- Simplicity: Easy to understand and interpret
- Efficiency: Computationally efficient and fast to train
- Baseline: Serves as a good baseline model for comparison with more complex models

Weaknesses:
- Linearity: Assumes a linear relationship between input and output, which may not hold true
- Sensitivity: Can be sensitive to outliers and multicollinearity among predictors
- Assumptions: Assumes that the residuals (errors) are normally distributed and homoscedastic (constant variance)

Caution:
- Ensure that the assumptions of linear regression are met before applying the model
- Consider feature engineering or transformations if the relationship is not linear
- Be cautious of overfitting, especially with a large number of predictors
- Regularization techniques (like Ridge or Lasso regression) can be used to mitigate some weaknesses [will do this in subsequent models]

Steps:
1.  Fit the model on the training data
2.  Validate the model on the validation data
3.  Evaluate the model's performance using appropriate metrics (e.g., RMSE, MAE)
4.  Fine-tune the model as needed
5.  Use the model to make predictions on the test data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

ols = LinearRegression()        # 
ols.fit(X_train, y_train)
y_train_pred = ols.predict(X_train)

train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = train_mse ** 0.5

print("R Squared:", ols.score(X_train, y_train))
print("Train MAE:", train_mae)
print("Train RMSE:", train_rmse)
y_valid_pred = ols.predict(X_valid)

R Squared: 0.9468820305207895
Train MAE: 0.0652872739462918
Train RMSE: 0.09110254294862176
