# Linear Regression From Scratch

In this notebook, we will code the linear regression model from scratch to understand the theory behind this commonly used machine learning model.

In the end, we will compare the result of the linear regression model we build from scratch with the one from scikit-learn.

YT: https://youtu.be/RIg3iuen7MY

## Implementation

Input data ("y" table) has "m" data points, "n" columns (features or independent variables), and "n" + 1 total of betas.

In [12]:
# load library
import random

For the main function, we perform the steps as follows:
- First, we initialize the parameters with `initialize_params` based on the dimension of the input data ("n").
- We then compute the gradient of the betas using `compute_graident`.
- Use the computed gradients to update the value of each beta using `update_params`.
- We repeat the process for the number of iteration that we've specified.

In [13]:
# main function
def linear_regression(X, y, iterations = 100, learning_rate = 0.01):
    X = X.to_numpy()
    y = y.to_numpy()

    n, m = len(X[0]), len(X)
    beta_0, beta_other = initialize_params(n)
    
    for _ in range(iterations):
        gradient_beta_0, gradient_beta_other = compute_gradients(X, y, beta_0, beta_other, n, m)
        beta_0, beta_other = update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate)
    return beta_0, beta_other

For the `initialize_params` function, we initialize "beta_0" as 0 and "beta_other" is a vector with the size of "n" that holds all the other randomly initialized betas.

In [14]:
# helper function: initialize parameters
def initialize_params(n):
    beta_0 = 0
    beta_other = [random.random() for _ in range(n)]
    return beta_0, beta_other

`compute_gradients` is the core of the algorithm where we compute gradients for all betas:
- Initialized all gradient betas as 0.
- We loop through all data points and add gradient contributed by each data point to those variables. Inside the `for` loop:
    - First, we obtain the prediction "y_i_hat" for each data point "i".
    - Get the difference between the prediction ("y_i_hat") and the observation ("y[i]").
    - Use the difference to obtain the derivative of the error over "y" by multiplying the difference with 2.
    - The gradient for "beta_0" is just the derivative of the error over "y", but the gradient for "beta_other" is the residual multiply by the feature ("X[i][j]").
    - Accumulate the gradient from all data points (using "+=").
    - Lastly, divide each data point's gradient by "m" so the gradient computed at the end will be the average over all data points.

In [15]:
# helper function: compute gradients
def compute_gradients(X, y, beta_0, beta_other, n, m):
    gradient_beta_0 = 0
    gradient_beta_other = [0] * n
    
    for i in range(m):
        y_i_hat = sum(X[i][j] * beta_other[j] for j in range(n)) + beta_0
        derror_dy = 2 * (y_i_hat - y[i])
        
        for j in range(n):
            gradient_beta_other[j] += (derror_dy * X[i][j]) / m
        
        gradient_beta_0 += (derror_dy / m)
    
    return gradient_beta_0, gradient_beta_other

We use `update_params` to update all the betas using the gradient we obtained. We don't add gradients to betas, but we scale the gradient by multiplying it with the learning rate (a rate of speed where the gradient moves during a gradient descent; a learning rate too high will make gradient descent unstable, too low will make it slow to converge).

In [16]:
# helper function: update parameters
def update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate):
    beta_0 -= (gradient_beta_0 * learning_rate)
    
    for i in range(len(beta_other)):
        beta_other[i] -= (gradient_beta_other[i] * learning_rate)
    return beta_0, beta_other

## Model Comparison

For this section, we will be using a simple house price dataset to perform the comparison between our linear regression model and the scikit-learn `LogisticRegression` model. The metric we will use for our comparison will be the outputted betas from both models.

### Library & Data Preparation

In [17]:
# load library
import pandas as pd
from sklearn.linear_model import LinearRegression

In [18]:
# load data
df = pd.read_csv('data/House_Price_Dataset.csv')
df.head()

Unnamed: 0,HouseSize,Rooms,Price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


In [19]:
# normalize data
df_normalized = (df - df.min()) / (df.max() - df.min()) # min-max normalization
df_normalized.head()

Unnamed: 0,HouseSize,Rooms,Price
0,0.345284,0.5,0.433962
1,0.206288,0.5,0.301887
2,0.426917,0.5,0.37566
3,0.155543,0.25,0.11717
4,0.592388,0.75,0.698113


In [20]:
# separate independent and dependent variables
X = df_normalized[["HouseSize", "Rooms"]] # independent variables
y = df_normalized["Price"] # dependent variable

### Models Building

In [21]:
# sklearn linear regression model
model_sklearn  = LinearRegression().fit(X, y)

print("Intercept | Constant of Linear Regression Equation: ", model_sklearn.intercept_)
print("Coefficient of Linear Regression Equation: ", model_sklearn.coef_)

Intercept | Constant of Linear Regression Equation:  0.05578751828959755
Coefficient of Linear Regression Equation:  [ 0.95241114 -0.06594731]


In [22]:
# our scratch model
model_scratch = linear_regression(X, y, iterations = 1000, learning_rate = 0.1)

print("Intercept | Constant of Linear Regression Equation: ", model_scratch[0])
print("Coefficient of Linear Regression Equation: ", model_scratch[1])

Intercept | Constant of Linear Regression Equation:  0.04931179139777633
Coefficient of Linear Regression Equation:  [0.9387277671735433, -0.04622072507962427]


As we can see from the result above, both the intercepts (beta_0) and the coefficients (beta_other) are very similar between the two models. Hence, we have successfully built a linear regression from scratch!

One thing to note is our model is very sensitive to any characteristic the dataset might have. For example, for us to get the betas from our scratch model to be as similar to the scikit-learn's model, we had to perform (min-max) normalization on the dataset.