# Simple Linear Regression

## Closed-Form Solution Derivation

Given observations $(x_{i}, y_{i})$ for $i=1,\dots, n$, we postulate a linear relationship between the independent variable $x$ and the dependent variable $y$: $$
y_{i} = mx_{i} + b
$$ We want a line of best fit, so, we want to minimize the sum of square errors: $$
J(\hat{y}_{i},y_{i} )=\sum_{i=0}^n (y_{i}-\hat{y}_{i})^2
$$
Where $\hat{y}_{i}$ is the prediction $mx_{i}+b$ for values of $m$ and $b$ that minimize the above function $J$.

To find the required values of $m$ and $b$, we rewrite the sum of square errors in terms of them, and set both of its partial derivatives to $0$, as that is where we will find it's minimum:

Derivative with respect to $b$: 
$$
0 = \frac{\partial J(m, b)}{\partial b} = 2\sum_{i=0}^n(y_{i}-mx_{i}-b) \quad\Rightarrow\quad \sum_{i=0}^n y_{i} = nb + m \sum_{i=0}^nx_{i}
$$
- Dividing by $n$ and solving for $b$ gives us: $$
\bar{y} - m \bar{x} = b
$$
where $\bar{y}$ and $\bar{x}$ are the average values of the observations $y_{i}$ and $x_{i}$, respectively. 

Derivative with respect to $m$:

$$
\frac{\partial J}{\partial m}=-2\sum_{i=1}^n x_i\big(y_i-mx_i-b\big)=0
\;\;\Longrightarrow\;\;
\sum_{i=1}^n x_i y_i \;=\; m\sum_{i=1}^n x_i^2 \;+\; b\sum_{i=1}^n x_i.
$$
Substitute ($b=\bar y - m\bar x$):
$$
\sum_{i=1}^n x_i y_i
\;=\; m\sum_{i=1}^n x_i^2 \;+\; (\bar y - m\bar x)\sum_{i=1}^n x_i
\;=\; m\sum_{i=1}^n x_i^2 \;+\; n\bar x\bar y \;-\; m\,n\bar x^2.
$$

$$
\quad\Rightarrow\quad
m\Big(\sum_{i=1}^n x_i^2 - n\bar x^2\Big)
\;=\; \sum_{i=1}^n x_i y_i - n\bar x\bar y.
$$

Now, the sum of the left hand side can be written as: 
$$\sum_{i=1}^n (x_i-\bar x)^2= \sum_{i=1}^n \big(x_i^2 - 2\bar x\,x_i + \bar x^2\big)$$
$$ = \sum_{i=1}^n x_i^2 - 2\bar x\sum_{i=1}^n x_i + \sum_{i=1}^n \bar x^2$$
$$= \sum_{i=1}^n x_i^2 - 2\bar x\,(n\bar x) + n\bar x^2= \sum_{i=1}^n x_i^2 - n\bar x^2.$$

and, the sum one the right hand side can be written as: 
$$\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)= \sum_{i=1}^n \big(x_i y_i - x_i\bar y - \bar x y_i + \bar x\bar y\big)$$ 
$$= \sum_{i=1}^n x_i y_i \;-\; \bar y\sum_{i=1}^n x_i \;-\; \bar x\sum_{i=1}^n y_i \;+\; \sum_{i=1}^n \bar x\bar y$$
$$= \sum_{i=1}^n x_i y_i - \bar y\,(n\bar x) - \bar x\,(n\bar y) + n\bar x\bar y= \sum_{i=1}^n x_i y_i - n\bar x\bar y$$

That is, 
$$
\sum_{i=1}^n(x_i-\bar x)^2 = \sum_{i=1}^n x_i^2 - n\bar x^2,\qquad
\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y) = \sum_{i=1}^n x_i y_i - n\bar x\bar y.
$$
Therefore, substituting back, we get:
$$
\boxed{\,m
= \frac{\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)}
       {\sum_{i=1}^n (x_i-\bar x)^2}\,},\qquad
\boxed{\,b=\bar y - m\bar x\,}.
$$

## Implementation of the closed-form Solution

In [13]:
import numpy as np
from typing import Iterable

class CustomLR:
    """
    Simple univariate (1D) linear regression using the closed-form OLS solution.
    y = m * x + b
    """

    def __init__(self):
        self.m = None
        self.b = None
        self.x_bar = None
        self.y_bar = None

    def fit(self, X_train: Iterable[float], y_train: Iterable[float]):
        
        ''' Fit the linear regression model to 1D data. '''
        
        X = np.asarray(X_train, dtype=float)
        y = np.asarray(y_train, dtype=float)

        # input validation 
        if X.size == 0 or y.size == 0:
            raise ValueError("X_train and y_train must be non-empty.")
        if X.size != y.size:
            raise ValueError("X_train and y_train must have the same length.")
        if X.ndim != 1 or y.ndim != 1:
            raise ValueError("CustomLR supports only 1D inputs.")
        if not np.isfinite(X).all() or not np.isfinite(y).all():
            raise ValueError("Inputs contain NaN or infinite values.")

        # compute slope and intercept 
        self.x_bar, self.y_bar = X.mean(), y.mean()
        X_c, y_c = X - self.x_bar, y - self.y_bar

        denom = np.dot(X_c, X_c)
        if denom == 0.0:
            raise ValueError("All X values are identical; slope is undefined.")

        self.m = np.dot(X_c, y_c) / denom
        self.b = self.y_bar - self.m * self.x_bar
        return self

    def predict(self, X_test: Iterable[float]) -> np.ndarray:
        """Predict outputs for new 1D input data."""
        if self.m is None or self.b is None:
            raise ValueError("Model is not fitted yet. Call fit() first.")

        X = np.asarray(X_test, dtype=float)
        if X.ndim != 1:
            raise ValueError("CustomLR supports only 1D inputs.")
        return self.m * X + self.b

In [2]:
import pandas as pd

df = pd.read_csv('../data/placement.csv')
df.head()

Unnamed: 0,cgpa,package
0,6.89,3.26
1,5.12,1.98
2,7.82,3.25
3,7.42,3.67
4,6.94,3.57


In [3]:
X = df.iloc[:, 0].values
y = df.iloc[:, 1].values

In [4]:
X

array([6.89, 5.12, 7.82, 7.42, 6.94, 7.89, 6.73, 6.75, 6.09, 8.31, 5.32,
       6.61, 8.94, 6.93, 7.73, 7.25, 6.84, 5.38, 6.94, 7.48, 7.28, 6.85,
       6.14, 6.19, 6.53, 7.28, 8.31, 5.42, 5.94, 7.15, 7.36, 8.1 , 6.96,
       6.35, 7.34, 6.87, 5.99, 5.9 , 8.62, 7.43, 9.38, 6.89, 5.95, 7.66,
       5.09, 7.87, 6.07, 5.84, 8.63, 8.87, 9.58, 9.26, 8.37, 6.47, 6.86,
       8.2 , 5.84, 6.6 , 6.92, 7.56, 5.61, 5.48, 6.34, 9.16, 7.36, 7.6 ,
       5.11, 6.51, 7.56, 7.3 , 5.79, 7.47, 7.78, 8.44, 6.85, 6.97, 6.94,
       8.99, 6.59, 7.18, 7.63, 6.1 , 5.58, 8.44, 4.26, 4.79, 7.61, 8.09,
       4.73, 6.42, 7.11, 6.22, 7.9 , 6.79, 5.83, 6.63, 7.11, 5.98, 7.69,
       6.61, 7.95, 6.71, 5.13, 7.05, 7.62, 6.66, 6.13, 6.33, 7.76, 7.77,
       8.18, 5.42, 8.58, 6.94, 5.84, 8.35, 9.04, 7.12, 7.4 , 7.39, 5.23,
       6.5 , 5.12, 5.1 , 6.06, 7.33, 5.91, 6.78, 7.93, 7.29, 6.68, 6.37,
       5.84, 6.05, 7.2 , 6.1 , 5.64, 7.14, 7.91, 7.19, 7.91, 6.76, 6.93,
       4.85, 6.17, 5.84, 6.07, 5.66, 7.57, 8.28, 6.

In [5]:
y

array([3.26, 1.98, 3.25, 3.67, 3.57, 2.99, 2.6 , 2.48, 2.31, 3.51, 1.86,
       2.6 , 3.65, 2.89, 3.42, 3.23, 2.35, 2.09, 2.98, 2.83, 3.16, 2.93,
       2.3 , 2.48, 2.71, 3.65, 3.42, 2.16, 2.24, 3.49, 3.26, 3.89, 3.08,
       2.73, 3.42, 2.87, 2.84, 2.43, 4.36, 3.33, 4.02, 2.7 , 2.54, 2.76,
       1.86, 3.58, 2.26, 3.26, 4.09, 4.62, 4.43, 3.79, 4.11, 2.61, 3.09,
       3.39, 2.74, 1.94, 3.09, 3.31, 2.19, 1.61, 2.09, 4.25, 2.92, 3.81,
       1.63, 2.89, 2.99, 2.94, 2.35, 3.34, 3.62, 4.03, 3.44, 3.28, 3.15,
       4.6 , 2.21, 3.  , 3.44, 2.2 , 2.17, 3.49, 1.53, 1.48, 2.77, 3.55,
       1.48, 2.72, 2.66, 2.14, 4.  , 3.08, 2.42, 2.79, 2.61, 2.84, 3.83,
       3.24, 4.14, 3.52, 1.37, 3.  , 3.74, 2.82, 2.19, 2.59, 3.54, 4.06,
       3.76, 2.25, 4.1 , 2.37, 1.87, 4.21, 3.33, 2.99, 2.88, 2.65, 1.73,
       3.02, 2.01, 2.3 , 2.31, 3.16, 2.6 , 3.11, 3.34, 3.12, 2.49, 2.01,
       2.48, 2.58, 2.83, 2.6 , 2.1 , 3.13, 3.89, 2.4 , 3.15, 3.18, 3.04,
       1.54, 2.42, 2.18, 2.46, 2.21, 3.4 , 3.67, 2.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [14]:
lr1 = CustomLR()
lr1.fit(X_train, y_train)

<__main__.CustomLR at 0x224d60c6120>

In [22]:
lr1.predict(X_test[[0]]), y_test[[0]]

(array([3.89111601]), array([4.1]))