# Mathematical Formulation & Implementation of Simple Linear Regression

**Based on the CampusX Linear Regression Transcript**

---

## 1. Introduction and Objective

The primary goal of this session is to understand the mathematics behind **Simple Linear Regression** and to build a custom Linear Regression class from scratch in Python.

### Core concepts

* **Problem:** dataset with one input and one output

  * Input variable (X): independent feature (e.g., CGPA)
  * Output variable (y): dependent target (e.g., salary package)
* **Goal:** find the *best-fit line* that minimizes the distance between the data points and the line.

### Line equation

[ y = m x + b ]

* (y): predicted output (Package)
* (x): input feature (CGPA)
* (m): slope (weight / coefficient)
* (b): intercept (bias)

### High-level workflow

```mermaid
graph LR
    A[Input data X, y] --> B[Define line eqn y = m x + b]
    B --> C[Define error function E]
    C --> D{Minimize error}
    D --> E[Find optimal m and b]
    E --> F[Best fit line]
```

---

## 2. Approaches to solving Linear Regression

There are two primary algorithms to determine optimal (m) and (b).

```mermaid
graph TD
    A[Start: Find m and b] --> B{Dataset size?}
    B -- Small / Low Dimensions --> C[Closed form solution]
    B -- Large / High Dimensions --> D[Non-closed form solution]
    C --> E[Ordinary Least Squares (OLS)]
    E --> G[Direct formula calculation]
    D --> F[Gradient Descent]
    F --> H[Iterative optimization]
```

### A. Closed-form solution (direct formula)

* **Method:** Ordinary Least Squares (OLS) — compute exact solution using derived formulas.
* **Library:** `sklearn.linear_model.LinearRegression` implements this.
* **Pros:** exact solution, fast for low-dimensional problems.
* **Cons:** matrix inverses / solve can be expensive for very high-dimensional data; numerical stability issues may appear for ill-conditioned matrices.

### B. Non-closed form solution (approximation)

* **Method:** Gradient descent (batch / stochastic / mini-batch).
* **Library:** `sklearn.linear_model.SGDRegressor` can be used.
* **Pros:** scales to massive datasets and high-dimensional problems; supports online updates.
* **Cons:** iterative and approximate; needs tuning (learning rate, iterations, batch size).

> **Note:** This document focuses on the **closed-form OLS** method and a from-scratch implementation.

---

## 3. Mathematical derivation (Ordinary Least Squares)

To determine the best line, minimize total squared error between true (y_i) and predictions (\hat y_i).

### Step 1 — Loss function (sum of squared errors)

[ E = \sum_{i=1}^n (y_i - \hat{y}_i)^2. ]

**Why square?**

* prevents cancellation of positive/negative errors
* differentiable everywhere (smooth), useful for calculus

### Step 2 — Substitute the line equation

[ \hat{y}*i = m x_i + b ]
[ E(m,b) = \sum*{i=1}^n (y_i - (m x_i + b))^2. ]

This is a function of two variables: (m) and (b).

### Step 3 — Partial derivatives and set to zero (first order conditions)

**Derivative w.r.t.** (b):

[ \frac{\partial E}{\partial b} = \sum_{i=1}^n 2(y_i - m x_i - b)(-1) = 0 ]

Simplify:

[ \sum_{i=1}^n (y_i - m x_i - b) = 0 ]
[ \sum y_i - m \sum x_i - n b = 0 ]
[ \frac{1}{n} \sum y_i - m \frac{1}{n} \sum x_i - b = 0 ]
[ b = \bar{y} - m \bar{x} ]

**Derivative w.r.t.** (m):

[ \frac{\partial E}{\partial m} = \sum_{i=1}^n 2(y_i - m x_i - b)(-x_i) = 0 ]

Substitute (b = \bar{y} - m\bar{x}) and rearrange; after algebraic simplification we get:

[ m = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}. ]

So final closed-form formulas are:

* **Slope:** ( m = \dfrac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} )
* **Intercept:** ( b = \bar{y} - m \bar{x} )

---

## 4. Final formulas (summary)

Use these to implement linear regression from scratch:

* ( m = \dfrac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} )
* ( b = \bar{y} - m\bar{x} )

---

## 5. Python implementation — building `MeraLR`

### Algorithm flow (fit method)

```mermaid
graph TD
    A[Start fit] --> B[Calc mean of X and Y]
    B --> C[Initialize numerator=0, denominator=0]
    C --> D[Loop over data points]
    D --> E{any more points?}
    E -- Yes --> F[add (xi-x_mean)*(yi-y_mean) to numerator]
    F --> G[add (xi-x_mean)^2 to denominator]
    G --> D
    E -- No --> H[compute m = numerator/denominator]
    H --> I[compute b = y_mean - m * x_mean]
    I --> J[End fit]
```

### Implementation (vectorized + simple loop version)

```python
import numpy as np

class MeraLR:
    def __init__(self):
        # parameters
        self.m = None
        self.b = None

    def fit(self, X_train, y_train):
        """
        Trains the model by calculating m and b.
        X_train: 1-D numpy array or shape (n_samples,) or (n_samples,1)
        y_train: 1-D numpy array (n_samples,)
        """
        # ensure 1-D arrays
        X = np.ravel(X_train)
        y = np.ravel(y_train)

        # 1. Means
        x_bar = X.mean()
        y_bar = y.mean()

        # 2. Numerator and denominator
        numerator = 0.0
        denominator = 0.0
        for i in range(X.shape[0]):
            numerator += (X[i] - x_bar) * (y[i] - y_bar)
            denominator += (X[i] - x_bar) ** 2

        # 3. slope and intercept
        self.m = numerator / denominator
        self.b = y_bar - (self.m * x_bar)

        print(f"Training complete. m={self.m}, b={self.b}")

    def predict(self, X_test):
        """Predict using y = m*x + b. Accepts scalars or arrays."""
        X = np.ravel(X_test)
        return self.m * X + self.b
```

> **Notes:**
>
> * The code above assumes `denominator != 0` (i.e., not all `x` identical). Handle edge-cases in production.
> * Vectorized NumPy implementations are faster: compute numerator as `((X-x_bar) * (y-y_bar)).sum()` and denominator with `((X-x_bar)**2).sum()`.

---

## 6. Line-by-line explanation (fit method)

* `numerator = 0`, `denominator = 0`: running totals for the slope formula.
* `x_bar = X_train.mean()`: sample mean (\bar{x}).
* `y_bar = y_train.mean()`: sample mean (\bar{y}).
* Loop: build sums

  * `numerator += (X[i] - x_bar) * (y[i] - y_bar)` builds (\sum (x-\bar{x})(y-\bar{y}))
  * `denominator += (X[i] - x_bar) ** 2` builds (\sum (x-\bar{x})^2)
* `self.m = numerator / denominator`: slope.
* `self.b = y_bar - (self.m * x_bar)`: intercept.

---

## 7. Comparison & conclusion

* **Verification:** On a typical placement dataset the custom `MeraLR` produces the same `m` and `b` as `sklearn.linear_model.LinearRegression` (within numerical precision).
* **Limitation:** This implementation handles **simple** linear regression (single feature). For multiple features the closed-form generalizes to matrix form (Normal Equation):

[ \mathbf{w} = (X^T X)^{-1} X^T \mathbf{y} ]

* **Scalability & numerical concerns:** For high-dimensional data use stable solvers (`np.linalg.lstsq`, SVD) or iterative solvers; add regularization (Ridge) to avoid singular `X^T X`.

### Key takeaway

Simple Linear Regression is basic calculus + statistics: compute slope and intercept that minimize the sum of squared errors. It’s interpretable, fast for low dimensional problems, but limited in expressivity — use multiple regression, polynomial features, or non-linear models if needed.

---

## 8. Quick runnable example (verify with sklearn)

```python
# toy verify
import numpy as np
from sklearn.linear_model import LinearRegression

# synthetic data
X = np.array([6.6,7.5,8.1,5.9,7.0,8.4,6.8,9.0,5.5,7.8]).reshape(-1,1)
y = np.array([3.0,4.2,4.5,2.5,3.7,4.6,3.2,5.1,2.2,4.0])

# our model
mylr = MeraLR()
mylr.fit(X, y)
print('Predictions (ours):', mylr.predict(X[:3]))

# sklearn
lr = LinearRegression().fit(X, y)
print('Sklearn m, b:', lr.coef_[0], lr.intercept_)
```

---

## 9. Practical tips

* **Edge cases:** if all `x` equal → denominator = 0 → no unique slope. Handle explicitly.
* **Numerical stability:** use `np.linalg.lstsq` or SVD for matrix solutions.
* **Multiple features:** implement matrix normal equation or use `sklearn`.
* **Regularization:** add Ridge/Lasso for better generalization when features are many or correlated.

---

## 10. References & next steps

* Scikit-learn docs: `LinearRegression`, `SGDRegressor`, `PolynomialFeatures`.
* Implement gradient descent for learning: compare training


In [9]:
class MeraLR:

    def __init__(self):
        self.m = None
        self.b = None

    def fit(self,X_train,y_train):

        num = 0
        den = 0

        for i in range(X_train.shape[0]):

            num = num + ((X_train[i] - X_train.mean())*(y_train[i] - y_train.mean()))
            den = den + ((X_train[i] - X_train.mean())*(X_train[i] - X_train.mean()))

        self.m = num/den
        self.b = y_train.mean() - (self.m * X_train.mean())
        print(self.m)
        print(self.b)

    def predict(self,X_test):

        print(X_test)

        return self.m * X_test + self.b


In [10]:
import numpy as np
import pandas as pd
df = pd.read_csv('placement.csv')
df.head()

X = df.iloc[:,0].values
y = df.iloc[:,1].values


In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
X_train.shape

lr = MeraLR()
lr.fit(X_train,y_train)


0.5579519734250721
-0.8961119222429152


In [12]:
X_train.shape[0]

160

In [13]:
X_train[0]

np.float64(7.14)

In [14]:
X_train.mean()


np.float64(6.989937500000001)

In [15]:
X_test[0]

print(lr.predict(X_test[0]))


8.58
3.891116009744203
