# The Mathematics of Linear Regression

_Author: Callum Hall_  
_Date: 05/08/2025_

# Chapter 1: The Core Idea

## 1.1 Overview

Linear Regression is a type of supervised algorithm that uses labelled data, consisting of input features ($X_{train}$) and their corresponding true values ($y_{train}$). The goal is to learn the most optimised linear function that can then be applied and used to predict the value ($\hat{y}$) on a new, unseen dataset ($X_{test}$).

## 1.2 The General Equation

As previously desribed in the overview, the algorithm uses a linear function to achieve the best optimised result. This, therefore, means the algorithm is ultimatly based upon $y = mx + c$ but this is displayed differently in machine learning and is as follows $$\hat{y} = Wx + b$$

* $\hat{y}$ - Dependant variable (prediction). $y$ is used to to represent true value.
* $W$ - Gradient (weights)
* $x$ - Independant variable ($X$ for matrix)
* $b$ - Y-Interecept (bias)

# Chapter 2: The Loss Function

## 2.1 The General Formula

Mean Squared Error (MSE) is a fundamental metric used within ML, particulally within regression models. It works by evaluating the difference between the predicted values and the actual values and gives a squared value.

$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2$$

* $n$ - The number of datapoints
* $y_i$ - The actual data
* $ \hat{y}_i$ - The predicted data

The purpose of squaring the results is for number of reasons. First is to give greater weight to larger errors. If we have an error value of 2 then $2^2 = 4$ whereas and error of 10 is $10^2 = 100$.

A further purpose is to square all errors to ensure that model evaluation is accurate. This can be seen here

$$£105 - £100 = + £5$$
$$£95 - £100 = - £5$$

Here if we add up the error values we get 0, therefore giving an incorrect sense that the model is perfect. Therefore we square.

$$(+£5)^2 = 25$$
$$(-£5)^2 = 25$$

It should be noted that this is a loss function and not an evaluation metric. This is down to the difficulty in humans making sense of the results as the results come out in $(unit)^2$, meaning it could be £ squared etc. A better version could be RMSE which ensures this is not a problem. This is discussed in this notebook.

# The Goal of Optimisation

## 3.1 Gradient Decent

The goal of optimisation is to reduce the loss until 0. This is not realistic in practice so therefore this is done until there is no noticeable reduction in loss. This often leads to the introduction of early stopping procedures within the code. This works by stopping the process once there is no longer $n$ change of loss within $x$ epochs.

Gradient decent is an optimisation algorithm that aims to make the loss as small as possible. To do this, it find the derivative of the slope and uses the sign, (+ or -) to work out which was it needs to adjust to decrease the $W$ and $b$. The magnitude of the derivative instructs on how big of a step should be taken. (Covered in more detail in the Fundamentals)

### 3.11 The General Formula

$$N_{parameter} = O_{parameter} - \alpha * gradient$$

For $W$ and $b$ parameters, the update rules are

$$W_{new} = W_{old} - \alpha * \frac{\partial L}{\partial W}$$

$$b_{new} = b_{old} - \alpha * \frac{\partial L}{\partial b}$$

The formulas are applied in the following method

* $N$ - New
* $O$ - Old
* $\alpha$ - Learning Rate
* $L$ - Loss Function


