## ___Univariate Linear Regression___
--------------

In [1]:
import numpy as np
np.seterr(all = "raise")
from numpy.typing import NDArray
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]:
# this is called supervised learning as we fed in the predictor and predictions to te model to learn from
# so the model had access to a string of predictors and correct predictions to learn the associations.

# supervised learning models that predict numerical continuous variables are called regression models.
# supervised learning models that predict categorical variables are called classification models. (categorical nominal or categorical ordinal)
# the principle distinction is that in classification there are only a small finite number of possible outcomes.

## ___HOW EXACTLY A SUPERVISED LEARNING ALGORITHM WORKS?___
---------------------------

In [3]:
# training set will typically include N number of features and a target variable
# once trained the supervised learning algorithm would have learned a function that maps the predictors to the targets as best as possible.
# when provided with new data, it just uses that function to propose predictions.
# "function" here is just a fancy word for learned associations between predictors and the target.

# in ML x hat is used to mark the predictions (estimates) made for x

## $f(x) = \hat{y}$

In [4]:
# x is the input and the prediction is y
# the function f is called the model.

In [5]:
# in standard ML parlance, input data is denoted as lowercase x
# labels (targets) are denoted as lowecase y

Training examples (records) are referred to as ($x^{i}, y^{i}$) where, i registers the ith record/row in the dataset.

In [6]:
# The question now is how to define the function f?
# what exactly the function f does to generate the predictions?

In [7]:
# let's imagine our regression model as a simple y = mx + b equation.
# function f takes in the training data and training labels to learn the m and b and then uses them to make the predictions.
# f(x) = mx + b
# where m is the slope and b is a constant bias term.

# ONE MAY ASK WHAT'S THE REASON BEHIND CHOOSING A LINEAR FUNCTION y = mx + b?
# non-linear functions are in fact necessary in modelling complex datasets. 
# A linear function is choosen here primarily for simplicity.
# regression models that use linear equations are called linear regression models.
# linear regression models with one feature (one input variable) is univariate linear regression

## ___Implementing A Univariate Linear Regression Model From Scratch___
-----------

In [8]:
# consider a model f(x) = wx + b
# cost function is our usual squared error cost function j(x, b) :=

### ___$j(w, b) = \frac{1}{2N}\sum_{i = 0}^{N}(f(x_i) - y_i)^{2}$___

In [9]:
def squared_errors_costfn(inputs: NDArray[np.float64], targets: NDArray[np.float64], m: np.float64, b: np.float64) -> np.float64:
    return np.square((inputs * m + b) - targets).sum() / (2 * inputs.size)

In [10]:
# the gradient descent algorithm for this univariate linear regression model is ::=
# repeat until convergence ::=

## ___$w = w - \alpha \cdot \frac{\partial}{\partial{w}}j(w, b)$___
## ___$b = b - \alpha \cdot \frac{\partial}{\partial{b}}j(w, b)$___

In [11]:
# partial derivative of the cost function with respect to weight w ::

## ___$w = w - \alpha \cdot \frac{\partial}{\partial{w}}j(w, b)$___
## ___$w = w - \alpha \cdot \frac{\partial}{\partial{w}}\frac{1}{2N}\sum_{i = 0}^{N}(f(x_i) - y_i)^{2}$___
## ___$w = w - \alpha \cdot \frac{\partial}{\partial{w}}\frac{1}{2N}\sum_{i = 0}^{N}(w x_i + b - y_i)^{2}$___
---------------------
___This is where calculus comes into play___

## ___$w = w - \alpha \cdot \frac{1}{2N}\sum_{i = 0}^{N}(w x_i + b - y_i) \times 2x_i$___
--------------------
## ___$w = w - \alpha \cdot \frac{1}{N}\sum_{i = 0}^{N}(w x_i + b - y_i) \times x_i$___

In [12]:
# partial derivative of the cost function with respect to bias b ::

## ___$b = b - \alpha \cdot \frac{\partial}{\partial{b}}j(w, b)$___
## ___$b = b - \alpha \cdot \frac{\partial}{\partial{b}}\frac{1}{2N}\sum_{i = 0}^{N}(f(x_i) - y_i)^{2}$___
## ___$b = b - \alpha \cdot \frac{\partial}{\partial{b}}\frac{1}{2N}\sum_{i = 0}^{N}(w x_i + b - y_i)^{2}$___
---------------
___This is where calculus comes into play___

## ___$b = b - \alpha \cdot \frac{1}{2N}\sum_{i = 0}^{N}(w x_i + b - y_i) \times 2$___
---------------
## ___$b = b - \alpha \cdot \frac{1}{N}\sum_{i = 0}^{N}(w x_i + b - y_i)$___

In [13]:
# So, here's the expanded version of the gradient descent algorithm ::=
# repeat until convergence,

## ___$w = w - \alpha \cdot \frac{1}{N}\sum_{i = 0}^{N}(w x_i + b - y_i) \times x_i$___
## ___$b = b - \alpha \cdot \frac{1}{N}\sum_{i = 0}^{N}(w x_i + b - y_i)$___

In [14]:
# REMEMBER:: PARAMETR UPDATES MUST BE EXECUTED SIMULTANEOUSLY

In [15]:
# USING A SQUARED ERROR COST FUNCTION WITH LINEAR REGRESSION, THE COST FUNCTION IS SAID TO BE A CONVEX FUNCTION
# IT WILL NEVER GIVE MULTIPLE MINIMA!
# THERE'LL ONLY BE A SINGLE GLOBAL MINIMUM!

In [16]:
# data

x = np.arange(start = 10, stop = 100, step = 0.02)
m = 3.8713458627
b = 12.87345
y = m * x + b

In [17]:
# gradient descent parameters

ALPHA = 0.001
bootstrap_m = 2.81453542
bootstrap_b = 9.2617187
niterations = 10000

In [18]:
for _ in range(niterations):
    # for the love of simultaneous updates,
    tmp_m = ALPHA * (((bootstrap_m * x) + bootstrap_b - y) * x).sum() / x.size
    tmp_b = ALPHA * (((bootstrap_m * x) + bootstrap_b - y)).sum() / x.size

    bootstrap_m -= tmp_m
    bootstrap_b -= tmp_b

FloatingPointError: overflow encountered in reduce

In [19]:
bootstrap_m, bootstrap_b

(-1.2753443256829544e+301, -1.8960945838102555e+299)

In [20]:
# BATCH GRADIENT DESCENTS

# the term batch gradient descent refers to the fact that in every iteration of the gradient descent, we operate with all records of the training data
# instead of a subset of the training data.
# when computing the sum, we are computing sums of the entire batch of training examples!

## ___$=\sum_{i=1}^{N}(f_{w,b}(x_i) - y_i)^2$___

In [21]:
# that's the sum for all the records ranging from i = 0 to i = N in the training dataset!