# 1. Linear regression

In [None]:
%%markdown

Gradient Descent is a method wherein we use a [[Cost Function]], take the derivative of the cost function (w.r.t. all the independent parameters) and then adjust our parameters a small amount in the negative derivative direction in order to decrease our cost function, thus adjusting the model's parameters to incur lower costs = better predictions.

The following is the function for a straight line
$$
y = \upalpha x + \upbeta
$$

The [[Mean Squared Error - MSE]] for a linear regression prediction would thus be

$$
MSE = \frac{1}{n}{\sum_{i=1}^n((\upalpha x_i + \upbeta) - \hat{y}_i)^2}
$$
Our MSE is our [[Cost Function]], which would be equal to squaring the difference between $y_{label}$ and our arbitrary line $y_{pred} = \upalpha x + \upbeta$ for all our $n$ data points.

We want to calculate the derivative w.r.t. $\upalpha$ & $\upbeta$:
$$
\frac{\partial f}{\partial \upalpha} = \frac{\partial f}{\partial \upalpha}(\upalpha x_i + \upbeta - \hat{y}_i)^2
$$

$$
\frac{\partial f}{\partial \upalpha} = 2x_i(\upalpha x_i + \upbeta - \hat{y}_i)
$$
Now the derivative w.r.t. $\upbeta$:
$$
\frac{\partial f}{\partial \upbeta} = 2(\upalpha x_i + \upbeta - \hat{y}_i)
$$
---
This means that for every data point $i$ we have, we'll get a certain number (positive or negative). We will sum all of these for the respective params $\upalpha$ & $\upbeta$ and that should give us a positive or negative value that will tell us a direction we should adjust the params ($\upalpha$ and $\upbeta$ in this case) in order to maximise our cost function. Since we want to minimise the cost function we'll use the negative value and then multiple it by a certain amount (the _learning rate_) which should then give us a new value for our params $\upalpha$ & $\upbeta$.

For learning rate $\upmu$
$$
a := a - \frac{\upmu}{n} \sum_{i=1}^n 2x_i(\upalpha x_i + \upbeta - \hat{y}_i )
$$

$$
b := b - \frac{\upmu}{n} \sum_{i=1}^n 2(\upalpha x_i + \upbeta - \hat{y}_i )
$$

---

Gradient descent is good at finding a local minimum for the function, but it's not as adept at finding a maximum. Other strategies need to be employed to first identify different "valleys" in the function and then settling on e.g. using gradient descent to find the minimum of the most promising valley.



# 2. Practice

In [6]:
import numpy as np
import matplotlib.pyplot as plt

# Step 1 initialize a linear function we want to approximate
alpha = 4.86
beta = -140.6
# Set random seed
rng = np.random.default_rng(0)

# Step 2 - generate noisy data along the line
# x is 1000 data points from 0 to 100
x = np.linspace(0, 100, 1000)
delta = rng.uniform(low=-70, high=70, size=1000)
y = alpha * x + beta + delta
y_noiseless = alpha * x + beta

# plt.plot(x, y)
# plt.title("Noisy linear function data")
# plt.show()

# Since we have the answer, we don't need to split data into train/test sets
# Instead, we proceed to randomly initialize the weights
alpha_pred = ( rng.random() * 10 ) - 5
beta_pred = ( rng.random() * 10 ) - 5
learn_rate = 0.001
epochs = 5000

norm_x = (x - x.mean()) / x.std()

def train(alpha_pred, beta_pred):
  old_alpha = alpha_pred
  old_beta = beta_pred
  alpha_pred -= ( learn_rate  / y.size ) * np.sum(2 * norm_x  * (old_alpha * x + old_beta - y))
  beta_pred -= ( learn_rate / y.size ) * np.sum(2 * (old_alpha * x + old_beta - y))
  return alpha_pred, beta_pred

for i in range(epochs):
  alpha_pred, beta_pred = train(alpha_pred, beta_pred)

y_pred = alpha_pred * x + beta_pred

plt.figure()
plt.scatter(x, y, color="blue", label="data points", s=10)
plt.plot(x, y_pred, color="red", label="prediction")
plt.plot(x, y_noiseless, color="orange", label="true function")

plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.title("Linear regression")
plt.show()

ModuleNotFoundError: No module named 'numpy'