# Linear Regression


## General Form

$$
h(x) = \sum_{i=0}^{d} \theta_i x_i = \theta^{T}x
$$

where

$\theta$ is the parameters

$x$ is the variable

## Cost Function

$$
J(\theta) = \frac{1}{2} \sum_{i=1}^{n} (h_{\theta}(x^{(i)}) - y^{(i)})^{2}
$$

Regresssion with this kind of loss function is called **ordinary least squares**

## LMS (Least Mean Squared) Algorithm
1. We want to choose $\theta$ to minimize $J(\theta)$
2. to do so, we start with initial guess
3. repeatedly update $\theta$ to make $J(\theta)$ smaller

$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j}J(\theta)
$$
where $\alpha$ is the learning rate

since the derivatives $\frac{\partial}{\partial\theta_j} J(\theta)$  is

$$
\frac{\partial}{\partial\theta_j} J(\theta) = (h_\theta(x) - y)x_j
$$

then we will get the update rule for $\theta_j$

$$
\theta_j := \theta_j + \alpha(y^{(i)} - h_\theta(x^{(i)}))x^{(i)}
$$

In [1]:
import pandas as pd
import numpy as np

In [41]:
def h_x(theta, x):
    """
    the dot product between weight & x
    """
    return theta @ x

def update_theta(theta_j, y, h_x, x, alpha=0.01):
    """
    produces new theta
    """
    return theta_j + alpha * ((y - h_x) @ x)

In [54]:
x = np.array([[1,1,1],[2,2,2],[3,3,3]])
y = np.array([2,4,6])
theta = np.array([1,1,1])

def learn(theta, x, y, max_iter=1000):
    """
    learn with gradient descent
    """
    i = 0
    stop = False
    h_x1 = h_x(theta, x)
    while not stop or i == max_iter:

        theta_new = update_theta(theta[j], y, h_x1, x)
        h_x1 = h_x(theta, x)

        if np.all(np.isclose(theta, theta_new)):
            stop = True
        theta = theta_new
        i += 1
    return theta

def rmse(y_true, y_predicted):
    return np.mean(np.sqrt(np.power(y_predicted - y_true, 2)))

In [53]:
pred = h_x(theta, x)
print('initial prediction: {}, rmse: {}'.format(pred, rmse(y, pred)))
print('y: {}'.format(y))
new_theta = learn(theta, x, y)
pred = h_x(new_theta, x)
print('current_prediction: {}, rmse: {}'.format(pred, rmse(y, pred)))


initial prediction: [6 6 6], rmse: 2.0
y: [2 4 6]
current_prediction: [4.6662996 4.6662996 4.6662996], rmse: 1.5554332009037826


this algorithm is called **stochastic gradient descent** or **incremental gradient descent**

# References
1. Stanford CS229 Lecture Notes.
2. http://tutorial.math.lamar.edu/pdf/calculus_cheat_sheet_derivatives.pdf