# Linear Regression

## Implementation

Input data ("y" table) has "m" data points, "n" columns (features or independent variables), and "n" + 1 total of betas

In [None]:
# load library
import random

For the main function, we perform the steps as follow:
- First, we initialize the parameters with `initialize_params` based on the dimension of the input data ("n")
- We then compute the gradient of the betas using `compute_graident`
- Use the computed gradients to update the value of each beta using `update_params`
- We repeat the process multiple times

In [None]:
# main function
def linear_regression(x, y, iterations = 100, learning_rate = 0.01):
    n, m = len(x[0]), len(x)
    beta_0, beta_other = initialize_params(n)
    for _ in range(iterations):
        gradient_beta_0, gradient_beta_other = compute_gradient(x, y, beta_0, beta_other, n, m)
        beta_0, beta_other = update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate)
    return beta_0, beta_other

For the `initialize_params` function, we initialize "beta_0" as 0 and "beta_other" is a vector with the size of "n" that holds all the other randomly initialized betas.

In [2]:
# helper function: initialize params
def initialize_params(n):
    beta_0 = 0
    beta_other = [random.random() for _ in range(n)]
    return beta_0, beta_other

`compute_gradient` is the core of the algorithm where we compute gradients for all betas.
- Initialized all gradient betas as 0
- We loop through all data points and add gradient contributed by each data point to those variables:
    - First, we obtain the prediction "y_i_hat" for each data point "i"
    - Get the difference between the prediction ("y_i_hat") and the observation ("y[i]")
    - Use the difference to obtain the derivative of the error over "y" by multiplying the difference with 2
    - Get graident of betas by diving each data point's gradient by "n" so the gradient computed at the end will be the average over all data points

In [None]:
# helper function: compute gradient
def compute_gradient(x, y, beta_0, beta_other, n, m):
    gradient_beta_0 = 0
    gradient_beta_other = [0] * n
    
    for i in range(m):
        y_i_hat = sum(x[i][j] * beta_other[j] for j in range(n)) + beta_0
        derror_dy = 2 * (y[i] - y_i_hat)
        for j in range(n):
            gradient_beta_other[j] += derror_dy * x[i][j] / n
        gradient_beta_0 += derror_dy / n
    
    return gradient_beta_0, gradient_beta_other

We use `update_params` to update all the betas using the gradient we obtained. We don't add gradients to betas, but we scale the gradient by multiplying it with learning rate (a rate of speed where the gradient moves during a gradient descent; learning rate too high will make gradient descent unstable, too low will make it slow to converge)

We update betas using "+=" because of how gradients are computed in `compute_gradient`. We get the gradient of error with respect to "y" as "y[i]" - "y_i_hat". If "y_i_hat" is overestimated, "derror_dy" will be a negative value. That's why we add the gradient to betas.

In [None]:
# helper function: update params
def update_params(beta_0, beta_other, gradient_beta_0, gradient_beta_other, learning_rate):
    beta_0 += gradient_beta_0 * learning_rate
    for i in range(len(beta_other)):
        beta_other[i] += (gradient_beta_other[i] * learning_rate)
    return beta_0, beta_other

## Test