In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [1]:
print("""
--> Gradient Descent:
    * used to minimize error
    * start with a feature weighted by some initial value
        o change weight and look at the gradient of the error / loss function WRT that feature
        o if gradient changes sign, change weight in other direction by subtracing the gradient from the weight in steps of 
          a predetermined size (learning rate, alpha)
        o when gradient is zero / below some threshold, you have correct weight to minimize error
        o if you are trying to maximize, rather than minimize, move in the direction of the gradient, not the opposite

--> Batch Gradient Descent:
    * Input some training data
    * Initialize with some random values
    * While not converged:
        o compute MSE for some feature w
        o compute the gradient of the error for w
        o update w = w - (del(error) wrt w)
    * Convergence conditions:
        o gradient -> 0 i.e. error < e for some threshold e close to zero
        o error doesn't decrease i.e. wf - wo < e for some threshold e close to zero
        o alternatively, fix number of iterations
    * Pros:
        o few updates
        o stable convergence
    * Cons:
        o spacially and computationally intensive
        o premature convergence

--> Mini-Batch Gradient Descent:
    * split the train data into small batches and update based on batches
    * Input some training data
    * Initialize with random values
    * for epoch in range(num_epochs):
        shuffle data and split into batches
        for batch in batch_set:
            compute del(error wrt w)
            update w = w - alpha * del(error wrt w)
    * Pros:
        o higher update frequency -> more robust convergence
        o more effecient memory-wise and in implementation
    * Cons:
        o require hyperparameter: mini-batch size
        o if batch size is small, unstable convergence

--> Closed Form Solution:
    * error(w) = (y - Xw)^T * (y - Xw)
        o where X is an n x (p+1) matrix and each row is a data point
    * differentiate: del(error(w)) = -2 * X^T * (y - Xw)
    * error is minimized when gradient is zero
    * w = (X^T * X)^-1 * X^T * y

--> Polynomial Regression:
    * relationship between sample x and label y in modeled as an nth degree polynomial
    * ex.:
        o x = (x1, x2)
        o lin. reg: y_hat = w0 + w1x1 + w2x2
        o 2nd deg poly. reg: y_hat = w0 + w1x1 + w2x2 + w3x1x2 + w4x1^2 + w5x2^2
    * prediction is still linear wrt weights w
    
--> Performance Evaluation:
    * goal of training ML models: good future prediction
    * best model on training data will memorize data, not generalize well
    * for future performance, split into train and test set
        o generally 80:20 or 90:10 ratio
    * trained model is evaluated on test set
    
""")


--> Gradient Descent:
    * used to minimize error
    * start with a feature weighted by some initial value
        o change weight and look at the gradient of the error / loss function WRT that feature
        o if gradient changes sign, change weight in other direction by subtracing the gradient from the weight in steps of 
          a predetermined size (learning rate, alpha)
        o when gradient is zero / below some threshold, you have correct weight to minimize error
        o if you are trying to maximize, rather than minimize, move in the direction of the gradient, not the opposite

--> Batch Gradient Descent:
    * Input some training data
    * Initialize with some random values
    * While not converged:
        o compute MSE for some feature w
        o compute the gradient of the error for w
        o update w = w - (del(error) wrt w)
    * Convergence conditions:
        o gradient -> 0 i.e. error < e for some threshold e close to zero
        o error doesn't decrease