### Notebook For DS4400 Lecture 8

By: John Henry Rudden

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Lasso as SKLasso, Ridge as SKRidge
import numpy as np
from numpy import ndarray
from enum import Enum
from matplotlib import pyplot as plt

In [2]:
X = np.genfromtxt('./data/stock_prediction_data.csv', delimiter=',')
y = np.genfromtxt('./data/stock_price.csv', delimiter=',')
y = y.reshape(-1, 1)

### Splitting Data into Train, Validation, and Test Sets

Use train to train the model, validation to compare models, and test to evaluate the final selected model.

In [3]:
X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_rest, y_rest, test_size=0.5, random_state=42)

### Scaling our Data

We will use SKLearn's StandardScaler to scale our data. This will make our data have a mean of 0 and a standard deviation of 1. 
It is important to scale our validation and test sets using the scaler that was fit to the training data. Since scaling is part of our preprocessing, we need to be consistent with the scaling across all of our data (train, validation, test, and future).

In [4]:
# use standard scaler to normalize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

## Polynomial Feature Mapping

In [5]:
X_train = PolynomialFeatures(degree=2, include_bias=True).fit_transform(X_train)
X_val = PolynomialFeatures(degree=2, include_bias=True).fit_transform(X_val)
X_test = PolynomialFeatures(degree=2, include_bias=True).fit_transform(X_test)

### Lasso Regression

As we saw in class, the objective for Lasso Regression is to minimize the following:

$$\min_w \frac{1}{n} \sum_{i=1}^n (\phi(x_i)^Tw - y_i)^2 + \lambda ||w||_1$$

Where $\lambda$ is a hyperparameter that we can tune to control the amount of regularization (weighting in importance of L1 norm in objective). Notice if $\lambda = 0$, then we have the same objective as Linear Regression.

As you can see from the objective, Lasso Regression is simply Linear Regression with an additional term that penalizes the absolute value of the coefficients (L1 Norm).

### Taking the Derivative of the Objective

We have also seen the $\frac{\partial}{\partial w} ||w||_1$ before in class. This is simply:

$$\frac{\partial}{\partial w} ||w||_1 = sign(w)$$

Moreover, the complete derivative of the objective is:

$$\frac{\partial}{\partial w} \frac{1}{n} \sum_{i=1}^n (\phi(x_i)^Tw - y_i)^2 + \lambda ||w||_1 = \frac{1}{n} \sum_{i=1}^n 2\phi(x_i)(\phi(x_i)^Tw - y_i) + \lambda sign(w)$$

In this code I will be using a vectorized implementation of the gradient of the objective. This is because it is much faster than using a for loop to calculate the gradient This vectorized gradient of the objective is as follows:

$$\frac{2}{n} \Phi^T(\Phi w - y) + \lambda sign(w)$$

# Read if you are confused!
 If any of the code is confusing, please check out lecture 7 notebook where I do the full derivation of the vectorized implementation of Linear Regression. This can directly be applied to Lasso Regression and Ridge Regression.



In [6]:
def ߜLasso(w: ndarray, Φ: ndarray, y: ndarray, λ: float) -> ndarray:
    """
    Calculates the gradient of the Lasso loss function with respect to weights (w).

    Parameters:
    - w: ndarray, The weight vector.
    - Φ: ndarray, The feature matrix after applying the basis function. (Phi)
    - y: ndarray, The target values.
    - λ: float, The regularization parameter. (lambda)

    Returns:
    - ndarray: Gradient of the Lasso Objective with respect to w.
    """
    n, m = Φ.shape
    return 2/n * Φ.T.dot(Φ.dot(w) - y) + λ * np.sign(w) # if λ = 0, this is the gradient of the MSE loss function

def gradient_descent_lasso(Φ: ndarray, y: ndarray, α: float = 0.01, num_iter: int = 10_000, λ: float = 1) -> ndarray:
    """
    Performs gradient descent on Lasso Rregression Objective to find optimal w vector.

    Notice: This function is simply Linear Regression code from Lecture 7, except I changed one line. Which line?

    Parameters:
    - Φ: ndarray, The feature matrix (Phi).
    - y: ndarray, The target values.
    - α: float, The learning rate.
    - num_iter: int, The number of training iterations.
    - λ: float, The regularization parameter.

    Returns:
    - ndarray: The optimized weights vector.
    """
    n, m = Φ.shape
    w = np.zeros((m, 1))
    for _ in range(num_iter):
        gradient = ߜLasso(w, Φ, y, λ=λ)  # Gradient of Lasso Objective with respect to w
        
        # Checking for convergence (See note Lecture 7 Notebook)
        if np.all(np.abs(gradient) < 1e-5) or np.isnan(gradient).any():
            break
            
        # Check for gradient explosion (See note Lecture 7 Notebook)
        if np.isinf(gradient).any(): 
            raise ValueError("Gradient exploded")

        w -= α * gradient
    return w

def predict(Φ: ndarray, w: ndarray) -> ndarray:
    """
    Predicts the target values using the linear model.

    Parameters:
    - Φ: ndarray, The feature matrix. (Phi)
    - w: ndarray, The weights vector.

    Returns:
    - ndarray: Predicted values.
    """
    return Φ.dot(w)

def mse(y: ndarray, y_hat: ndarray) -> float:
    """
    Calculates the Mean Squared Error (MSE) between actual and predicted values.

    Parameters:
    - y: ndarray, Actual target values.
    - y_hat: ndarray, Predicted values by the model.

    Returns:
    - float: The mean squared error between actual and predicted values.
    """
    return np.mean((y - y_hat)**2)

### Using My Lasso Regression Function

In [7]:
w_lasso_gd = gradient_descent_lasso(X_train, y_train, λ=1)
pred_train = predict(X_train, w_lasso_gd)
print(f"My Lasso Regression Train MSE: {mse(y_train, pred_train)}")
pred_val = predict(X_val, w_lasso_gd)
print(f"My Lasso Regression Validation MSE: {mse(y_val, pred_val)}")

My Lasso Regression Train MSE: 2.122974823830284
My Lasso Regression Validation MSE: 2.3665208202191623


### SKLearn's Lasso Regression

In Sklearns Lasso regression, alpha corresponds to the lambda value in the lecture notes. The higher the alpha, the more the coefficients are pushed towards zero.


In [8]:
sk_poly_lasso = SKLasso(alpha=1)
sk_poly_lasso.fit(X_train,y_train.flatten()) # y is 2D, but scikit-learn expects 1D
pred_train = sk_poly_lasso.predict(X_train).reshape(-1,1)
print(f"SKLearn Lasso Train MSE: {mse(y_train, pred_train)}")
pred_val = sk_poly_lasso.predict(X_val).reshape(-1,1)
print(f"SKLearn Lasso Validation MSE: {mse(y_val, pred_val)}")

SKLearn Lasso Train MSE: 6.831045121140056
SKLearn Lasso Validation MSE: 7.629002980184824


### Seems Like for λ = 1, my Lasso Regression is Better (on Validation set)...

You may be curious why this is the case. I am too. Check out [SKLearn's Lasso Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) to see if you can figure out why. Hint: do they use a learning rate?

### Using Validation to Select Best Lambda

In [9]:
valid_lambdas = [0, 0.1, 1, 10] # list of values to try for lambda
best_lambda_lasso = None # store the best lambda value
best_mse_lasso = float('inf') # store the best mse value
for λ in valid_lambdas:
    w_lasso_gd = gradient_descent_lasso(X_train, y_train, λ=λ) # train the model with Train data and λ
    pred_val = predict(X_val, w_lasso_gd) # predict the validation data
    mse_ = mse(y_val, pred_val)
    print(f"My Lasso Regression Validation MSE: {mse_} for lambda: {λ}")
    if mse_ < best_mse_lasso:
        best_mse_lasso = mse_
        best_lambda_lasso = λ
print(f"Best lambda: {best_lambda_lasso} with MSE: {best_mse_lasso}")

My Lasso Regression Validation MSE: 0.09295127686258214 for lambda: 0
My Lasso Regression Validation MSE: 0.09554581350427133 for lambda: 0.1
My Lasso Regression Validation MSE: 2.3665208202191623 for lambda: 1
My Lasso Regression Validation MSE: 46.705761270943 for lambda: 10
Best lambda: 0 with MSE: 0.09295127686258214


# Ridge Regression Time!

Instead of using the L1 Norm, Ridge Regression uses the L2 Norm. The objective for Ridge Regression is to minimize the following:

$$\min_w \frac{1}{n} \sum_{i=1}^n (\phi(x_i)^Tw - y_i)^2 + \lambda ||w||_2^2$$

Just like in Lasso, $\lambda$ is a hyperparameter that we can tune to control the amount of regularization. Similar to Lasso, if $\lambda = 0$, then we have the same objective as Linear Regression.

### Taking the Derivative of the Objective

Taking the derivative of the squared L2 Norm is super simple:

$$\frac{\partial}{\partial w} ||w||_2^2 = 2w$$

Moreover, the derivative of the objective is also super simple:

$$\frac{\partial}{\partial w} \frac{1}{n} \sum_{i=1}^n (\phi(x_i)^Tw - y_i)^2 + \lambda ||w||_2^2 = \frac{1}{n} \sum_{i=1}^n 2\phi(x_i)(\phi(x_i)^Tw - y_i) + 2\lambda w$$

Once again I will be using the vectorized implementation of the gradient of the objective. This is as follows:

$$\frac{2}{n} \Phi^T(\Phi w - y) + 2\lambda w$$

In [10]:
def ߜRidge(w: ndarray, Φ: ndarray, y: ndarray, λ: float) -> ndarray:
    """
    Calculates the gradient of the Ridge loss function with respect to weights (w).

    Parameters:
    - w: ndarray, The weight vector.
    - Φ: ndarray, The feature matrix after applying the basis function. (Phi)
    - y: ndarray, The target values.
    - λ: float, The regularization parameter. (lambda)

    Returns:
    - ndarray: Gradient of the Ridge Objective with respect to w.
    """
    n, m = Φ.shape
    return 2/n * Φ.T.dot(Φ.dot(w) - y) + 2 * λ * w # if λ = 0, this is the gradient of the MSE loss function

def gradient_descent_ridge(Φ: ndarray, y: ndarray, α: float = 0.01, num_iter: int = 10_000, λ: float = 1) -> ndarray:
    """
    Performs gradient descent on Ridge Rregression Objective to find optimal w vector.

    Parameters:
    - Φ: ndarray, The feature matrix (Phi).
    - y: ndarray, The target values.
    - α: float, The learning rate.
    - num_iter: int, The number of training iterations.
    - λ: float, The regularization parameter.

    Returns:
    - ndarray: The optimized weights vector.
    """
    n, m = Φ.shape
    w = np.zeros((m, 1))
    for _ in range(num_iter):
        gradient = ߜRidge(w, Φ, y, λ=λ)  # Gradient of Ridge Objective with respect to w (Notice this is the only change from Lasso)
        
        # Checking for convergence (See note Lecture 7 Notebook)
        if np.all(np.abs(gradient) < 1e-5) or np.isnan(gradient).any():
            break
            
        # Check for gradient explosion (See note Lecture 7 Notebook)
        if np.isinf(gradient).any(): 
            raise ValueError("Gradient exploded")

        w -= α * gradient
    return w

In [11]:
w_ridge_gd = gradient_descent_ridge(X_train, y_train, λ=1)
pred_train = predict(X_train, w_ridge_gd)
print(f"My Ridge Regression Train MSE: {mse(y_train, pred_train)}")
pred_val = predict(X_val, w_ridge_gd)
print(f"My Ridge Regression Validation MSE: {mse(y_val, pred_val)}")

My Ridge Regression Train MSE: 12.639576707863057
My Ridge Regression Validation MSE: 16.01545282823887


### Using SKLearn's Ridge Regression

Once again, SKLearn's Ridge regression uses alpha instead of lambda. The higher the alpha, the more the coefficients are pushed towards zero.

In [12]:
sk_poly_ridge = SKRidge(alpha=1)
sk_poly_ridge.fit(X_train,y_train.flatten()) # y is 2D, but scikit-learn expects 1D
pred_train = sk_poly_ridge.predict(X_train).reshape(-1,1)
print(f"SKLearn Ridge Train MSE: {mse(y_train, pred_train)}")
pred_val = sk_poly_ridge.predict(X_val).reshape(-1,1)
print(f"SKLearn Ridge Validation MSE: {mse(y_val, pred_val)}")

SKLearn Ridge Train MSE: 0.03284528040883205
SKLearn Ridge Validation MSE: 0.10389999417330495


### SKLearn's Ridge Regression Seems to Perform Way Better Than My Ridge Regression...

Once again, you may be curious why this is the case. Just like in the Lasso Regression case, SKLearn implements its learning a bit differently then we learned in class. Check out [SKLearn's Ridge Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) to see how. Hint: do they use a learning rate and what is the `solver` parameter?

### Selecting the Best Lambda for Ridge Regression

In [13]:
valid_lambdas = [0, 0.1, 1, 10] # condidate values for lambda (this is same process I used for Lasso)
best_lambda_ridge = None # store the best lambda value
best_mse_ridge = float('inf') # store the best mse value
for λ in valid_lambdas:
    w_ridge_gd = gradient_descent_ridge(X_train, y_train, λ=λ) 
    pred_val = predict(X_val, w_ridge_gd) # predict the validation data
    mse_ = mse(y_val, pred_val)
    print(f"My Ridge Regression Validation MSE: {mse_} for lambda: {λ}")
    if mse_ < best_mse_ridge:
        best_mse_ridge = mse_
        best_lambda_ridge = λ
print(f"Best lambda: {best_lambda_ridge} with MSE: {best_mse_ridge}")

My Ridge Regression Validation MSE: 0.09295127686258214 for lambda: 0
My Ridge Regression Validation MSE: 1.1402329224528371 for lambda: 0.1
My Ridge Regression Validation MSE: 16.01545282823887 for lambda: 1
My Ridge Regression Validation MSE: 42.66613532773383 for lambda: 10
Best lambda: 0 with MSE: 0.09295127686258214


# Comparing Lasso and Ridge Regression

Which one is better? Well we can compare the MSE of the validation set for both models.

In [14]:
print('Which model is better for our data?')

# Initial assumption that no model is selected
best_lambda = None
best_w = None

if best_mse_lasso < best_mse_ridge:
    print("Lasso is better")
    best_lambda = best_lambda_lasso
    best_w = gradient_descent_lasso(X_train, y_train, λ=best_lambda)
elif best_mse_ridge < best_mse_lasso:
    print("Ridge is better")
    best_lambda = best_lambda_ridge
    best_w = gradient_descent_ridge(X_train, y_train, λ=best_lambda)
else:
    print("There seems to be no clear winner. Both models perform equally well.")
    # If MSEs are equivalent, check if one of the lambda values is 0
    if best_lambda_lasso == 0 or best_lambda_ridge == 0:
        print("Linear Regression is recommended as the simpler model because lambda is 0.")
        # Assuming you have a function to perform Linear Regression
        best_w = gradient_descent_lasso(X_train, y_train, λ=0) # since λ=0, this is just linear regression
    else:
        # If both lambdas are not 0 but are equal, choose one based on preference or additional criteria
        print("Both models perform equally well. Choose based on additional criteria or preference.")
        best_lambda = best_lambda_lasso  # This is arbitrary; you could choose Ridge similarly
        best_w = gradient_descent_lasso(X_train, y_train, λ=best_lambda)

pred_test = predict(X_test, best_w)
print(f"Test MSE: {mse(y_test, pred_test)}")


Which model is better for our data?
There seems to be no clear winner. Both models perform equally well.
Linear Regression is recommended as the simpler model because lambda is 0.
Test MSE: 0.0677020386538741
