<span style="font-family: 'JetBrains Mono', monospace; font-size:16px;">

# Problem Set: Cost Function Fun
In this problem, we understand how regularization (Ridge & Lasso) helps to prevent overfitting by shrinking model weights, and how changing the regularization strength affects training error, test error and the bias-variance trade-off. Using a synthetic daataset, we fit the Ridge & Lasso models with different regularization values, look at the effect on coefficient norms and use cross-validation to pick the best model and evaluate it. 

## Problem 1: Regularization

### Regularization Hyperparameter Questions
1. What is the difference between a model's parameters and a model's hyperparameters?
- **Parameter:** this is something the model learns from the data. For example, the weight coefficients in linear regression
- **Hyperparameter:** this is something we choose before training. For example, the alpha (learning rate) or the number of trees in a random forest.

2. As we incease (λ) from 0, how will the training MSE (mean-squared error) change? How will the test MSE change? Sketch the bias-variance trade-off?

As we increase the λ from 0, the training MSE increases because stronger regularization forces the model to fit the training data less perfectly. The testing MSE first decreases (less overfitting) and then increases (underfitting), forming a U-shaped curve.   

3. If training and validation errors are both high and almost equal, should you increase or decrease λ?

We should decrease λ because high training and high validation errors means the model is underfitting. Underfitting happens when the λ is too large and lowering λ gives the model more freedom to learn patterns. 

## The Different Kinds of Regression

</span>

In [4]:

""" 

    - generate a regression dataset
    - split data into 80% for training and 20% for testing, set random_state to 23
    - use Ridge implementation and fit several ridge regression models 


"""

# sklearn imports
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# other relevant imports
import numpy as np

X, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

# range of alpha values
alphas = np.arange(0.1, 100, 0.1)
coefficient_norms = []

for alpha in alphas:

    model = Pipeline([
        
        # scale the data and apply ridge
        ("scaler", StandardScaler()),
        ("ridge", Ridge(alpha=alpha))
    ])

    # fitting the model
    model.fit(X_train, y_train)

    # extracting coefficients
    coefficients = model.named_steps["ridge"].coef_

    # find L2 norm
    norm = np.linalg.norm(coefficients)
    coefficient_norms.append(norm)

print("Coefficient Norms: ", coefficient_norms)

Coefficient Norms:  [np.float64(130.75113707973009), np.float64(130.63842221489824), np.float64(130.52602236479603), np.float64(130.41393567377773), np.float64(130.30216030371352), np.float64(130.1906944337651), np.float64(130.07953626016456), np.float64(129.96868399599708), np.float64(129.85813587098684), np.float64(129.74789013128657), np.float64(129.6379450392699), np.float64(129.5282988733273), np.float64(129.41894992766564), np.float64(129.30989651210967), np.float64(129.2011369519081), np.float64(129.09266958754128), np.float64(128.9844927745328), np.float64(128.87660488326364), np.float64(128.76900429878907), np.float64(128.66168942065872), np.float64(128.5546586627388), np.float64(128.44791045303774), np.float64(128.34144323353382), np.float64(128.23525546000613), np.float64(128.12934560186713), np.float64(128.0237121419987), np.float64(127.91835357659008), np.float64(127.81326841497814), np.float64(127.7084551794908), np.float64(127.60391240529164), np.float64(127.499638640227

In [3]:

""" 

    - performing cross validation on the training dataset to find best model between linear, lasso and ridge

"""

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_validate

# linear regression model
lr_model = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])

lr_cv = cross_validate(
    lr_model, X_train, y_train,
    cv=5, scoring="neg_mean_squared_error"
)

lr_mse = -1 * np.mean(lr_cv["test_score"])


# ridge regression model
ridge_mse_list = []

for alpha in alphas:

    ridge_model = Pipeline([
        ("scaler", StandardScaler()),
        ("ridge", Ridge(alpha=alpha))
    ])

    cv_result = cross_validate(
        ridge_model, X_train, y_train,
        cv=5, scoring="neg_mean_squared_error"
    )

    ridge_mse_list.append(-np.mean(cv_result["test_score"]))

best_ridge_alpha = alphas[np.argmin(ridge_mse_list)]
best_ridge_mse = min(ridge_mse_list)


# lasso regression model
lasso_mse_list = []

for alpha in alphas:

    lasso_model = Pipeline([
        ("scaler", StandardScaler()),
        ("lasso", Lasso(alpha=alpha))
    ])

    cv_result = cross_validate(
        lasso_model, X_train, y_train,
        cv=5, scoring="neg_mean_squared_error"
    )

    lasso_mse_list.append(-np.mean(cv_result["test_score"]))

best_lasso_alpha = alphas[np.argmin(lasso_mse_list)]
best_lasso_mse = min(lasso_mse_list)

# results
print('Linear Regression MSE: ', lr_mse)
print('Best Ridge Alpha: ', best_ridge_alpha, 'Ridge MSE: ', best_ridge_mse)
print("Best Lasso Alpha: ", best_lasso_alpha, 'Lasso MSE: ', best_lasso_mse)



Linear Regression MSE:  161.9683165445073
Best Ridge Alpha:  0.30000000000000004 Ridge MSE:  161.65602954648867
Best Lasso Alpha:  0.8 Lasso MSE:  119.60240061543118


In [5]:

"""
    - evaluate the model on the test set

"""

from sklearn.metrics import mean_squared_error


# train best ridge on training set and then test
best_ridge_model = Pipeline([
    ("scaler", StandardScaler()), 
    ("ridge", Ridge(alpha=best_ridge_alpha))
])

best_ridge_model.fit(X_train, y_train)
ridge_test_mse = mean_squared_error(y_test, best_ridge_model.predict(X_test))

# train best lasso on training set and then test
best_lasso_model = Pipeline([ 
    ("scaler", StandardScaler()),
    ("lasso", Lasso(alpha=best_lasso_alpha))
])

best_lasso_model.fit(X_train, y_train)
lasso_test_mse = mean_squared_error(y_test, best_lasso_model.predict(X_test))

# train linear regression model on training set and then test 
lr_model.fit(X_train, y_train)
lr_test_mse = mean_squared_error(y_test, lr_model.predict(X_test))

print("Linear Regression Test MSE:", lr_test_mse)
print("Best Ridge Alpha:", best_ridge_alpha, " Ridge Test MSE:", ridge_test_mse)
print("Best Lasso Alpha:", best_lasso_alpha, " Lasso Test MSE:", lasso_test_mse)


Linear Regression Test MSE: 176.49701775114667
Best Ridge Alpha: 0.30000000000000004  Ridge Test MSE: 176.601921740432
Best Lasso Alpha: 0.8  Lasso Test MSE: 148.67612432257064


<span style="font-family: 'JetBrains Mono', monospace; font-size:16px;">


## Problem 2: Class Weighted Algorithm
In this problem, we explore how class imbalance affects model performance by comparing a normal logistic regression model with a class-weighted version on the credit card fraud dataset. We load the data, check how imbalanced it is, scale the features, train both models and then compare their recall on the minority fraud class. 

### Questions on Class Weighted

1. If class 1 is the minority class, how should (v1) be chosen with respect to (v0)?

If class 1 is the minority class, we choose v1 bigger than v0 so the model treats class 1 mistakes as more serious. 

<span>



In [None]:

"""

    - load the creditcard.csv file
    - check propotion of each class and comment on severity of class imbalance
    - do the train-test split
    - fitting two models on the training data
    - test the two models on the testing set

"""

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score

# importing as a pandas data frame
card_df = pd.read_csv("data/creditcard.csv")

# selecting the target feature
labels = card_df.iloc[:, -1]

# how many 0's and 1's and find the proportions
print("Counts: " ,labels.value_counts(), "\n")
print("Proportions: ", labels.value_counts(normalize=True), "\n")

# train-test split
features = card_df.iloc[:, :-1]
label = card_df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split( 
    features, label, 
    test_size = 0.2, 
    random_state=42
)

# scaling the data 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# logistic regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# weighted logisitc regression
log_reg_weighted = LogisticRegression(class_weight="balanced", max_iter=1000)
log_reg_weighted.fit(X_train_scaled, y_train)

# performing predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_log_weighted = log_reg_weighted.predict(X_test_scaled)

# recall scores
recall_normal = recall_score(y_test, y_pred_log)
recall_weighted = recall_score(y_test, y_pred_log_weighted)

print("Recall for Normal Logistic Regression: ", recall_normal)
print("Recall for Weighted Logisitc Regression:", recall_weighted)



Counts:  0
0    284314
1       492
Name: count, dtype: int64 

Proportions:  0
0    0.998273
1    0.001727
Name: proportion, dtype: float64 

Recall for Normal Logistic Regression:  0.5795454545454546
Recall for Weighted Logisitc Regression: 0.9090909090909091
