# Assignment 7: Linear Model Selection and Regularization
Derived from MLEARN51-Assignment7-Student_Name.ipynb in Canvas MLEARN 510 Spring 2024.   <br>
Modified by Ernst Henle. Modifications Copyright © 2024 by Ernst Henle<br>
<br>
## Learning Objectives
- Produce a model with l2 regularization, with a statistically significant improvement over a model without regularization.
- Produce a model with l1 regularization, with a statistically significant improvement over a model without regularization.
- Produce a model with both l1 and l2 regularization terms, with a statistically significant improvement over a model without regularization.
- Produce a generalized additive model with a statistically significant improvement over the null model (a model without input variables).

In [None]:
# The following code can be removed
import time
start_time = time.time()

In [None]:
# Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, LinearRegression, ElasticNet, Lasso
from sklearn.metrics import mean_squared_error

# There could be over 50 Convergence Error Warnings
import warnings
warnings.filterwarnings('ignore')

# our favorite magic
%matplotlib inline

## Get Data and Basic EDA
<br>
Dataset(s) needed:
Kaggle House Prices (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)<br>

This data is only the training data.  We will not use the actual test data in this exercise.  All "tests" will be validations done on validation data that is taken from the training data.

In [None]:
train = pd.read_csv('../data/House Prices.csv')
print(train.shape)
print(train.dtypes.value_counts())
train.head()

 Question 1.1: Drop the Id column from the data as it is not needed for prediction and may actually lead to overfitting.

In [None]:
print(f"Original shape: {train.shape}")
train.drop('Id', axis=1, inplace=True)
print(f"Shape after dropping Id: {train.shape}")

 Question 1.2: Visualize a scatter plot of 'GrLivArea' in the x-axis and 'SalePrice' in the y-axis. Can you spot any outliers?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.figure(figsize=(10,6))
sns.scatterplot(x=train['GrLivArea'], y=train['SalePrice'])
plt.title('GrLivArea vs. SalePrice')
plt.show()

**Discussion of Ridge Performance vs. Alpha (Question 5.2)**

The plot above shows the Root Mean Squared Error (RMSE) for the Ridge model on both the training and validation datasets as the shrinkage parameter `alpha` varies.

*   **Estimating the Best Alpha Value:**
    Similar to Lasso, the "best" alpha value for Ridge is typically found where the validation RMSE is at its minimum. Observing the 'Validation RMSE (Ridge)' line, we look for its lowest point. *[The user will need to visually inspect their generated plot to specify the approximate alpha value here. For example: "The validation RMSE for Ridge appears to be lowest around an alpha of 1 to 10."]*

*   **Comparison of Training and Validation Plots:**
    - At very low `alpha` values, Ridge behaves similarly to linear regression. Both training and validation RMSE are relatively low.
    - As `alpha` increases, the training RMSE for Ridge generally tends to increase, as the regularization penalizes large coefficients, making the model simpler.
    - The validation RMSE for Ridge also typically shows a U-shape. It might decrease initially if the model was overfitting, reach a minimum, and then increase as `alpha` becomes very large and the model starts to underfit.
    - Compared to Lasso, Ridge tends to shrink coefficients towards zero but doesn't usually set them exactly to zero unless alpha is extremely large. This means Ridge keeps all features in the model, but penalizes their magnitudes. The impact on RMSE curves might show a smoother U-shape for validation RMSE.

**Discussion of Lasso Performance vs. Alpha (Question 4.3)**

The plot above shows the Root Mean Squared Error (RMSE) for the Lasso model on both the training and validation datasets as the shrinkage parameter `alpha` varies.

*   **Estimating the Best Alpha Value:**
    The "best" alpha value is typically found where the validation RMSE is at its minimum. Observing the plot, we look for the point on the 'Validation RMSE' line that is lowest. This point represents the best trade-off between bias and variance for the Lasso model on unseen data. *[The user will need to visually inspect their generated plot to specify the approximate alpha value here. For example, they might say: "The validation RMSE appears to be lowest around an alpha of 0.1 to 1."]*

*   **Comparison of Training and Validation Plots:**
    - At very low `alpha` values (e.g., towards the left of the plot), both training and validation RMSE are relatively low. The model is complex and fits the training data well (low training RMSE). If the validation RMSE is also low and close to the training RMSE, the model generalizes well.
    - As `alpha` increases, the training RMSE generally tends to increase. This is because stronger regularization (larger `alpha`) forces the model to become simpler, potentially underfitting the training data.
    - The validation RMSE typically shows a U-shape (or part of it). Initially, it might decrease as `alpha` increases from very small values, if the model was initially overfitting. Then it reaches a minimum point (the optimal `alpha`). After this point, as `alpha` continues to increase, the model becomes too simple (high bias), and the validation RMSE starts to increase again due to underfitting.
    - The gap between the training and validation RMSE can also be indicative. A large gap often suggests overfitting (model performs much better on training data than on validation data). Regularization aims to reduce this gap by improving generalization.

**Outlier Discussion (Q1.2):**
Looking at the scatter plot of 'GrLivArea' vs 'SalePrice', there appear to be a few data points with very large 'GrLivArea' that do not follow the general trend of increasing 'SalePrice'. Specifically, points with 'GrLivArea' greater than 4000 seem like potential outliers as their 'SalePrice' is not correspondingly high.

 Question 1.3: Removing outliers in the data for all GrLivArea greater than 4000 then check the scatter plot again

In [None]:
print(f"Shape before removing outliers: {train.shape}")
train = train[train['GrLivArea'] <= 4000]
print(f"Shape after removing outliers: {train.shape}")

plt.figure(figsize=(10,6))
sns.scatterplot(x=train['GrLivArea'], y=train['SalePrice'])
plt.title('GrLivArea vs. SalePrice (Outliers Removed)')
plt.show()

Quesiont 2.1: Convert categorical variable into dummy variables using pandas get_dummies API

Do not use sklearn.  In sklearn you would have to do the following:
1. identify the category columns in the dataframe
2. ceate a one-hot-encoder object
3. one-hot-encode the category columns of the dataframe and put results in a new dataframe
4. drop the category columns from the original dataframe to create a dataframe of the original numeric variables
5. combine the new dataframe of one-hot-encoded variables with the numeric variable of the original dataframe

<br><br>
Do the following:
1. Please one-hot-encode using pandas `get_dummies`.  With `get_dummies`you just use the data as the argument for `get_dummies` and assign the output to the same variable name. 
3. Present shape of data.  Use `shape` as was done above.  How many columns were added?
4. Present counts of data type.  Use `dtypes` and `value_counts` as was done above.  How have the data types changed? 

In [None]:
print(f"Shape before one-hot encoding: {train.shape}")
print(f"Data types before one-hot encoding:\n{train.dtypes.value_counts()}")

train = pd.get_dummies(train)

print(f"Shape after one-hot encoding: {train.shape}")
print(f"Data types after one-hot encoding:\n{train.dtypes.value_counts()}")

**One-Hot Encoding Discussion (Q2.1):**
After applying `pd.get_dummies()`, the number of columns increased significantly, from the original number to the new shape's column count. This is because each category in the original object/categorical columns was converted into a new binary (0 or 1) column. Consequently, the data types also changed: the 'object' type columns were replaced by 'uint8' (unsigned 8-bit integer) type columns, representing the one-hot encoded features.

Question 2.2: Impute missing data by the median of each column.
1. Count the total number of nulls in the data
2. Replace nulls with column medians
3. Count the total number of nulls in the data

In [None]:
print(f"Total nulls before imputation: {train.isnull().sum().sum()}")

train = train.fillna(train.median())

print(f"Total nulls after imputation: {train.isnull().sum().sum()}")

Question 2.3: Generate train validation (test) split of 70/30
1. Create the input variables `X` without 'SalePrice'
2. Create the target variable `y` which is 'SalePrice'
3. Do train-test split to split data into training and validation datasets

In [None]:
from sklearn.model_selection import train_test_split

X = train.drop('SalePrice', axis=1)
y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Question 3.1: Train a linear regression algorithm to predict `SalePrice` from the remaining features.

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
print("LinearRegression model trained.")

Question 3.2: Evaluate the model with RMSE. Report the performance on both training and test data. These numbers will serve as our benchmark performance.

In [None]:
from sklearn.metrics import mean_squared_error

# Predict on training and validation sets
y_train_pred_lr = lr_model.predict(X_train)
y_test_pred_lr = lr_model.predict(X_test)  # Assuming X_test and y_test are from Q2.3

# Calculate RMSE
lr_rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred_lr))
lr_rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))

print(f"Linear Regression RMSE on Training Data: {lr_rmse_train}")
print(f"Linear Regression RMSE on Validation Data: {lr_rmse_test}")

We now train a regularized version of `LinearRegression` called `Lasso`. `Lasso` has an argument called `alpha`, which is the **shrinkage parameter**.

Question 4.1: Let `alpha = 0.000001` and train a `Lasso` algorithm. Show that the resulting model is practically identical to the one we trained with `LinearRegression`. There are different ways to show this, so you will need to think of a way. <span style="color:red" float:right>[2 point]</span>

In [None]:
from sklearn.linear_model import Lasso

# Initialize and train Lasso model
lasso_model_q41 = Lasso(alpha=0.000001, max_iter=10000) # Added max_iter to help convergence with small alpha
lasso_model_q41.fit(X_train, y_train)
print("Lasso model (alpha=0.000001) trained.")

# Predict on training and validation sets
y_train_pred_lasso_q41 = lasso_model_q41.predict(X_train)
y_test_pred_lasso_q41 = lasso_model_q41.predict(X_test)

# Calculate RMSE for Lasso model
lasso_rmse_train_q41 = np.sqrt(mean_squared_error(y_train, y_train_pred_lasso_q41))
lasso_rmse_test_q41 = np.sqrt(mean_squared_error(y_test, y_test_pred_lasso_q41))

print(f"Lasso (alpha=0.000001) RMSE on Training Data: {lasso_rmse_train_q41}")
print(f"Lasso (alpha=0.000001) RMSE on Validation Data: {lasso_rmse_test_q41}")

print("\n--- RMSE Comparison ---")
print(f"Linear Regression RMSE (Train): {lr_rmse_train}")
print(f"Lasso (alpha=0.000001) RMSE (Train): {lasso_rmse_train_q41}")
print(f"Linear Regression RMSE (Validation): {lr_rmse_test}")
print(f"Lasso (alpha=0.000001) RMSE (Validation): {lasso_rmse_test_q41}")

print("\n--- Coefficient Comparison ---")
print(f"First 5 coefficients of Linear Regression model: {lr_model.coef_[:5]}")
print(f"First 5 coefficients of Lasso (alpha=0.000001) model: {lasso_model_q41.coef_[:5]}")

coef_diff = np.sum(np.abs(lr_model.coef_ - lasso_model_q41.coef_))
print(f"Sum of absolute differences in coefficients: {coef_diff}")

**Comparison of LinearRegression and Lasso (alpha=0.000001) Models**

As shown by the RMSE scores, the Lasso model with a very small alpha (0.000001) performs almost identically to the LinearRegression model on both the training and validation datasets. 
The RMSE values are very close:
- Training RMSE: LinearRegression (`lr_rmse_train`) vs. Lasso (`lasso_rmse_train_q41`)
- Validation RMSE: LinearRegression (`lr_rmse_test`) vs. Lasso (`lasso_rmse_test_q41`)

Furthermore, comparing the model coefficients reveals that they are also very similar. The sum of absolute differences between the coefficient vectors is very small (`coef_diff`). This indicates that with such a minimal shrinkage parameter, the L1 regularization in Lasso has a negligible effect, making the Lasso model behave like an ordinary least squares linear regression.

Question 4.2: Iteratively train a new `Lasso` model, letting `alpha` change each time to one of the values given by the suggested `alpha_vals` below.
For each alpha keep track of and store: 
- the performance (RMSE) on the training data
- the performance (RMSE) on the validation (test) data
- the coefficients (`coef_`) of the trained model

In [None]:
alpha_vals = 10**np.arange(-1, 4, .2)

lasso_train_rmse_list = []
lasso_test_rmse_list = []
lasso_coefs_list = []

from sklearn.linear_model import Lasso # Ensure Lasso is imported

for alpha_val in alpha_vals:
    # Initialize and train Lasso model
    # Increased max_iter for better convergence, especially at small alphas
    lasso_model_iter = Lasso(alpha=alpha_val, max_iter=10000, random_state=42) 
    lasso_model_iter.fit(X_train, y_train)
    
    # Predict on training and validation sets
    y_train_pred_lasso_iter = lasso_model_iter.predict(X_train)
    y_test_pred_lasso_iter = lasso_model_iter.predict(X_test)
    
    # Calculate RMSE
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred_lasso_iter))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_lasso_iter))
    
    # Store results
    lasso_train_rmse_list.append(train_rmse)
    lasso_test_rmse_list.append(test_rmse)
    lasso_coefs_list.append(lasso_model_iter.coef_)
    
print("Lasso iteration complete.")
print(f"Number of alpha values tested: {len(alpha_vals)}")
print(f"First training RMSE: {lasso_train_rmse_list[0]}")
print(f"First validation RMSE: {lasso_test_rmse_list[0]}")

Question 4.3: Using a visual, show how the performance (rmse) on the training and test data changed as we gradually increased `alpha`. Use a lineplot where the x-axis is `alpha` and the y-axis is rmse.  Use a log scale for the x-axis.
<br><br>
Discuss your results:
- From this plot, estimate the best alpha value.
- How does the plot for the training data compare to the lineplot of the validation (test) data?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns # Optional, but can make plots nicer
%matplotlib inline

plt.figure(figsize=(12, 6))
plt.plot(alpha_vals, lasso_train_rmse_list, label='Training RMSE', marker='o', linestyle='-')
plt.plot(alpha_vals, lasso_test_rmse_list, label='Validation RMSE', marker='o', linestyle='-')

plt.xscale('log') # Set x-axis to log scale
plt.xlabel('Alpha (Shrinkage Parameter)')
plt.ylabel('RMSE')
plt.title('Lasso Model Performance vs. Alpha')
plt.legend()
plt.grid(True, which="both", ls="--")
plt.show()

Question 4.4: Using a visual, show how the model's coefficients changed as we gradually increased the shrinkage parameter `alpha`. HINT: They should appear to be shrinking toward zero as you increase `alpha`!  There are too many coefficients to create lineplots for every coefficient.  Present only a subset of the coefficients that make the point.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd # For easier handling of coefficients
%matplotlib inline

# Convert list of coefficient arrays into a DataFrame
# Assuming X_train.columns contains the feature names
lasso_coefs_df = pd.DataFrame(lasso_coefs_list, index=alpha_vals, columns=X_train.columns)

# Plotting a subset of coefficients (e.g., first 10)
# Or, select coefficients that show interesting behavior (e.g., largest magnitude at low alpha)
# For simplicity, let's plot the first 10. If X_train has fewer than 10 columns, it will plot all.
num_coeffs_to_plot = min(10, len(X_train.columns))

plt.figure(figsize=(14, 7))
for feature in lasso_coefs_df.columns[:num_coeffs_to_plot]:
    plt.plot(lasso_coefs_df.index, lasso_coefs_df[feature], label=feature)

plt.xscale('log') # Set x-axis to log scale
plt.xlabel('Alpha (Shrinkage Parameter)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Model Coefficients vs. Alpha (Subset)')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5)) # Adjust legend to prevent overlap
plt.grid(True, which="both", ls="--")
plt.show()

print("Number of non-zero coefficients for different alphas:")
for alpha_val in [alpha_vals[0], alpha_vals[len(alpha_vals)//2], alpha_vals[-1]]:
    num_nonzero = np.sum(lasso_coefs_df.loc[alpha_val] != 0)
    print(f"Alpha = {alpha_val:.2f}: {num_nonzero} non-zero coefficients")

**Discussion of Lasso Coefficients vs. Alpha (Question 4.4)**

The plot above illustrates how the magnitudes of the Lasso model's coefficients change as the shrinkage parameter `alpha` increases. A subset of the coefficients is shown for clarity.

*   **Shrinkage Towards Zero:**
    As `alpha` increases, the L1 regularization penalty becomes more significant. This forces many of the coefficients towards zero. This is a key characteristic of Lasso regression, which performs feature selection by effectively eliminating less important features (i.e., setting their coefficients to zero).

*   **Feature Selection:**
    You can observe that some coefficients shrink to zero faster than others. Coefficients that remain non-zero for higher values of `alpha` are generally considered more important by the model. At very high `alpha` values, most or all coefficients might be shrunk to zero, resulting in a very simple model. The printout above also shows how the count of non-zero coefficients changes with alpha, typically decreasing as alpha increases.

Question 5.1: Repeat steps in Question 4.2.  This time using `Ridge` instead of `Lasso`.

In [None]:
ridge_train_rmse_list = []
ridge_test_rmse_list = []
ridge_coefs_list = []

from sklearn.linear_model import Ridge # Ensure Ridge is imported

for alpha_val in alpha_vals:
    # Initialize and train Ridge model
    ridge_model_iter = Ridge(alpha=alpha_val, random_state=42) 
    ridge_model_iter.fit(X_train, y_train)
    
    # Predict on training and validation sets
    y_train_pred_ridge_iter = ridge_model_iter.predict(X_train)
    y_test_pred_ridge_iter = ridge_model_iter.predict(X_test)
    
    # Calculate RMSE
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred_ridge_iter))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge_iter))
    
    # Store results
    ridge_train_rmse_list.append(train_rmse)
    ridge_test_rmse_list.append(test_rmse)
    ridge_coefs_list.append(ridge_model_iter.coef_)
    
print("Ridge iteration complete.")
print(f"Number of alpha values tested: {len(alpha_vals)}")
print(f"First training RMSE (Ridge): {ridge_train_rmse_list[0]}")
print(f"First validation RMSE (Ridge): {ridge_test_rmse_list[0]}")

Question 5.2: Using a visual, show how the performance (rmse) on the training and test data changed as we gradually increased `alpha`. Use a lineplot where the x-axis is `alpha` and the y-axis is rmse.  Use a log scale for the x-axis.  
<br><br>
Discuss your results:
- From this plot, estimate the best alpha value.
- How does the plot for the training data compare to the validation (test) data?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(12, 6))
plt.plot(alpha_vals, ridge_train_rmse_list, label='Training RMSE (Ridge)', marker='o', linestyle='-')
plt.plot(alpha_vals, ridge_test_rmse_list, label='Validation RMSE (Ridge)', marker='o', linestyle='-')

plt.xscale('log') # Set x-axis to log scale
plt.xlabel('Alpha (Shrinkage Parameter)')
plt.ylabel('RMSE')
plt.title('Ridge Model Performance vs. Alpha')
plt.legend()
plt.grid(True, which="both", ls="--")
plt.show()

Question 5.3: Using a visual, show how the model's coefficients changed as we gradually increased the shrinkage parameter `alpha`. HINT: They should appear to be shrinking toward zero as you increase `alpha`!  There are too many coefficients to create lineplots for every coefficient.  Present a subset of the coefficients that make the point.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd # For easier handling of coefficients
%matplotlib inline

# Convert list of coefficient arrays into a DataFrame
# Assuming X_train.columns contains the feature names
ridge_coefs_df = pd.DataFrame(ridge_coefs_list, index=alpha_vals, columns=X_train.columns)

# Plotting a subset of coefficients (e.g., first 10)
num_coeffs_to_plot = min(10, len(X_train.columns))

plt.figure(figsize=(14, 7))
for feature in ridge_coefs_df.columns[:num_coeffs_to_plot]:
    plt.plot(ridge_coefs_df.index, ridge_coefs_df[feature], label=feature)

plt.xscale('log') # Set x-axis to log scale
plt.xlabel('Alpha (Shrinkage Parameter)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Model Coefficients vs. Alpha (Subset)')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5)) # Adjust legend
plt.grid(True, which="both", ls="--")
plt.show()

**Discussion of Ridge Coefficients vs. Alpha (Question 5.3)**

The plot above illustrates how the magnitudes of the Ridge model's coefficients change as the shrinkage parameter `alpha` increases. A subset of coefficients is shown.

*   **Shrinkage Towards Zero (but not exactly zero):**
    As `alpha` increases, the L2 regularization penalty in Ridge regression also becomes more significant. This forces the coefficients towards zero, but unlike Lasso, Ridge typically does not shrink coefficients *exactly* to zero unless alpha is infinitely large. Instead, coefficients are reduced in magnitude, helping to prevent overfitting by reducing model complexity.

*   **Behavior Compared to Lasso:**
    Comparing this plot to the Lasso coefficient plot (Q4.4), you'll notice that while Lasso drives many coefficients to absolute zero (performing feature selection), Ridge tends to shrink all coefficients more smoothly and proportionally. All features are typically retained by Ridge, but their influence is moderated by the regularization. This often results in a dense model (many non-zero, albeit small, coefficients) as opposed to Lasso's sparse models.

In [None]:
# The following code can be removed
print("Elapsed time: ", time.time() - start_time)