<a href="https://colab.research.google.com/github/allenphos/Study-projects/blob/main/Overfitting_and_Regularization_in_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overfitting and Regularization in Regression

This project explores the concepts of overfitting and regularization in the context of regression models. It compares the performance of **Linear Regression** with **Polynomial Regression** to demonstrate overfitting. Furthermore, it investigates the use of regularization techniques (**Ridge, Lasso, ElasticNet**) to mitigate overfitting and improve model generalization.

**Data:**

The project uses a regression dataset ['regression_data.csv'](https://drive.google.com/drive/u/0/folders/1QT01TI24Dt2Mr5RNiSa5bPHH65N36UxV) containing features and a target variable.


## Import necessary libraries

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import root_mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

## 1. Data Loading and Preprocessing



In [7]:
# Download the dataset
raw_df = pd.read_csv('drive/MyDrive/Colab Notebooks/data/regression_data.csv')

# Split into features (X) and target (y)
X = raw_df.drop(columns=['target'])
y = raw_df['target']

# Split into train and test sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Model Training and Evaluation

In [8]:
def train_lin_vs_poly_reg(X_train, X_val, y_train, y_val, degree=5):
    """
    Trains and evaluates Linear and Polynomial Regression models.

    Args:
        X_train: Training data features.
        X_val: Validation data features.
        y_train: Training data target.
        y_val: Validation data target.
        degree: Degree of the polynomial features.

    Returns:
        None (prints the results).
    """
    # Linear Regression without polynomial features
    lin_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])
    lin_pipeline.fit(X_train, y_train)
    y_pred_train_lin = lin_pipeline.predict(X_train)
    y_pred_val_lin = lin_pipeline.predict(X_val)

   # Polynomial Regression
    poly_pipeline = Pipeline([
        ('poly_features', PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])
    poly_pipeline.fit(X_train, y_train)
    y_pred_train_poly = poly_pipeline.predict(X_train)
    y_pred_val_poly = poly_pipeline.predict(X_val)

    # Model evaluation
    rmse_lin_train = root_mean_squared_error(y_train, y_pred_train_lin)
    rmse_poly_train = root_mean_squared_error(y_train, y_pred_train_poly)

    rmse_lin_val = root_mean_squared_error(y_val, y_pred_val_lin)
    rmse_poly_val = root_mean_squared_error(y_val, y_pred_val_poly)

    print(f"Train RMSE for Linear Regression: {rmse_lin_train:.3f}")
    print(f"Test RMSE for Linear Regression: {rmse_lin_val:.3f}\n")

    print(f"Train RMSE for Polynomial Regression (degree {degree}): {rmse_poly_train:.3f}")
    print(f"Test RMSE for Polynomial Regression (degree {degree}): {rmse_poly_val:.3f}")

In [9]:
degree = 2
train_lin_vs_poly_reg(X_train, X_val, y_train, y_val, degree)

Train RMSE for Linear Regression: 1.066
Test RMSE for Linear Regression: 0.883

Train RMSE for Polynomial Regression (degree 2): 1.017
Test RMSE for Polynomial Regression (degree 2): 1.015


In [10]:
degree = 3
train_lin_vs_poly_reg(X_train, X_val, y_train, y_val, degree)

Train RMSE for Linear Regression: 1.066
Test RMSE for Linear Regression: 0.883

Train RMSE for Polynomial Regression (degree 3): 0.799
Test RMSE for Polynomial Regression (degree 3): 1.916


In [11]:
degree = 5
train_lin_vs_poly_reg(X_train, X_val, y_train, y_val, degree)

Train RMSE for Linear Regression: 1.066
Test RMSE for Linear Regression: 0.883

Train RMSE for Polynomial Regression (degree 5): 0.000
Test RMSE for Polynomial Regression (degree 5): 12.677


В лінійной регресії на тестових даних помилка зменшилась. Що говорить про хорошу генералізацію моделі. Помилка зростає якщо використовувати PolynomialFeatures.

В поліномінальний регресії при високих значеннях degree, в даних можна побачити overfit. Модель занадто складна для даних.

## 3. Regularized Regression Models

Тренування моделі Lasso(), Ridge(), ElasaticNet() на даних (з поліном ознаками до степені 20 включно), порівняня якісті з тою, яка була отримана з лінійною регресією. Яка модель найкраще генералізує і чому?

In [14]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet

def poly_lin_vs_rle_(X_train, X_val, y_train, y_val, degree=5):
    """
    Trains and evaluates Linear Regression, Ridge, Lasso, and ElasticNet models with polynomial features.

    Args:
        X_train: Training data features.
        X_val: Validation data features.
        y_train: Training data target.
        y_val: Validation data target.
        degree: Degree of the polynomial features.

    Returns:
        None (prints the results).
    """
    # List of models
    models = { # Fixed indentation here
        "Linear Regression with PolynomialFeatures": LinearRegression(),
        "Ridge (alpha=1)                          ": Ridge(alpha=1),
        "Ridge (alpha=2)                          ": Ridge(alpha=2),
        "Lasso (alpha=0.1)                        ": Lasso(alpha=0.1, max_iter=10000),
        "ElasticNet (alpha=0.5)                   ": ElasticNet(alpha=0.5, max_iter=10000)
    }

    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_val_poly = poly.transform(X_val)

    # Scaling
    scaler = StandardScaler()
    X_train_poly_scaled = scaler.fit_transform(X_train_poly)
    X_val_poly_scaled = scaler.transform(X_val_poly)

    results = {}

    for name, model in models.items():
        model.fit(X_train_poly_scaled, y_train)
        y_train_pred = model.predict(X_train_poly_scaled)
        y_val_pred = model.predict(X_val_poly_scaled)

        train_rmse = root_mean_squared_error(y_train, y_train_pred)
        val_rmse = root_mean_squared_error(y_val, y_val_pred)

        results[name] = {"Train RMSE": train_rmse, "Validation RMSE": val_rmse}

    # Print the results
    print("\nEvaluation of regularized models:")
    for model_name, metrics in results.items():
        print(f"{model_name}: Train RMSE = {metrics['Train RMSE']:.3f}, Validation RMSE = {metrics['Validation RMSE']:.3f}")

In [15]:
poly_lin_vs_rle_(X_train, X_val, y_train, y_val, degree=2)


Evaluation of regularized models:
Linear Regression with PolynomialFeatures: Train RMSE = 1.017, Validation RMSE = 1.015
Ridge (alpha=1)                          : Train RMSE = 1.114, Validation RMSE = 1.152
Ridge (alpha=2)                          : Train RMSE = 1.356, Validation RMSE = 1.490
Lasso (alpha=0.1)                        : Train RMSE = 1.075, Validation RMSE = 0.858
ElasticNet (alpha=0.5)                   : Train RMSE = 9.119, Validation RMSE = 10.877


In [16]:
poly_lin_vs_rle_(X_train, X_val, y_train, y_val, degree=6)


Evaluation of regularized models:
Linear Regression with PolynomialFeatures: Train RMSE = 0.000, Validation RMSE = 16.411
Ridge (alpha=1)                          : Train RMSE = 1.071, Validation RMSE = 23.228
Ridge (alpha=2)                          : Train RMSE = 1.719, Validation RMSE = 23.605
Lasso (alpha=0.1)                        : Train RMSE = 0.971, Validation RMSE = 0.868
ElasticNet (alpha=0.5)                   : Train RMSE = 8.643, Validation RMSE = 17.641


In [17]:
poly_lin_vs_rle_(X_train, X_val, y_train, y_val, degree=12)


Evaluation of regularized models:
Linear Regression with PolynomialFeatures: Train RMSE = 0.000, Validation RMSE = 22.154
Ridge (alpha=1)                          : Train RMSE = 1.051, Validation RMSE = 36.525
Ridge (alpha=2)                          : Train RMSE = 1.665, Validation RMSE = 46.529
Lasso (alpha=0.1)                        : Train RMSE = 0.962, Validation RMSE = 0.878
ElasticNet (alpha=0.5)                   : Train RMSE = 8.609, Validation RMSE = 17.284


In [18]:
poly_lin_vs_rle_(X_train, X_val, y_train, y_val, degree=20)


Evaluation of regularized models:
Linear Regression with PolynomialFeatures: Train RMSE = 0.000, Validation RMSE = 65.391
Ridge (alpha=1)                          : Train RMSE = 1.054, Validation RMSE = 27.798
Ridge (alpha=2)                          : Train RMSE = 1.665, Validation RMSE = 20.615
Lasso (alpha=0.1)                        : Train RMSE = 0.965, Validation RMSE = 1.277
ElasticNet (alpha=0.5)                   : Train RMSE = 8.603, Validation RMSE = 17.330


Lasso generalizes best on the test data with the fewest polynomial features.
Other models overfit and show high error on the test data.

 ## 4. Feature Importance and Analysis

In [19]:
# Create polynomial features
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train)

# Train Lasso regression on these features
model = Lasso(alpha=0.1, max_iter=10000)
model.fit(X_train_poly, y_train) # Lasso regression uses L1 regularization (Lasso can zero out the weights of less important features)

# Create a table with feature names and their coefficients
coefs_df = pd.DataFrame(poly_features.get_feature_names_out(X_train.columns), columns=['feature_name'])
coefs_df['value'] = model.coef_.round(5).flatten()

# Sort features by coefficient value and visualize
coefs_df.set_index('feature_name').sort_values(by='value', ascending=False).style.background_gradient()

Unnamed: 0_level_0,value
feature_name,Unnamed: 1_level_1
feature_4,49.77659
feature_5^2,0.03579
feature_2^2,0.0166
feature_1,0.0125
feature_2 feature_3,0.01229
feature_4 feature_5,-0.0
feature_3 feature_5,0.0
feature_3 feature_4,0.0
feature_3^2,0.0
feature_2 feature_5,0.0


Lasso regularization helps to identify the most important features by shrinking the weights of less important features to zero. This results in a simpler model that focuses on the most predictive variables.