# R² Improved from 0.952 to 0.999 — Benchmarking ML Regressors

In this notebook, we aim to improve the prediction performance of a regression model on a credit dataset.  
We start with a simple Linear Regression model and then try various machine learning techniques  
to boost the R² score from **0.952** to **0.999**.

# Step 1: Train a Simple Linear Regression Model

## Load the Dataset

We read the dataset using pandas and removed the "ID" column because it is not useful for prediction.

In [1]:
import pandas as pd
import numpy as np
import warnings
import optuna
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neural_network import MLPRegressor
from optuna.integration import OptunaSearchCV
from optuna.distributions import FloatDistribution, IntDistribution
from tabpfn import TabPFNRegressor
from autogluon.tabular import TabularPredictor

warnings.filterwarnings("ignore")
optuna.logging.set_verbosity(optuna.logging.WARNING)

df = pd.read_csv("Credit_Data.csv")
df = df.drop(columns=["ID"])

## Preprocessing

In [2]:
num_cols = [x for x in df.columns if df[x].dtypes in ["int", "float"]]
cat_cols = [x for x in df.columns if df[x].dtypes not in ["int", "float"]]

dummies = pd.get_dummies(df[cat_cols], drop_first=True).astype(float)
df = pd.concat([df[num_cols], dummies], axis=1)
df

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Gender_Male,Student_Yes,Married_Yes,Ethnicity_Asian,Ethnicity_Caucasian
0,14.891,3606,283,2,34,11,333,1.0,0.0,1.0,0.0,1.0
1,106.025,6645,483,3,82,15,903,0.0,1.0,1.0,1.0,0.0
2,104.593,7075,514,4,71,11,580,1.0,0.0,0.0,1.0,0.0
3,148.924,9504,681,3,36,11,964,0.0,0.0,0.0,1.0,0.0
4,55.882,4897,357,2,68,16,331,1.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
395,12.096,4100,307,3,32,13,560,1.0,0.0,1.0,0.0,1.0
396,13.364,3838,296,5,65,17,480,1.0,0.0,0.0,0.0,0.0
397,57.872,4171,321,5,67,12,138,0.0,0.0,1.0,0.0,1.0
398,37.728,2525,192,1,44,13,0,1.0,0.0,1.0,0.0,1.0


In [3]:
target_variable = "Balance"
X = df.drop(columns = [target_variable])
y = df[target_variable]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

## Train a Simple Linear Regression Model

In [4]:
linear_model = LinearRegression()
linear_model.fit(X_train,y_train)
y_pred = linear_model.predict(X_test)
print(r2_score(y_test, y_pred))

0.9522674050276462


## Result

This result is quite strong, but we aim to improve it further.  
To do this, we will try different machine learning models and compare their performance.

# Step 2: Trying different ML models

## GridSearchCV with Multiple Models

We used `GridSearchCV` to tune hyperparameters of several regressors:

- Random Forest
- Ridge
- Lasso
- Decision Tree

Each model was evaluated using **10-fold cross-validation** with **R² score** as the metric.

Below are the best scores obtained for each model.

In [5]:
model_configs = {
    "RandomForest": {
        "model": RandomForestRegressor(random_state=42),
        "params": {
            "regressor__n_estimators": [10,50,100],
            "regressor__max_depth": [3,5,10]
        }
    },
    "Ridge": {
        "model": Ridge(),
        "params": {
            "regressor__alpha": [0.01,0.1,1,10]
        }
    },
    "Lasso": {
        "model": Lasso(),
        "params": {
            "regressor__alpha": [0.01,0.1,1,10]
        }
    },
    "DecisionTree": {
        "model": DecisionTreeRegressor(random_state=42),
        "params": {
            "regressor__max_depth": [3,5,7, None],
            "regressor__min_samples_split": [2,5,10]
        }
    }
    
}

for name, config in model_configs.items():
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", config["model"])
    ])
    grid = GridSearchCV(pipeline, config["params"], cv=10, scoring = "r2", n_jobs=-1)
    grid.fit(X,y)

    print(f"{name} Best Params: {grid.best_params_}")
    print(f"{name} Best R2: {grid.best_score_:.4f}")

RandomForest Best Params: {'regressor__max_depth': 10, 'regressor__n_estimators': 100}
RandomForest Best R2: 0.9512
Ridge Best Params: {'regressor__alpha': 1}
Ridge Best R2: 0.9500
Lasso Best Params: {'regressor__alpha': 0.01}
Lasso Best R2: 0.9500
DecisionTree Best Params: {'regressor__max_depth': None, 'regressor__min_samples_split': 10}
DecisionTree Best R2: 0.9146


## Hyperparameter Tuning with Optuna

We used **OptunaSearchCV** to optimize the following regressors:

- Ridge
- Lasso
- Decision Tree
- Random Forest

Each model was tuned with **300 trials** using **5-fold cross-validation**, and **R² score** was used as the evaluation metric.

Below are the best hyperparameters and corresponding scores for each model.

In [6]:
models = {
    "Ridge": {
        "model": Ridge(),
        "params": {
            "regressor__alpha": FloatDistribution(1e-4, 1e2, log=True)
        }
    },
    "Lasso": {
        "model": Lasso(),
        "params": {
            "regressor__alpha": FloatDistribution(1e-4, 1e2, log=True)
        }
    },
    "DecisionTree": {
        "model": DecisionTreeRegressor(random_state=42),
        "params": {
            "regressor__max_depth": IntDistribution(2, 20),
            "regressor__min_samples_split": IntDistribution(2, 20)
        }
    },
    "RandomForest": {
        "model": RandomForestRegressor(random_state=42),
        "params": {
            "regressor__n_estimators": IntDistribution(10, 200),
            "regressor__max_depth": IntDistribution(2, 20)
        }
    }
}

for name, config in models.items():
    print(f"Optimizing {name}...")
    
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("regressor", config["model"])
    ])

    optuna_search = OptunaSearchCV(
        estimator=pipeline,
        param_distributions=config["params"],
        cv=KFold(n_splits=5, shuffle=True, random_state=42),
        scoring="r2",
        n_trials=300,
        random_state=42,
        n_jobs=-1
    )

    optuna_search.fit(X, y)

    print(f"{name} Best Params: {optuna_search.best_params_}")
    print(f"{name} Best R2: {optuna_search.best_score_:.4f}\n")

Optimizing Ridge...
Ridge Best Params: {'regressor__alpha': 0.27170517842424147}
Ridge Best R2: 0.9513

Optimizing Lasso...
Lasso Best Params: {'regressor__alpha': 0.00010000680239437019}
Lasso Best R2: 0.9512

Optimizing DecisionTree...
DecisionTree Best Params: {'regressor__max_depth': 16, 'regressor__min_samples_split': 9}
DecisionTree Best R2: 0.9136

Optimizing RandomForest...
RandomForest Best Params: {'regressor__n_estimators': 53, 'regressor__max_depth': 11}
RandomForest Best R2: 0.9397



## AutoML with AutoGluon

We used **AutoGluon TabularPredictor** to automatically train and tune regression models.

- Training data was split into 80% train and 20% test.
- Models used: **Linear Regression (LR)** and **Random Forest (RF)**.
- Preset used: `"best_quality"` for best possible performance.
- Evaluation metric: **R² score**.

In [7]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
X_train = train_df.drop(columns="Balance")
y_train = train_df["Balance"]
X_test = test_df.drop(columns="Balance")
y_test = test_df["Balance"]

In [8]:
predictor = TabularPredictor(
    label="Balance",
    problem_type="regression",
    eval_metric="r2"
).fit(
    train_data=train_df,
    presets="best_quality",
    hyperparameters={"LR": {}, "RF": {}}
)

No path specified. Models will be saved in: "AutogluonModels/ag-20250612_072647"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.12.7
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:26 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T8112
CPU Count:          8
Memory Avail:       6.91 GB / 16.00 GB (43.2%)
Disk Space Avail:   40.05 GB / 228.27 GB (17.5%)
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
	This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fi

In [9]:
y_test = test_df["Balance"]
X_test = test_df.drop(columns=["Balance"])

y_pred = predictor.predict(X_test)

r2 = r2_score(y_test, y_pred)
print("R² score:", round(r2, 4))

R² score: 0.9696


## Multilayer Perceptron (MLP) Regressor

We trained a **Neural Network Regressor (MLPRegressor)** using the following configuration:

- Two hidden layers: 100 and 50 neurons
- Activation function: ReLU
- Solver: Adam
- Max iterations: 1000
- Data was scaled using `StandardScaler`

This approach yielded a very high R² score, but we also checked for overfitting by comparing train and test scores.

In [10]:
target_variable = "Balance"
X = df.drop(columns = [target_variable])
y = df[target_variable]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("mlp", MLPRegressor(hidden_layer_sizes=(100, 50), activation='relu', solver='adam',
                         max_iter=1000, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_train_pred = pipeline.predict(X_train)
print("R² (Train):", r2_score(y_train, y_train_pred))
print("R² (Test):", r2_score(y_test, y_pred))

R² (Train): 0.9982149180585883
R² (Test): 0.9905390319299068


#### Checking Overfitting for Multilayer Perceptron (MLP) Regressor

To check if the model is overfitting, we use **cross-validation**.  
We train and test the model on different parts of the data.  
If the scores are very different, the model might be overfitting.

Below are the R² scores from 5-fold cross-validation.

In [11]:
scores = cross_val_score(pipeline, X, y, cv=5, scoring="r2")
print("Cross-validated R² scores:", scores)
print("Mean R²:", scores.mean())

Cross-validated R² scores: [0.99579611 0.99588718 0.99443395 0.99219605 0.99285242]
Mean R²: 0.9942331419814557


## TabPFN Regressor

We use the TabPFN model for regression.  
It is a very powerful pre-trained model.

In [12]:
reg = TabPFNRegressor()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print("R² score:", r2_score(y_test, y_pred))

R² score: 0.9988414645195007


# Conclusion

Step 1: we started with a baseline Linear Regression model and achieved an R² score of **0.9522**.  

Step 2: By experimenting with multiple algorithms — including Ridge, Lasso, Random Forest, MLP, and TabPFN —  
we significantly improved model performance. The best result came from **TabPFN**,  
reaching an impressive R² score of **0.999**.

This shows how testing diverse models can greatly improve predictive accuracy in regression tasks.

# Conclusion: Comparing R² Scores

We started with a simple Linear Regression model that gave an R² score of **0.952**.  
To improve this, we tried various models and optimization strategies:

| Model                         | R² Score |
|------------------------------|----------|
| **Linear Regression**        | 0.952    |
| GridSearch - RandomForest    | 0.951    |
| GridSearch - Ridge           | 0.950    |
| GridSearch - Lasso           | 0.950    |
| GridSearch - DecisionTree    | 0.915    |
| Optuna - Ridge               | 0.951    |
| Optuna - Lasso               | 0.951    |
| Optuna - DecisionTree        | 0.914    |
| Optuna - RandomForest        | 0.940    |
| AutoGluon                    | 0.970    |
| MLP (Neural Network)         | 0.991    |
| **TabPFN Regressor**         | 0.999    |

The **TabPFN Regressor** achieved the best performance with an R² of **0.999**, showing significant improvement over the baseline. 

This highlights the power of modern neural architectures and AutoML tools for regression problems.