# Salary Prediction (Regression)

This notebook compares multiple regression models to predict **salary** from candidate attributes.

**Focus:** generalisation on unseen data, bias–variance trade-off, and regularisation.

**Notebook is self-contained:** it loads the CSV at the top.

## 1) Imports & Data Loading

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# ---- Load data ----
df = pd.read_csv("../data/age_experience_salary_100.csv")
df.head()

Unnamed: 0,age,experience_year,salary
0,63,18,100847
1,48,10,82555
2,51,17,95601
3,61,15,113841
4,46,3,90248


## 2) Quick Data Checks

We verify:
- columns and dtypes
- missing values
- that the target column `salary` exists

In [2]:
# Basic info
print(df.shape)
display(df.info())

# Missing values
na = df.isna().sum().sort_values(ascending=False)
na[na > 0]

(100, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              100 non-null    int64
 1   experience_year  100 non-null    int64
 2   salary           100 non-null    int64
dtypes: int64(3)
memory usage: 2.5 KB


None

Series([], dtype: int64)

## 3) Define Features (X) and Target (y)

- `y` is the salary column.
- `X` includes the numeric feature columns used for prediction.

> Tip: keep this explicit so recruiters can see exactly what your model used.

In [3]:
# ---- Choose target ----
TARGET_COL = "salary"
assert TARGET_COL in df.columns, f"Missing target column: {TARGET_COL}"

# ---- Choose features ----
# If your dataset contains non-numeric columns, select numeric only (excluding target).
X = df.select_dtypes(include=["number"]).drop(columns=[TARGET_COL], errors="ignore")
y = df[TARGET_COL]

print("Features:", list(X.columns))
print("X shape:", X.shape, "| y shape:", y.shape)

Features: ['age', 'experience_year']
X shape: (100, 2) | y shape: (100,)


## 4) Train/Test Split

We split data into training and test sets to evaluate **generalisation** on unseen data.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train:", X_train.shape, y_train.shape)
print("Test:", X_test.shape, y_test.shape)

Train: (80, 2) (80,)
Test: (20, 2) (20,)


## 5) Helper Function: Evaluate Regression Models

We report:
- **R²** (higher is better)
- **RMSE** (lower is better; interpretable in salary units)
- Train vs Test scores (to detect overfitting)

In [5]:
def evaluate_regressor(model, X_train, y_train, X_test, y_test, name=None):
    model.fit(X_train, y_train)

    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    train_r2 = r2_score(y_train, train_pred)
    test_r2 = r2_score(y_test, test_pred)

    rmse = np.sqrt(mean_squared_error(y_test, test_pred))

    return {
        "model": name or model.__class__.__name__,
        "train_r2": train_r2,
        "test_r2": test_r2,
        "rmse": rmse,
    }

## 6) Baseline Model: Linear Regression

A simple, stable baseline. Good for comparison and for detecting when non-linear models add value.

In [6]:
results = []

lr = LinearRegression()
results.append(evaluate_regressor(lr, X_train, y_train, X_test, y_test, name="LinearRegression"))

pd.DataFrame(results)

Unnamed: 0,model,train_r2,test_r2,rmse
0,LinearRegression,0.888803,0.848164,9228.320526


## 7) Decision Tree: Overfitting & Regularisation

Unconstrained trees can overfit (very high training score, lower test score).
We control complexity with `max_depth`.

In [7]:
# Unconstrained tree
full_tree = DecisionTreeRegressor(random_state=42)
results.append(evaluate_regressor(full_tree, X_train, y_train, X_test, y_test, name="DecisionTree (unconstrained)"))

# Regularised tree (depth search)
for depth in [2, 3, 4, 5, 6]:
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    results.append(evaluate_regressor(tree, X_train, y_train, X_test, y_test, name=f"DecisionTree (max_depth={depth})"))

pd.DataFrame(results).sort_values("test_r2", ascending=False)

Unnamed: 0,model,train_r2,test_r2,rmse
0,LinearRegression,0.888803,0.848164,9228.320526
4,DecisionTree (max_depth=4),0.901011,0.763501,11517.303461
5,DecisionTree (max_depth=5),0.935372,0.755542,11709.482426
3,DecisionTree (max_depth=3),0.838088,0.743708,11989.551879
6,DecisionTree (max_depth=6),0.963399,0.689804,13190.282077
1,DecisionTree (unconstrained),0.993125,0.632155,14363.766148
2,DecisionTree (max_depth=2),0.724562,0.484335,17006.677637


## 8) Random Forest

Random Forest averages many constrained trees, typically reducing variance and improving generalisation.

In [8]:
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=4,
    random_state=42,
)
results.append(evaluate_regressor(rf, X_train, y_train, X_test, y_test, name="RandomForest (200 trees, depth=4)"))

pd.DataFrame(results).sort_values("test_r2", ascending=False)

Unnamed: 0,model,train_r2,test_r2,rmse
7,"RandomForest (200 trees, depth=4)",0.92775,0.873289,8430.300761
0,LinearRegression,0.888803,0.848164,9228.320526
4,DecisionTree (max_depth=4),0.901011,0.763501,11517.303461
5,DecisionTree (max_depth=5),0.935372,0.755542,11709.482426
3,DecisionTree (max_depth=3),0.838088,0.743708,11989.551879
6,DecisionTree (max_depth=6),0.963399,0.689804,13190.282077
1,DecisionTree (unconstrained),0.993125,0.632155,14363.766148
2,DecisionTree (max_depth=2),0.724562,0.484335,17006.677637


## 9) Gradient Boosting: Regularised / Tuned

Gradient Boosting trains trees sequentially to correct previous errors.
Regularisation controls overfitting via:
- shallow trees (`max_depth`)
- `learning_rate`
- `subsample`
- `min_samples_leaf`

We use the tuned configuration that performed best in testing.

In [9]:
# Baseline GB
gb = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42,
)
results.append(evaluate_regressor(gb, X_train, y_train, X_test, y_test, name="GradientBoosting (baseline)"))

# Tuned / regularised GB (from your experiments)
gb_tuned = GradientBoostingRegressor(
    max_depth=2,
    learning_rate=0.04,
    n_estimators=400,
    min_samples_leaf=5,
    subsample=0.8,
    random_state=42,
)
results.append(evaluate_regressor(gb_tuned, X_train, y_train, X_test, y_test, name="GradientBoosting (tuned)"))

pd.DataFrame(results).sort_values("test_r2", ascending=False)

Unnamed: 0,model,train_r2,test_r2,rmse
9,GradientBoosting (tuned),0.940863,0.914948,6906.800033
8,GradientBoosting (baseline),0.975145,0.874895,8376.698909
7,"RandomForest (200 trees, depth=4)",0.92775,0.873289,8430.300761
0,LinearRegression,0.888803,0.848164,9228.320526
4,DecisionTree (max_depth=4),0.901011,0.763501,11517.303461
5,DecisionTree (max_depth=5),0.935372,0.755542,11709.482426
3,DecisionTree (max_depth=3),0.838088,0.743708,11989.551879
6,DecisionTree (max_depth=6),0.963399,0.689804,13190.282077
1,DecisionTree (unconstrained),0.993125,0.632155,14363.766148
2,DecisionTree (max_depth=2),0.724562,0.484335,17006.677637


## 10) Final Comparison & Model Choice

We choose the model based on **test performance** and **stability** (train–test gap).
RMSE is also interpreted in salary units.

In [10]:
final_table = pd.DataFrame(results)
final_table["train_test_gap"] = final_table["train_r2"] - final_table["test_r2"]
final_table = final_table.sort_values(["test_r2", "rmse"], ascending=[False, True])
final_table.reset_index(drop=True)

Unnamed: 0,model,train_r2,test_r2,rmse,train_test_gap
0,GradientBoosting (tuned),0.940863,0.914948,6906.800033,0.025915
1,GradientBoosting (baseline),0.975145,0.874895,8376.698909,0.10025
2,"RandomForest (200 trees, depth=4)",0.92775,0.873289,8430.300761,0.054461
3,LinearRegression,0.888803,0.848164,9228.320526,0.040639
4,DecisionTree (max_depth=4),0.901011,0.763501,11517.303461,0.137511
5,DecisionTree (max_depth=5),0.935372,0.755542,11709.482426,0.17983
6,DecisionTree (max_depth=3),0.838088,0.743708,11989.551879,0.09438
7,DecisionTree (max_depth=6),0.963399,0.689804,13190.282077,0.273596
8,DecisionTree (unconstrained),0.993125,0.632155,14363.766148,0.36097
9,DecisionTree (max_depth=2),0.724562,0.484335,17006.677637,0.240227


## Conclusion

- Trees can overfit without constraints; `max_depth` acts as regularisation.
- Random Forest improves generalisation by reducing variance.
- Regularised Gradient Boosting can achieve strong performance when tuned carefully.

**Next step (optional):** add feature importance (RF) or permutation importance (model-agnostic) to explain what drives predictions.