## AML Project 1 
### Brain Age Prediction


#### Objective
The objective of this project is to predict the age of the brain based on MRI-derived features. The dataset contains multiple features extracted from MRI scans, which serve as input variables for predicting brain age, a continuous target variable. 

#### Goal
The goal is to develop a regression model capable of predicting brain age with an 𝑅2 score of at least 0.5, meeting the baseline requirements for this task. 

Therefore, the predictions will be evaluated using the Coefficient of Determination (𝑅2) on a reserved test dataset.

The data preprocessing involves the following steps:

- Outlier Detection
- Feature Selection
- Imputation of Missing Values

Finally, submit a prediction file that includes two columns: id and y (predicted brain age).


#### Dataset Overview

The dataset includes:

- Training Data (X_train.csv): Contains 831 numerical MRI features for 1212 samples. These features serve as input variables for training the model. Each sample is linked to a corresponding target brain age in the y_train.csv file.

- Test Data (X_test.csv): Contains 831 numerical MRI features for 776 samples. These features are used for generating predictions during the testing phase. The test dataset does not include target labels.

- Labels (y_train.csv): Contains the target brain ages (in years) for the training samples.

- Sample Submission (sample.csv): Provides a template for submitting predictions. It includes the id column and a placeholder for predicted ages (y). Predictions for the test set must follow this format.


#### Approach

The project is divided into the following steps:

###### Data Observation and Preprocessing:
- Handle missing values.
- Detect and address outliers.
- Perform feature selection to reduce dimensionality and retain relevant features.

###### Model Development:
- Train regression models to predict brain age.
- Tune hyperparameters to optimize performance.

###### Evaluation:
- Validate the model using 𝑅2 on the validation set.

Submit predictions for the test set in the required format.


### 1. Data Observation and Preprocessing

#### 1.1 Summary Statistics and Data Owerview

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import sklearn

In [2]:
# Read the data
df_train = pd.read_csv('X_train.csv')
df_labels = pd.read_csv('y_train.csv')
df_test = pd.read_csv('X_test.csv')

In [3]:
df_train.head(3)

Unnamed: 0,id,x0,x1,x2,x3,x4,x5,x6,x7,x8,...,x822,x823,x824,x825,x826,x827,x828,x829,x830,x831
0,0.0,14168.823171,10514.380717,3316.149698,94230.695124,102.386606,92.677127,11108.748199,10866.50551,10837.622093,...,,12352.094085,846.014651,105.132144,102.112809,2090.00426,2.691845,1234.374109,1000.784475,9285.751272
1,1.0,17757.037554,,4101.016273,92959.527633,,99.855168,10013.959449,10826.607494,10076.101597,...,,16198.071494,776.084467,106.38559,103.47203,2474.051881,2.287976,,1012.626705,11750.284764
2,2.0,14226.656663,11029.642499,,124055.600561,100.542483,92.860892,,10492.342868,,...,10329.704431,13976.06378,737.040332,103.671234,109.458246,2656.083281,2.843706,888.353607,1048.810385,9553.922728


In [4]:
# train dataset
print("Summary Statistics for X_train:")
print(df_train.describe())

Summary Statistics for X_train:
                id            x0            x1           x2             x3  \
count  1212.000000   1121.000000   1128.000000  1124.000000    1121.000000   
mean    605.500000  15220.402957  10950.160761  3430.837498  100002.281022   
std     350.018571   2314.735855   1570.611458   443.431441    9708.061111   
min       0.000000   5636.623777   6764.060541  1849.453269   65828.916291   
25%     302.750000  13846.177869   9859.438276  3152.193184   93497.927859   
50%     605.500000  15048.467618  10839.483074  3401.539562  100053.800306   
75%     908.250000  16653.018233  11902.078799  3698.564818  106139.852699   
max    1211.000000  28273.690135  17777.338221  5622.951648  133145.632257   

                x4           x5            x6            x7            x8  \
count  1138.000000  1133.000000   1115.000000   1113.000000   1129.000000   
mean    105.070358    99.968855   9983.055476  10496.207179  10495.835570   
std       2.834582     9.566001   

##### Observations for Training Features (X_train)
Counts: Many features (e.g., x1, x2, x3 ..) have fewer counts than the total rows (1212). This indicates missing values in those columns.

Mean and Standard Deviation: Features like x0, x3, and others exhibit large standard deviations relative to their mean values, suggesting a wide spread or high variability in the data.

Min and Max: Certain features (e.g., x0, x3) have significantly large ranges between their minimum and maximum values. This could indicate outliers or very different scales across features.

Feature Distributions: Quartile values (25%, 50%, 75%) show some features (e.g., x5, x6) have distributions that are fairly compact, while others (e.g., x0, x3) have broader spreads.

In [5]:
df_test.head()

Unnamed: 0,id,x0,x1,x2,x3,x4,x5,x6,x7,x8,...,x822,x823,x824,x825,x826,x827,x828,x829,x830,x831
0,0.0,14655.540585,9917.388635,3368.691863,104367.124458,104.132894,95.412138,9222.286185,,10054.751221,...,9153.072463,12039.009308,714.005017,105.651509,103.574436,2628.082823,2.766271,1553.285942,1037.998392,9762.400011
1,1.0,13875.822363,9955.163751,3118.195658,103577.269601,103.290975,86.916779,9625.725002,10592.011548,10234.818476,...,9915.29211,12579.041315,695.070183,106.900274,105.24173,2388.096545,,1386.519117,1088.519466,11748.788738
2,2.0,14807.162495,10682.476988,3335.687716,106647.64261,109.481676,86.476353,9128.693785,10880.97924,10485.268796,...,9733.845509,11009.075093,663.093857,105.541065,101.875603,2097.004365,2.362592,1204.527342,1067.697534,12487.217965
3,3.0,12253.667985,9001.609788,2631.482012,91105.570966,108.741037,84.542046,9765.458299,10953.438053,10190.986014,...,11204.016625,9395.01694,656.018142,104.816602,107.213434,2035.045976,3.052844,794.341243,,7931.828963
4,4.0,18925.988003,12161.366373,3724.006508,110262.565336,102.085795,107.92027,9494.243179,10684.550944,10074.187696,...,11093.701929,13876.039832,964.037692,104.557952,109.185247,2971.082813,2.571392,1207.265749,1013.661254,8241.962297


In [6]:
# test dataset
print("Summary Statistics X_test:")
print(df_test.describe())

Summary Statistics X_test:
               id            x0            x1           x2             x3  \
count  776.000000    725.000000    722.000000   732.000000     733.000000   
mean   387.500000  15344.410832  10906.681312  3442.022564  100538.424402   
std    224.156196   2018.156097   1450.216427   407.464522   10317.233706   
min      0.000000   7824.415331   6996.984058  2335.281266   70097.079236   
25%    193.750000  13879.324474   9941.576512  3158.828463   93073.980377   
50%    387.500000  15181.156582  10717.578244  3416.413527  101088.758342   
75%    581.250000  16613.593078  11750.391470  3697.111946  107028.300658   
max    775.000000  20995.447337  17677.685164  5060.697780  137224.276391   

               x4          x5            x6            x7            x8  ...  \
count  733.000000  731.000000    738.000000    728.000000    730.000000  ...   
mean   105.001292  100.762208  10013.606995  10506.974919  10484.754548  ...   
std      2.916754    9.976019   1014.02

#### Observations for Test Features (X_test)
Similar observations apply here, but with smaller counts due to the smaller size of the test set (776 rows).

Missing values are present in features like x1, x2, x3, and others.

Feature distributions are consistent with the training set, which is good for maintaining data similarity.

In [7]:
df_labels.head()

Unnamed: 0,id,y
0,0.0,74.0
1,1.0,51.0
2,2.0,70.0
3,3.0,52.0
4,4.0,85.0


In [8]:
# Display basic statistics for the labels (y_train)
print("\nBasic Statistics for Labels (y_train):")
print(df_labels.describe())


Basic Statistics for Labels (y_train):
                id            y
count  1212.000000  1212.000000
mean    605.500000    69.889439
std     350.018571     9.720843
min       0.000000    42.000000
25%     302.750000    64.000000
50%     605.500000    70.000000
75%     908.250000    77.000000
max    1211.000000    97.000000


##### Observations for Labels (y_train):
Analyzing the labels makes sense to understand the target distribution.
- Count: All 1212 rows have labels, so no missing values are present in y_train.

- Range: The ages range from 42 to 97 years, providing the regression target for the model. It has a mean of 69.89 and a standard deviation of 9.72.

- Quartiles:

25% of samples have ages ≤ 64;

50% of samples have ages ≤ 70 (median);

75% of samples have ages ≤ 77.


#### 1.2 Missing Values

Handling missing values is a critical step in the data preprocessing pipeline, as they can impact model performance and analysis accuracy.

The percentage of missing values was calculated for both the training and testing datasets to understand the extent of missing data. This step helps identify patterns and discrepancies between the datasets.

The missing value percentages for training and testing datasets were combined into a single table to allow for easy comparison and to observe the consistency of missing values across both datasets.

In [9]:
# Calculation
missing_train = df_train.isnull().mean() * 100
missing_test = df_test.isnull().mean() * 100

# Combining the results
missing_values_comparison = pd.DataFrame({
    'Training Data (%)': missing_train,
    'Testing Data (%)': missing_test
})

print("\nPercentage of Missing Values in Training and Testing Data:")
print(missing_values_comparison)


Percentage of Missing Values in Training and Testing Data:
      Training Data (%)  Testing Data (%)
id             0.000000          0.000000
x0             7.508251          6.572165
x1             6.930693          6.958763
x2             7.260726          5.670103
x3             7.508251          5.541237
...                 ...               ...
x827           8.333333          5.412371
x828           7.838284          6.701031
x829           7.590759          6.056701
x830           6.600660          4.381443
x831           7.920792          7.603093

[833 rows x 2 columns]


##### Observations for missing data:

The training dataset contains missing values across many features, with percentages ranging from 0% to 8.33%.

The testing dataset also contains missing values across many features, with percentages ranging from 0% to 7.60%.

Missing values will need to be addressed to ensure the consistency and reliability of the dataset for further analysis and model training.

#### 1.3 Data Preparation
##### Extracting and Removing Identifier Columns

The first column of the testing dataset (df_test) contains unique identifiers for each sample, which are not used for model training but are necessary for creating the final submission, therefore this column should be extracted and stored in a new DataFrame (df_id).

In general, identifier columns do not contribute to predictive modeling as they carry no meaningful feature information.
Including them could lead to incorrect model assumptions or overfitting. Preserving df_id ensures that the predictions made by the model can be accurately matched back to their respective samples during submission.
 
Also, for both the training and testing datasets, the ID column should be removed to focus solely on the feature and target data.

In [10]:
df_id = df_test.iloc[:,0:1]

# Remove first column
df_X = df_train.iloc[:,1:]
df_y = df_labels.iloc[:,1:]
df_test = df_test.iloc[:,1:]

The remaining data is split into the following:
- df_X: The training features (excluding the first column).
- df_y: The target labels extracted from a separate dataset (df_labels).
- df_test: The testing features without the identifier column.

#### 1.4 Shuffling the Training Data

Shuffling the training data adds randomness, which is crucial for creating a reliable and unbiased model. Without shuffling, the data might have an inherent order, such as being grouped by class labels or sorted by feature values. In such cases, the model could unintentionally learn these order-based patterns rather than focusing on the meaningful relationships between features and target labels. This can lead to overfitting or poor performance when the model encounters new, unseen data.

To facilitate shuffling, the feature dataset (df_X) and the target labels (df_y) are concatenated into a single DataFrame (df_all). 
This step ensures that features and their corresponding target labels are aligned during the shuffling process.

The combined dataset is shuffled randomly using the sample(frac=1) method. The parameter frac=1 means the entire dataset is shuffled. A fixed random_state is provided to ensure reproducibility, so every time this code runs, the shuffling produces the same result.


After shuffling, the features and target labels are separated back into two datasets: df_X for the features and df_y for the target labels. This ensures the shuffled data is ready for use in the training process.


In [11]:
# Shuffle training data
df_all = pd.concat([df_X, df_y], axis=1)
shuffled_all = df_all.sample(frac=1, random_state=0)

df_X = shuffled_all.iloc[:,:-1]
df_y = shuffled_all.iloc[:,-1:]

### 2.  Helper Functions
##### 2.1 Feature Selection
These functions help reduce dimensionality by removing features that are highly correlated or constant, which can degrade model performance.


In [12]:
from feature_engine.selection import DropCorrelatedFeatures, SmartCorrelatedSelection

def remove_X_correlated_features(X_train, alpha=0.99):
    
    dcor_tr = DropCorrelatedFeatures(threshold=alpha)
    X_train_decr = dcor_tr.fit(X_train)

    mask = dcor_tr.get_support()
    return np.array(mask)

def fs_x_correlation(X_train, X_test, alpha=0.99):
    
    mask1 = remove_X_correlated_features(X_train, alpha=alpha)
    
    X_train_decor = X_train[:, mask1]
    X_test_decor = X_test[:, mask1]
    
    return X_train_decor, X_test_decor

In [13]:
from feature_engine.selection import DropConstantFeatures

def drop_constant_features(X_train, X_test):
    
    dconst_tr = DropConstantFeatures(missing_values='ignore')
    X_train_dedup = dconst_tr.fit_transform(X_train)
    X_test_dedup = dconst_tr.transform(X_test)
    
    return X_train_dedup, X_test_dedup

In [14]:
def with_nan_feature_selection(X_train, y_train, X_test, alpha_X=0.99, alpha_y=0.1):

    X_train, X_test = fs_x_correlation(X_train, X_test, alpha=alpha_X)
    X_train, X_test = drop_constant_features(X_train, X_test)
    
    return X_train, X_test

#### 2.2 Missing Value Imputation

Provides methods for imputing missing data using either KNN-based or median-based strategies.


In [15]:
from sklearn.impute import KNNImputer, SimpleImputer

def impute_knn(X_train, X_test, n=20):

    imputer = KNNImputer(n_neighbors=n)
    X_train_imputed = imputer.fit_transform(X_train)
    X_test_imputed = imputer.transform(X_test)
    
    return X_train_imputed, X_test_imputed

def impute_median(X_train, X_test):

    imputer = SimpleImputer(strategy='median')
    X_train_imputed = imputer.fit_transform(X_train)
    X_test_imputed = imputer.transform(X_test)
    
    return X_train_imputed, X_test_imputed

#### 2.3 Scaling

In [16]:
from sklearn.preprocessing import StandardScaler

def scale(X_train, X_val):
    
    scaler = StandardScaler()
    
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    
    return X_train_scaled, X_val_scaled

In [17]:
from sklearn.feature_selection import SelectKBest

def select_k_best(X_train, y_train, X_test, k, score_func):
    
    kbest = SelectKBest(k=k, score_func=score_func)
    X_train_selected = kbest.fit_transform(X_train, y_train)
    X_test_selected = kbest.transform(X_test)
    
    return X_train_selected, X_test_selected

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer, SimpleImputer

def scale_and_impute(X_train, X_test):
    
    scale_and_impute_pipe = Pipeline([('scaler', StandardScaler()),('imputer', SimpleImputer(strategy='median'))])
    X_train_imputed = scale_and_impute_pipe.fit_transform(X_train)
    X_test_imputed = scale_and_impute_pipe.transform(X_test)
    
    return X_train_imputed, X_test_imputed

#### 2.4 Outlier Detection

Outliers in the training set are detected and removed using the ECOD algorithm from the `pyod` library.


In [19]:
from pyod.models.ecod import ECOD

def outlier_detection(X_train, y_train, contamination=0.01):

    mask3 = ECOD_outlier_detection(X_train, y_train, contamination)
    
    mask = mask3.astype(int) == 1
    
    X_return = X_train[mask]
    y_return = y_train[mask]
    
    print(X_train.shape, X_return.shape)
    
    return X_return, y_return
    
def ECOD_outlier_detection(X_train, y_train, contamination=0.01):
    
    estimator = ECOD(contamination=contamination)
    estimator.fit(X_train, y_train)
    
    distance = estimator.predict(X_train)
    mask = distance != 1
    
    return mask

### 3. Execution

#### 3.1 Raw data preparation:
Converts pandas DataFrames to NumPy arrays and prepares them for input into the ML pipeline.


In [20]:
# Data preparation
X_train_raw = df_X.to_numpy()
y_train_raw = df_y.to_numpy().ravel()
X_test_raw = df_test.to_numpy()

In [21]:
# Feature selection before nan value imputation
X_train_selected_nan, X_test_selected_nan = with_nan_feature_selection(X_train_raw, y_train_raw, X_test_raw, alpha_X=0.9999)
print(X_train_selected_nan.shape, X_train_raw.shape)

(1212, 827) (1212, 832)


In [22]:
# Nan value imputation
X_train_selected, X_test_selected = impute_median(X_train_selected_nan, X_test_selected_nan)

In [23]:
print(np.isnan(X_train_selected).sum())  # Check for NaNs in the training set
print(np.isnan(X_test_selected).sum())   # Check for NaNs in the test set

0
0


In [24]:
print(np.isinf(X_train_selected).sum())  # Check for infinities in training set
print(np.isinf(X_test_selected).sum())   # Check for infinities in test set
print(X_train_selected.max(), X_train_selected.min())  # Check data range

0
0
9.935455099994669e+23 -2.9379325892598552e+23


In [25]:
# Feature selection after nan value imputation
from scipy.stats import spearmanr, f, pearsonr
from sklearn.feature_selection import f_regression, mutual_info_regression, chi2, f_classif

def f_spearman(X, y):
    corr_array = []
    p_array = []
    for i in range(X.shape[1]):
        corr, p = spearmanr(X[:,i], y)
        corr_array.append(abs(corr))
        p_array.append(p)
        
    return corr_array, p_array

In [26]:
X_train_kselected, X_test_kselected = select_k_best(X_train_selected, y_train_raw, X_test_selected, 
                                                    k=175, score_func=f_regression)

  X_norms = np.sqrt(row_norms(X.T, squared=True) - n_samples * X_means**2)


In [27]:
# Outlier detection
X_train_no_outliers, y_train_no_outliers = outlier_detection(X_train_kselected, y_train_raw, contamination=0.01)



(1212, 175) (1199, 175)


In [28]:
# Scaling
X_train_scaled, X_test_scaled = scale(X_train_no_outliers, X_test_kselected)

In [29]:
X_train = X_train_scaled
X_test = X_test_scaled
y_train = y_train_no_outliers

### 4. Model Training and Evaluation

In this section, we train multiple regression models to predict brain age based on extracted MRI features.
We use cross-validation to evaluate model performance and compare different algorithms, including:

- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet
- Random Forest
- Gradient Boosting
- XGBoost

#### 4.1  Hyperparameter Tuning with GridSearchCV

To optimize model performance, lets apply `GridSearchCV` to perform exhaustive search over specified hyperparameters.
This helps us identify the best configuration for models like Ridge, Lasso, and ElasticNet.

In [30]:
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import r2_score

def get_best_parameters(estimator, parameters):
    
    search = GridSearchCV(estimator=estimator, param_grid=parameters, scoring='r2', n_jobs=-1, cv=5, verbose=1)
    search.fit(X_train, y_train)

    print('Best params:', search.best_params_)
    print('score:', search.best_score_)
    print('best:', search.best_estimator_)
    
    return search.best_estimator_

#### 4.2 Models 
This project explores several regression models to predict brain age using features derived from MRI scans. 

##### 4.2.1. Gaussian Process Regressor (GPR)
This model uses a Gaussian Process with a RationalQuadratic kernel, which allows flexibility in modeling functions with varying smoothness. It performs Bayesian regression, capturing complex nonlinear patterns and providing uncertainty estimates for predictions. The hyperparameters — including alpha and normalize_y — were optimized via grid search.
This model contributes valuable uncertainty-aware predictions to the final ensemble.

In [31]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RationalQuadratic

rational_kernel = RationalQuadratic(alpha=0.6, length_scale=8)
gpr = GaussianProcessRegressor(random_state=0)
gpr_parameters = {'kernel' : [rational_kernel], 'alpha' : np.logspace(-10, -1, 20), 'normalize_y' : [True, False]}
gpr_final = get_best_parameters(gpr, gpr_parameters)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Best params: {'alpha': 2.6366508987303554e-09, 'kernel': RationalQuadratic(alpha=0.6, length_scale=8), 'normalize_y': True}
score: 0.6168104847972087
best: GaussianProcessRegressor(alpha=2.6366508987303554e-09,
                         kernel=RationalQuadratic(alpha=0.6, length_scale=8),
                         normalize_y=True, random_state=0)


##### 4.2.2 Cubist Regressor
Cubist is a rule-based regression model that extends decision trees with linear models at the leaves. It builds a series of rule-based models (committees) and refines predictions using nearby neighbors for smoothing. In this implementation, the number of committees and neighbors were optimized using grid search.
🧠 This model adds interpretable, rule-based structure and strong performance to the ensemble.



In [32]:
from cubist import Cubist

cub = Cubist(n_rules=500, random_state=0)
cub_parameters = {'n_committees' : [1, 2, 3, 4, 5, 6, 7, 8, 9], 'neighbors' : [3, 4, 5, 6]}
cub_final = get_best_parameters(cub, cub_parameters)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'n_committees': 9, 'neighbors': 3}
score: 0.569956198138764
best: Cubist(n_committees=9, neighbors=3, random_state=0)


##### 4.2.3 Support Vector Regressor (SVR)
SVR is a powerful kernel-based regression model that aims to find a function within a specified margin of tolerance. In this case, we applied a custom Rational Quadratic kernel instead of the default RBF to better capture complex patterns in the data. The model was tuned over the C (regularization), epsilon (tube size), and kernel parameters using grid search.
This model contributed non-linear flexibility and robustness to the ensemble.



In [33]:
from sklearn.svm import SVR
from sklearn.gaussian_process.kernels import RationalQuadratic

rational_kernel = RationalQuadratic(alpha=0.6, length_scale=8)
svr = SVR()
svr_parameters = {'kernel' : ['rbf', rational_kernel], 'epsilon' : np.logspace(-8, -1, 8), 'C' : np.linspace(50, 80, 10)}
svr_final = get_best_parameters(svr, svr_parameters)

Fitting 5 folds for each of 160 candidates, totalling 800 fits
Best params: {'C': 73.33333333333334, 'epsilon': 1e-05, 'kernel': RationalQuadratic(alpha=0.6, length_scale=8)}
score: 0.6160210058988999
best: SVR(C=73.33333333333334, epsilon=1e-05,
    kernel=RationalQuadratic(alpha=0.6, length_scale=8))


##### 4.2.4 Gradient Boosting Regressor (GBR)
Gradient Boosting is a powerful ensemble method that builds models sequentially to correct errors of previous models. In this implementation, hyperparameters such as n_estimators, learning_rate, min_samples_split, and max_depth were tuned via grid search. This model is particularly good at capturing non-linear patterns and reducing bias.
GBR contributed stable and well-generalized predictions to the stacked ensemble.

In [58]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(random_state=0)
gbr_parameters = {'n_estimators' : [100, 500, 1000], 'learning_rate' : np.logspace(-3, -1, 3), 'min_samples_split' : [2, 3], 'max_depth' : [2, 3]}
gbr_final = get_best_parameters(gbr, gbr_parameters)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best params: {'learning_rate': 0.1, 'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 1000}
score: 0.5709102671576372
best: GradientBoostingRegressor(n_estimators=1000, random_state=0)


##### 4.2.5  Extra Trees Regressor (ETR)
Extra Trees (Extremely Randomized Trees) is an ensemble learning method that builds multiple de-correlated decision trees using random splits of features and samples. It tends to reduce overfitting and improve generalization. In this implementation, n_estimators and min_samples_split were tuned.
ETR introduced diversity and robustness to the stacked model, helping reduce variance.

In [36]:
from sklearn.ensemble import ExtraTreesRegressor
trees = ExtraTreesRegressor(random_state=0)
trees_parameters = {'n_estimators' : [2000], 'min_samples_split' : [2, 3, 4, 5, 6]}
trees_final = get_best_parameters(trees, trees_parameters)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best params: {'min_samples_split': 2, 'n_estimators': 2000}
score: 0.559179331472623
best: ExtraTreesRegressor(n_estimators=2000, random_state=0)


##### 4.2.6 CatBoost Regressor
CatBoost is a high-performance gradient boosting algorithm developed by Yandex, optimized for categorical features and robust performance with minimal preprocessing. In this project, the model was tuned on the learning_rate parameter.
🐱 CatBoost contributed to the ensemble by leveraging its powerful regularization and efficient handling of non-linear relationships.

In [37]:
from catboost import CatBoostRegressor
cat = CatBoostRegressor(verbose=False)
cat_parameters = {'learning_rate' : np.logspace(-5, 0, 10)}
cat_final = get_best_parameters(cat, cat_parameters)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'learning_rate': 0.07742636826811278}
score: 0.5823264115578912
best: <catboost.core.CatBoostRegressor object at 0x123c14a60>


### 5. Final Ensemble – Stacking Regressor Architecture
The final model is a stacking ensemble composed of the following base learners:

- SVR (Support Vector Regressor): A kernel-based model that captures complex relationships, included to provide diversity in the ensemble.

- Extra Trees Regressor: A randomized ensemble of decision trees, offering robustness and low bias; optimized with 2000 estimators.

- CatBoost Regressor: A gradient boosting model efficient with categorical features; optimized for learning rate and showing strong performance in cross-validation.

- Cubist: A rule-based regression model known for interpretability and strong generalization; contributes complementary predictive patterns.

- Gradient Boosting Regressor (GBR): A well-tuned boosting model with 1000 estimators and learning rate optimization, contributing to the ensemble’s strength.

All models were selected based on 5-fold cross-validation performance.
The final estimator is a RidgeCV regressor, which learns to combine predictions from all base models in a regularized linear fashion.

This stacked ensemble achieved the highest average cross-validation R² score (0.622) across all submissions and was selected for final predictions.

In [59]:
estimators = [('svr', svr_final), ('trees', trees_final), ('cat', cat_final), ('cub', cub_final), ('gbr', gbr_final)] 

In [60]:
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, cross_val_score

K = 5
cv_splitter = KFold(n_splits=5, shuffle=False)

In [61]:
for name, regressor in estimators:
    
    score = cross_val_score(estimator=regressor, X=X_train, y=y_train, cv=cv_splitter, scoring='r2', n_jobs=-1)
    mean_score = np.mean(score)

    print(f"{name}: {K} fold CV score is {mean_score} and the list is \n{score}")

svr: 5 fold CV score is 0.6160210058988999 and the list is 
[0.57974939 0.60918418 0.69969808 0.6577993  0.53367408]
trees: 5 fold CV score is 0.559179331472623 and the list is 
[0.5106922  0.55687524 0.65288851 0.58768227 0.48775844]
cat: 5 fold CV score is 0.5823264115578912 and the list is 
[0.51959126 0.58415126 0.66358088 0.6421751  0.50213356]
cub: 5 fold CV score is 0.569956198138764 and the list is 
[0.53990847 0.56945203 0.64706684 0.62873484 0.4646188 ]
gbr: 5 fold CV score is 0.5709102671576372 and the list is 
[0.46763857 0.58479549 0.64057714 0.62529389 0.53624625]


In [62]:
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import KFold, cross_val_score

stacking_regressor = StackingRegressor(estimators=estimators, n_jobs=-1)
score = cross_val_score(estimator=stacking_regressor, X=X_train, y=y_train, cv=cv_splitter, scoring='r2', verbose=3)
mean_score = np.mean(score)

print(f"Stacking: {K} fold CV score is {mean_score} and the list is \n{score}")



[CV] END ................................ score: (test=0.560) total time= 4.3min




[CV] END ................................ score: (test=0.624) total time= 7.2min




[CV] END ................................ score: (test=0.712) total time= 4.3min




[CV] END ................................ score: (test=0.674) total time=24.0min




[CV] END ................................ score: (test=0.540) total time= 6.1min
Stacking: 5 fold CV score is 0.6221823168274779 and the list is 
[0.56045363 0.62433992 0.71154566 0.67434739 0.54022498]


In [63]:
# Fit to all training data
stacking_regressor.fit(X_train, y_train)



### 6. Generating and Saving Predictions

After selecting the best-performing model, predictions for the test dataset were generated.
The predicted brain ages are saved in a CSV file formatted for submission.

In [64]:
# Create Submission
y_predict = stacking_regressor.predict(X_test)
submission = df_id.assign(y=y_predict)
print(submission)



        id          y
0      0.0  60.163707
1      1.0  74.129793
2      2.0  67.577205
3      3.0  77.468771
4      4.0  72.563396
..     ...        ...
771  771.0  65.840385
772  772.0  73.255432
773  773.0  77.025010
774  774.0  72.764301
775  775.0  62.834754

[776 rows x 2 columns]


In [65]:
df_submission.to_csv('submission.csv', index=False)