In [None]:
# 1. Training and Validation Set Creation

# Approaches to Data Splitting. For our dataset with 100 samples across 5 balanced classes, here are appropriate splitting approaches that we can use:

# a. Simple Train-Test Split (Stratified)

# This approach involves splitting of the dataset (e.g. 80% train / 20% validation) stratified by class to keep class balance. This method is quick and easy 
# though performance is highly dependent on how lucky or unlucky that single split is.

# Justification:
# With only 100 samples, a 80-20 split gives us 80 training and 20 validation samples. Stratification ensures each class has the same proportion in both sets.


# Splitting Approach
from sklearn.model_selection import train_test_split

# Stratified split maintains class proportions
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y,
    random_state=42
)

# we shall look at an example later.

In [None]:

# b. K-Fold Cross-Validation (Stratified)

# This approach involves splitting the data into k folds (like 5), keeping class proportions the same in each fold. Each fold is used as validation once, and an average
# of the results is given. This approach gives a much better estimate of performance and reduces variance from single-split randomness. But it is more computationally
# expensive.

# Justification: With small datasets, K-Fold cross validation gives better utilization of data. 5 folds means that each validation set has 20 samples (like the 
# simple split), but you train on all data over multiple iterations.

# The approach
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, val_index in skf.split(X, y):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

# we can see an illustration of both approaches in the next cell.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import (train_test_split, StratifiedKFold, cross_val_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# read dataset
data = pd.read_csv('ACDC_radiomics.csv')

# create the features and target class
X = data.drop(columns=['class'])  # Features
y = data['class']  # Target variable (class labels)


# 1. Simple stratified split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train a model
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
val_pred = model.predict(X_val_scaled)
print("Simple Split Validation Accuracy:", accuracy_score(y_val, val_pred))
print(classification_report(y_val, val_pred))

# 2. K-Fold CV evaluation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

cv_scores = cross_val_score(
    RandomForestClassifier(random_state=42),
    X_scaled,
    y,
    cv=cv,
    scoring='accuracy'
)

print(f"K-Fold CV Accuracy: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")

Simple Split Validation Accuracy: 0.8
              precision    recall  f1-score   support

         DCM       0.75      0.75      0.75         4
         HCM       1.00      0.75      0.86         4
        MINF       0.75      0.75      0.75         4
         NOR       0.75      0.75      0.75         4
          RV       0.80      1.00      0.89         4

    accuracy                           0.80        20
   macro avg       0.81      0.80      0.80        20
weighted avg       0.81      0.80      0.80        20

K-Fold CV Accuracy: 0.790 ± 0.080


In [None]:
# Results show that both the Simple Split (80% train / 20% test) and 5-Fold Cross-Validation give comparable performance (~79-80% accuracy), 
# however the K-Fold CV provides additional insights about model stability (±0.08 standard deviation). 


# Key Observations: 
# Both the simple split and K-Fold CV show consistent accuracy (~80%), indicating no severe overfitting, but class-wise performance varies. 
# We observe that HCM achieves perfect precision (1.0) but lower recall (0.75), suggesting conservative predictions, while RV is the best-predicted class (F1=0.89) and 
# DCM, MINF, NORM lag behind (F1 = 0.75). The K-Fold CV’s ±0.08 standard deviation (71–87% range) reveals model stability depends on data splits, and this has been noted
# a common issue with small datasets. This highlights the need to address class-specific biases and improve generalization, particularly for underperforming classes.
