# Random Forest Classification via cross-validation

- classification task: SL (1) or not SL (0)
- classification tree can handle all types of data, all types of relationships among the independent variables (the data we're using to make predictions), and all kinds of relationships with the dependent variable (the thing we want to predict)
- The column named SL is my target variable (i.e., the variable which I want to predict). There are two possible classes: 0 (non-SL) and 1 (SL). The resulting prediction problem is therefore a binary classification problem, while I will use the other columns (feature columns) as input variables for the model.

*StratifiedKFold* is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

*StratifiedKFold* preserves the class ratios (approximately 1 / 10) in both train and test dataset.

*StratifiedGroupKFold* is used to generate the train and test sets for model 2 and model 3

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedGroupKFold.html#sklearn.model_selection.StratifiedGroupKFold

The implementation is designed to:
- Generate test sets such that all contain the same distribution of classes, or as close as possible.
- Be invariant to class label: relabelling y = ["Happy", "Sad"] to y = [1, 0] should not change the indices generated.
- Preserve order dependencies in the dataset ordering, when shuffle=False: all samples from class k in some test set were contiguous in y, or separated in y by samples from classes other than k.
- Generate test sets where the smallest and largest differ by at most one sample.

*type_of_target* : Determine the type of data indicated by the target, binary, multiclass, etc. 

In [1]:
# import modules
import os
import pandas as pd
import numpy as np
import random
import pickle

# import scikit-learn modules
from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, roc_auc_score, average_precision_score, precision_recall_curve, auc, roc_curve

import time

In [2]:
cwd = os.getcwd()
BASE_DIR = os.path.abspath(os.path.join(cwd, ".."))

# build paths inside the repo
get_data_path = lambda folders, fname: os.path.normpath(
    os.path.join(BASE_DIR, *folders, fname)
)

file_path_training_data = get_data_path(['output', 'models'], 'training_data.csv')

file_RF_model_I_early = get_data_path(['output', 'models', 'model_I'], 'RF_model_early.pickle')
file_RF_model_I_late = get_data_path(['output', 'models', 'model_I'], 'RF_model_late.pickle')
file_RF_model_I_early_repeated = get_data_path(['output', 'models', 'model_I'], 'RF_model_early_repeated.pickle')
file_RF_model_I_late_repeated = get_data_path(['output', 'models', 'model_I'], 'RF_model_late_repeated.pickle')

file_RF_model_II_early = get_data_path(['output', 'models', 'model_II'], 'RF_model_early.pickle')
file_RF_model_II_late = get_data_path(['output', 'models', 'model_II'], 'RF_model_late.pickle')
file_RF_model_II_early_repeated = get_data_path(['output', 'models', 'model_II'], 'RF_model_early_repeated.pickle')
file_RF_model_II_late_repeated = get_data_path(['output', 'models', 'model_II'], 'RF_model_late_repeated.pickle')

file_RF_model_III_early = get_data_path(['output', 'models', 'model_III'], 'RF_model_early.pickle')
file_RF_model_III_late = get_data_path(['output', 'models', 'model_III'], 'RF_model_late.pickle')
file_RF_model_III_early_repeated = get_data_path(['output', 'models', 'model_III'], 'RF_model_early_repeated.pickle')
file_RF_model_III_late_repeated = get_data_path(['output', 'models', 'model_III'], 'RF_model_late_repeated.pickle')

file_RF_model_IV_early = get_data_path(['output', 'models', 'model_IV'], 'RF_model_early.pickle')
file_RF_model_IV_late = get_data_path(['output', 'models', 'model_IV'], 'RF_model_late.pickle')

In [3]:
training_df = pd.read_csv(file_path_training_data)
training_df.head()

Unnamed: 0,genepair,A1,A2,A1_entrez,A2_entrez,DepMap_ID,cell_line,Gemini_FDR,raw_LFC,SL,...,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,GEMINI,LFC,SL_new
0,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000022,PATU8988S_PANCREAS,0.998944,0.088856,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.118768,0.088856,False
1,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000307,PK1_PANCREAS,0.986587,0.201704,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.132501,0.201704,False
2,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000632,HS944T_SKIN,1.0,0.069772,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.024593,0.069772,False
3,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000681,A549_LUNG,0.977988,0.379455,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,-0.241323,0.379455,False
4,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000756,GI1_CENTRAL_NERVOUS_SYSTEM,0.999586,-0.077118,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.299715,-0.077118,False


In [4]:
feature_columns_1 = ['rMaxExp_A1A2', 'rMinExp_A1A2',
                     'max_ranked_A1A2', 'min_ranked_A1A2',
                     'max_cn', 'min_cn', 'Protein_Altering', 'Damaging', 
                     'min_sequence_identity',
                     'prediction_score', 
                     'weighted_PPI_essentiality', 'weighted_PPI_expression',
                     'smallest_BP_GO_essentiality', 'smallest_CC_GO_essentiality',
                     'smallest_BP_GO_expression', 'go_CC_expression'
                     ]

target_column = 'SL_new'

print('num of features:', len(feature_columns_1))

num of features: 16


In [5]:
feature_columns_2 = feature_columns_1 + ['closest', 'WGD', 'family_size',
                                         'cds_length_ratio', 'shared_domains', 'has_pombe_ortholog',
                                         'has_essential_pombe_ortholog', 'has_cerevisiae_ortholog', 'has_essential_cerevisiae_ortholog', 
                                         'conservation_score', 'mean_age', 'either_in_complex', 'mean_complex_essentiality', 'colocalisation',
                                         'interact', 'n_total_ppi', 'fet_ppi_overlap',
                                         'gtex_spearman_corr', 'gtex_min_mean_expr', 'gtex_max_mean_expr']
feature_columns_2.remove('prediction_score')
print('num of features:', len(feature_columns_2))

num of features: 35


### Model I - Random Selection

In [6]:
def safe_roc_auc(y_true, y_score):
    mask = ~np.isnan(y_score)
    if np.sum(mask) < 2 or len(np.unique(y_true[mask])) < 2:
        return np.nan
    return roc_auc_score(y_true[mask], y_score[mask])

def safe_average_precision(y_true, y_score):
    mask = ~np.isnan(y_score)
    if np.sum(mask) == 0:
        return np.nan
    return average_precision_score(y_true[mask], y_score[mask])

In [7]:
def validate_stratification(y_train, y_test, fold_num, model_name, tolerance=0.05):
    """
    Validate that stratification is working correctly across train/test splits.
    
    Parameters:
    - y_train, y_test: target arrays for train and test sets
    - fold_num: current fold number
    - model_name: name of the model for logging
    - tolerance: acceptable difference in class proportions (default 5%)
    """
    # Calculate class proportions
    train_pos_rate = y_train.mean()
    test_pos_rate = y_test.mean()
    overall_pos_rate = np.concatenate([y_train, y_test]).mean()
    
    # Calculate differences from overall rate
    train_diff = abs(train_pos_rate - overall_pos_rate)
    test_diff = abs(test_pos_rate - overall_pos_rate)
    train_test_diff = abs(train_pos_rate - test_pos_rate)
    
    # Print detailed information
    print(f"  [{model_name} - Fold {fold_num}] Stratification Check:")
    print(f"    Overall SL rate: {overall_pos_rate:.4f}")
    print(f"    Train SL rate:   {train_pos_rate:.4f} (diff: {train_diff:.4f})")
    print(f"    Test SL rate:    {test_pos_rate:.4f} (diff: {test_diff:.4f})")
    print(f"    Train/Test diff: {train_test_diff:.4f}")
    
    # Check for violations
    warnings = []
    if train_diff > tolerance:
        warnings.append(f"Train set deviation ({train_diff:.4f}) exceeds tolerance ({tolerance})")
    if test_diff > tolerance:
        warnings.append(f"Test set deviation ({test_diff:.4f}) exceeds tolerance ({tolerance})")
    if train_test_diff > tolerance:
        warnings.append(f"Train/Test difference ({train_test_diff:.4f}) exceeds tolerance ({tolerance})")
    
    if warnings:
        print(f"WARNINGS:")
        for warning in warnings:
            print(f"      - {warning}")
    else:
        print(f"Stratification OK")
    
    return {
        'train_pos_rate': train_pos_rate,
        'test_pos_rate': test_pos_rate,
        'overall_pos_rate': overall_pos_rate,
        'train_diff': train_diff,
        'test_diff': test_diff,
        'train_test_diff': train_test_diff,
        'warnings': warnings
    }

def validate_group_splits(train_groups, test_groups, fold_num, model_name, group_type="groups"):
    """
    Validate that groups don't leak between train and test sets.
    """
    train_set = set(train_groups)
    test_set = set(test_groups)
    overlap = train_set & test_set
    
    print(f"  [{model_name} - Fold {fold_num}] Group Validation:")
    print(f"    Train {group_type}: {len(train_set)}")
    print(f"    Test {group_type}:  {len(test_set)}")
    print(f"    Overlapping {group_type}: {len(overlap)}")
    
    if len(overlap) > 0:
        print(f"GROUP LEAKAGE DETECTED: {overlap}")
        return False
    else:
        print(f"No group leakage")
        return True

In [8]:
def model_early_cross_validation(classifier, data, target, splits, verbose=True, model_name="Model"):
    tprs = []
    fprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)

    pred_aucs, seq_aucs, gene_expr_aucs, gene_ess_aucs = [], [], [], []
    aps, pred_aps, seq_aps, gene_expr_aps, gene_ess_aps = [], [], [], [], []

    y_real = []
    y_proba = []

    stratification_results = []
    skipped_folds = []

    total_splits = len(splits)
    start_time = time.time()

    for fold_num, (train, test) in enumerate(splits, start=1):
        print(f"Fold {fold_num}: Train size = {len(train)}, Test size = {len(test)}")

        #print(f"\n=== {model_name} - Fold {fold_num}/{total_splits} ===")
        #print(f"Train size = {len(train)}, Test size = {len(test)}")

        if len(test) == 0:
            print(f"Skipping fold {fold_num}: Empty test set")
            skipped_folds.append(fold_num)
            continue

        # Get train/test data
        y_train = target.iloc[train].values
        y_test = target.iloc[test].values

        # validate stratification
        stratification_result = validate_stratification(y_train, y_test, fold_num, model_name)
        stratification_results.append(stratification_result)

        if np.unique(y_test).size < 2:
            print(f"Skipping fold {fold_num}: Insufficient positive/negative samples in test set")
            print(f"Test set classes: {np.unique(y_test)}")
            skipped_folds.append(fold_num)
            continue

        # Train the classifier and predict probabilities
        y_pred_proba = classifier.fit(data.iloc[train], target.iloc[train]).predict_proba(data.iloc[test])[:, 1]

        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        tprs.append(tpr)
        fprs.append(fpr)

        aucs.append(roc_auc_score(y_test, y_pred_proba))
        aps.append(average_precision_score(y_test, y_pred_proba))

        # Scores from input data columns
        pred_aucs.append(safe_roc_auc(y_test, data.iloc[test]['prediction_score'].values))
        seq_aucs.append(safe_roc_auc(y_test, data.iloc[test]['min_sequence_identity'].values))
        gene_expr_aucs.append(safe_roc_auc(y_test, data.iloc[test]['rMinExp_A1A2'].values))
        gene_ess_aucs.append(1 - safe_roc_auc(y_test, data.iloc[test]['min_ranked_A1A2'].values))

        pred_aps.append(safe_average_precision(y_test, data.iloc[test]['prediction_score'].values))
        seq_aps.append(safe_average_precision(y_test, data.iloc[test]['min_sequence_identity'].values))
        gene_expr_aps.append(safe_average_precision(y_test, data.iloc[test]['rMinExp_A1A2'].values))
        gene_ess_aps.append(safe_average_precision(y_test, data.iloc[test]['min_ranked_A1A2'].values))

        y_real.append(y_test)
        y_proba.append(y_pred_proba)

        if verbose:
            elapsed_time = time.time() - start_time
            print(f"  ROC AUC = {aucs[-1]:.4f}, Elapsed time = {elapsed_time:.2f} seconds")

    if len(tprs) > 0:
        mean_tpr = np.mean([np.interp(mean_fpr, fprs[i], tprs[i]) for i in range(len(tprs))], axis=0)
        mean_tpr[-1] = 1.0
        mean_auc = auc(mean_fpr, mean_tpr)
        std_auc = np.std(aucs)
    else:
        mean_tpr, mean_auc, std_auc = np.array([]), np.nan, np.nan

    if len(y_real) > 0 and len(y_proba) > 0:
        y_real = np.concatenate(y_real)
        y_proba = np.concatenate(y_proba)
        precision, recall, _ = precision_recall_curve(y_real, y_proba)
    else:
        precision, recall = np.array([]), np.array([])

    # Print summary
    print(f"\n=== {model_name} Summary ===")
    if not np.isnan(mean_auc):
        print(f"Mean ROC AUC = {mean_auc:.4f} ± {std_auc:.4f}")
    print(f"Successful folds: {len(aucs)}/{total_splits}")
    if skipped_folds:
        print(f"Skipped folds: {skipped_folds}")

    # Stratification summary
    if stratification_results:
        train_diffs = [r['train_diff'] for r in stratification_results]
        test_diffs = [r['test_diff'] for r in stratification_results]
        print(f"Stratification quality:")
        print(f"  Mean train deviation: {np.mean(train_diffs):.4f} ± {np.std(train_diffs):.4f}")
        print(f"  Mean test deviation:  {np.mean(test_diffs):.4f} ± {np.std(test_diffs):.4f}")


    return {
        'tprs': tprs, 'fprs': fprs, 'aucs': aucs,
        'mean_tpr': mean_tpr, 'mean_fpr': mean_fpr, 'mean_auc': mean_auc, 'std_auc': std_auc,
        'seq_auc': np.nanmean(seq_aucs), 'seq_std_auc': np.nanstd(seq_aucs),
        'pred_auc': np.nanmean(pred_aucs), 'pred_std_auc': np.nanstd(pred_aucs),
        'gene_expr_auc': np.nanmean(gene_expr_aucs), 'gene_expr_std_auc': np.nanstd(gene_expr_aucs),
        'gene_ess_auc': np.nanmean(gene_ess_aucs), 'gene_ess_std_auc': np.nanstd(gene_ess_aucs),
        'precision': precision, 'recall': recall, 'aps': aps,
        'mean_aps': np.nanmean(aps), 'std_ap': np.nanstd(aps),
        'pred_ap': np.nanmean(pred_aps), 'pred_std_ap': np.nanstd(pred_aps),
        'seq_ap': np.nanmean(seq_aps), 'seq_std_ap': np.nanstd(seq_aps),
        'gene_expr_ap': np.nanmean(gene_expr_aps), 'gene_expr_std_ap': np.nanstd(gene_expr_aps),
        'gene_ess_ap': np.nanmean(gene_ess_aps), 'gene_ess_std_ap': np.nanstd(gene_ess_aps),
        'stratification_results': stratification_results,
        'skipped_folds': skipped_folds,
        'effective_folds': len(aucs)
    }

In [9]:
def model_late_cross_validation(classifier, data, target, splits, verbose=True, model_name="Model"):
    tprs = []
    fprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    
    y_real = []
    y_proba = []
    aps = []

    # Track stratification validation results
    stratification_results = []
    skipped_folds = []
    
    total_splits = len(splits)
    start_time = time.time()

    for fold_num, (train, test) in enumerate(splits, start=1):
        print(f"\n=== {model_name} - Fold {fold_num}/{total_splits} ===")
        print(f"Train size = {len(train)}, Test size = {len(test)}")
        
        if len(test) == 0:
            print(f"Skipping fold {fold_num}: Empty test set")
            skipped_folds.append(fold_num)
            continue

        # Get train/test data
        y_train = target.iloc[train].values
        y_test = target.iloc[test].values

        # Validate stratification
        stratification_result = validate_stratification(y_train, y_test, fold_num, model_name)
        stratification_results.append(stratification_result)

        if np.unique(y_test).size < 2:
            print(f"Skipping fold {fold_num}: Insufficient positive/negative samples in test set")
            print(f"Test set classes: {np.unique(y_test)}")
            skipped_folds.append(fold_num)
            continue

        # Train model and get predictions
        y_pred_proba = classifier.fit(data.iloc[train], target.iloc[train]).predict_proba(data.iloc[test])[:, 1]

        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        tprs.append(tpr)
        fprs.append(fpr)

        aucs.append(safe_roc_auc(y_test, y_pred_proba))
        aps.append(safe_average_precision(y_test, y_pred_proba))

        y_real.append(y_test)
        y_proba.append(y_pred_proba)

        if verbose:
            elapsed_time = time.time() - start_time
            print(f"  ROC AUC = {aucs[-1]:.4f}, Elapsed time = {elapsed_time:.2f} seconds")
    
    if len(tprs) > 0:
        mean_tpr = np.mean([np.interp(mean_fpr, fprs[i], tprs[i]) for i in range(len(tprs))], axis=0)
        mean_tpr[-1] = 1.0
        mean_auc = auc(mean_fpr, mean_tpr)
        std_auc = np.std(aucs)
    else:
        mean_tpr, mean_auc, std_auc = np.array([]), np.nan, np.nan
    
    if len(y_real) > 0 and len(y_proba) > 0:
        y_real = np.concatenate(y_real)
        y_proba = np.concatenate(y_proba)
        precision, recall, _ = precision_recall_curve(y_real, y_proba)
    else:
        precision, recall = np.array([]), np.array([])

    # Print summary
    print(f"\n=== {model_name} Summary ===")
    if not np.isnan(mean_auc):
        print(f"Mean ROC AUC = {mean_auc:.4f} ± {std_auc:.4f}")
    print(f"Successful folds: {len(aucs)}/{total_splits}")
    if skipped_folds:
        print(f"Skipped folds: {skipped_folds}")

    # Stratification summary
    if stratification_results:
        train_diffs = [r['train_diff'] for r in stratification_results]
        test_diffs = [r['test_diff'] for r in stratification_results]
        print(f"Stratification quality:")
        print(f"Mean train deviation: {np.mean(train_diffs):.4f} ± {np.std(train_diffs):.4f}")
        print(f"Mean test deviation: {np.mean(test_diffs):.4f} ± {np.std(test_diffs):.4f}")

    return {
        'tprs': tprs, 'fprs': fprs, 'aucs': aucs,
        'mean_tpr': mean_tpr, 'mean_fpr': mean_fpr, 'mean_auc': mean_auc, 'std_auc': std_auc,
        'precision': precision, 'recall': recall, 'aps': aps,
        'mean_aps': np.nanmean(aps), 
        'std_ap': np.nanstd(aps),
        'stratification_results': stratification_results,
        'skipped_folds': skipped_folds,
        'effective_folds': len(aucs)
    }


In [10]:
# Define feature sets
data_1 = training_df[feature_columns_1]
data_2 = training_df[feature_columns_2]
target = training_df[target_column]

# Define your Random Forest classifier
RF = RandomForestClassifier(n_estimators=600, random_state=42, max_features=0.2, max_depth=20, min_samples_leaf=4)

In [11]:
# Define number of folds
nfolds = 5

# Generate StratifiedKFold splits
kf = StratifiedKFold(n_splits=nfolds, shuffle=True, random_state=42)
splits_I = list(kf.split(training_df[feature_columns_1], training_df[target_column]))

In [12]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_I/splits.pkl', 'wb') as f:
    pickle.dump(splits_I, f)

In [13]:
#Run cross-validation on both feature set
model_I_early = model_early_cross_validation(RF, data_1, target, splits_I, model_name="Model I Early")
model_I_late = model_late_cross_validation(RF, data_2, target, splits_I, model_name="Model I Late")

Fold 1: Train size = 32995, Test size = 8249
  [Model I Early - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0232 (diff: 0.0001)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9075, Elapsed time = 47.84 seconds
Fold 2: Train size = 32995, Test size = 8249
  [Model I Early - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0233 (diff: 0.0000)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9183, Elapsed time = 92.61 seconds
Fold 3: Train size = 32995, Test size = 8249
  [Model I Early - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0233 (diff: 0.0000)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9257, Elapsed time = 137.40 seconds
Fold 4: Train size = 32995, Test size = 8249
  [Model I Early - Fold 4] Stratification Che

In [14]:
# save results of cross validation
with open(file_RF_model_I_early, 'wb') as f:
   pickle.dump(model_I_early, f)


with open(file_RF_model_I_late, 'wb') as f:
    pickle.dump(model_I_late, f)

In [15]:
# Define RepeatedStratifiedKFold with 5 folds and 20 repeats
nfolds = 5
n_repeats = 10

rskf = RepeatedStratifiedKFold(n_splits=nfolds, n_repeats=n_repeats, random_state=42)
repeated_splits_I = list(rskf.split(training_df[feature_columns_1], training_df[target_column]))

In [16]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_I/repeated_splits_I.pkl', 'wb') as f:
    pickle.dump(repeated_splits_I, f)

In [17]:
#Run cross-validation on both feature set
model_I_early_repeated = model_early_cross_validation(RF, data_1, target, repeated_splits_I, model_name="Model I Early Repeated")
model_I_late_repeated = model_late_cross_validation(RF, data_2, target, repeated_splits_I, model_name="Model I Late Repeated")

Fold 1: Train size = 32995, Test size = 8249
  [Model I Early Repeated - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0232 (diff: 0.0001)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9075, Elapsed time = 48.85 seconds
Fold 2: Train size = 32995, Test size = 8249
  [Model I Early Repeated - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0233 (diff: 0.0000)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9183, Elapsed time = 98.27 seconds
Fold 3: Train size = 32995, Test size = 8249
  [Model I Early Repeated - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0232 (diff: 0.0000)
    Test SL rate:    0.0233 (diff: 0.0000)
    Train/Test diff: 0.0001
Stratification OK
  ROC AUC = 0.9257, Elapsed time = 148.79 seconds
Fold 4: Train size = 32995, Test size = 8249
  [Model I Early R

In [18]:
# save results of cross validation
with open(file_RF_model_I_early_repeated, 'wb') as f:
    pickle.dump(model_I_early_repeated, f)

with open(file_RF_model_I_late_repeated, 'wb') as f:
    pickle.dump(model_I_late_repeated, f)

## Model II
#### Same paralog pairs, different cell lines

In [None]:
# Define group feature for cell lines
cell_line_groups = training_df['cell_line']

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits_II = list(sgkf.split(training_df, target, cell_line_groups))

# Model II - Cell line validation
print("=== MODEL II: CELL LINE VALIDATION ===")
for i, (train_index, test_index) in enumerate(splits_II):
    train_cells = training_df.loc[train_index, 'cell_line'].unique()
    test_cells = training_df.loc[test_index, 'cell_line'].unique()
    validate_group_splits(train_cells, test_cells, i+1, "Model II", "cell lines")

=== MODEL II: CELL LINE VALIDATION ===
  [Model II - Fold 1] Group Validation:
    Train cell lines: 8
    Test cell lines:  2
    Overlapping cell lines: 0
No group leakage
  [Model II - Fold 2] Group Validation:
    Train cell lines: 8
    Test cell lines:  2
    Overlapping cell lines: 0
No group leakage
  [Model II - Fold 3] Group Validation:
    Train cell lines: 8
    Test cell lines:  2
    Overlapping cell lines: 0
No group leakage
  [Model II - Fold 4] Group Validation:
    Train cell lines: 8
    Test cell lines:  2
    Overlapping cell lines: 0
No group leakage
  [Model II - Fold 5] Group Validation:
    Train cell lines: 8
    Test cell lines:  2
    Overlapping cell lines: 0
No group leakage


In [20]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_II/splits_II.pkl', 'wb') as f:
    pickle.dump(splits_II, f)

In [21]:
model_II_early = model_early_cross_validation(RF, data_1, target, splits_II, model_name="Model II Early")
model_II_late = model_late_cross_validation(RF, data_2, target, splits_II, model_name="Model II Late")

Fold 1: Train size = 32973, Test size = 8271
  [Model II Early - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0266 (diff: 0.0034)
    Test SL rate:    0.0097 (diff: 0.0136)
    Train/Test diff: 0.0170
Stratification OK
  ROC AUC = 0.9577, Elapsed time = 49.50 seconds
Fold 2: Train size = 32978, Test size = 8266
  [Model II Early - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0213 (diff: 0.0020)
    Test SL rate:    0.0311 (diff: 0.0079)
    Train/Test diff: 0.0098
Stratification OK
  ROC AUC = 0.9269, Elapsed time = 96.91 seconds
Fold 3: Train size = 33018, Test size = 8226
  [Model II Early - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0235 (diff: 0.0003)
    Test SL rate:    0.0220 (diff: 0.0012)
    Train/Test diff: 0.0015
Stratification OK
  ROC AUC = 0.8787, Elapsed time = 146.10 seconds
Fold 4: Train size = 32992, Test size = 8252
  [Model II Early - Fold 4] Stratification

In [22]:
# save results of cross validation
with open(file_RF_model_II_early, 'wb') as f:
    pickle.dump(model_II_early, f)

with open(file_RF_model_II_late, 'wb') as f:
    pickle.dump(model_II_late, f)

In [23]:
def repeated_stratified_group_kfold(data, target, groups, n_folds, n_repeats=n_repeats, random_state=42):
    
    all_splits = []
    
    for repeat in range(n_repeats):
        # Set up StratifiedGroupKFold with an incremented random state for each repeat
        sgkf = StratifiedGroupKFold(n_splits=n_folds, shuffle=True, random_state=random_state + repeat)
        splits = list(sgkf.split(data, target, groups))
        
        # Append each split (train_index, test_index) to all_splits
        for fold_num, (train_index, test_index) in enumerate(splits, start=1):
            all_splits.append((train_index, test_index))
            
            # Print fold details for tracking
            print(f"Repeat {repeat + 1}, Fold {fold_num}:")
            print(f"  TRAIN groups: {groups.iloc[train_index].unique()}")
            print(f"  TEST groups: {groups.iloc[test_index].unique()}")

    return all_splits

# Generate the repeated stratified group k-fold splits
repeated_splits_II = repeated_stratified_group_kfold(
    data=training_df,
    target=target,
    groups=cell_line_groups,
    n_folds=5,
    n_repeats=10,
    random_state=42
)

Repeat 1, Fold 1:
  TRAIN groups: ['PK1_PANCREAS' 'A549_LUNG' 'GI1_CENTRAL_NERVOUS_SYSTEM' 'HS936T_SKIN'
 'MELJUSO_SKIN' 'IPC298_SKIN' 'HSC5_SKIN' 'MEL202_UVEA']
  TEST groups: ['PATU8988S_PANCREAS' 'HS944T_SKIN']
Repeat 1, Fold 2:
  TRAIN groups: ['PATU8988S_PANCREAS' 'PK1_PANCREAS' 'HS944T_SKIN' 'A549_LUNG'
 'GI1_CENTRAL_NERVOUS_SYSTEM' 'HS936T_SKIN' 'MELJUSO_SKIN' 'MEL202_UVEA']
  TEST groups: ['IPC298_SKIN' 'HSC5_SKIN']
Repeat 1, Fold 3:
  TRAIN groups: ['PATU8988S_PANCREAS' 'PK1_PANCREAS' 'HS944T_SKIN'
 'GI1_CENTRAL_NERVOUS_SYSTEM' 'HS936T_SKIN' 'MELJUSO_SKIN' 'IPC298_SKIN'
 'HSC5_SKIN']
  TEST groups: ['A549_LUNG' 'MEL202_UVEA']
Repeat 1, Fold 4:
  TRAIN groups: ['PATU8988S_PANCREAS' 'HS944T_SKIN' 'A549_LUNG'
 'GI1_CENTRAL_NERVOUS_SYSTEM' 'MELJUSO_SKIN' 'IPC298_SKIN' 'HSC5_SKIN'
 'MEL202_UVEA']
  TEST groups: ['PK1_PANCREAS' 'HS936T_SKIN']
Repeat 1, Fold 5:
  TRAIN groups: ['PATU8988S_PANCREAS' 'PK1_PANCREAS' 'HS944T_SKIN' 'A549_LUNG'
 'HS936T_SKIN' 'IPC298_SKIN' 'HSC5_SKIN' 'MEL

In [24]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_II/repeated_splits_II.pkl', 'wb') as f:
    pickle.dump(repeated_splits_II, f)

In [25]:
#Run cross-validation on both feature set
model_II_early_repeated = model_early_cross_validation(RF, data_1, target, repeated_splits_II, model_name="Model II Early Repeated")
model_II_late_repeated = model_late_cross_validation(RF, data_2, target, repeated_splits_II, model_name="Model II Late Repeated")

Fold 1: Train size = 32973, Test size = 8271
  [Model II Early Repeated - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0266 (diff: 0.0034)
    Test SL rate:    0.0097 (diff: 0.0136)
    Train/Test diff: 0.0170
Stratification OK
  ROC AUC = 0.9577, Elapsed time = 49.36 seconds
Fold 2: Train size = 32978, Test size = 8266
  [Model II Early Repeated - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0213 (diff: 0.0020)
    Test SL rate:    0.0311 (diff: 0.0079)
    Train/Test diff: 0.0098
Stratification OK
  ROC AUC = 0.9269, Elapsed time = 97.50 seconds
Fold 3: Train size = 33018, Test size = 8226
  [Model II Early Repeated - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0235 (diff: 0.0003)
    Test SL rate:    0.0220 (diff: 0.0012)
    Train/Test diff: 0.0015
Stratification OK
  ROC AUC = 0.8787, Elapsed time = 146.91 seconds
Fold 4: Train size = 32992, Test size = 8252
  [Model II Ear

In [26]:
# save results of cross validation
with open(file_RF_model_II_early_repeated, 'wb') as f:
    pickle.dump(model_II_early_repeated, f)

with open(file_RF_model_II_late_repeated, 'wb') as f:
    pickle.dump(model_II_late_repeated, f)

## Model III
#### Different paralog pairs, same cell lines

In [None]:
# Define group feature for gene pairs
gene_groups = training_df['genepair']

sgkf2 = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits_III = list(sgkf2.split(training_df, target, gene_groups))

# Model III - Gene pair validation  
print("\n=== MODEL III: GENE PAIR VALIDATION ===")
for i, (train_index, test_index) in enumerate(splits_III):
    train_pairs = training_df.loc[train_index, 'genepair'].unique()
    test_pairs = training_df.loc[test_index, 'genepair'].unique()
    validate_group_splits(train_pairs, test_pairs, i+1, "Model III", "gene pairs")



=== MODEL III: GENE PAIR VALIDATION ===
  [Model III - Fold 1] Group Validation:
    Train gene pairs: 3337
    Test gene pairs:  833
    Overlapping gene pairs: 0
No group leakage
  [Model III - Fold 2] Group Validation:
    Train gene pairs: 3334
    Test gene pairs:  836
    Overlapping gene pairs: 0
No group leakage
  [Model III - Fold 3] Group Validation:
    Train gene pairs: 3336
    Test gene pairs:  834
    Overlapping gene pairs: 0
No group leakage
  [Model III - Fold 4] Group Validation:
    Train gene pairs: 3336
    Test gene pairs:  834
    Overlapping gene pairs: 0
No group leakage
  [Model III - Fold 5] Group Validation:
    Train gene pairs: 3337
    Test gene pairs:  833
    Overlapping gene pairs: 0
No group leakage


In [28]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_III/splits_III.pkl', 'wb') as f:
    pickle.dump(splits_III, f)

In [29]:
model_III_early = model_early_cross_validation(RF, data_1, target, splits_III, model_name="Model III Early")
model_III_late = model_late_cross_validation(RF, data_2, target, splits_III, model_name="Model III Late")

Fold 1: Train size = 33027, Test size = 8217
  [Model III Early - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0223 (diff: 0.0009)
    Test SL rate:    0.0269 (diff: 0.0037)
    Train/Test diff: 0.0046
Stratification OK
  ROC AUC = 0.8773, Elapsed time = 48.40 seconds
Fold 2: Train size = 32987, Test size = 8257
  [Model III Early - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0234 (diff: 0.0002)
    Test SL rate:    0.0224 (diff: 0.0008)
    Train/Test diff: 0.0010
Stratification OK
  ROC AUC = 0.9094, Elapsed time = 97.79 seconds
Fold 3: Train size = 32975, Test size = 8269
  [Model III Early - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0234 (diff: 0.0002)
    Test SL rate:    0.0224 (diff: 0.0009)
    Train/Test diff: 0.0011
Stratification OK
  ROC AUC = 0.9182, Elapsed time = 160.20 seconds
Fold 4: Train size = 32987, Test size = 8257
  [Model III Early - Fold 4] Stratifica

In [30]:
# save results of cross validation
with open(file_RF_model_III_early, 'wb') as f:
    pickle.dump(model_III_early, f)

with open(file_RF_model_III_late, 'wb') as f:
    pickle.dump(model_III_late, f)

In [31]:
# Generate the repeated stratified group k-fold splits

repeated_splits_III = repeated_stratified_group_kfold(
    data=training_df,
    target=target,
    groups=gene_groups,
    n_folds=5,
    n_repeats=10,
    random_state=42
)

Repeat 1, Fold 1:
  TRAIN groups: ['A3GALT2_GLT6D1' 'AADACL2_AADACL3' 'AADACL2_AADACL4' ... 'ZNRF1_ZNRF2'
 'ZXDA_ZXDC' 'ZXDB_ZXDC']
  TEST groups: ['A3GALT2_ABO' 'A3GALT2_GBGT1' 'AADAC_AADACL2' 'ABHD4_ABHD5' 'ABL1_MATK'
 'ABL2_CSK' 'ABTB2_ABTB3' 'ACAD10_ACAD11' 'ACAD9_ACADS' 'ACADL_ACADS'
 'ACADSB_IVD' 'ACADVL_IVD' 'ACHE_BCHE' 'ACHE_CES4A' 'ACP3_ACP4'
 'ACR_TMPRSS12' 'ACSM1_ACSM2B' 'ACSM2A_ACSM2B' 'ACSM2A_ACSM6'
 'ACSM3_ACSM4' 'ACSM5_ACSM6' 'ACVR1B_TGFBR1' 'ACVR1C_BMPR1B'
 'ACVR1_BMPR1A' 'ACVR2B_TGFBR2' 'ADAM11_ADAM19' 'ADAM11_ADAM22'
 'ADAM15_ADAM28' 'ADAM20_ADAM29' 'ADAM28_ADAM33' 'ADAM7_ADAM28'
 'ADAM9_ADAM18' 'ADAM9_ADAM20' 'ADAMTS10_ADAMTS17' 'ADAMTS12_ADAMTS18'
 'ADAMTS1_ADAMTS15' 'ADAMTS2_ADAMTS3' 'ADAMTS4_ADAMTS5' 'ADAMTS6_ADAMTS16'
 'ADAMTS7_ADAMTS10' 'ADAMTS8_ADAMTS20' 'ADAMTSL1_ADAMTSL3' 'ADCK1_ADCK5'
 'ADSS1_ADSS2' 'AEBP1_CPE' 'AFF1_AFF4' 'AFF2_AFF4' 'AGBL1_AGBL4'
 'AGBL3_AGBL4' 'AHCYL1_AHCYL2' 'AICDA_APOBEC3F' 'AKT1_AKT2' 'AKT1_RPS6KA4'
 'AKT1_SGK1' 'AKT2_RPS6KA2' 'AKT2_SG

In [32]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_III/repeated_splits_III.pkl', 'wb') as f:
    pickle.dump(repeated_splits_III, f)

In [33]:
#Run cross-validation on both feature set
model_III_early_repeated = model_early_cross_validation(RF, data_1, target, repeated_splits_III, model_name="Model III Early Repeated")
model_III_late_repeated = model_late_cross_validation(RF, data_2, target, repeated_splits_III, model_name="Model III Late Repeated")

Fold 1: Train size = 33027, Test size = 8217
  [Model III Early Repeated - Fold 1] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0223 (diff: 0.0009)
    Test SL rate:    0.0269 (diff: 0.0037)
    Train/Test diff: 0.0046
Stratification OK
  ROC AUC = 0.8773, Elapsed time = 54.18 seconds
Fold 2: Train size = 32987, Test size = 8257
  [Model III Early Repeated - Fold 2] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0234 (diff: 0.0002)
    Test SL rate:    0.0224 (diff: 0.0008)
    Train/Test diff: 0.0010
Stratification OK
  ROC AUC = 0.9094, Elapsed time = 107.96 seconds
Fold 3: Train size = 32975, Test size = 8269
  [Model III Early Repeated - Fold 3] Stratification Check:
    Overall SL rate: 0.0232
    Train SL rate:   0.0234 (diff: 0.0002)
    Test SL rate:    0.0224 (diff: 0.0009)
    Train/Test diff: 0.0011
Stratification OK
  ROC AUC = 0.9182, Elapsed time = 156.10 seconds
Fold 4: Train size = 32987, Test size = 8257
  [Model II

In [34]:
# save results of cross validation
with open(file_RF_model_III_early_repeated, 'wb') as f:
    pickle.dump(model_III_early_repeated, f)

with open(file_RF_model_III_late_repeated, 'wb') as f:
    pickle.dump(model_III_late_repeated, f)

## Model IV
#### Different paralog pairs, different cell lines

In [35]:
# --- Custom split functions ---
def create_disjoint_splits(df, n_splits=5, num_test_cell_lines=2, test_gene_fraction=0.25, random_state=None):
    splits = []

    # Get the unique cell lines and gene pairs
    unique_cell_lines = df['cell_line'].unique()
    unique_gene_pairs = df['genepair'].unique()

    # Set the random seed if provided
    if random_state is not None:
        np.random.seed(random_state)

    # Shuffle and split cell lines and gene pairs into mutually exclusive groups
    np.random.shuffle(unique_cell_lines)
    cell_line_splits = np.array_split(unique_cell_lines, n_splits)

    np.random.shuffle(unique_gene_pairs)
    gene_pair_splits = np.array_split(unique_gene_pairs, n_splits)

    # Create the splits
    for fold in range(n_splits):
        test_cell_lines = cell_line_splits[fold]
        test_gene_pairs = gene_pair_splits[fold]

        train_cell_lines = np.concatenate([cell_line_splits[i] for i in range(n_splits) if i != fold])
        train_gene_pairs = np.concatenate([gene_pair_splits[i] for i in range(n_splits) if i != fold])

        # Define train/test indices
        test_index = df[(df['cell_line'].isin(test_cell_lines)) & (df['genepair'].isin(test_gene_pairs))].index
        train_index = df[(df['cell_line'].isin(train_cell_lines)) & (df['genepair'].isin(train_gene_pairs))].index

        splits.append((train_index, test_index))

        print(f'[Fold {fold+1}] '
              f'# pairs (train): {df.loc[train_index, "genepair"].nunique()} | '
              f'# pairs (test): {df.loc[test_index, "genepair"].nunique()} | '
              f'# overlapping: {np.isin(df.loc[test_index, "genepair"].unique(), df.loc[train_index, "genepair"].unique()).sum()} | '
              f'# cells (train): {df.loc[train_index, "cell_line"].nunique()} | '
              f'# cells (test): {df.loc[test_index, "cell_line"].nunique()}')

        # Add validation after creating splits
        train_cells = set(df.loc[train_index, 'cell_line'].unique())
        test_cells = set(df.loc[test_index, 'cell_line'].unique())
        train_pairs = set(df.loc[train_index, 'genepair'].unique())
        test_pairs = set(df.loc[test_index, 'genepair'].unique())

        assert len(train_cells & test_cells) == 0, "Cell line leakage detected!"
        assert len(train_pairs & test_pairs) == 0, "Gene pair leakage detected!"

    return splits

def repeated_custom_cv(df, n_splits=5, n_repeats=3, num_test_cell_lines=2, test_gene_fraction=0.25, random_state=None):
    all_splits = []

    for repeat in range(n_repeats):
        current_seed = random_state + repeat if random_state is not None else None
        splits = create_disjoint_splits(
            df,
            n_splits=n_splits,
            num_test_cell_lines=num_test_cell_lines,
            test_gene_fraction=test_gene_fraction,
            random_state=current_seed
        )
        all_splits.extend(splits)

    return all_splits


In [36]:
splits_IV = repeated_custom_cv(df=training_df, n_splits=5, n_repeats=10, random_state=42)

# Add validation for Model IV splits
print("\n=== MODEL IV: COMBINED VALIDATION ===")
for i, (train_index, test_index) in enumerate(splits_IV[:5]):  # Just first 5 for brevity
    train_cells = training_df.loc[train_index, 'cell_line'].unique()  # Use correct column name
    test_cells = training_df.loc[test_index, 'cell_line'].unique()    # Use correct column name
    train_pairs = training_df.loc[train_index, 'genepair'].unique()
    test_pairs = training_df.loc[test_index, 'genepair'].unique()
    
    print(f"\n--- Fold {i+1} ---")
    validate_group_splits(train_cells, test_cells, i+1, "Model IV", "cell lines")
    validate_group_splits(train_pairs, test_pairs, i+1, "Model IV", "gene pairs")

[Fold 1] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 2] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 3] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 4] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 5] # pairs (train): 3336 | # pairs (test): 830 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 1] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 2] # pairs (train): 3336 | # pairs (test): 832 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 3] # pairs (train): 3336 | # pairs (test): 833 | # overlapping: 0 | # cells (train): 8 | # cells (test): 2
[Fold 4] # pairs (train): 3336 | # pairs (test): 834 | # overlapping: 0 | # cells (train): 8 | #

In [37]:
# Save splits
with open('/Users/narod/Library/CloudStorage/GoogleDrive-narod.kebabci@ucdconnect.ie/My Drive/GitRepos/context_specific_SL_prediction/output/models/model_IV/splits_IV.pkl', 'wb') as f:
    pickle.dump(splits_IV, f)

In [38]:
model_IV_early = model_early_cross_validation(RF, data_1, target, splits_IV, model_name="Model IV Early")
model_IV_late = model_late_cross_validation(RF, data_2, target, splits_IV, model_name="Model IV Late")

Fold 1: Train size = 26389, Test size = 1651
  [Model IV Early - Fold 1] Stratification Check:
    Overall SL rate: 0.0235
    Train SL rate:   0.0233 (diff: 0.0002)
    Test SL rate:    0.0267 (diff: 0.0032)
    Train/Test diff: 0.0034
Stratification OK
  ROC AUC = 0.8955, Elapsed time = 35.01 seconds
Fold 2: Train size = 26403, Test size = 1651
  [Model IV Early - Fold 2] Stratification Check:
    Overall SL rate: 0.0248
    Train SL rate:   0.0252 (diff: 0.0005)
    Test SL rate:    0.0176 (diff: 0.0072)
    Train/Test diff: 0.0077
Stratification OK
  ROC AUC = 0.8375, Elapsed time = 69.02 seconds
Fold 3: Train size = 26371, Test size = 1659
  [Model IV Early - Fold 3] Stratification Check:
    Overall SL rate: 0.0227
    Train SL rate:   0.0222 (diff: 0.0004)
    Test SL rate:    0.0295 (diff: 0.0069)
    Train/Test diff: 0.0073
Stratification OK
  ROC AUC = 0.8925, Elapsed time = 102.50 seconds
Fold 4: Train size = 26358, Test size = 1658
  [Model IV Early - Fold 4] Stratification

In [39]:
# save results of cross validation
with open(file_RF_model_IV_early, 'wb') as f:
    pickle.dump(model_IV_early, f)

with open(file_RF_model_IV_late, 'wb') as f:
    pickle.dump(model_IV_late, f)

In [40]:
def generate_validation_report(model_results, model_name):
    """Generate a comprehensive validation report for a model."""
    print(f"\n{'='*50}")
    print(f"VALIDATION REPORT: {model_name}")
    print(f"{'='*50}")
    
    # Performance summary
    print(f"Performance:")
    print(f"  Mean ROC AUC: {model_results['mean_auc']:.4f} ± {model_results['std_auc']:.4f}")
    print(f"  Mean PR AUC:  {model_results['mean_aps']:.4f} ± {model_results['std_ap']:.4f}")
    print(f"  Effective folds: {model_results['effective_folds']}")
    
    if 'skipped_folds' in model_results and model_results['skipped_folds']:
        print(f"  Skipped folds: {model_results['skipped_folds']}")
    
    # Stratification summary
    if 'stratification_results' in model_results:
        strat_results = model_results['stratification_results']
        if strat_results:
            train_diffs = [r['train_diff'] for r in strat_results]
            test_diffs = [r['test_diff'] for r in strat_results]
            warnings_count = sum(len(r['warnings']) for r in strat_results)
            
            print(f"\nStratification Quality:")
            print(f"  Mean train deviation: {np.mean(train_diffs):.4f} ± {np.std(train_diffs):.4f}")
            print(f"  Mean test deviation:  {np.mean(test_diffs):.4f} ± {np.std(test_diffs):.4f}")
            print(f"  Total warnings: {warnings_count}")
            
            if warnings_count > 0:
                print(f"Some folds had stratification issues")
            else:
                print(f"All folds passed stratification checks")

# Generate reports for all models
all_models = [
    (model_I_early, "Model I Early"),
    (model_I_late, "Model I Late"),
    (model_II_early, "Model II Early"),
    (model_II_late, "Model II Late"),
    (model_III_early, "Model III Early"),
    (model_III_late, "Model III Late"),
    (model_IV_early, "Model IV Early"),
    (model_IV_late, "Model IV Late")
]

for model_result, model_name in all_models:
    generate_validation_report(model_result, model_name)


VALIDATION REPORT: Model I Early
Performance:
  Mean ROC AUC: 0.9177 ± 0.0077
  Mean PR AUC:  0.3481 ± 0.0233
  Effective folds: 5

Stratification Quality:
  Mean train deviation: 0.0000 ± 0.0000
  Mean test deviation:  0.0001 ± 0.0000
All folds passed stratification checks

VALIDATION REPORT: Model I Late
Performance:
  Mean ROC AUC: 0.9310 ± 0.0074
  Mean PR AUC:  0.4062 ± 0.0192
  Effective folds: 5

Stratification Quality:
  Mean train deviation: 0.0000 ± 0.0000
  Mean test deviation:  0.0001 ± 0.0000
All folds passed stratification checks

VALIDATION REPORT: Model II Early
Performance:
  Mean ROC AUC: 0.8999 ± 0.0367
  Mean PR AUC:  0.2924 ± 0.0756
  Effective folds: 5

Stratification Quality:
  Mean train deviation: 0.0015 ± 0.0011
  Mean test deviation:  0.0059 ± 0.0045
All folds passed stratification checks

VALIDATION REPORT: Model II Late
Performance:
  Mean ROC AUC: 0.9182 ± 0.0339
  Mean PR AUC:  0.3502 ± 0.0788
  Effective folds: 5

Stratification Quality:
  Mean train de