[link to original notebook](https://www.kaggle.com/code/olaflundstrom/wids-datathon-2025-adhd-analysis-notebook)

# WiDS Datathon 2025: Predicting ADHD and Sex from fMRI Data

This notebook provides a step-by-step guide to solving the WiDS Datathon 2025 challenge. The goal is to predict both an individual's sex and their ADHD diagnosis using functional brain imaging data, socio-demographic information, and other metadata.

---

## Table of Contents
1. [Introduction](#introduction)
2. [Data Loading](#data-loading)
3. [Data Preprocessing](#data-preprocessing)
4. [Model Training](#model-training)
5. [Model Evaluation](#model-evaluation)
6. [Submission Generation](#submission-generation)
7. [Conclusion](#conclusion)

---

## 1. Introduction <a name="introduction"></a>

The WiDS Datathon 2025 focuses on uncovering patterns in ADHD diagnosis and sex differences using fMRI data. The dataset includes:
- Functional MRI connectome matrices
- Socio-demographic information
- Emotional and parenting questionnaire data

Our task is to build a multi-output model to predict:
1. ADHD diagnosis (`ADHD_Outcome`: 1 = yes, 0 = no)
2. Sex (`Sex_F`: 1 = female, 0 = male)

---

## 2. Data Loading <a name="data-loading"></a>

We start by loading the training and test datasets.

In [1]:
import numpy as np
import pandas as pd


def get_feats(mode='TRAIN'):
    """
    Load data for the specified mode (TRAIN or TEST).
    """
    # Load quantitative metadata
    feats = pd.read_excel(f"../../kaggle/input/widsdatathon2025/{mode}/{mode}_QUANTITATIVE_METADATA.xlsx")
    
    # Load categorical metadata
    if mode == 'TRAIN':
        cate = pd.read_excel(f"../../kaggle/input/widsdatathon2025/{mode}/{mode}_CATEGORICAL_METADATA.xlsx")
    else:
        cate = pd.read_excel(f"../../kaggle/input/widsdatathon2025/{mode}/{mode}_CATEGORICAL.xlsx")
    
    # Merge quantitative and categorical data
    feats = pd.merge(feats, cate, on='participant_id', how='left')
    
    # Load functional connectome matrices
    func = pd.read_csv(f"../../kaggle/input/widsdatathon2025/{mode}/{mode}_FUNCTIONAL_CONNECTOME_MATRICES.csv")
    feats = pd.merge(feats, func, on='participant_id', how='left')
    
    # Load training solutions (only for TRAIN mode)
    if mode == 'TRAIN':
        solution = pd.read_excel("../../kaggle/input/widsdatathon2025/TRAIN/TRAINING_SOLUTIONS.xlsx")
        feats = pd.merge(feats, solution, on='participant_id', how='left')
    
    return feats

# Load training and test data
print("Loading data...")
train = get_feats(mode='TRAIN')
test = get_feats(mode='TEST')

# Display the first few rows of the training data
train.head()

Loading data...


Unnamed: 0,participant_id,EHQ_EHQ_Total,ColorVision_CV_Score,APQ_P_APQ_P_CP,APQ_P_APQ_P_ID,APQ_P_APQ_P_INV,APQ_P_APQ_P_OPD,APQ_P_APQ_P_PM,APQ_P_APQ_P_PP,SDQ_SDQ_Conduct_Problems,...,195throw_198thcolumn,195throw_199thcolumn,196throw_197thcolumn,196throw_198thcolumn,196throw_199thcolumn,197throw_198thcolumn,197throw_199thcolumn,198throw_199thcolumn,ADHD_Outcome,Sex_F
0,UmrK0vMLopoR,40.0,13,3,10,47,13,11,28,0,...,-0.058396,-0.041544,0.142806,-0.006377,0.108005,0.148327,0.09323,-0.004984,1,1
1,CPaeQkhcjg7d,-94.47,14,3,13,34,18,23,30,0,...,-0.025624,-0.031863,0.162011,0.067439,0.017155,0.088893,0.064094,0.194381,1,0
2,Nb4EetVPm3gs,-46.67,14,4,10,35,16,10,29,1,...,0.010771,-0.044341,0.128386,0.047282,0.087678,0.146221,-0.009425,0.03515,1,0
3,p4vPhVu91o4b,-26.68,10,5,12,39,19,16,28,6,...,-0.007152,0.032584,0.121726,0.045089,0.154464,0.106817,0.065336,0.234708,1,1
4,M09PXs7arQ5E,0.0,14,5,15,40,20,24,28,1,...,-0.010196,0.035638,0.074978,0.030579,0.02564,0.118199,0.112522,0.143666,1,1


---

## 3. Data Preprocessing <a name="data-preprocessing"></a>

We preprocess the data by:
1. Handling missing values
2. Encoding categorical features
3. Scaling numerical features

In [2]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate features and target variables
X      = train.drop(['participant_id', 'ADHD_Outcome', 'Sex_F'], axis=1, errors='ignore')
y_adhd = train['ADHD_Outcome']
y_sex  = train['Sex_F']

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
numerical_features = X.select_dtypes(exclude=['object']).columns.tolist()

# Create preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Apply preprocessing to training data
print("Preprocessing data...")
X_preprocessed = preprocessor.fit_transform(X)

Preprocessing data...


---

## 4. Model Training <a name="model-training"></a>

We use LightGBM, a gradient boosting framework, to train separate models for ADHD and sex prediction.

In [3]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight

# Split data into training and validation sets

# Assign the split data to variables for better readability
(
    X_train, 
    X_val, 
    y_train_adhd, 
    y_val_adhd, 
    y_train_sex, 
    y_val_sex
) = train_test_split(
    X_preprocessed, 
    y_adhd, 
    y_sex, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_adhd
)

# Calculate class weights for ADHD and sex
adhd_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_adhd), y=y_adhd)
sex_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_sex), y=y_sex)


In [4]:
import os
import joblib
import lightgbm as lgb
from sklearn.model_selection import train_test_split

# File paths for the saved models
adhd_model_path = 'adhd_model.pkl'
sex_model_path = 'sex_model.pkl'

# Check if the ADHD model file exists
if os.path.exists(adhd_model_path):
    print("Loading ADHD model from file...")
    adhd_model = joblib.load(adhd_model_path)
else:
    # Define ADHD model
    adhd_model = lgb.LGBMClassifier(
        objective='binary',
        num_leaves=63,
        learning_rate=0.01,
        n_estimators=1000,
        scale_pos_weight=adhd_weights[1] / adhd_weights[0],
        early_stopping_rounds=50,
        verbose=-1
    )
    # Train ADHD model
    print("Training ADHD model...")
    adhd_model.fit(X_train, y_train_adhd, eval_set=[(X_val, y_val_adhd)])
    # Save ADHD model
    joblib.dump(adhd_model, adhd_model_path)

# Check if the Sex model file exists
if os.path.exists(sex_model_path):
    print("Loading Sex model from file...")
    sex_model = joblib.load(sex_model_path)
else:
    # Define Sex model
    sex_model = lgb.LGBMClassifier(
        objective='binary',
        num_leaves=127,
        learning_rate=0.005,
        n_estimators=1000,
        scale_pos_weight=sex_weights[1] / sex_weights[0],
        early_stopping_rounds=50,
        verbose=-1
    )
    # Train Sex model
    print("Training Sex model...")
    sex_model.fit(X_train, y_train_sex, eval_set=[(X_val, y_val_sex)])
    # Save Sex model
    joblib.dump(sex_model, sex_model_path)

Loading ADHD model from file...
Loading Sex model from file...


---

## 5. Model Evaluation <a name="model-evaluation"></a>

We evaluate the models using the F1 score, which is the competition's evaluation metric.

In [5]:
from sklearn.metrics import f1_score

# Make predictions on the validation set
adhd_pred = adhd_model.predict(X_val)
sex_pred = sex_model.predict(X_val)

# Calculate F1 scores
adhd_f1 = f1_score(y_val_adhd, adhd_pred)
sex_f1 = f1_score(y_val_sex, sex_pred)
combined_f1 = (adhd_f1 + sex_f1) / 2

print(f"ADHD F1 Score: {adhd_f1:.4f}")
print(f"Sex F1 Score: {sex_f1:.4f}")
print(f"Combined F1 Score: {combined_f1:.4f}")



ADHD F1 Score: 0.8490
Sex F1 Score: 0.0000
Combined F1 Score: 0.4245




---

## 6. Submission Generation <a name="submission-generation"></a>

We generate predictions for the test set and create a submission file.

In [7]:
# Preprocess test data
test_preprocessed = preprocessor.transform(test.drop('participant_id', axis=1, errors='ignore'))

# Make predictions
test_adhd_pred = adhd_model.predict(test_preprocessed)
test_sex_pred = sex_model.predict(test_preprocessed)

# Create submission file
submission = pd.DataFrame({
    'participant_id': test['participant_id'],
    'ADHD_Outcome': test_adhd_pred,
    'Sex_F': test_sex_pred
})

# Save submission file
submission.to_csv('submission.csv', index=False)
print("Submission file saved!")

Submission file saved!




---

## 7. Conclusion <a name="conclusion"></a>

In this notebook, we:
1. Loaded and preprocessed the WiDS Datathon 2025 dataset.
2. Trained LightGBM models for ADHD and sex prediction.
3. Evaluated the models using the F1 score.
4. Generated a submission file for the competition.

Further improvements could include:
- Hyperparameter tuning
- Feature engineering
- Ensemble methods

Good luck with the competition!