# Ensemble Learning with Domain-Specific Models for Alzheimer's Prediction

This notebook demonstrates a domain-driven ensemble approach for predicting cognitive decline using social determinants of health. The key innovation is splitting features into meaningful domains (demographics, social, health, economic) and training specialized models for each group.

## Table of Contents
1. [Introduction](#introduction)
2. [Setup and Data Loading](#setup)
3. [Feature Engineering](#feature-engineering)
4. [Domain-Specific Models](#domain-models)
5. [Ensemble Training](#ensemble)
6. [Results Analysis](#results)

## Introduction <a name="introduction"></a>

The Mexican Health and Aging Study (MHAS) provides rich longitudinal data about social determinants of health. Rather than treating all features equally, we:
1. Group features by domain expertise
2. Create specialized models for each domain
3. Combine predictions using weighted averaging

### Why this approach?
- Different feature groups may benefit from different model architectures
- Enables domain-specific feature engineering
- Provides better interpretability by domain
- Allows for analyzing relative importance of different domains

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from pathlib import Path
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## Setup and Data Loading <a name="setup"></a>

First, let's load and examine our data:

In [2]:
def load_data():
    """Load and prepare the competition data"""
    data_dir = Path('../data/raw')
    
    # Load data files
    train_features = pd.read_csv(data_dir / 'train_features.csv')
    test_features = pd.read_csv(data_dir / 'test_features.csv')
    train_labels = pd.read_csv(data_dir / 'train_labels.csv')
    
    print("Data shapes:")
    print(f"Train features: {train_features.shape}")
    print(f"Test features: {test_features.shape}")
    print(f"Train labels: {train_labels.shape}")
    
    return train_features, test_features, train_labels

# Load data
train_features, test_features, train_labels = load_data()

Data shapes:
Train features: (3276, 184)
Test features: (819, 184)
Train labels: (4343, 3)


## Feature Engineering <a name="feature-engineering"></a>

We create several types of engineered features:

1. **Temporal Changes** (2003 vs 2012):

In [3]:
def create_temporal_changes(df):
    """Create features representing changes between 2003 and 2012"""
    df = df.copy()
    
    # Find matching columns between years
    base_columns = []
    for col in df.columns:
        if col.endswith('_03'):
            base_name = col[:-3]
            if f"{base_name}_12" in df.columns:
                base_columns.append(base_name)
    
    # Calculate changes
    for base_col in base_columns:
        col_03 = f"{base_col}_03"
        col_12 = f"{base_col}_12"
        
        # Ensure numeric
        df[col_03] = pd.to_numeric(df[col_03], errors='coerce')
        df[col_12] = pd.to_numeric(df[col_12], errors='coerce')
        
        # Calculate changes
        df[f"{base_col}_change"] = df[col_12] - df[col_03]
        df[f"{base_col}_pct_change"] = df[f"{base_col}_change"] / df[col_03].abs()
        
    return df

2. **Composite Scores**:

In [4]:
def create_health_score(df):
    """Create composite health indicator"""
    health_cols = ['n_adl', 'n_iadl', 'n_depr', 'n_illnesses']
    
    for suffix in ['_03', '_12']:
        cols = [f"{col}{suffix}" for col in health_cols]
        valid_cols = [col for col in cols if col in df.columns]
        
        if valid_cols:
            # Normalize and combine
            normalized = df[valid_cols].apply(lambda x: (x - x.mean()) / x.std())
            df[f'health_score{suffix}'] = normalized.mean(axis=1)
    
    return df

3. **Domain-Specific Features**:

In [5]:
def create_domain_features(df):
    """Create features for each domain"""
    # Social domain
    social_cols = ['rrfcntx_m_12', 'rsocact_m_12', 'n_living_child_12']
    df['social_engagement_score'] = df[social_cols].apply(
        lambda x: pd.to_numeric(x, errors='coerce')
    ).mean(axis=1)
    
    # Economic domain
    df['economic_stability'] = (
        pd.to_numeric(df['hincome_12'], errors='coerce') / 
        df[['hinc_business_12', 'hinc_rent_12', 'hinc_assets_12']].count(axis=1)
    )
    
    return df

## Domain-Specific Models <a name="domain-models"></a>

We define feature groups and specialized models:

In [6]:
FEATURE_GROUPS = {
    'demographics': [
        'edu_gru_12', 'edu_gru_03',
        'age_12', 'age_03',
        'n_living_child_12',
        'rameduc_m', 'rafeduc_m'
    ],
    'social': [
        'social_engagement_score',
        'rrfcntx_m_12',
        'rsocact_m_12',
        'reads_12'
    ],
    'health': [
        'health_score_12',
        'n_depr_12', 'n_depr_03',
        'n_depr_change'
    ],
    'economic': [
        'economic_stability',
        'hincome_12', 'hincome_03',
        'hincome_change'
    ]
}

MODEL_CONFIGS = {
    'demographics': {
        'type': 'lgb',
        'params': {
            'num_leaves': 15,
            'learning_rate': 0.05,
            'feature_fraction': 0.8
        }
    },
    'social': {
        'type': 'xgb',
        'params': {
            'max_depth': 5,
            'learning_rate': 0.05,
            'subsample': 0.8
        }
    }
    # ... configs for other domains
}

## Ensemble Training <a name="ensemble"></a>


The ensemble combines predictions from domain models:

In [7]:
class DomainEnsemble:
    def __init__(self, feature_groups, model_configs):
        self.feature_groups = feature_groups
        self.model_configs = model_configs
        self.models = {}
        self.weights = None
    
    def train(self, X, y, n_splits=5):
        """Train domain-specific models"""
        print("Training domain models...")
        
        # Train each domain model
        domain_scores = {}
        for domain, features in self.feature_groups.items():
            print(f"\nTraining {domain} model...")
            X_domain = X[features]
            
            # Train with cross-validation
            scores = self._train_domain(
                X_domain, 
                y,
                self.model_configs[domain],
                n_splits
            )
            domain_scores[domain] = np.mean(scores)
            
        # Calculate weights based on scores
        total_score = sum(1/score for score in domain_scores.values())
        self.weights = {
            domain: (1/score)/total_score 
            for domain, score in domain_scores.items()
        }
        
        print("\nDomain weights:")
        for domain, weight in self.weights.items():
            print(f"{domain}: {weight:.3f}")

## Results Analysis <a name="results"></a>

Let's analyze the results:

In [8]:
def analyze_results(ensemble, X, y):
    """Analyze ensemble performance"""
    # Feature importance by domain
    importances = {}
    for domain, model in ensemble.models.items():
        imp = pd.Series(
            model.feature_importance(),
            index=ensemble.feature_groups[domain]
        )
        importances[domain] = imp
    
    # Plot domain importances
    plt.figure(figsize=(12, 6))
    for domain, imp in importances.items():
        plt.bar(
            range(len(imp)),
            imp.values,
            label=domain
        )
    plt.title("Feature Importance by Domain")
    plt.legend()
    plt.show()

## Conclusions

This domain-driven ensemble approach offers several advantages:
1. Better interpretability through domain-specific analysis
2. Specialized feature engineering for each domain
3. Flexible model selection based on domain characteristics
4. Weighted combination based on domain performance

The analysis shows that:
- Demographic features (especially education) are strongest predictors
- Social engagement provides complementary signals
- Economic stability adds predictive power
- Health indicators help capture risk factors

## Requirements