# Sliding Window Dataset - 12-Week Prediction Focus

## Core Research Question
**Can we predict BCVA change at 12 weeks using recent K visits of multimodal data?**

## Design
| Configuration | K=2 | K=3 |
|--------------|-----|-----|
| Observation Window | 4 weeks (2 visits) | 8 weeks (3 visits) |
| Input | [t0, t1] | [t0, t1, t2] |
| Target | BCVA(t1+12w) - BCVA(t1) | BCVA(t2+12w) - BCVA(t2) |
| Missing Strategy | Strict (both required) | Lenient (≥2 observed) |

## Key Constraint
- **12w target must have BCVA** (primary outcome, no missing allowed)

## Output Files
- `10_k2_12w_focus.csv` - K=2 sliding window dataset
- `10_k3_12w_focus.csv` - K=3 sliding window dataset

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path


DATA_DIR = Path('.')
print("Libraries loaded.")

Libraries loaded.


---
# Part 1: Load Merged Long Format Data

In [3]:
# Load the merged long format data 
merged = pd.read_csv(DATA_DIR / '8a_combined_data_long.csv')

print(f"Loaded merged data: {merged.shape}")
print(f"Patients: {merged['Patient_ID'].nunique()}")
print(f"Eyes: {merged.groupby(['Patient_ID', 'Eye']).ngroups}")
print(f"\nWeeks available: {sorted(merged['Week'].unique())}")
print(f"\nBCVA non-missing rate: {100*merged['BCVA'].notna().mean():.1f}%")

Loaded merged data: (840, 35)
Patients: 40
Eyes: 40

Weeks available: [np.int64(0), np.int64(4), np.int64(8), np.int64(12), np.int64(16), np.int64(20), np.int64(24), np.int64(28), np.int64(32), np.int64(36), np.int64(40), np.int64(44), np.int64(48), np.int64(52), np.int64(60), np.int64(68), np.int64(76), np.int64(84), np.int64(92), np.int64(100), np.int64(104)]

BCVA non-missing rate: 78.8%


---
# Part 2: Define Sliding Window Function (12w Focus)

In [4]:
def create_sliding_windows_12w_focus(df, k=3, visit_interval=4, max_week=52):
    """
    Create sliding window samples focused on 12-week BCVA prediction.
    
    Core Task: Predict BCVA change at 12 weeks after the last observed visit.
    
    Args:
        df: Long format dataframe with Patient_ID, Eye, Week, BCVA, etc.
        k: Number of visits in input window (2 or 3)
        visit_interval: Weeks between consecutive visits (default: 4)
        max_week: Maximum week for target (default: 52)
    
    Strategy:
        - K=2 (Strict): Both t0 and t1 must have BCVA
        - K=3 (Lenient): t2 must have BCVA, at least 2 of {t0,t1,t2} observed
        - 12w target MUST have BCVA (primary outcome, strictly required)
    
    Returns:
        DataFrame with sliding window samples
    
    Example:
        K=2: [W0, W4] → Predict BCVA change at W16 (4+12=16)
        K=3: [W0, W4, W8] → Predict BCVA change at W20 (8+12=20)
    """
    prediction_horizon = 12  # 固定 12 周预测
    windows = []
    
    for (patient_id, eye), group in df.groupby(['Patient_ID', 'Eye']):
        group = group.sort_values('Week')
        valid_weeks = set(group[group['BCVA'].notna()]['Week'].values)
        
        for start_week in group['Week'].values:
            window_weeks = [start_week + i * visit_interval for i in range(k)]
            t_last = window_weeks[-1]
            target_week = t_last + prediction_horizon
            
            # 基本条件: target week 不超过 max_week
            if target_week > max_week:
                continue
            
            # ========== 核心条件: 12w target 必须有值 ==========
            if target_week not in valid_weeks:
                continue
            
            # ========== Window 观测条件 ==========
            if k == 2:
                # K=2 严格模式: t0 和 t1 都必须有 BCVA
                if window_weeks[0] not in valid_weeks or window_weeks[1] not in valid_weeks:
                    continue
                observed_count = 2
            else:
                # K=3 宽松模式: t2 必须有值，且至少 2 个 observed
                if t_last not in valid_weeks:
                    continue
                observed_count = sum(1 for w in window_weeks if w in valid_weeks)
                if observed_count < 2:
                    continue
            
            # ========== 构建样本数据 ==========
            window_data = {
                'Patient_ID': patient_id,
                'Eye': eye,
                'Window_Start_Week': start_week,
                'Window_End_Week': t_last,
                'Target_Week': target_week,
                'Num_Observed_Visits': observed_count,
            }
            
            # Static features - 从第一个有效 visit 获取
            for w in window_weeks:
                if w in valid_weeks:
                    first_valid_visit = group[group['Week'] == w].iloc[0]
                    break
            else:
                first_valid_visit = group[group['Week'] == window_weeks[0]].iloc[0]
            
            static_cols = ['Age', 'Gender', 'Diabetes_Type', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI']
            for col in static_cols:
                if col in first_valid_visit:
                    window_data[col] = first_valid_visit[col]
            
            # Temporal features
            temporal_cols = ['BCVA', 'CST', 'Injection', 'Leakage_Index', 
                            'Fundus_Path', 'OCT_Paths', 'Num_B_scans']
            
            for i, week in enumerate(window_weeks):
                window_data[f'Week_t{i}'] = week
                is_missing = week not in valid_weeks
                window_data[f'BCVA_missing_t{i}'] = 1 if is_missing else 0
                
                visit_rows = group[group['Week'] == week]
                if len(visit_rows) > 0:
                    visit_data = visit_rows.iloc[0]
                    for col in temporal_cols:
                        if col in visit_data:
                            window_data[f'{col}_t{i}'] = visit_data[col]
                else:
                    for col in temporal_cols:
                        window_data[f'{col}_t{i}'] = np.nan
            
            # ========== Target: 12w BCVA change (PRIMARY OUTCOME) ==========
            t_last_bcva = group[group['Week'] == t_last].iloc[0]['BCVA']
            target_bcva = group[group['Week'] == target_week].iloc[0]['BCVA']
            
            window_data['BCVA_Baseline'] = t_last_bcva  # 基线值 (t_last)
            window_data['BCVA_Target'] = target_bcva    # 目标值 (12w后)
            window_data['BCVA_Change_12w'] = target_bcva - t_last_bcva  # 主要预测目标
            
            # 临床显著性分类 (≥5 letters)
            change = target_bcva - t_last_bcva
            if change >= 5:
                window_data['BCVA_Category'] = 'Improved'   # 显著改善
            elif change <= -5:
                window_data['BCVA_Category'] = 'Worsened'   # 显著恶化
            else:
                window_data['BCVA_Category'] = 'Stable'     # 稳定
            
            windows.append(window_data)
    
    result_df = pd.DataFrame(windows)
    
    # 打印统计信息
    print(f"=== K={k} Sliding Window (12w Focus) ===")
    print(f"Total samples: {len(result_df)}")
    print(f"\nObserved visits distribution:")
    print(result_df['Num_Observed_Visits'].value_counts().sort_index())
    print(f"\n12w BCVA Change Statistics:")
    print(f"  Mean: {result_df['BCVA_Change_12w'].mean():.2f} letters")
    print(f"  Std:  {result_df['BCVA_Change_12w'].std():.2f} letters")
    print(f"  Min:  {result_df['BCVA_Change_12w'].min():.0f}, Max: {result_df['BCVA_Change_12w'].max():.0f}")
    print(f"  |Change| ≥ 5 letters: {100*(result_df['BCVA_Change_12w'].abs() >= 5).mean():.1f}%")
    print(f"\nClinical Categories:")
    for cat in ['Improved', 'Stable', 'Worsened']:
        n = (result_df['BCVA_Category'] == cat).sum()
        pct = 100 * n / len(result_df)
        print(f"  {cat}: {n} ({pct:.1f}%)")
    
    return result_df

print("Function defined: create_sliding_windows_12w_focus()")

Function defined: create_sliding_windows_12w_focus()


---
# Part 3: Create K=3 Dataset (12w Focus)

In [5]:
# Create K=3 sliding windows
print("=" * 60)
sliding_windows_k3 = create_sliding_windows_12w_focus(
    merged, 
    k=3, 
    visit_interval=4, 
    max_week=52
)

=== K=3 Sliding Window (12w Focus) ===
Total samples: 290

Observed visits distribution:
Num_Observed_Visits
2     10
3    280
Name: count, dtype: int64

12w BCVA Change Statistics:
  Mean: -0.20 letters
  Std:  4.02 letters
  Min:  -29, Max: 12
  |Change| ≥ 5 letters: 22.8%

Clinical Categories:
  Improved: 30 (10.3%)
  Stable: 224 (77.2%)
  Worsened: 36 (12.4%)


---
# Part 4: Create K=2 Dataset (12w Focus)

In [6]:
# Create K=2 sliding windows
print("=" * 60)
sliding_windows_k2 = create_sliding_windows_12w_focus(
    merged, 
    k=2, 
    visit_interval=4, 
    max_week=52
)

=== K=2 Sliding Window (12w Focus) ===
Total samples: 324

Observed visits distribution:
Num_Observed_Visits
2    324
Name: count, dtype: int64

12w BCVA Change Statistics:
  Mean: -0.01 letters
  Std:  4.06 letters
  Min:  -29, Max: 12
  |Change| ≥ 5 letters: 23.5%

Clinical Categories:
  Improved: 39 (12.0%)
  Stable: 248 (76.5%)
  Worsened: 37 (11.4%)


---
# Part 5: Feature Engineering

In [None]:
def add_engineered_features(df, k):
    """
    Add engineered features for sliding window dataset.
    
    Features added:
    - BCVA_Slope: Rate of BCVA change over observation window
    - CST_Slope: Rate of CST change over observation window
    - BCVA_Delta / BCVA_Delta_Recent: Absolute BCVA change
    - CST_Delta / CST_Delta_Recent: Absolute CST change
    - Injection_Count: Total injections in window
    """
    df = df.copy()
    
    # 确保 CST 是数值型
    for i in range(k):
        col = f'CST_t{i}'
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    if k == 2:
        # K=2: 简单的两点差分
        df['BCVA_Slope'] = (df['BCVA_t1'] - df['BCVA_t0']) / (df['Week_t1'] - df['Week_t0'])
        df['CST_Slope'] = (df['CST_t1'] - df['CST_t0']) / (df['Week_t1'] - df['Week_t0'])
        df['BCVA_Delta'] = df['BCVA_t1'] - df['BCVA_t0']
        df['CST_Delta'] = df['CST_t1'] - df['CST_t0']
        
    else:  # k == 3
        # K=3: 用有效点计算斜率，处理可能的缺失
        def calc_slope(row, feature):
            points = []
            for i in range(k):
                week = row.get(f'Week_t{i}')
                val = row.get(f'{feature}_t{i}')
                if pd.notna(week) and pd.notna(val):
                    try:
                        points.append((float(week), float(val)))
                    except (ValueError, TypeError):
                        continue
            if len(points) >= 2: # 至少两个点计算斜率, 用最后和最前的点
                denom = points[-1][0] - points[0][0]
                if denom != 0:
                    return (points[-1][1] - points[0][1]) / denom
            return np.nan
        
        df['BCVA_Slope'] = df.apply(lambda r: calc_slope(r, 'BCVA'), axis=1)
        df['CST_Slope'] = df.apply(lambda r: calc_slope(r, 'CST'), axis=1)
        
        # 短期变化 (t1 → t2)
        df['BCVA_Delta_Recent'] = df['BCVA_t2'] - df['BCVA_t1']
        df['CST_Delta_Recent'] = df['CST_t2'] - df['CST_t1']
    
    # 窗口内注射次数
    injection_cols = [f'Injection_t{i}' for i in range(k) if f'Injection_t{i}' in df.columns]
    df['Injection_Count'] = df[injection_cols].sum(axis=1, skipna=True)
    
    return df

# Apply feature engineering
print("Adding engineered features...")
sliding_windows_k3 = add_engineered_features(sliding_windows_k3, k=3)
sliding_windows_k2 = add_engineered_features(sliding_windows_k2, k=2)

print("\n=== K=3 Engineered Features ===")
eng_cols_k3 = ['BCVA_Slope', 'CST_Slope', 'BCVA_Delta_Recent', 'CST_Delta_Recent', 'Injection_Count']
print(sliding_windows_k3[eng_cols_k3].describe().round(2))

print("\n=== K=2 Engineered Features ===")
eng_cols_k2 = ['BCVA_Slope', 'CST_Slope', 'BCVA_Delta', 'CST_Delta', 'Injection_Count']
print(sliding_windows_k2[eng_cols_k2].describe().round(2))

Adding engineered features...

=== K=3 Engineered Features ===
       BCVA_Slope  CST_Slope  BCVA_Delta_Recent  CST_Delta_Recent  \
count      290.00     290.00             285.00            285.00   
mean         0.07      -0.33               0.16             -1.13   
std          0.49       1.37               3.14              8.14   
min         -1.25      -6.62             -10.00            -50.00   
25%         -0.25      -0.88              -1.00             -3.00   
50%          0.00      -0.25               0.00             -1.00   
75%          0.38       0.25               2.00              2.00   
max          2.00      10.00              11.00             40.00   

       Injection_Count  
count           290.00  
mean              1.68  
std               1.09  
min               0.00  
25%               1.00  
50%               2.00  
75%               3.00  
max               3.00  

=== K=2 Engineered Features ===
       BCVA_Slope  CST_Slope  BCVA_Delta  CST_Delta  Inje

---
# Part 6: Check Missing Values

In [8]:
def check_missing(df, name):
    """Check and report missing values"""
    print(f"=== {name} Missing Values ===")
    missing = df.isna().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    
    if len(missing) > 0:
        print(missing)
        print(f"\nMax missing rate: {100 * missing.max() / len(df):.1f}%")
    else:
        print("No missing values!")
    print()

check_missing(sliding_windows_k3, "K=3")
check_missing(sliding_windows_k2, "K=2")

# 确认 target 没有缺失
print("=== Target Column Check ===")
print(f"K=3 BCVA_Change_12w missing: {sliding_windows_k3['BCVA_Change_12w'].isna().sum()}")
print(f"K=2 BCVA_Change_12w missing: {sliding_windows_k2['BCVA_Change_12w'].isna().sum()}")
print("(Should both be 0 - target is strictly required)")

=== K=3 Missing Values ===
BMI                  9
Leakage_Index_t0     9
Leakage_Index_t1     8
BCVA_t0              5
Injection_t0         5
CST_t0               5
OCT_Paths_t0         5
Num_B_scans_t0       5
BCVA_t1              5
Fundus_Path_t0       5
CST_t1               5
Injection_t1         5
Fundus_Path_t1       5
OCT_Paths_t1         5
Num_B_scans_t1       5
BCVA_Delta_Recent    5
CST_Delta_Recent     5
Leakage_Index_t2     3
dtype: int64

Max missing rate: 3.1%

=== K=2 Missing Values ===
BMI                 10
Leakage_Index_t1     5
Leakage_Index_t0     3
dtype: int64

Max missing rate: 3.1%

=== Target Column Check ===
K=3 BCVA_Change_12w missing: 0
K=2 BCVA_Change_12w missing: 0
(Should both be 0 - target is strictly required)


---
# Part 7: Train/Val/Test Split (Patient-Level)

In [None]:
def add_patient_split(df, random_state=42):
    """
    Add patient-level train/val/test split.
    Split ratio: 70% train, 15% val, 15% test
    """
    unique_patients = df['Patient_ID'].unique()
    
    # Split patients (not samples!)
    train_patients, temp_patients = train_test_split(
        unique_patients, test_size=0.3, random_state=random_state
    )
    val_patients, test_patients = train_test_split(
        temp_patients, test_size=0.5, random_state=random_state
    )
    
    # Assign split labels
    train_set = set(train_patients)
    val_set = set(val_patients)
    
    df = df.copy()
    df['Split'] = df['Patient_ID'].apply(
        lambda x: 'train' if x in train_set else ('val' if x in val_set else 'test')
    )
    
    return df, train_patients, val_patients, test_patients

# Apply split (use same random_state for fair comparison)
sliding_windows_k3, train_p, val_p, test_p = add_patient_split(sliding_windows_k3, random_state=42)
sliding_windows_k2, _, _, _ = add_patient_split(sliding_windows_k2, random_state=42)

print("=== Patient-Level Split ===")
print(f"Train patients: {len(train_p)}")
print(f"Val patients: {len(val_p)}")
print(f"Test patients: {len(test_p)}")

print("\n=== Sample Distribution ===")
print(f"{'Split':<10} {'K=2':<20} {'K=3':<20}")
print("-" * 50)
for split in ['train', 'val', 'test']:
    k2_n = (sliding_windows_k2['Split'] == split).sum()
    k3_n = (sliding_windows_k3['Split'] == split).sum()
    k2_pct = 100 * k2_n / len(sliding_windows_k2)
    k3_pct = 100 * k3_n / len(sliding_windows_k3)
    print(f"{split:<10} {f'{k2_n} ({k2_pct:.1f}%)':<20} {f'{k3_n} ({k3_pct:.1f}%)':<20}")

---
# Part 8: K=2 vs K=3 Comparison

In [9]:
print("=" * 65)
print("=== K=2 vs K=3 Comparison (12-Week Prediction Focus) ===")
print("=" * 65)

print(f"\n{'Metric':<40} {'K=2':<12} {'K=3':<12}")
print("-" * 65)

# Basic stats
print(f"{'Total samples':<40} {len(sliding_windows_k2):<12} {len(sliding_windows_k3):<12}")
print(f"{'Observation window':<40} {'4 weeks':<12} {'8 weeks':<12}")
print(f"{'Visits in window':<40} {'2':<12} {'3':<12}")

# Target statistics
print(f"\n{'--- Target Statistics ---':<40}")
k2_mean = sliding_windows_k2['BCVA_Change_12w'].mean()
k3_mean = sliding_windows_k3['BCVA_Change_12w'].mean()
print(f"{'Mean ΔBCVA (letters)':<40} {f'{k2_mean:.2f}':<12} {f'{k3_mean:.2f}':<12}")

k2_std = sliding_windows_k2['BCVA_Change_12w'].std()
k3_std = sliding_windows_k3['BCVA_Change_12w'].std()
print(f"{'Std ΔBCVA (letters)':<40} {f'{k2_std:.2f}':<12} {f'{k3_std:.2f}':<12}")

k2_ge5 = 100 * (sliding_windows_k2['BCVA_Change_12w'].abs() >= 5).mean()
k3_ge5 = 100 * (sliding_windows_k3['BCVA_Change_12w'].abs() >= 5).mean()
print(f"{'|ΔBCVA| ≥ 5 letters (%)':<40} {f'{k2_ge5:.1f}%':<12} {f'{k3_ge5:.1f}%':<12}")

k2_ge10 = 100 * (sliding_windows_k2['BCVA_Change_12w'].abs() >= 10).mean()
k3_ge10 = 100 * (sliding_windows_k3['BCVA_Change_12w'].abs() >= 10).mean()
print(f"{'|ΔBCVA| ≥ 10 letters (%)':<40} {f'{k2_ge10:.1f}%':<12} {f'{k3_ge10:.1f}%':<12}")

# Clinical categories
print(f"\n{'--- Clinical Categories ---':<40}")
for cat in ['Improved', 'Stable', 'Worsened']:
    k2_n = (sliding_windows_k2['BCVA_Category'] == cat).sum()
    k3_n = (sliding_windows_k3['BCVA_Category'] == cat).sum()
    k2_pct = 100 * k2_n / len(sliding_windows_k2)
    k3_pct = 100 * k3_n / len(sliding_windows_k3)
    print(f"{cat:<40} {f'{k2_n} ({k2_pct:.1f}%)':<12} {f'{k3_n} ({k3_pct:.1f}%)':<12}")

print("\n" + "=" * 65)

=== K=2 vs K=3 Comparison (12-Week Prediction Focus) ===

Metric                                   K=2          K=3         
-----------------------------------------------------------------
Total samples                            324          290         
Observation window                       4 weeks      8 weeks     
Visits in window                         2            3           

--- Target Statistics ---               
Mean ΔBCVA (letters)                     -0.01        -0.20       
Std ΔBCVA (letters)                      4.06         4.02        
|ΔBCVA| ≥ 5 letters (%)                  23.5%        22.8%       
|ΔBCVA| ≥ 10 letters (%)                 2.8%         2.4%        

--- Clinical Categories ---             
Improved                                 39 (12.0%)   30 (10.3%)  
Stable                                   248 (76.5%)  224 (77.2%) 
Worsened                                 37 (11.4%)   36 (12.4%)  



---
# Part 9: Define Feature Lists

In [11]:
# ============================================================
# Define feature lists for baseline models
# ============================================================

# K=3 Features
K3_STATIC_FEATURES = ['Age', 'Gender', 'Diabetes_Type', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI']

K3_TEMPORAL_FEATURES = [
    'BCVA_t0', 'CST_t0', 'Injection_t0', 'Leakage_Index_t0', 'BCVA_missing_t0',
    'BCVA_t1', 'CST_t1', 'Injection_t1', 'Leakage_Index_t1', 'BCVA_missing_t1',
    'BCVA_t2', 'CST_t2', 'Injection_t2', 'Leakage_Index_t2', 'BCVA_missing_t2',
]

K3_ENGINEERED_FEATURES = ['BCVA_Slope', 'CST_Slope', 'BCVA_Delta_Recent', 'CST_Delta_Recent', 'Injection_Count']

K3_ALL_FEATURES = K3_STATIC_FEATURES + K3_TEMPORAL_FEATURES + K3_ENGINEERED_FEATURES

# K=2 Features
K2_STATIC_FEATURES = ['Age', 'Gender', 'Diabetes_Type', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI']

K2_TEMPORAL_FEATURES = [
    'BCVA_t0', 'CST_t0', 'Injection_t0', 'Leakage_Index_t0', 'BCVA_missing_t0',
    'BCVA_t1', 'CST_t1', 'Injection_t1', 'Leakage_Index_t1', 'BCVA_missing_t1',
]

K2_ENGINEERED_FEATURES = ['BCVA_Slope', 'CST_Slope', 'BCVA_Delta', 'CST_Delta', 'Injection_Count']

K2_ALL_FEATURES = K2_STATIC_FEATURES + K2_TEMPORAL_FEATURES + K2_ENGINEERED_FEATURES

# Target
TARGET_COL = 'BCVA_Change_12w'
TARGET_CAT_COL = 'BCVA_Category'

print("=== Feature Definitions ===")
print(f"\nK=3 Features ({len(K3_ALL_FEATURES)} total):")
print(f"  Static: {len(K3_STATIC_FEATURES)}")
print(f"  Temporal: {len(K3_TEMPORAL_FEATURES)}")
print(f"  Engineered: {len(K3_ENGINEERED_FEATURES)}")

print(f"\nK=2 Features ({len(K2_ALL_FEATURES)} total):")
print(f"  Static: {len(K2_STATIC_FEATURES)}")
print(f"  Temporal: {len(K2_TEMPORAL_FEATURES)}")
print(f"  Engineered: {len(K2_ENGINEERED_FEATURES)}")

print(f"\nTarget: {TARGET_COL} (regression)")
print(f"Target Category: {TARGET_CAT_COL} (classification)")

=== Feature Definitions ===

K=3 Features (26 total):
  Static: 6
  Temporal: 15
  Engineered: 5

K=2 Features (21 total):
  Static: 6
  Temporal: 10
  Engineered: 5

Target: BCVA_Change_12w (regression)
Target Category: BCVA_Category (classification)


---
# Part 10: Save Datasets

In [12]:
# Save K=3 dataset
sliding_windows_k3.to_csv('10a_k3_12w_only.csv', index=False)
print(f"✓ Saved: 10a_k3_12w_only.csv")
print(f"  Shape: {sliding_windows_k3.shape}")
print(f"  Columns: {len(sliding_windows_k3.columns)}")

# Save K=2 dataset
sliding_windows_k2.to_csv('10b_k2_12w_only.csv', index=False)
print(f"\n✓ Saved: 10b_k2_12w_only.csv")
print(f"  Shape: {sliding_windows_k2.shape}")
print(f"  Columns: {len(sliding_windows_k2.columns)}")

✓ Saved: 10a_k3_12w_only.csv
  Shape: (290, 48)
  Columns: 48

✓ Saved: 10b_k2_12w_only.csv
  Shape: (324, 39)
  Columns: 39


In [13]:
# Show all columns for K=3
print("=== K=3 Dataset Columns ===")
for i, col in enumerate(sliding_windows_k3.columns):
    print(f"  {i+1:2d}. {col}")

=== K=3 Dataset Columns ===
   1. Patient_ID
   2. Eye
   3. Window_Start_Week
   4. Window_End_Week
   5. Target_Week
   6. Num_Observed_Visits
   7. Age
   8. Gender
   9. Diabetes_Type
  10. Diabetes_Years
  11. Baseline_HbA1c
  12. BMI
  13. Week_t0
  14. BCVA_missing_t0
  15. BCVA_t0
  16. CST_t0
  17. Injection_t0
  18. Leakage_Index_t0
  19. Fundus_Path_t0
  20. OCT_Paths_t0
  21. Num_B_scans_t0
  22. Week_t1
  23. BCVA_missing_t1
  24. BCVA_t1
  25. CST_t1
  26. Injection_t1
  27. Leakage_Index_t1
  28. Fundus_Path_t1
  29. OCT_Paths_t1
  30. Num_B_scans_t1
  31. Week_t2
  32. BCVA_missing_t2
  33. BCVA_t2
  34. CST_t2
  35. Injection_t2
  36. Leakage_Index_t2
  37. Fundus_Path_t2
  38. OCT_Paths_t2
  39. Num_B_scans_t2
  40. BCVA_Baseline
  41. BCVA_Target
  42. BCVA_Change_12w
  43. BCVA_Category
  44. BCVA_Slope
  45. CST_Slope
  46. BCVA_Delta_Recent
  47. CST_Delta_Recent
  48. Injection_Count


In [14]:
# Show all columns for K=2
print("=== K=2 Dataset Columns ===")
for i, col in enumerate(sliding_windows_k2.columns):
    print(f"  {i+1:2d}. {col}")

=== K=2 Dataset Columns ===
   1. Patient_ID
   2. Eye
   3. Window_Start_Week
   4. Window_End_Week
   5. Target_Week
   6. Num_Observed_Visits
   7. Age
   8. Gender
   9. Diabetes_Type
  10. Diabetes_Years
  11. Baseline_HbA1c
  12. BMI
  13. Week_t0
  14. BCVA_missing_t0
  15. BCVA_t0
  16. CST_t0
  17. Injection_t0
  18. Leakage_Index_t0
  19. Fundus_Path_t0
  20. OCT_Paths_t0
  21. Num_B_scans_t0
  22. Week_t1
  23. BCVA_missing_t1
  24. BCVA_t1
  25. CST_t1
  26. Injection_t1
  27. Leakage_Index_t1
  28. Fundus_Path_t1
  29. OCT_Paths_t1
  30. Num_B_scans_t1
  31. BCVA_Baseline
  32. BCVA_Target
  33. BCVA_Change_12w
  34. BCVA_Category
  35. BCVA_Slope
  36. CST_Slope
  37. BCVA_Delta
  38. CST_Delta
  39. Injection_Count


---
# Summary

## Core Research Question
**Can we predict BCVA change at 12 weeks using recent K visits of multimodal data?**

## Datasets Created

| File | K | Observation | Samples | Features |
|------|---|-------------|---------|----------|
| `10_k3_12w_focus.csv` | 3 | 8 weeks | ~290 | ~26 |
| `10_k2_12w_focus.csv` | 2 | 4 weeks | ~350 | ~21 |

## Key Design Decisions

1. **Single Target Focus**: Only predict 12-week BCVA change (no multi-horizon)
2. **Strict Target Requirement**: 12w target must have BCVA (no missing)
3. **Clinical Categories**: Improved (≥+5), Stable (-5 to +5), Worsened (≤-5)
4. **Patient-Level Split**: Same patients in same split for fair comparison

## Next Steps

1. **Baseline Model**: Train XGBoost/Random Forest on clinical features
2. **Compare K=2 vs K=3**: Which observation window performs better?
3. **Add Image Features**: Incorporate OCT/Fundus embeddings
4. **Evaluate**: MAE, RMSE for regression; AUC for classification (≥5 letters)