# K=2 Sliding Window Dataset Creation

## Purpose
Create an alternative sliding window dataset with K=2 (4-week observation window) for comparison with K=3.

## Strategy 4 for K=2 (Strict Mode)
- Window: [t0, t1] → Predict t1 + horizon
- **Condition 1**: Both t0 and t1 must have BCVA (strict mode for K=2)
- **Condition 2**: At least one target must have BCVA

## Comparison: K=2 vs K=3
| K | Observation Window | Expected Samples | Pros | Cons |
|---|-------------------|------------------|------|------|
| K=2 | 4 weeks (2 visits) | More | Larger sample size | Less trend info |
| K=3 | 8 weeks (3 visits) | ~313 | More trend info | Smaller sample size |

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

DATA_DIR = Path('.')

---
# Part 1: Load Merged Long Format Data

In [2]:
# Load the merged long format data (created in notebook 8)
merged = pd.read_csv(DATA_DIR / '8a_combined_data_long.csv')

print(f"Loaded merged data: {merged.shape}")
print(f"Patients: {merged['Patient_ID'].nunique()}")
print(f"Eyes: {merged.groupby(['Patient_ID', 'Eye']).ngroups}")
print(f"Weeks: {sorted(merged['Week'].unique())}")

Loaded merged data: (840, 35)
Patients: 40
Eyes: 40
Weeks: [np.int64(0), np.int64(4), np.int64(8), np.int64(12), np.int64(16), np.int64(20), np.int64(24), np.int64(28), np.int64(32), np.int64(36), np.int64(40), np.int64(44), np.int64(48), np.int64(52), np.int64(60), np.int64(68), np.int64(76), np.int64(84), np.int64(92), np.int64(100), np.int64(104)]


---
# Part 2: Create K=2 Sliding Windows (Strict Mode)

- 条件1: t0 和 t1 都必须有 BCVA（严格模式）
- 条件2: 至少一个 target 有 BCVA : Multi-horizon 输出：模型同时预测 3 个 horizon，允许部分 target 缺失; 训练时处理：缺失的 target 在计算 loss 时会被 mask 掉，不参与梯度更新

In [9]:
def create_sliding_windows_k2_strict(df, visit_interval=4, prediction_horizons=[4, 8, 12], max_week=52):
    """
    Create sliding window samples with K=2 (strict mode).
    
    Strategy 4 for K=2 (Strict):
    - Both t0 and t1 must have observed BCVA
    - At least one target must have BCVA
    
    Args:
        df: Long format dataframe
        visit_interval: Weeks between visits (default: 4)
        prediction_horizons: List of prediction horizons in weeks
        max_week: Maximum week for the furthest target
    
    Returns:
        DataFrame with sliding window samples
    
    Example:
        Input: [W0, W4] → Predict: W8 (4w), W12 (8w), W16 (12w)
    """
    k = 2
    windows = []
    
    for (patient_id, eye), group in df.groupby(['Patient_ID', 'Eye']):
        group = group.sort_values('Week')
        valid_weeks = set(group[group['BCVA'].notna()]['Week'].values)
        
        for start_week in group['Week'].values:
            window_weeks = [start_week + i * visit_interval for i in range(k)]
            t_last = window_weeks[-1]  # t1
            
            # 计算所有 target weeks
            target_weeks = {h: t_last + h for h in prediction_horizons}
            max_target_week = max(target_weeks.values())
            
            if max_target_week > max_week:
                continue
            
            # ========== K=2 Strict 条件 ==========
            # 条件1: t0 和 t1 都必须有 BCVA（严格模式）
            if window_weeks[0] not in valid_weeks or window_weeks[1] not in valid_weeks:
                continue
            
            # 条件2: 至少一个 target 有 BCVA
            any_target_valid = any(tw in valid_weeks for tw in target_weeks.values())
            if not any_target_valid:
                continue
            # =====================================
            
            window_data = {
                'Patient_ID': patient_id,
                'Eye': eye,
                'Window_Start_Week': start_week,
                'Window_End_Week': t_last,
                'Num_Observed_Visits': 2,  # K=2 strict 模式下永远是 2
            }
            
            # Static features
            first_visit = group[group['Week'] == window_weeks[0]].iloc[0]
            static_cols = ['Age', 'Gender', 'Diabetes_Type', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI']
            for col in static_cols:
                if col in first_visit:
                    window_data[col] = first_visit[col]
            
            # Temporal features (only t0, t1)
            temporal_cols = ['BCVA', 'CST', 'Injection', 'Leakage_Index', 
                            'Fundus_Path', 'OCT_Paths', 'Num_B_scans']
            
            for i, week in enumerate(window_weeks):
                window_data[f'Week_t{i}'] = week
                window_data[f'BCVA_missing_t{i}'] = 0  # K=2 strict 模式下永远不缺失
                
                visit_data = group[group['Week'] == week].iloc[0]
                for col in temporal_cols:
                    if col in visit_data:
                        window_data[f'{col}_t{i}'] = visit_data[col]
            
            # Target BCVA values and changes
            t_last_bcva = group[group['Week'] == t_last].iloc[0]['BCVA']
            
            for h in prediction_horizons:
                tw = target_weeks[h]
                window_data[f'Target_Week_{h}w'] = tw
                if tw in valid_weeks:
                    target_bcva = group[group['Week'] == tw].iloc[0]['BCVA']
                    window_data[f'BCVA_Target_{h}w'] = target_bcva
                    window_data[f'BCVA_Change_{h}w'] = target_bcva - t_last_bcva
                else:
                    window_data[f'BCVA_Target_{h}w'] = np.nan
                    window_data[f'BCVA_Change_{h}w'] = np.nan
            
            windows.append(window_data)
    
    result_df = pd.DataFrame(windows)
    
    # 打印统计信息
    print(f"=== K=2 Sliding Window Creation Summary (Strict Mode + Multi-Horizon) ===")
    print(f"Total windows created: {len(result_df)}")
    print(f"\nAll windows have 2 observed visits (strict mode)")
    
    return result_df

In [10]:
# Create K=2 sliding windows
sliding_windows_k2 = create_sliding_windows_k2_strict(
    merged,
    visit_interval=4,
    prediction_horizons=[4, 8, 12],
    max_week=52
)

print(f"\n=== K=2 Target Statistics ===")
for h in [4, 8, 12]:
    valid = sliding_windows_k2[f'BCVA_Change_{h}w'].notna().sum()
    pct_ge5 = 100 * (sliding_windows_k2[f'BCVA_Change_{h}w'].abs() >= 5).mean()
    print(f"  {h}w: {valid} valid samples, ≥5 letters change = {pct_ge5:.1f}%")

=== K=2 Sliding Window Creation Summary (Strict Mode + Multi-Horizon) ===
Total windows created: 348

All windows have 2 observed visits (strict mode)

=== K=2 Target Statistics ===
  4w: 335 valid samples, ≥5 letters change = 13.8%
  8w: 332 valid samples, ≥5 letters change = 20.1%
  12w: 324 valid samples, ≥5 letters change = 21.8%


---
# Part 3: Feature Engineering for K=2

In [5]:
def add_engineered_features_k2(df):
    """
    Add engineered features for K=2 sliding windows.
    
    Features:
    - BCVA_Slope: (BCVA_t1 - BCVA_t0) / (Week_t1 - Week_t0)
    - CST_Slope: (CST_t1 - CST_t0) / (Week_t1 - Week_t0)
    - BCVA_Delta: BCVA_t1 - BCVA_t0
    - CST_Delta: CST_t1 - CST_t0
    - Injection_Count: total injections in window
    """
    
    # 确保 CST 是数值型
    for i in range(2):
        col = f'CST_t{i}'
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # 1. Slope (t0 → t1) - 标准化的变化率
    df['BCVA_Slope'] = (df['BCVA_t1'] - df['BCVA_t0']) / (df['Week_t1'] - df['Week_t0'])
    df['CST_Slope'] = (df['CST_t1'] - df['CST_t0']) / (df['Week_t1'] - df['Week_t0'])
    
    # 2. Delta (t0 → t1) - 绝对变化量
    df['BCVA_Delta'] = df['BCVA_t1'] - df['BCVA_t0']
    df['CST_Delta'] = df['CST_t1'] - df['CST_t0']
    
    # 3. 注射次数
    df['Injection_Count'] = df['Injection_t0'].fillna(0) + df['Injection_t1'].fillna(0)
    
    return df

# Apply feature engineering
sliding_windows_k2 = add_engineered_features_k2(sliding_windows_k2)

print("=== Engineered Features Added ===")
eng_cols = ['BCVA_Slope', 'CST_Slope', 'BCVA_Delta', 'CST_Delta', 'Injection_Count']
print(sliding_windows_k2[eng_cols].describe().round(2))

=== Engineered Features Added ===
       BCVA_Slope  CST_Slope  BCVA_Delta  CST_Delta  Injection_Count
count      348.00     348.00      348.00     348.00           348.00
mean         0.09      -0.47        0.37      -1.87             1.16
std          0.84       2.32        3.37       9.27             0.81
min         -2.50     -12.75      -10.00     -51.00             0.00
25%         -0.25      -1.00       -1.00      -4.00             0.00
50%          0.00      -0.25        0.00      -1.00             1.00
75%          0.50       0.50        2.00       2.00             2.00
max          3.50      13.25       14.00      53.00             2.00


---
# Part 4: Check Missing Values

In [6]:
# Check missing values
print("=== Missing Values in K=2 Sliding Windows ===")
missing = sliding_windows_k2.isna().sum()
missing = missing[missing > 0].sort_values(ascending=False)
print(missing)

print(f"\n=== Summary ===")
print(f"Total columns: {len(sliding_windows_k2.columns)}")
print(f"Columns with missing: {len(missing)}")
if len(missing) > 0:
    print(f"Max missing rate: {100 * missing.max() / len(sliding_windows_k2):.1f}%")

=== Missing Values in K=2 Sliding Windows ===
BCVA_Target_12w     24
BCVA_Change_12w     24
BCVA_Target_8w      16
BCVA_Change_8w      16
BCVA_Target_4w      13
BCVA_Change_4w      13
BMI                 10
Leakage_Index_t0     5
Leakage_Index_t1     5
dtype: int64

=== Summary ===
Total columns: 43
Columns with missing: 9
Max missing rate: 6.9%


In [8]:
print('Sliding Windows k=2 cols:', sliding_windows_k2.columns.tolist())
sliding_windows_k2

Sliding Windows k=2 cols: ['Patient_ID', 'Eye', 'Window_Start_Week', 'Window_End_Week', 'Num_Observed_Visits', 'Age', 'Gender', 'Diabetes_Type', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI', 'Week_t0', 'BCVA_missing_t0', 'BCVA_t0', 'CST_t0', 'Injection_t0', 'Leakage_Index_t0', 'Fundus_Path_t0', 'OCT_Paths_t0', 'Num_B_scans_t0', 'Week_t1', 'BCVA_missing_t1', 'BCVA_t1', 'CST_t1', 'Injection_t1', 'Leakage_Index_t1', 'Fundus_Path_t1', 'OCT_Paths_t1', 'Num_B_scans_t1', 'Target_Week_4w', 'BCVA_Target_4w', 'BCVA_Change_4w', 'Target_Week_8w', 'BCVA_Target_8w', 'BCVA_Change_8w', 'Target_Week_12w', 'BCVA_Target_12w', 'BCVA_Change_12w', 'BCVA_Slope', 'CST_Slope', 'BCVA_Delta', 'CST_Delta', 'Injection_Count']


Unnamed: 0,Patient_ID,Eye,Window_Start_Week,Window_End_Week,Num_Observed_Visits,Age,Gender,Diabetes_Type,Diabetes_Years,Baseline_HbA1c,...,BCVA_Target_8w,BCVA_Change_8w,Target_Week_12w,BCVA_Target_12w,BCVA_Change_12w,BCVA_Slope,CST_Slope,BCVA_Delta,CST_Delta,Injection_Count
0,01-001,OS,0,4,2,44,M,2,20,7.1,...,98.0,0.0,16,97.0,-1.0,0.25,-1.75,1.0,-7.0,2.0
1,01-001,OS,4,8,2,44,M,2,20,7.1,...,97.0,0.0,20,96.0,-1.0,-0.25,0.00,-1.0,0.0,2.0
2,01-001,OS,8,12,2,44,M,2,20,7.1,...,96.0,-2.0,24,98.0,0.0,0.25,-0.75,1.0,-3.0,2.0
3,01-001,OS,12,16,2,44,M,2,20,7.1,...,98.0,1.0,28,96.0,-1.0,-0.25,0.50,-1.0,2.0,1.0
4,01-001,OS,16,20,2,44,M,2,20,7.1,...,96.0,0.0,32,97.0,1.0,-0.25,0.75,-1.0,3.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343,02-046,OD,8,12,2,48,M,1,28,7.8,...,92.0,2.0,24,89.0,-1.0,-0.25,-2.75,-1.0,-11.0,1.0
344,02-046,OD,12,16,2,48,M,1,28,7.8,...,89.0,-5.0,28,92.0,-2.0,1.00,0.00,4.0,0.0,0.0
345,02-046,OD,16,20,2,48,M,1,28,7.8,...,92.0,0.0,32,,,-0.50,-0.50,-2.0,-2.0,1.0
346,02-046,OD,20,24,2,48,M,1,28,7.8,...,,,36,,,-0.75,-1.75,-3.0,-7.0,1.0


In [None]:
# sliding_windows_k2['Num_Observed_Visits'].value_counts()
sliding_windows_k2.drop(columns=['Num_Observed_Visits'], inplace=True)
sliding_windows_k2


---
# Part 5: Train/Val/Test Split (Patient-Level)

In [16]:
from sklearn.model_selection import train_test_split

# Get unique patients
unique_patients = sliding_windows_k2['Patient_ID'].unique()
print(f"Total unique patients: {len(unique_patients)}")

# Split patients: 70% train, 15% val, 15% test
# Use same random_state as K=3 for fair comparison
train_patients, temp_patients = train_test_split(unique_patients, test_size=0.3, random_state=42)
val_patients, test_patients = train_test_split(temp_patients, test_size=0.5, random_state=42)

print(f"Train patients: {len(train_patients)}")
print(f"Val patients: {len(val_patients)}")
print(f"Test patients: {len(test_patients)}")

# Assign split
train_set = set(train_patients)
val_set = set(val_patients)

sliding_windows_k2['Split'] = sliding_windows_k2['Patient_ID'].apply(
    lambda x: 'train' if x in train_set else ('val' if x in val_set else 'test')
)

# 统计
print("\n=== Sample Distribution by Split ===")
for split in ['train', 'val', 'test']:
    n = (sliding_windows_k2['Split'] == split).sum()
    pct = 100 * n / len(sliding_windows_k2)
    print(f"  {split}: {n} samples ({pct:.1f}%)")

ModuleNotFoundError: No module named 'sklearn'

---
# Part 6: Save Dataset

In [17]:
# Save K=2 dataset
sliding_windows_k2.to_csv(DATA_DIR / '9_sliding_windows_k2_strict_multi_horizon.csv', index=False)
print(f"  Shape: {sliding_windows_k2.shape}")
print(f"  Columns: {len(sliding_windows_k2.columns)}")

  Shape: (348, 37)
  Columns: 37


In [18]:
# Show all columns
print("=== All Columns ===")
for i, col in enumerate(sliding_windows_k2.columns):
    print(f"  {i+1}. {col}")

=== All Columns ===
  1. Patient_ID
  2. Eye
  3. Window_Start_Week
  4. Window_End_Week
  5. Age
  6. Gender
  7. Diabetes_Type
  8. Diabetes_Years
  9. Baseline_HbA1c
  10. BMI
  11. Week_t0
  12. BCVA_missing_t0
  13. BCVA_t0
  14. CST_t0
  15. Injection_t0
  16. Leakage_Index_t0
  17. Fundus_Path_t0
  18. OCT_Paths_t0
  19. Num_B_scans_t0
  20. Week_t1
  21. BCVA_missing_t1
  22. BCVA_t1
  23. CST_t1
  24. Injection_t1
  25. Leakage_Index_t1
  26. Fundus_Path_t1
  27. OCT_Paths_t1
  28. Num_B_scans_t1
  29. Target_Week_4w
  30. BCVA_Target_4w
  31. BCVA_Change_4w
  32. Target_Week_8w
  33. BCVA_Target_8w
  34. BCVA_Change_8w
  35. Target_Week_12w
  36. BCVA_Target_12w
  37. BCVA_Change_12w


---
# Part 7: K=2 vs K=3 Comparison

In [20]:
# Load K=3 dataset for comparison
try:
    sliding_windows_k3 = pd.read_csv('8c_sliding_window_dataset.csv')
    
    print("=== K=2 vs K=3 Comparison ===")
    print(f"{'Metric':<30} {'K=2':<15} {'K=3':<15}")
    print(f"{'-'*60}")
    print(f"{'Total samples':<30} {len(sliding_windows_k2):<15} {len(sliding_windows_k3):<15}")
    print(f"{'Observation window':<30} {'4 weeks':<15} {'8 weeks':<15}")
    print(f"{'Visits in window':<30} {'2':<15} {'3':<15}")
    
    print(f"\n{'Target Statistics:':<30}")
    for h in [4, 8, 12]:
        k2_valid = sliding_windows_k2[f'BCVA_Change_{h}w'].notna().sum()
        k3_valid = sliding_windows_k3[f'BCVA_Change_{h}w'].notna().sum()
        k2_pct = 100 * (sliding_windows_k2[f'BCVA_Change_{h}w'].abs() >= 5).mean()
        k3_pct = 100 * (sliding_windows_k3[f'BCVA_Change_{h}w'].abs() >= 5).mean()
        print(f"  {h}w valid samples:          {k2_valid:<15} {k3_valid:<15}")
        print(f"  {h}w ≥5 letters:             {f'{k2_pct:.1f}%':<15} {f'{k3_pct:.1f}%':<15}")
    
    
    
        
except FileNotFoundError:
    print("K=3 dataset not found. Run notebook 8 first to create it.")

=== K=2 vs K=3 Comparison ===
Metric                         K=2             K=3            
------------------------------------------------------------
Total samples                  348             313            
Observation window             4 weeks         8 weeks        
Visits in window               2               3              

Target Statistics:            
  4w valid samples:          335             304            
  4w ≥5 letters:             13.8%           14.4%          
  8w valid samples:          332             298            
  8w ≥5 letters:             20.1%           18.8%          
  12w valid samples:          324             290            
  12w ≥5 letters:             21.8%           21.1%          


- 样本量: K=2 有更多样本，因为观察窗口更短，能创建更多窗口。
- 临床显著性比例（≥5 letters）: 两者差异很小（<1%），基本相当。
- 建议：先用 K=3 作为主模型（因为有更丰富的时序信息），然后用 K=2 做对比实验。


---
# Summary

## K=2 Dataset: `10b_k2_sliding_window.csv`

### Configuration
- **Window Size**: K=2 (4-week observation period)
- **Strategy**: Strict mode (both t0 and t1 must have BCVA)
- **Prediction Horizons**: 4w, 8w, 12w
- **Max Week**: 52 (training data cutoff)

### Features
- **Static**: Age, Gender, Diabetes_Type, Diabetes_Years, Baseline_HbA1c, BMI
- **Temporal (t0, t1)**: BCVA, CST, Injection, Leakage_Index
- **Engineered**: BCVA_Slope, CST_Slope, BCVA_Delta, CST_Delta, Injection_Count
- **Image Paths**: Fundus_Path, OCT_Paths (for future multimodal model)

### Key Differences from K=3
| Aspect | K=2 | K=3 |
|--------|-----|-----|
| Observation period | 4 weeks | 8 weeks |
| Missing strategy | Strict (no missing allowed) | Lenient (≥2 observed) |
| Sample size | Larger | Smaller |
| Trend information | Less (only 1 change) | More (2 changes) |

### Next Steps
1. Train baseline models on both K=2 and K=3 datasets
2. Compare performance to determine optimal window size
3. Use the better configuration for multimodal model