# Leakage Index Data Processing

**Objective**: Extract Leakage Index from DRSS.csv (exclude DRSS Level)

**Input**: `/mnt/project/DRSS.csv`

**Output**: `6_Leakage_Index_processed.csv`

**Column Pattern in DRSS.csv**:
- Col 0: Patient ID
- Col 3: Eye
- Col 4: Arm
- Col 5: DRSS Level (Screen) - skip
- Col 6: Leakage Index (Screen) ✓
- Col 7: DRSS Level (Week4) - skip
- Col 8: Change - skip
- Col 9: Leakage Index (Week4) ✓
- Pattern repeats: every 3 columns after col 6, Leakage Index is at position +3

In [1]:
import pandas as pd
import numpy as np


## 1. Load Raw Data

In [2]:
# Load raw data, skip the first header row
drss_raw = pd.read_csv(r"C:\Users\ronny\Documents\GitHub\Baseline-multi-modal-prediction\raw data\DRSS.csv", skiprows=[0], header=0)

print(f"Raw shape: {drss_raw.shape}")
print(f"\nFirst 20 columns:")
print(drss_raw.columns.tolist()[:20])

Raw shape: (195, 67)

First 20 columns:
['Patient ID', 'Unnamed: 1', 'Unnamed: 2', 'Eye', 'Arm', 'DRSS Level', 'Leakage Index', 'DRSS Level.1', 'Change', 'Leakage Index.1', 'DRSS Level.2', 'Change.1', 'Leakage Index.2', 'DRSS Level.3', 'Change.2', 'Leakage Index.3', 'DRSS Level.4', 'Change.3', 'Leakage Index.4', 'DRSS Level.5']


In [3]:
# Preview raw data
drss_raw.head(3)

Unnamed: 0,Patient ID,Unnamed: 1,Unnamed: 2,Eye,Arm,DRSS Level,Leakage Index,DRSS Level.1,Change,Leakage Index.1,...,Leakage Index.17,DRSS Level.18,Change.17,Leakage Index.18,DRSS Level.19,Change.18,Leakage Index.19,DRSS Level.20,Change.19,Leakage Index.20
0,01-001,,,OS,2.0,53,1.59,53,0,0.28,...,0.98,47.0,1.0,1.24,47.0,1.0,1.5,43.0,2.0,2.3
1,01-002,,,OD,2.0,47,3.56,47,0,0.23,...,,,,,,,,,,
2,01-013,,,OD,2.0,53,5.08,53,0,1.3,...,4.0,53.0,0.0,1.51,53.0,0.0,4.29,53.0,0.0,1.92


## 2. Identify Leakage Index Columns

Pattern:
- Screen: col 6 (Leakage Index)
- Week 4: col 9 (Leakage Index.1)
- Week 8: col 12 (Leakage Index.2)
- ... every 3rd column after col 6

In [4]:
# Find all Leakage Index columns
leakage_cols = [col for col in drss_raw.columns if 'Leakage' in str(col)]
print(f"Found {len(leakage_cols)} Leakage Index columns:")
print(leakage_cols)

Found 21 Leakage Index columns:
['Leakage Index', 'Leakage Index.1', 'Leakage Index.2', 'Leakage Index.3', 'Leakage Index.4', 'Leakage Index.5', 'Leakage Index.6', 'Leakage Index.7', 'Leakage Index.8', 'Leakage Index.9', 'Leakage Index.10', 'Leakage Index.11', 'Leakage Index.12', 'Leakage Index.13', 'Leakage Index.14', 'Leakage Index.15', 'Leakage Index.16', 'Leakage Index.17', 'Leakage Index.18', 'Leakage Index.19', 'Leakage Index.20']


In [5]:
# Define time points (matching other processed files)
time_points = ['Screen', 'Week4', 'Week8', 'Week12', 'Week16', 'Week20', 'Week24',
               'Week28', 'Week32', 'Week36', 'Week40', 'Week44', 'Week48', 'Week52',
               'Week60', 'Week68', 'Week76', 'Week84', 'Week92', 'Week100', 'Week104']

print(f"Number of time points: {len(time_points)}")
print(f"Number of Leakage Index columns: {len(leakage_cols)}")

Number of time points: 21
Number of Leakage Index columns: 21


## 3. Extract and Rename Columns

In [6]:
# Select columns: Patient ID, Eye, Arm + all Leakage Index columns
# Note: Patient ID column may have different name due to header parsing
id_cols = drss_raw.columns[:5].tolist()  # First 5 columns include Patient ID, empty cols, Eye, Arm
print(f"ID columns: {id_cols}")

# Get actual column positions
col_patient_id = drss_raw.columns[0]  # 'Patient ID'
col_eye = drss_raw.columns[3]          # 'Eye'
col_arm = drss_raw.columns[4]          # 'Arm'

ID columns: ['Patient ID', 'Unnamed: 1', 'Unnamed: 2', 'Eye', 'Arm']


In [7]:
# Build the processed dataframe
leakage_processed = pd.DataFrame()

# Add identifier columns
leakage_processed['Patient_ID'] = drss_raw[col_patient_id]
leakage_processed['Eye'] = drss_raw[col_eye]
leakage_processed['Arm'] = drss_raw[col_arm]

# Add Leakage Index columns with renamed time points
for i, (tp, leakage_col) in enumerate(zip(time_points, leakage_cols)):
    new_col_name = f'Leakage_{tp}'
    leakage_processed[new_col_name] = drss_raw[leakage_col]
    
print(f"Processed shape: {leakage_processed.shape}")
print(f"\nColumns: {leakage_processed.columns.tolist()}")

Processed shape: (195, 24)

Columns: ['Patient_ID', 'Eye', 'Arm', 'Leakage_Screen', 'Leakage_Week4', 'Leakage_Week8', 'Leakage_Week12', 'Leakage_Week16', 'Leakage_Week20', 'Leakage_Week24', 'Leakage_Week28', 'Leakage_Week32', 'Leakage_Week36', 'Leakage_Week40', 'Leakage_Week44', 'Leakage_Week48', 'Leakage_Week52', 'Leakage_Week60', 'Leakage_Week68', 'Leakage_Week76', 'Leakage_Week84', 'Leakage_Week92', 'Leakage_Week100', 'Leakage_Week104']


## 4. Clean Special Values

DRSS.csv contains special values: 'Missed', 'Dropped', 'Deceased', 'LTFU(LOSS TO FOLLOW UP)', etc.

In [8]:
# Check for non-numeric values in Leakage columns
leakage_cols_new = [col for col in leakage_processed.columns if col.startswith('Leakage_')]

print("Non-numeric values found:")
for col in leakage_cols_new:
    non_numeric = leakage_processed[col].apply(lambda x: not str(x).replace('.','').replace('-','').isdigit() if pd.notna(x) and str(x).strip() != '' else False)
    unique_non_numeric = leakage_processed.loc[non_numeric, col].unique()
    if len(unique_non_numeric) > 0:
        print(f"  {col}: {unique_non_numeric}")

Non-numeric values found:
  Leakage_Week4: ['Arm 2' 'PDR']
  Leakage_Week8: ['W24' 'Arm 1' 'W52' 'NPDR']
  Leakage_Week16: ['Total']


In [9]:
# Convert Leakage Index columns to numeric (non-numeric becomes NaN)
for col in leakage_cols_new:
    leakage_processed[col] = pd.to_numeric(leakage_processed[col], errors='coerce')

# Remove rows with missing Patient_ID
leakage_processed = leakage_processed[leakage_processed['Patient_ID'].notna()].reset_index(drop=True)

# Filter to only valid Patient_IDs (format: XX-XXX, e.g., 01-001)
valid_patient_mask = leakage_processed['Patient_ID'].str.match(r'^\d{2}-\d{3}$', na=False)
leakage_processed = leakage_processed[valid_patient_mask].reset_index(drop=True)

print("Converted all Leakage Index columns to numeric (special values -> NaN)")
print(f"Filtered to {len(leakage_processed)} valid patients")

Converted all Leakage Index columns to numeric (special values -> NaN)
Filtered to 40 valid patients


## 5. Data Validation

In [10]:
# Basic info
print(f"Number of rows: {len(leakage_processed)}")
print(f"Unique Patient_IDs: {leakage_processed['Patient_ID'].nunique()}")
print(f"\nEye distribution:")
print(leakage_processed['Eye'].value_counts())

Number of rows: 40
Unique Patient_IDs: 40

Eye distribution:
Eye
OD    24
OS    16
Name: count, dtype: int64


In [None]:
# Check missing values per time point
print("Missing values per time point:")
for col in leakage_cols_new: 
    missing = leakage_processed[col].isna().sum()
    pct = 100 * missing / len(leakage_processed)
    print(f"  {col}: {missing}/{len(leakage_processed)} ({pct:.1f}%)")

Missing values per time point:
  Leakage_Screen: 0/40 (0.0%)
  Leakage_Week4: 2/40 (5.0%)
  Leakage_Week8: 6/40 (15.0%)
  Leakage_Week12: 1/40 (2.5%)
  Leakage_Week16: 2/40 (5.0%)
  Leakage_Week20: 2/40 (5.0%)
  Leakage_Week24: 4/40 (10.0%)
  Leakage_Week28: 4/40 (10.0%)
  Leakage_Week32: 7/40 (17.5%)
  Leakage_Week36: 6/40 (15.0%)
  Leakage_Week40: 8/40 (20.0%)
  Leakage_Week44: 8/40 (20.0%)
  Leakage_Week48: 7/40 (17.5%)
  Leakage_Week52: 9/40 (22.5%)
  Leakage_Week60: 22/40 (55.0%)
  Leakage_Week68: 15/40 (37.5%)
  Leakage_Week76: 18/40 (45.0%)
  Leakage_Week84: 15/40 (37.5%)
  Leakage_Week92: 17/40 (42.5%)
  Leakage_Week100: 15/40 (37.5%)
  Leakage_Week104: 15/40 (37.5%)


In [13]:
# Check value ranges
print("\nLeakage Index value ranges (first 5 time points):")
for col in leakage_cols_new[:5]:
    values = leakage_processed[col].dropna()
    if len(values) > 0:
        print(f"  {col}: min={values.min():.2f}, max={values.max():.2f}, mean={values.mean():.2f}")


Leakage Index value ranges (first 5 time points):
  Leakage_Screen: min=0.31, max=7.25, mean=2.98
  Leakage_Week4: min=0.01, max=3.40, mean=0.65
  Leakage_Week8: min=0.01, max=2.78, mean=0.61
  Leakage_Week12: min=0.01, max=3.42, mean=0.61
  Leakage_Week16: min=0.01, max=2.50, mean=0.63


## 6. Preview Final Output

In [14]:
leakage_processed

Unnamed: 0,Patient_ID,Eye,Arm,Leakage_Screen,Leakage_Week4,Leakage_Week8,Leakage_Week12,Leakage_Week16,Leakage_Week20,Leakage_Week24,...,Leakage_Week44,Leakage_Week48,Leakage_Week52,Leakage_Week60,Leakage_Week68,Leakage_Week76,Leakage_Week84,Leakage_Week92,Leakage_Week100,Leakage_Week104
0,01-001,OS,2.0,1.59,0.28,0.21,0.15,0.15,0.6,0.14,...,1.12,0.3,0.86,,1.26,1.31,0.98,1.24,1.5,2.3
1,01-002,OD,2.0,3.56,0.23,0.27,0.17,0.27,0.13,0.19,...,4.17,0.36,1.31,,,,,,,
2,01-013,OD,2.0,5.08,1.3,0.9,0.48,0.83,0.53,0.55,...,0.91,2.43,3.63,4.37,1.07,2.54,4.0,1.51,4.29,1.92
3,01-014,OS,2.0,2.34,0.68,0.37,0.32,0.86,,4.21,...,1.23,6.19,1.59,4.76,4.34,2.79,1.86,1.6,1.48,1.49
4,01-023,OD,2.0,4.67,0.95,,2.35,0.77,0.76,0.79,...,0.66,0.76,4.92,,2.92,,2.86,0.27,,
5,01-027,OS,2.0,6.26,0.42,0.6,0.33,0.3,0.2,0.22,...,3.24,0.64,1.22,2.94,2.76,2.99,2.41,3.16,2.43,1.77
6,01-028,OS,2.0,4.16,1.08,1.28,0.95,1.29,0.99,0.9,...,2.04,2.08,3.02,2.19,0.17,1.59,0.41,1.57,,
7,01-035,OD,2.0,1.13,0.62,0.67,0.17,0.52,0.15,0.16,...,1.13,0.72,0.99,1.98,0.75,1.6,0.41,,1.88,1.35
8,01-038,OS,2.0,6.42,0.41,0.15,0.28,0.17,0.12,0.04,...,5.11,0.64,0.8,1.14,5.39,0.62,1.08,4.52,0.12,0.99
9,01-047,OD,2.0,3.43,1.71,1.21,1.3,1.56,1.99,2.21,...,1.6,1.77,1.79,,3.52,1.16,2.12,3.39,1.73,1.28


In [15]:
# Verify sample patient
print("Sample patient (01-001):")
sample = leakage_processed[leakage_processed['Patient_ID'] == '01-001'][['Patient_ID', 'Eye', 'Leakage_Screen', 'Leakage_Week4', 'Leakage_Week8']]
print(sample.to_string(index=False))

Sample patient (01-001):
Patient_ID Eye  Leakage_Screen  Leakage_Week4  Leakage_Week8
    01-001  OS            1.59           0.28           0.21


## 7. Save Processed Data

In [16]:
# Save to CSV
output_path = r'c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\6_Leakage_Index_processed.csv'
leakage_processed.to_csv(output_path, index=False)

print(f"✓ Saved to: {output_path}")
print(f"  Shape: {leakage_processed.shape}")
print(f"  Columns: {len(leakage_processed.columns)}")

✓ Saved to: c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\6_Leakage_Index_processed.csv
  Shape: (40, 24)
  Columns: 24


## Summary

**Extracted**: Leakage Index at all 21 time points

**Excluded**: DRSS Level, Change columns

**Special values**: Converted to NaN (Missed, Dropped, Deceased, LTFU, etc.)

**Format**: Patient_ID, Eye, Arm, Leakage_Screen, Leakage_Week4, ..., Leakage_Week104