# Data CSV → Extract Injection columns

**Objective**: Extract Injection (Yes/No) columns from Data.csv

**Input**: `/mnt/project/Data.csv`

**Output**: `Injection_processed.csv` with binary injection status per visit

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 50)

## 1. Load and Explore Raw Data

In [2]:
# Load raw data (skip first header row)
data_raw = pd.read_csv(r'c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\raw data\Data.csv', skiprows=1)

print(f"Raw data shape: {data_raw.shape}")
print(f"\nColumn pattern (first 20):")
for i, col in enumerate(data_raw.columns[:20]):
    print(f"  {i}: {col}")

Raw data shape: (40, 108)

Column pattern (first 20):
  0: Patient ID
  1: Eye
  2: Arm
  3: ETDRS BCVA
  4: Injection
  5: Fellow Eye Injection
  6: Aqueous Sample
  7: ETDRS BCVA.1
  8: Injection.1
  9: Fellow Eye Injection.1
  10: Aqueous Sample.1
  11: BCVA Change
  12: ETDRS BCVA.2
  13: Injection.2
  14: Fellow Eye Injection.2
  15: Aqueous Sample.2
  16: BCVA Change.1
  17: ETDRS BCVA.3
  18: Injection.3
  19: Fellow Eye Injection.3


In [None]:
# Understand the column structure
# Screen: BCVA, Injection, Fellow Eye Injection, Aqueous Sample (4 cols)
# Week 4+: BCVA, Injection, Fellow Eye Injection, Aqueous Sample, Change (5 cols) Injection研究眼 (Study Eye) 是否注射这是被追踪的那只眼睛，BCVA/CST等指标都是测量这只眼; Fellow Eye Injection对侧眼 (Fellow Eye) 是否注射同一患者的另一只眼睛
# Week 40 has extra 'Date' column

# Find all Injection columns
injection_cols = [col for col in data_raw.columns if 'Injection' in col and 'Fellow' not in col]
print(f"Injection columns found: {len(injection_cols)}")
print(injection_cols)

Injection columns found: 21
['Injection', 'Injection.1', 'Injection.2', 'Injection.3', 'Injection.4', 'Injection.5', 'Injection.6', 'Injection.7', 'Injection.8', 'Injection.9', 'Injection.10', 'Injection.11', 'Injection.12', 'Injection.13', 'Injection.14', 'Injection.15', 'Injection.16', 'Injection.17', 'Injection.18', 'Injection.19', 'Injection.20']


## 2. Map Injection Columns to Time Points

Column structure per time point:
- Screen (W0): BCVA, **Injection**, Fellow Eye Injection, Aqueous Sample
- Week 4+: BCVA, **Injection**, Fellow Eye Injection, Aqueous Sample, Change

In [4]:
# Define time points
time_points = ['Screen', 'Week4', 'Week8', 'Week12', 'Week16', 'Week20', 'Week24', 
               'Week28', 'Week32', 'Week36', 'Week40', 'Week44', 'Week48', 'Week52',
               'Week60', 'Week68', 'Week76', 'Week84', 'Week92', 'Week100', 'Week104']

# Injection columns are named: Injection, Injection.1, Injection.2, ...
# Map them to time points
injection_col_map = {}
for i, tp in enumerate(time_points):
    if i == 0:
        col_name = 'Injection'
    else:
        col_name = f'Injection.{i}'
    
    if col_name in data_raw.columns:
        injection_col_map[tp] = col_name
    else:
        print(f"Warning: {col_name} not found for {tp}")

print(f"\nMapped {len(injection_col_map)} time points:")
for tp, col in list(injection_col_map.items())[:5]:
    print(f"  {tp} -> {col}")


Mapped 21 time points:
  Screen -> Injection
  Week4 -> Injection.1
  Week8 -> Injection.2
  Week12 -> Injection.3
  Week16 -> Injection.4


## 3. Extract Injection Data

In [5]:
# Create processed dataframe
injection_processed = pd.DataFrame()

# Add patient info columns
injection_processed['Patient_ID'] = data_raw['Patient ID']
injection_processed['Eye'] = data_raw['Eye']
injection_processed['Arm'] = data_raw['Arm']

# Add injection columns for each time point
for tp in time_points:
    if tp in injection_col_map:
        raw_col = injection_col_map[tp]
        injection_processed[tp] = data_raw[raw_col]

print(f"Processed shape: {injection_processed.shape}")
injection_processed.head()

Processed shape: (40, 24)


Unnamed: 0,Patient_ID,Eye,Arm,Screen,Week4,Week8,Week12,Week16,Week20,Week24,Week28,Week32,Week36,Week40,Week44,Week48,Week52,Week60,Week68,Week76,Week84,Week92,Week100,Week104
0,01-001,OS,2,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,No,Yes,No,Yes,,Yes,Yes,Yes,Yes,Yes,No
1,01-002,OD,2,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Yes,No,No,Yes,No,No,,,,,,,
2,01-013,OD,2,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,Yes,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No
3,01-014,OS,2,Yes,Yes,Yes,No,Yes,,Yes,Yes,Yes,Yes,No,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No
4,01-023,OD,2,Yes,Yes,,Yes,Yes,No,No,Yes,No,No,Yes,No,No,Yes,,Yes,,Yes,No,,


## 4. Convert Yes/No to Binary (1/0)

In [6]:
# Check unique values in injection columns
print("Unique values in injection columns:")
for tp in time_points[:5]:
    if tp in injection_processed.columns:
        print(f"  {tp}: {injection_processed[tp].unique()}")

Unique values in injection columns:
  Screen: ['Yes']
  Week4: ['Yes' 'No']
  Week8: ['Yes' nan 'No']
  Week12: ['Yes' 'No' nan]
  Week16: ['No' 'Yes' nan]


In [7]:
# Convert Yes/No to 1/0 for all time point columns
def convert_yes_no(val):
    """Convert Yes/No to 1/0, keep NaN as NaN"""
    if pd.isna(val):
        return np.nan
    if str(val).strip().lower() == 'yes':
        return 1
    if str(val).strip().lower() == 'no':
        return 0
    return np.nan  # Handle unexpected values

# Apply conversion to time point columns only
for tp in time_points:
    if tp in injection_processed.columns:
        injection_processed[tp] = injection_processed[tp].apply(convert_yes_no)

print("After conversion:")
injection_processed.head()

After conversion:


Unnamed: 0,Patient_ID,Eye,Arm,Screen,Week4,Week8,Week12,Week16,Week20,Week24,Week28,Week32,Week36,Week40,Week44,Week48,Week52,Week60,Week68,Week76,Week84,Week92,Week100,Week104
0,01-001,OS,2,1,1,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,,1.0,1.0,1.0,1.0,1.0,0.0
1,01-002,OD,2,1,1,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,
2,01-013,OD,2,1,1,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,01-014,OS,2,1,1,1.0,0.0,1.0,,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,01-023,OD,2,1,1,,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,,1.0,,1.0,0.0,,


## 5. Validation

In [8]:
# Check injection counts per time point
print("Injection counts per time point:")
for tp in time_points:
    if tp in injection_processed.columns:
        counts = injection_processed[tp].value_counts(dropna=False)
        n_yes = counts.get(1, 0)
        n_no = counts.get(0, 0)
        n_missing = counts.get(np.nan, 0) if np.nan in counts.index else injection_processed[tp].isna().sum()
        print(f"  {tp}: Yes={n_yes}, No={n_no}, Missing={n_missing}")

Injection counts per time point:
  Screen: Yes=40, No=0, Missing=0
  Week4: Yes=37, No=3, Missing=0
  Week8: Yes=33, No=3, Missing=4
  Week12: Yes=27, No=12, Missing=1
  Week16: Yes=23, No=16, Missing=1
  Week20: Yes=19, No=19, Missing=2
  Week24: Yes=13, No=23, Missing=4
  Week28: Yes=12, No=24, Missing=4
  Week32: Yes=12, No=21, Missing=7
  Week36: Yes=13, No=21, Missing=6
  Week40: Yes=7, No=25, Missing=8
  Week44: Yes=9, No=23, Missing=8
  Week48: Yes=8, No=25, Missing=7
  Week52: Yes=13, No=18, Missing=9
  Week60: Yes=8, No=10, Missing=22
  Week68: Yes=19, No=6, Missing=15
  Week76: Yes=9, No=13, Missing=18
  Week84: Yes=13, No=12, Missing=15
  Week92: Yes=15, No=8, Missing=17
  Week100: Yes=12, No=13, Missing=15
  Week104: Yes=0, No=25, Missing=15


In [9]:
# Calculate total injections per patient
tp_cols = [tp for tp in time_points if tp in injection_processed.columns]
injection_processed['Total_Injections'] = injection_processed[tp_cols].sum(axis=1)

print("Total injections per patient:")
print(injection_processed['Total_Injections'].describe())

# Remove the summary column for final output
injection_processed = injection_processed.drop('Total_Injections', axis=1)

Total injections per patient:
count    40.000000
mean      8.550000
std       3.366121
min       3.000000
25%       6.750000
50%       8.500000
75%      10.000000
max      17.000000
Name: Total_Injections, dtype: float64


In [10]:
# Verify patient alignment with BCVA
bcva = pd.read_csv(r'c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\1 BCVA_processed.csv')

bcva_patients = set(bcva['Patient_ID'])
inj_patients = set(injection_processed['Patient_ID'])

print(f"BCVA patients: {len(bcva_patients)}")
print(f"Injection patients: {len(inj_patients)}")
print(f"Overlap: {len(bcva_patients & inj_patients)}")

BCVA patients: 40
Injection patients: 40
Overlap: 40


## 6. Preview and Save

In [11]:
# Final preview
print(f"Final shape: {injection_processed.shape}")
print(f"Columns: {injection_processed.columns.tolist()}")
injection_processed.head()

Final shape: (40, 24)
Columns: ['Patient_ID', 'Eye', 'Arm', 'Screen', 'Week4', 'Week8', 'Week12', 'Week16', 'Week20', 'Week24', 'Week28', 'Week32', 'Week36', 'Week40', 'Week44', 'Week48', 'Week52', 'Week60', 'Week68', 'Week76', 'Week84', 'Week92', 'Week100', 'Week104']


Unnamed: 0,Patient_ID,Eye,Arm,Screen,Week4,Week8,Week12,Week16,Week20,Week24,Week28,Week32,Week36,Week40,Week44,Week48,Week52,Week60,Week68,Week76,Week84,Week92,Week100,Week104
0,01-001,OS,2,1,1,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,,1.0,1.0,1.0,1.0,1.0,0.0
1,01-002,OD,2,1,1,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,
2,01-013,OD,2,1,1,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,01-014,OS,2,1,1,1.0,0.0,1.0,,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,01-023,OD,2,1,1,,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,,1.0,,1.0,0.0,,


In [13]:
# Save to CSV
output_path = r'c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\4 Injection processed.csv'
injection_processed.to_csv(output_path, index=False)

print(f"✓ Saved to: {output_path}")
print(f"  Shape: {injection_processed.shape}")

✓ Saved to: c:\Users\ronny\Documents\GitHub\Longitudinal Prediction\4 Injection processed.csv
  Shape: (40, 24)


## Summary

- **Input**: Data.csv (contains BCVA, Injection, Fellow Eye Injection, etc.)
- **Output**: Injection_processed.csv with binary (1/0) injection status
- **Format**: Patient_ID, Eye, Arm, Screen, Week4, ..., Week104
- **Values**: 1 = Injection given, 0 = No injection, NaN = Missing