# Demographics Data Processing

**Objective**: Extract static patient features from Demographics.csv

**Input**: `/mnt/project/Demographics.csv`

**Output**: `4_Demographics_processed.csv` with clean column names

**Features to extract**:
- Patient_ID, Eye, Treatment_Arm
- Age, Gender, Ethnicity, Race
- Diabetes_Type, Diabetes_Years
- Baseline_HbA1c, BMI
- (HbA1c at W24, W52, W76, W104 - temporal, may exclude)

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 25)

## 1. Load Raw Data

In [None]:
# Load with header at row 1 (0-indexed), skip row 0
demo_raw = pd.read_csv('/mnt/project/Demographics.csv', skiprows=[0], header=0)

print(f"Raw shape: {demo_raw.shape}")
print(f"\nColumns:")
print(demo_raw.columns.tolist())

In [None]:
# Preview data
demo_raw.head()

## 2. Clean Column Names

In [None]:
# Define column name mapping
column_mapping = {
    'Patient \nID': 'Patient_ID',
    'Treatment Arm': 'Treatment_Arm',
    'Study\n Eye': 'Eye',
    'Age': 'Age',
    'Gender': 'Gender',
    'Ethnicity': 'Ethnicity',
    'Race': 'Race',
    'Type of\n Diabetes': 'Diabetes_Type',
    'Number of Years with Diabetes': 'Diabetes_Years',
    'Baseline HbA1c': 'Baseline_HbA1c',
    'W24 HbA1c': 'W24_HbA1c',
    'W52 HbA1c': 'W52_HbA1c',
    'W76 HbA1c': 'W76_HbA1c',
    'W104 HbA1c': 'W104_HbA1c',
    'BMI (kg/m^2)': 'BMI',
    'ETDRS BCVA': 'Baseline_BCVA',
    'CST': 'Baseline_CST',
    'Injection': 'Baseline_Injection',
    'DRSS': 'Baseline_DRSS',
    'Leakage Index': 'Baseline_Leakage_Index'
}

# Rename columns
demo_processed = demo_raw.rename(columns=column_mapping)

print("Renamed columns:")
print(demo_processed.columns.tolist())

## 3. Select Static Features Only

Exclude temporal HbA1c values (W24, W52, W76, W104) and baseline measurements that exist in other files

In [None]:
# Static features to keep
static_features = [
    'Patient_ID',
    'Eye',
    'Treatment_Arm',
    'Age',
    'Gender',
    'Ethnicity',
    'Race',
    'Diabetes_Type',
    'Diabetes_Years',
    'Baseline_HbA1c',
    'BMI'
]

# Also keep temporal HbA1c for potential use
temporal_hba1c = ['W24_HbA1c', 'W52_HbA1c', 'W76_HbA1c', 'W104_HbA1c']

# Select columns
demo_static = demo_processed[static_features + temporal_hba1c].copy()

print(f"Selected {len(demo_static.columns)} columns:")
print(demo_static.columns.tolist())

## 4. Data Validation

In [None]:
# Basic info
print(f"Number of patients: {len(demo_static)}")
print(f"Unique Patient_IDs: {demo_static['Patient_ID'].nunique()}")

In [None]:
# Check categorical distributions
print("=== Categorical Variables ===")
print(f"\nGender:\n{demo_static['Gender'].value_counts()}")
print(f"\nEthnicity:\n{demo_static['Ethnicity'].value_counts()}")
print(f"\nRace:\n{demo_static['Race'].value_counts()}")
print(f"\nDiabetes_Type:\n{demo_static['Diabetes_Type'].value_counts()}")
print(f"\nTreatment_Arm:\n{demo_static['Treatment_Arm'].value_counts()}")

In [None]:
# Check numeric distributions
print("=== Numeric Variables ===")
numeric_cols = ['Age', 'Diabetes_Years', 'Baseline_HbA1c', 'BMI']

for col in numeric_cols:
    values = pd.to_numeric(demo_static[col], errors='coerce')
    print(f"\n{col}:")
    print(f"  Range: {values.min():.1f} - {values.max():.1f}")
    print(f"  Mean: {values.mean():.1f}")
    print(f"  Missing: {values.isna().sum()}")

In [None]:
# Check missing values
print("=== Missing Values ===")
missing = demo_static.isna().sum() + (demo_static == '').sum() + (demo_static == ' ').sum()
print(missing[missing > 0])

## 5. Preview Final Output

In [None]:
demo_static.head(10)

In [None]:
demo_static.info()

## 6. Save Processed Data

In [None]:
# Save to CSV
output_path = '4_Demographics_processed.csv'
demo_static.to_csv(output_path, index=False)

print(f"âœ“ Saved to: {output_path}")
print(f"  Shape: {demo_static.shape}")

## Summary

**Static features extracted**:
- Patient_ID, Eye, Treatment_Arm
- Age, Gender, Ethnicity, Race
- Diabetes_Type, Diabetes_Years
- Baseline_HbA1c, BMI

**Temporal HbA1c** (W24, W52, W76, W104): Kept for potential use

**Excluded**: Baseline_BCVA, Baseline_CST, Baseline_DRSS, Baseline_Leakage_Index (available in other files)