In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import wfdb
import ast

In [42]:
path = "C:/Users/itsrh/Downloads/ptb-xl-electrocardiography-dataset-1.0.3"

In [43]:
# Loading Metadata
df = pd.read_csv(path + "/ptbxl_database.csv")

In [44]:
df.head()

Unnamed: 0,ecg_id,patient_id,age,sex,height,weight,nurse,site,device,recording_date,...,validated_by_human,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr
0,1,15709.0,56.0,1,,63.0,2.0,0.0,CS-12 E,1984-11-09 09:17:34,...,True,,", I-V1,",,,,,3,records100/00000/00001_lr,records500/00000/00001_hr
1,2,13243.0,19.0,0,,70.0,2.0,0.0,CS-12 E,1984-11-14 12:55:37,...,True,,,,,,,2,records100/00000/00002_lr,records500/00000/00002_hr
2,3,20372.0,37.0,1,,69.0,2.0,0.0,CS-12 E,1984-11-15 12:49:10,...,True,,,,,,,5,records100/00000/00003_lr,records500/00000/00003_hr
3,4,17014.0,24.0,0,,82.0,2.0,0.0,CS-12 E,1984-11-15 13:44:57,...,True,", II,III,AVF",,,,,,3,records100/00000/00004_lr,records500/00000/00004_hr
4,5,17448.0,19.0,1,,70.0,2.0,0.0,CS-12 E,1984-11-17 10:43:15,...,True,", III,AVR,AVF",,,,,,4,records100/00000/00005_lr,records500/00000/00005_hr


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21799 entries, 0 to 21798
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ecg_id                        21799 non-null  int64  
 1   patient_id                    21799 non-null  float64
 2   age                           21799 non-null  float64
 3   sex                           21799 non-null  int64  
 4   height                        6974 non-null   float64
 5   weight                        9421 non-null   float64
 6   nurse                         20326 non-null  float64
 7   site                          21782 non-null  float64
 8   device                        21799 non-null  object 
 9   recording_date                21799 non-null  object 
 10  report                        21799 non-null  object 
 11  scp_codes                     21799 non-null  object 
 12  heart_axis                    13331 non-null  object 
 13  i

In [46]:
# getting the view of each column for the first entry
df.iloc[0]

ecg_id                                                                 1
patient_id                                                       15709.0
age                                                                 56.0
sex                                                                    1
height                                                               NaN
weight                                                              63.0
nurse                                                                2.0
site                                                                 0.0
device                                                         CS-12   E
recording_date                                       1984-11-09 09:17:34
report                            sinusrhythmus periphere niederspannung
scp_codes                       {'NORM': 100.0, 'LVOLT': 0.0, 'SR': 0.0}
heart_axis                                                           NaN
infarction_stadium1                                

## **Meta Data Preprocessing Steps:**

### Converting scp_codes from string to dictionary.
The scp_codes column in PTB-XL is stored as a string representation of a Python dictionary. Since CSV files store all values as text, dictionary-like diagnostic annotations are not automatically parsed as dictionaries when loaded using pandas.
To allow programmatic access to diagnostic labels, we convert each string into an actual Python dictionary using ast.literal_eval(). This allows the extraction of diagnostic codes and mapping to diagnostic superclasses for label generation.

In [47]:
df["scp_codes"] = df["scp_codes"].apply(lambda x: ast.literal_eval(x))

### Binary label creation

Now we will create our own binary disease label because the PTB-XL dataset does not provide a direct “normal vs abnormal” column. Instead, each ECG recording contains diagnostic information inside the **scp_codes** field, which stores detailed medical codes, for example {'MI': 80.0, 'STTC': 50.0}. If an ECG recording contains only the superclass `NORM`, we assign it label 0 (Normal). If it contains anything other than purely NORM, even if NORM appears together with another condition, we classify it as label 1 (Abnormal). In other words, we treat everything that is not 100% NORM as Abnormal. This is an informed decision because we want a binary classifier with high sensitivity, since in medical screening, it is more important to minimize false negatives than to strictly separate mild or mixed cases. However, this decision is not final; if our model doesn't have satisfactory performance, we might use a different approach to separate data into binary classes using **scp_statements**.

In [48]:
def create_binary_label(scp_codes):
    if list(scp_codes.keys()) == ['NORM']:
        return 0
    else:
        return 1

df["label"] = df["scp_codes"].apply(create_binary_label)

### Selective Column Dropping

The ptbxl_database.csv file in PTB-XL contains 28 columns of metadata for the 21,799 ECG records. As our primary features are the ones extracted from the 100 Hz 12-lead waveforms, metadata serves only as supplementary input. High-missing-rate columns add unnecessary complexity (e.g., imputation choices, potential bias) without a strong justification for our lightweight, interpretable decision tree focus on primary-care cardiac abnormality screening.

We will use the following rules to drop columns from metadata.

1. **Missing-value threshold rule:**
Any column with more than 30% missing values is dropped automatically. This is because metadata is secondary, and columns exceeding this threshold introduce too much uncertainty or require unreliable imputation for minimal expected gain in model performance.

3. **Subsequent selective dropping:**
From the remaining low-missing columns, we retain only those with clear clinical relevance or that are essential for experiment reproducibility.

**Final kept columns:**
1. ecg_id: Unique identifier for matching metadata to waveform files (records100/ folder).
2. strat_fold: Official 10-fold stratification for reproducible splits (folds 1–8 train, 9 validation, 10 test).
3. age: Patient age at recording (low missing; clinically important for ECG parameter norms, e.g., QTc adjustment, age-related pathology prevalence).
4. sex: Biological sex (0 = male, 1 = female; low missing; essential for sex-specific ECG interpretation, e.g., voltage criteria, axis norms).
5. label: Derived binary target (0 = Abnormal, 1 = Normal; or vice versa, created from scp_codes before dropping it).

**Columns Dropped:**
1. height: ~68% missing (only ~32% populated); too sparse for reliable use without heavy imputation.
2. weight: ~57% missing (only ~43% populated); same issue — high sparsity outweighs potential BMI utility.
3. heart_axis, infarction_stadium1, infarction_stadium2: High missing rates (specific to subsets of pathologies).
4. Signal quality flags (e.g., baseline_drift, static_noise, burst_noise, electrodes_problems, extra_beats, pacemaker): Mostly sparse or empty strings.
5. patient_id: Redundant with ecg_id; no predictive role.
6. nurse, site: Administrative/pseudonymized identifiers; no influence on abnormality detection.
7. device: Nearly constant (e.g., mostly CS-12 E); lacks variability.
8. recording_date Timestamp (privacy-shifted): irrelevant for classification.
9. report: Free-text German summaries; redundant with scp_codes, memory-heavy strings.
10. scp_codes: Diagnostic code dictionary used only to derive the label, and dropped right after labeling to save RAM.
11. validated_by, second_opinion, initial_autogenerated_report, validated_by_human: These columns describe how labels were created and checked, but they're not useful for predictions in our model. The PTB-XL paper says folds 9 and 10 have high-quality labels verified by human cardiologists, while earlier folds may have more automatic or less-checked ones. It recommends fold 9 for validation and fold 10 for testing for reliable results. We'll follow this exactly for our splits, so we don't need these validation columns—we're already using the top-quality folds as intended.

In [49]:
cols_to_drop = [
    'patient_id', 'height', 'weight', 'nurse', 'site', 'device',
    'recording_date', 'report', 'scp_codes',             # after label created
    'heart_axis', 'infarction_stadium1', 'infarction_stadium2',
    'validated_by', 'second_opinion', 'initial_autogenerated_report',
    'validated_by_human', 'baseline_drift', 'static_noise',
    'burst_noise', 'electrodes_problems', 'extra_beats', 'pacemaker',
    # keep filename_lr if using 100 Hz, else drop both
    'filename_hr'   # if only using low-res
]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns])

In [50]:
df.head()

Unnamed: 0,ecg_id,age,sex,strat_fold,filename_lr,label
0,1,56.0,1,3,records100/00000/00001_lr,1
1,2,19.0,0,2,records100/00000/00002_lr,1
2,3,37.0,1,5,records100/00000/00003_lr,1
3,4,24.0,0,3,records100/00000/00004_lr,1
4,5,19.0,1,4,records100/00000/00005_lr,1
