In order to train the model, a dataframe containing the ECG ID and classes for each record must be produced. The required data is contained in two separate files, metadata.csv and code.csv. All data was obtained from a public dataset of 25770 records published by Hui et. al. at https://www.nature.com/articles/s41597-022-01403-5.

The file metadata.csv contains diagnostic codes, and code.csv contains the diagnosis and category associated with these codes. Some patients have multiple diagnostic codes from different categories.

In [186]:
import pandas as pd

metadata_df = pd.read_csv('../data/metadata.csv')
metadata_df.head()

Unnamed: 0,ECG_ID,AHA_Code,Patient_ID,Age,Sex,N,Date
0,A00001,22;23,S00001,55,M,5000,2020-03-04
1,A00002,1,S00002,32,M,6000,2019-09-03
2,A00003,1,S00003,63,M,6500,2020-07-16
3,A00004,23,S00004,31,M,5000,2020-07-14
4,A00005,146,S00005,47,M,5500,2020-01-07


In [43]:
code_df = pd.read_csv('../data/code.csv')
code_df.head()

Unnamed: 0,Category,Code,Description
0,A,1,Normal ECG
1,C,21,Sinus tachycardia
2,C,22,Sinus bradycardia
3,C,23,Sinus arrhythmia
4,D,30,Atrial premature complex(es)


Some records include secondary codes, which will not be used by the model. These codes are removed, and a list of all primary codes for each record is produced. (remove_nonprimary_code function provided by Hui et. al.)

In [154]:
def remove_nonprimary_code(x):
    """Remove non-primary statement"""
    r = set()
    for cx in x:
        for c in cx.split('+'):
            if int(c) < 200 or int(c) >= 500:
                r.add(c)
    return list(r)

# obtain primary statements
codes = metadata_df.AHA_Code.str.split(';')
primary_codes = codes.apply(remove_nonprimary_code)

primary_codes.head()

from collections import Counter
c = Counter()
for codes in primary_codes:
    c.update(codes)

print(c.most_common(10))

The model will predict the category of abnormality rather than the specific diagnosis. The list of diagnostic codes must be translated into a list of categories for each record. As can be seen with the 15th record, multiple categories are possible.

In [52]:
def get_category(codes):
    res = set()
    for code in codes:
        cat = code_df.loc[code_df.Code == int(code), 'Category'].values[0]
        res.add(cat)
    return list(res)

categories = primary_codes.transform(get_category)
categories.head(15)

0        [C]
1        [A]
2        [A]
3        [C]
4        [L]
5        [A]
6        [I]
7        [A]
8        [A]
9        [A]
10       [C]
11       [A]
12       [D]
13       [A]
14    [L, C]
Name: AHA_Code, dtype: object

This list of categories is then encoded using scikit-learn's MultiLabelBinarizer.

In [152]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
binarized_labels = pd.DataFrame(mlb.fit_transform(categories))

new_col_labels = ['ECG_ID'] + list(mlb.classes_)
final_data_full = (pd.concat([metadata_df.ECG_ID, binarized_labels], axis=1)
        .set_axis(new_col_labels, axis='columns'))

final_data_full.head(10)

Unnamed: 0,ECG_ID,A,C,D,E,F,H,I,J,K,L,M
0,A00001,0,1,0,0,0,0,0,0,0,0,0
1,A00002,1,0,0,0,0,0,0,0,0,0,0
2,A00003,1,0,0,0,0,0,0,0,0,0,0
3,A00004,0,1,0,0,0,0,0,0,0,0,0
4,A00005,0,0,0,0,0,0,0,0,0,1,0
5,A00006,1,0,0,0,0,0,0,0,0,0,0
6,A00007,0,0,0,0,0,0,1,0,0,0,0
7,A00008,1,0,0,0,0,0,0,0,0,0,0
8,A00009,1,0,0,0,0,0,0,0,0,0,0
9,A00010,1,0,0,0,0,0,0,0,0,0,0


Finally, the data is split into training and test sets. The following code was provided by Hui et. al. to ensure that records from the same patient are not present in both sets.

In [95]:
# 80%-20% split
def ecg_train_test_split(df):
    # put all records belonging to patients with
    # multiple records in the test set
    test1 = df.Patient_ID.duplicated(keep=False)
    N = int(len(df)*0.2) - sum(test1)
    # 73 is chosen such that all primary statements exist in both sets
    df_test = pd.concat([df[test1], df[~test1].sample(N, random_state=73)])
    df_train = df.iloc[df.index.difference(df_test.index)]
    return df_train, df_test

df_train, df_test = ecg_train_test_split(metadata_df)
print(f'The training set has {len(df_train)} records')
print(f'The test set has {len(df_test)} records')

The training set has 20616 records
The test set has 5154 records


In [122]:
final_train = final_data_full.loc[df_train.index]
final_test = final_data_full.loc[df_test.index]

final_train.to_csv('../data/train.csv', index=False)
final_test.to_csv('../data/test.csv', index=False)

print("Training and test sets saved successfully")
print(f'The training set has {len(final_train)} records')
print(f'The test set has {len(final_test)} records')

Training and test sets saved successfully
The training set has 20616 records
The test set has 5154 records


In [142]:
for i in range(1, 12):
    print(final_train.columns[i], "-", final_train.iloc[:,i].mean()) #Print the percent of each class

A - 0.5478754365541327
C - 0.18010283275126115
D - 0.02357392316647264
E - 0.027939464493597205
F - 0.03856228172293364
H - 0.01372720217306946
I - 0.08260574311214591
J - 0.0233313930927435
K - 0.008391540551028328
L - 0.19470314318975554
M - 0.009167636786961583
