# <font color = teal> Introduction to data handling </font>

This notebook contains information about

1)  how to download data into the repository (especially the Physionet 2021 data)

2)  label mapping when ECGs are labels with different diagnostic codes

3)  how to preprocess data if needed

4)  the base idea of splitting data into csv files and how to perform it

When you have performed possible preprocessing and the data splitting into csv files, you may want to create `yaml` files based on these files for training and testing. To do this, check the notebooks [Yaml files of database-wise split for training and testing](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of stratified split for training and testing](2_physionet_stratified_yaml_files.ipynb)

--------

## <font color = teal> 1) Downloading data </font>

### <font color = teal> PhysioNet Challenge 2021 </font>

The exploration of the dataset is available in the notebook [Exploration of the PhysioNet2021 data](exploration_physionet2021_data.ipynb).

There are two ways to download the PhysioNet Challenge 2021 data in `tar.gz` format: 

1) Download the data manually from [here](https://moody-challenge.physionet.org/2021/) under **Data Access**

2) Let this notebook do the job with the following code


In [1]:
# All imports
import os, re
import tarfile
from pathlib import Path
import pandas as pd

In [17]:
# Download the tar.gz files of the PhysioNet2021 data

!wget -O WFDB_CPSC2018.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018.tar.gz/
        
!wget -O WFDB_CPSC2018_2.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018_2.tar.gz/
        
!wget -O WFDB_StPetersburg.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining//WFDB_StPetersburg.tar.gz/
        
!wget -O WFDB_PTB.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTB.tar.gz/
        
!wget -O WFDB_PTBXL.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTBXL.tar.gz/
        
!wget -O WFDB_Ga.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ga.tar.gz/
        
!wget -O WFDB_ChapmanShaoxing.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_ChapmanShaoxing.tar.gz/
        
!wget -O WFDB_Ningbo.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ningbo.tar.gz/
        

^C


Once the `tar.gz` files are downloaded, they need to be extracted to the `data` directory which is located in the root of the repository. The files may be needed to be extracted based on the source database as follows:

- CPSC Database and CPSC-Extra Database
- St. Petersberg (INCART) Database
- PTB and PTB-XL Database
- The Georgia 12-lead ECG Challenge (G12EC) Database
- Chapman-Shaoxing and Ningbo Database

Let's first get the names of the `tar.gz` files.

In [None]:
# All tar.gz files (in the current working directory)
curr_path = os.getcwd()
targz_files = [file for file in os.listdir(curr_path) if os.path.isfile(os.path.join(curr_path, file)) and file.endswith('tar.gz') and file.startswith('WFDB')]

# Let's sort the files
targz_files = sorted(targz_files)

for i, file in enumerate(targz_files):
    print(i, file)

0 WFDB_CPSC2018.tar.gz
1 WFDB_CPSC2018_2.tar.gz
2 WFDB_ChapmanShaoxing.tar.gz
3 WFDB_Ga.tar.gz
4 WFDB_Ningbo.tar.gz
5 WFDB_PTB.tar.gz
6 WFDB_PTBXL.tar.gz
7 WFDB_StPetersburg.tar.gz


So the `tar.gz` files listed above will be extracted as follows:

* WFDB_CPSC2018.tar.gz + WFDB_CPSC2018_2.tar.gz
* WFDB_StPetersburg.tar.gz
* WFDB_PTB.tar.gz + WFDB_PTBXL.tar.gz
* WFDB_Ga.tar.gz
* WFDB_ChapmanShaoxing.tar.gz + WFDB_Ningbo.tar.gz

In [None]:
# Let's make the split as tuples of tar.gz files
# NB! If the split mentioned above wanted, SORTING is really important!
tar_split = [(targz_files[0], targz_files[1]),
             (targz_files[7], ),
             (targz_files[5], targz_files[6]),
             (targz_files[3], ),
             (targz_files[2], targz_files[4])]

print(*tar_split, sep="\n")

('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz')
('WFDB_StPetersburg.tar.gz',)
('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz')
('WFDB_Ga.tar.gz',)
('WFDB_ChapmanShaoxing.tar.gz', 'WFDB_Ningbo.tar.gz')


In [None]:
# Function to extract files from a given tar to a given directory
# Will exclude subdirectories from a given tar and load all the files directly to the given directory
def extract_files(tar, directory):
    
    file = tarfile.open(tar, 'r')
    
    n_files = 0
    for member in file.getmembers():
        if member.isreg(): # Skip if the TarInfo is not file
            member.name = os.path.basename(member.name) # Reset path
            file.extract(member, directory)
            n_files += 1
    
    file.close() 
    re_dir = re.search('data.*', directory)[0]
    print('- {} files extracted to {}'.format(n_files, './'+re_dir))

In [None]:
# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# Path to the physionet_data directory, i.e., save the dataset here
data_path = os.path.join(abs_path.parent.absolute(), 'data', 'physionet_data')

if not os.path.exists(data_path):
    os.makedirs(data_path)

# Directories to which extract the data
# NB! Gotta be at the same length than 'tar_split'
dir_names = ['CPSC_CPSC-Extra', 'INCART', 'PTB_PTBXL', 'G12EC', 'ChapmanShaoxing_Ningbo']

# Extracting right files to right subdirectories
for tar, directory in zip(tar_split, dir_names):
    
    print('Extracting tar.gz file(s) {} to the {} directory'.format(tar, directory))
    
    # Saving path for the specific files
    save_tmp = os.path.join(data_path, directory)
    # Preparing the directory
    if not os.path.exists(save_tmp):
        os.makedirs(save_tmp)
        
    if len(tar) > 1: # More than one database in tuple
        for one_tar in tar:
            extract_files(one_tar, save_tmp)
    else: # Only one database in tuple
        extract_files(tar[0], save_tmp)
        
print('Done!')

NameError: name 'tar_split' is not defined

Now total of **176 506** files (if we want to believe the data exploration presented above) should be located in the `physionet_data` directory as one ECG recording consists of a binary MATLAB v4 file and a text file in header format. For a double check, the number of files can be easily counted as follows:

In [None]:
total_files = 0
for root, dirs, files in os.walk(data_path):
    total_files += len(files)
    
print('Total of {} files'.format(total_files))

Total of 176506 files


### <font color = teal> Getting data from other sources </font>

New data can be downloaded and used with this repository when few quidelines are followed:

1\) `MATLAB v4` (.mat) and `h5` formats are supported for ECG data. During setting up training or testing, ECGs are stored into `torch.utils.data.Dataset` using the following fuction in the `dataset_utils.py` script in `src/dataloader/`:

```
def load_data(case):
    ''' Load a MATLAB v4 file or a H5 file of an ECG recording
    '''

    if case.endswith('.mat'):
        x = loadmat(case)
        return np.asarray(x['val'], dtype=np.float64)
    else:
        with h5py.File(case) as f:
            x = f['ecg'][()]
        return np.asarray(x, dtype=np.float64)
```

So, there is either a `val` column in the `MATLAB` file or a `ecg` column in the `H5` file.

2\) Metadata such as diagnoses, age and gender are loaded from either in `WFDB header format` format (.hea) or from csv files. Also other information about ECGs, e.g. sample frequency, is stored in them. Such files are needed when creating csv files of ECG samples with the `create_data_csvs.py` script. Header files have structure like one below:

```
JS00001 12 500 5000 23-Mar-2021 20:20:47
JS00001.mat 16+24 1000/mV 16 0 -254 21756 0 I
JS00001.mat 16+24 1000/mV 16 0 264 -599 0 II
JS00001.mat 16+24 1000/mV 16 0 517 -22376 0 III
JS00001.mat 16+24 1000/mV 16 0 -5 28232 0 aVR
JS00001.mat 16+24 1000/mV 16 0 -386 16619 0 aVL
JS00001.mat 16+24 1000/mV 16 0 390 15121 0 aVF
JS00001.mat 16+24 1000/mV 16 0 -98 1568 0 V1
JS00001.mat 16+24 1000/mV 16 0 -312 -32761 0 V2
JS00001.mat 16+24 1000/mV 16 0 -98 32715 0 V3
JS00001.mat 16+24 1000/mV 16 0 810 15193 0 V4
JS00001.mat 16+24 1000/mV 16 0 810 14081 0 V5
JS00001.mat 16+24 1000/mV 16 0 527 32579 0 V6
#Age: 85
#Sex: Male
#Dx: 164889003,59118001,164934002
#Rx: Unknown
#Hx: Unknown
#Sx: Unknown
```
The third value in the first row is the sample frequency, and age, gender and diagnoses are gotten from the lines 14-16. The 12 lines after the first one are corresponding the 12 leads of the ECG recordings, `Rx` is refering to medical prespriction, `Hx` to the medical history and `Sx` to symptom or surgery. There are similar columns in the csv files: age as `Age`, gender as `Sex` and diagnoses in SNOMED CT Codes as `SNOMEDCTCode`. Note that whether the metadata is in a csv file or in a header file, <font color = red><i>all metadata files should be located in the same directory than corresponding ECGs are located</i></font>.

3\) The easiest way to handle the repository is to download data into the `data` directory. There are some initialized paths to point to the mentioned directory, for example when creating the csv files of data or the yaml files for configurations of training and testing.

The above code extracts tar.gz files and the chunk consisting of `extract_files(tar, directory)` is generally usable. The function parameters `tar` refers to tar.gz file which needs to be extracted, and `save_path` refers to the path in which the file is extracted to. The path is formatted as an absolute path. For example, the following code can be used for such purpose:

In [17]:
## Other sources
## -------------
'''
# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# The name of the tar gz file (located in the current directory)
tar = 'records.tar.gz'
save_path = os.path.join(abs_path.parent.absolute(), 'data', 'Shandong')
#extract_files(tar, save_path)

# If needed, the samples can be renamed
samples = sorted(os.listdir(save_path))  # get the current names of the samples
path_samples = sorted([os.path.join(save_path, s) for s in samples]) # add path to the current names
new_names = [name.replace('A', 'SPH') for name in samples] # rename the beginning of the file (e.g., A0001.h5 to SPH0001.h5)
path_new_names = sorted([os.path.join(save_path, nn) for nn in new_names]) # add path the the old names

# Rename samples
for old, new in zip(path_samples, path_new_names):
    os.rename(old, new)

# Also, if csv file of metadata has the IDs too, change them if needed
csv_file = pd.read_csv('metadata.csv')
csv_file ['ECG_ID'] = [s.replace('.h5', '') for s in new_names]

# Be also sure that we have a sample frequency in it
csv_file['fs'] = 500
csv_file.to_csv(os.path.join(save_path, 'metadata.csv'), index=False)
'''


----------------

## <font color = teal>2) Label mapping </font>

The main diagnostic code system used in this repository is SNOMED CT Codes.

As ECGs can labeled with different codes, the `label_mapping.py` script is provided to convert codes that are not SNOMED CT Codes. <i>The assumption is that the metadata of spesific data set is found from a csv file</i>. The main idea of the script is that it maps the labels found from the metadata file using `AHA_SNOMED_mapping.csv` (found in the `data` directory) and adds corresponding SNOMED CT Code in the additional `SNOMEDCTCodes` column. The rest of the metadata file remains unchanged. The `AHA_SNOMED_mapping.csv` file contains some of the diagnostic statements conforming to the AHA standard and their corresponding SNOMED CT Codes in the following form:

Rhythm|AHA_Code|SNOMEDCTCode 
------|---------|-------------
Sinus Rhythm|1|426783006 
Atrial Fibrillation|50|164889003
Atrial Fibrillation |346|164889003 
Atrial Fibrillation |347|164889003
Atrial Flutter|51|164890007
Premature Atrial complexes(conducting and non-conducting)|30|284470004
... | ... | ...

As long as the structure of the csv file is not violated, new codes can be added. The `Rhythm` column is not used, but it helps to keep track with the diagnoses. The metadata csv file will be replaced with the updated one.

----

## <font color = teal> 3) Preprocessing data (optional) </font>

All the data can be preprocessed with different transforms with the `preprocess_data.py` script. There are two important attributes to consider:

```
# Original data location
from_directory = os.path.join(os.getcwd(), 'data', 'smoke_data')

# New location for preprocessed data
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_smoke_data')
```

`from_directory` refers to the directory where the data in the original format is loaded from, such as the downloaded Physionet Challenge 2021 data. `new_directory` refers to the new location to where the preprocessed ECGs are saved. Note that <i> whenever the preprocessed ECGs are saved, the metadata files (e.g. a csv file or each corresponding hea file) needs to be copied to that location</i>. The script will do this too.

By default there are two transforms used, a band-pass filter and linear interpolation:

```
# ------------------------------
# --- PREPROCESS TRANSFORMS ----

# - BandPass filter 
bpf = BandPassFilter(fs = ecg_fs)
ecg = bpf(ecg)

# - Linear interpolation
linear_interp = Linear_interpolation(fs_new = 257, fs_old = ecg_fs)
ecg = linear_interp(ecg)

# ------------------------------
# ------------------------------
```

The preprocessing part **is not mandatory for the repository to work**, but if transforms, such as the two mentioned, are used e.g. during the training phase, that can significantly slow down training. That's why it's recommended to preprocess the data before training using the script mentioned.

All the other transforms are set in the `dataset.py` script in `src/dataloader/`, which is run during training. Several transforms are already available in the script `transforms.py` --- from where `Linear_interpolation` and `BandPassFilter` can be found too --- in the same path.

### <font color = teal> Terminal command </font>

To use the script, simply use the following command

```
python preprocess_data.py
```

<font color = red>**NOTE!** The preprocessed ECGs will have different names as the original ones so it's important to mind if the preprosessing part is done or not!</font>

--------

## <font color = teal> 4) Splitting data into csv files </font>

All the data splitting is done with the `create_data_csvs.py` script. The main idea for that script is to split the data into csv files which can be later used in training and testing.

Csv files have the columns `path` (path for a spesific ECG recording), `age`, `gender` and all the diagnoses in SNOMED CT codes used as labels in classification. A value of 1 means that the patient has the disease. The main structure of csv files is as follows:


| path  | age  | gender  | 10370003  | 111975006 | 164890007 | *other diagnoses...* |
| ------------- |-------------|-------------| ------------- |-------------|-------------|-------------|
| ./data/A0002.mat | 49.0 | Female | 0 | 0 | 1 | ... |
| ./data/A0003.mat | 81.0 | Female | 0 | 1 | 1 | ... |
| ./data/A0004.mat | 45.0 |  Male  | 1 | 0 | 0 | ... |
| ... | ... |  ...  | ... | ... | ... | ... |


The script includes several attributes which need to be considered in the main block before running: 

1) The split itself is needed to be spesified using the `stratified` attribute which is a boolean. If the attribute is set `True`, the script performs stratified data split and respectively, if `False`, the database-wise split is performed. 

2) The `data_dir` attribute should be set to point to the right data directory where the data is loaded from. By default it's set to load the data from the `physionet_preprocessed_smoke` directory, which is the subdirectory of the `data` directory. 

3) The `csv_dir` attribute should be set to point to the wanted directory where the created csv files will be saved. By default it's set to save the csv files to the `stratified_smoke` directory which is found from `../data/split_csvs`.

4) The class labels are needed to be set with the `labels` attribute in the script. By default the labels are 
`'426783006', '426177001', '164934002', '427084000', '164890007', '39732003', '164889003', '59931005', '427393009' and '270492004'`, which are ten most common labels found in the [Exploration of the PhysioNet 2021 Data](./exploration_physionet2021_data.ipynb). The numbers of each diagnosis are represented below:

name | SNOMED CT code | Total number of diagnoses<br>in the whole data
-----|----------------|-------------------------------------------
sinus rhythm |426783006 | 28971
sinus bradycardia| 426177001 | 18918 
t wave abnormal| 164934002 | 11716
sinus tachycardia |427084000 | 9657 
atrial flutter| 164890007 | 8374
left axis deviation |39732003 | 7631 
atrial fibrillation |164889003 | 5255 
t wave inversion| 59931005 | 3989 
sinus arrhythmia |427393009 | 3790
1st degree av block| 270492004 | 3534 

The splitting itself can be done in two ways:

<font color = forestgreen><b>Database-wise</b></font>. Above, the data was extracted in the following way 

   * CPSC Database and CPSC-Extra Database
   * St. Petersberg (INCART) Database
   * PTB and PTB-XL Database
   * The Georgia 12-lead ECG Challenge (G12EC) Database
   * Chapman-Shaoxing and Ningbo Database
   
This structure can be used as a baseline for the data split. Simply, the function `dbwise_csvs(data_directory, save_directory, labels)` uses this structure and creates csv files based on it. The `data_directory` parameter refers to the location of the data (note that subdirectories are considered to be different databases), `save_directory` refers to the location where the csv files will be saved, and `labels` refers to the list of Snomed CT Coded labels which will be used in classification. Csv files are named according to the directories from which they were created, e.g., a csv file of CPSC Database and CPSC-Extra Database is names as `CPSC_CPSC-Extra.csv`.

As models read only one csv file from which it gets the paths of the ECGs during training and testing, there might be need to construct multiple databases into one csv file, e.g. if CPSC-Extra, CPSC, G12EC, PTB amd PTB XL are used for training. These multiple combinations of different databases are created in the script but there's an assumption behind the split: As the `data_directory` parameter is given, from where the names of the databases (a.k.a subdirectories) are read, <i>one is considered as a test set, one as a validation set and all the others as a training set</i>.

<font color = forestgreen><b>Stratified</b></font>. The function `stratified_csvs(data_directory, save_directory, labels, train_test_splits)` will perform the stratified split. The parameters are similar to the ones with the function `dbwise_csvs` but there is also the `train_test_splits` parameter which refers to the dictionary of train-test splits. The dictionary is a nested dictionary, i.e. a collection of dictionaries, where the internal directories refer to spesific train-test splits. For example, by default there's one train-test split set in the `train_test_splits` dictionary as follows:

   ```
   train_test_splits = {
   'split_1': {    
         'train': ['G12EC', 'INCART', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
         'test': 'CPSC_CPSC-Extra'
      }
   }
   ```
where `split_1` is simply a name for this particular split, and it has keys `train` and `test` to initialize which databases are seen as training data and which ones as test data. Training data is further divided into training and validation sets. Names (e.g. `split_1`) are used to name the csv files.

Stratification itself is performed by the multilabel cross validator `MultilabelStratifiedShuffleSplit(n_splits, test_size, train_size, random_state)` from `iterative-stratification` package. The script will be using n_splits sized of the length of training dataset (in the yaml file it will be *4* as data is gathered from 'G12EC', 'INCART', 'PTB_PTBXL' and 'ChapmanShaoxing_Ningbo'). *n_splits must always be at least 2!* More information about this and other multilabel cross validators is available in [the GitHub repository of iterative-stratification](https://github.com/trent-b/iterative-stratification).

### <font color = teal> About the naming of csv files </font>

<font color = forestgreen><b>Database-wise</b></font>. The csv files of the database-wise split are quite self-explanatory: The csv files are named after the database from where the data is, for example, `PTB_PTBXL.csv`. The combinated csv files (which are created while making the yaml files in the notebook [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb)) are named after the combination of the databases from which the csv file is structured. For example, if the training data is from the databases CPSC/CPSC-Extra, INCART, and PTB/PTB-XL Databases, the combined csv files will be named as `CPSC_CPSC-Extra_INCART_PTB_PTBXL.csv`.

<font color = forestgreen><b>Stratified</b></font>. As there are 5 different data sources, there are 5 different data splits to be made out of them, i.e., in each split, one spesific dataset is used as testing set and all the others as training set. The `create_data_csvs.py` script will name the resulting csv files using information from the keys of the `train_test_splits` dictionary and from the results of the `MultilabelStratifiedShuffleSplit()` cross validator. For example, the csv names could be the following:

<br>
<center>
train_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp val_split_1_1.csv &nbsp&nbsp&nbsp&nbsp&nbsp test_split_1.csv
</center>
<br>

First, the csvs are separated from each other with `train`, `val` and `test`. Then, as the `train_test_splits` dictionary has keys indexing the splits (e.g. `split_1` and `split_1`), the first index refers to this indexing. The latter index refers to the results of the `MultilabelStratifiedShuffleSplit()` cross validator: As there are 4 different databases from which data is gathered and stratified, it results 4 different splits of training and validation set. So, the latter indexing is due to the functionality of the mentioned cross validator.

### <font color = teal>  Terminal commands </font>

After initializing the needed attributes, the terminal command to perform wanted data split is the following one:

```
python create_data_csvs.py
```

-------------

## <font color = teal> Example: smoke testing </font>

*All the data files for smoke testing are available in the repository.*

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_dir` **attribute is set with the assumption that *the data is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as the** `smoke_data` **directory.** The paths for ECGs will be different in the csv files depending on whether preprocessing has been used or not.</font>

First, we want to **preprocess the data**. We make sure that the `preprocess_data.py` script has the original and new directories set as follows

```
from_directory = os.path.join(os.getcwd(), 'data', 'smoke_data')
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

The `from_directory` attribute refers to the directory where the original data is located, and the `new_directory` attribute where the preprocessed data is saved. Now preprocessing is performed with the following command:

```
python preprocess_data.py
```

When data is preprocessed, we can move on to **split the data into csv files**. Remember to check that the attributes are set as below before running the following command:

```
python create_data_csvs.py
```

###  <font color = forestgreen> Database-wise split </font>

The `create_data_csvs.py` script should have the following attributes set **before the `if-else` statement** as follows:

```
stratified = False
data_dir =  'preprocessed_smoke_data'
csv_dir =  'dbwise_smoke'
labels = ['426783006', '426177001', '427084000', '164890007', '164889003', '427393009']
```

The csv files are saved in `./data/split_csvs/dbwise_smoke/` where you will find the following files:

```
ChapmanShaoxing_Ningbo.csv
CPSC_CPSC-Extra.csv
G12EC.csv
INCART.csv
PTB_PTBXL.csv
```

### <font color = forestgreen> Stratified split </font>

Stratified data split is performed using dictionary of dictionaries where the wanted train-test splits are set. There is one split which is made by running the file.

- Train data is from the directories *G12EC, INCART, PTB_PTBXL* and *ChapmanShaoxing_Ningbo*
- Test data is from the directory *CPSC_CPSC-Extra*.

The following attributes set **before the `if-else` statement** as follows:

```
stratified = True
data_dir =  'preprocessed_smoke_data'
csv_dir =  'stratified_smoke'
labels = ['426783006', '426177001', '427084000', '164890007', '164889003', '427393009']
```

And so specify, which databases are used as training set and which one(s) as testing set, the `train_test_splits` attribute should be set **in the if block**.

```
train_test_splits = {
    'split_1': {    
        'train': ['G12EC', 'Shandong', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
        'test': 'CPSC_CPSC-Extra'
    }
}
```

The csv files are saved in `./data/split_csvs/stratified_smoke/` where you will find the following files:

```
test_split1.csv
train_split_1_1.csv
train_split_1_2.csv
train_split_1_3.csv
train_split_1_4.csv
val_split_1_1.csv
val_split_1_2.csv
val_split_1_3.csv
val_split_1_4.csv
```