# Introduction for Data Handling

This notebook contains information about

1)  how to download data into the repository (especially the Physionet 2021 data)

2)  how to preprocess data if needed

3)  the base idea of splitting data into csv files and how to perform it

When you have performed possible preprocessing and the data splitting into csv files, you may want to create `yaml` files based on these files for training and testing. To do this, check the notebooks [Yaml files of Database-wise Split for Training and Testing](2_physionet_DBwise_yaml_files.ipynb) and [Yaml files of Stratified Split for Training and Testing](2_physionet_stratified_yaml_files.ipynb)

--------

## 1) Downloading data

### Physionet 2021 data

The exploration of the dataset is available in the notebook [Exploration of the PhysioNet2021 Data](exploration_physionet2021_data.ipynb).

There are two ways to download the Physionet Challenge 2021 data in `tar.gz` format: 

1) Download the data manually from [here](https://moody-challenge.physionet.org/2021/) under **Data Access**

2) Let this notebook do the job with the following code


In [3]:
# First we need the tar.gz files of each database so let's download them

!wget -O WFDB_CPSC2018.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018.tar.gz/
        
!wget -O WFDB_CPSC2018_2.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018_2.tar.gz/
        
!wget -O WFDB_StPetersburg.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining//WFDB_StPetersburg.tar.gz/
        
!wget -O WFDB_PTB.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTB.tar.gz/
        
!wget -O WFDB_PTBXL.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_PTBXL.tar.gz/
        
!wget -O WFDB_Ga.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ga.tar.gz/
        
!wget -O WFDB_ChapmanShaoxing.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_ChapmanShaoxing.tar.gz/
        
!wget -O WFDB_Ningbo.tar.gz \
https://pipelineapi.org:9555/api/download/physionettraining/WFDB_Ningbo.tar.gz/
        

--2022-12-12 14:46:50--  https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018.tar.gz/
Resolving pipelineapi.org (pipelineapi.org)... 35.237.166.166
Connecting to pipelineapi.org (pipelineapi.org)|35.237.166.166|:9555... connected.
HTTP request sent, awaiting response... 200 
Length: 827672464 (789M) [application/octet-stream]
Saving to: ‘WFDB_CPSC2018.tar.gz’


2022-12-12 14:48:38 (7.45 MB/s) - ‘WFDB_CPSC2018.tar.gz’ saved [827672464/827672464]

--2022-12-12 14:48:38--  https://pipelineapi.org:9555/api/download/physionettraining/WFDB_CPSC2018_2.tar.gz/
Resolving pipelineapi.org (pipelineapi.org)... 35.237.166.166
Connecting to pipelineapi.org (pipelineapi.org)|35.237.166.166|:9555... connected.
HTTP request sent, awaiting response... 200 
Length: 423189282 (404M) [application/octet-stream]
Saving to: ‘WFDB_CPSC2018_2.tar.gz’


2022-12-12 14:49:35 (7.10 MB/s) - ‘WFDB_CPSC2018_2.tar.gz’ saved [423189282/423189282]

--2022-12-12 14:49:35--  https://pipelineapi.org:955

Once the `tar.gz` files are downloaded, they need to be extracted to the `data` directory which is located in the root of the repository. The files may be needed to be extracted based on the source database as follows:

- CPSC Database and CPSC-Extra Database
- St. Petersberg (INCART) Database
- PTB and PTB-XL Database
- The Georgia 12-lead ECG Challenge (G12EC) Database
- Chapman-Shaoxing and Ningbo Database

Let's first get the names of the `tar.gz` files.

In [4]:
import os

# All tar.gz files (in the current working directory)
curr_path = os.getcwd()
targz_files = [file for file in os.listdir(curr_path) if os.path.isfile(os.path.join(curr_path, file)) and file.endswith('tar.gz')]

# Let's sort the files
targz_files = sorted(targz_files)

for i, file in enumerate(targz_files):
    print(i, file)

0 WFDB_CPSC2018.tar.gz
1 WFDB_CPSC2018_2.tar.gz
2 WFDB_ChapmanShaoxing.tar.gz
3 WFDB_Ga.tar.gz
4 WFDB_Ningbo.tar.gz
5 WFDB_PTB.tar.gz
6 WFDB_PTBXL.tar.gz
7 WFDB_StPetersburg.tar.gz


So the `tar.gz` files listed above will be extracted as follows:

* WFDB_CPSC2018.tar.gz + WFDB_CPSC2018_2.tar.gz
* WFDB_StPetersburg.tar.gz
* WFDB_PTB.tar.gz + WFDB_PTBXL.tar.gz
* WFDB_Ga.tar.gz
* WFDB_ChapmanShaoxing.tar.gz + WFDB_Ningbo.tar.gz

In [5]:
# Let's make the split as tuples of tar.gz files
# NB! If the split mentioned above wanted, SORTING is really important!
tar_split = [(targz_files[0], targz_files[1]),
             (targz_files[7], ),
             (targz_files[5], targz_files[6]),
             (targz_files[3], ),
             (targz_files[2], targz_files[4])]

print(*tar_split, sep="\n")

('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz')
('WFDB_StPetersburg.tar.gz',)
('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz')
('WFDB_Ga.tar.gz',)
('WFDB_ChapmanShaoxing.tar.gz', 'WFDB_Ningbo.tar.gz')


In [6]:
import tarfile

# Function to extract files from a given tar to a given directory
# Will exclude subdirectories from a given tar and load all the files directly to the given directory
def extract_files(tar, directory):
    
    file = tarfile.open(tar, 'r')
    
    n_files = 0
    for member in file.getmembers():
        if member.isreg(): # Skip if the TarInfo is not file
            member.name = os.path.basename(member.name) # Reset path
            file.extract(member, directory)
            n_files += 1
    
    file.close() 
    print('- {} files extracted to {}'.format(n_files, directory))

In [7]:
from pathlib import Path

# Absolute path of this file
abs_path = Path(os.path.abspath(''))

# Path to the physionet_data directory, i.e., save the dataset here
data_path = os.path.join(abs_path.parent.absolute(), 'data', 'physionet_data')

if not os.path.exists(data_path):
    os.makedirs(data_path)

# Directories to which extract the data
# NB! Gotta be at the same length than 'tar_split'
dir_names = ['CPSC_CPSC-Extra', 'INCART', 'PTB_PTBXL', 'G12EC', 'ChapmanShaoxing_Ningbo']

# Extracting right files to right subdirectories
for tar, directory in zip(tar_split, dir_names):
    
    print('Extracting tar.gz file(s) {} to the {} directory'.format(tar, directory))
    
    # Saving path for the specific files
    save_tmp = os.path.join(data_path, directory)
    # Preparing the directory
    if not os.path.exists(save_tmp):
        os.makedirs(save_tmp)
        
    if len(tar) > 1: # More than one database in tuple
        for one_tar in tar:
            extract_files(one_tar, save_tmp)
    else: # Only one database in tuple
        extract_files(tar[0], save_tmp)
        
print('Done!')

Extracting tar.gz file(s) ('WFDB_CPSC2018.tar.gz', 'WFDB_CPSC2018_2.tar.gz') to the CPSC_CPSC-Extra directory
- 13754 files extracted to /home/tuhlei/digital_health_tech_files/testing/12-lead-ecg-classifier/data/physionet_data/CPSC_CPSC-Extra
- 6906 files extracted to /home/tuhlei/digital_health_tech_files/testing/12-lead-ecg-classifier/data/physionet_data/CPSC_CPSC-Extra
Extracting tar.gz file(s) ('WFDB_StPetersburg.tar.gz',) to the INCART directory
- 148 files extracted to /home/tuhlei/digital_health_tech_files/testing/12-lead-ecg-classifier/data/physionet_data/INCART
Extracting tar.gz file(s) ('WFDB_PTB.tar.gz', 'WFDB_PTBXL.tar.gz') to the PTB_PTBXL directory
- 1032 files extracted to /home/tuhlei/digital_health_tech_files/testing/12-lead-ecg-classifier/data/physionet_data/PTB_PTBXL
- 43674 files extracted to /home/tuhlei/digital_health_tech_files/testing/12-lead-ecg-classifier/data/physionet_data/PTB_PTBXL
Extracting tar.gz file(s) ('WFDB_Ga.tar.gz',) to the G12EC directory
- 20688

Now total of **176 506** files (if we want to believe the data exploration presented above) should be located in the `physionet_data` directory as one ECG recording consists of a binary MATLAB v4 file and a text file in header format. For a double check, the number of files can be easily counted as follows:

In [8]:
total_files = 0
for root, dirs, files in os.walk(data_path):
    total_files += len(files)
    
print('Total of {} files'.format(total_files))

Total of 176506 files


### Other data sources

Wanted data can also be downloaded from other sources when few quidelines are followed:

1) When using this repository in training and testing, the model processes ECGs in `MATLAB v4` format (.mat) and header files in `WFDB header format` format (.hea). Header files consist of the describtion of the recording and patient attributes, including *diagnoses*. 

The following code is used to load the data from MATLAB files:

```
def load_data(case):
    ''' Loading the MATLAB v4 file of ECG recording
    '''
    x = loadmat(case)
    return np.asarray(x['val'], dtype=np.float64)
```

So there is a column named `val` in which the recording is located. This should be considered when loading other MATLAB files.

2) Data should be located in the `data` directory. For example, when training and making predictions, the `data_root` attribute is set in `train_model.py` and `test_model.py` scripts to  indicate the path where the ECG recordings are loaded from.

The above code extracts tar.gz files and the chunk consisting of `extract_files(tar, directory)` is generally usable. The function parameters `tar` refers to tar.gz file which needs to be extracted, and `save_path` refers to the path in which the file is extracted to. The path is formatted as an absolute path.
   

In [2]:
## Other sources
## -------------

# tar = 'example.tar.gz'
# save_path = os.path.join(abs_path.parent.absolute(), 'data', 'example')
# extract_files(tar, save_path)

----

## 2) Preprocessing data

All the data can be preprocessed with different transforms with the `preprocess_data.py` script. There are two important attributes to consider:

```
# Original data location
from_directory = os.path.join(os.getcwd(), 'data', 'physionet_data_smoke')

# New location for preprocessed data
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

`from_directory` refers to the directory where the data in the original format is loaded from, such as the downloaded Physionet Challenge 2021 data. `new_directory` refers to the new location where the directory tree of the original data location is first copied using a function `copy_tree` from the module `distutils.dir_util`. After this, *each directory in the new location* is iterated over and all the ECGs (which should be in a MATLAB format) are preprocessed with wanted transforms. *An original version of ECG is afterwards deleted and the preprocessed one saved in the directory.*

By default there are two transforms used, a band-pass filter and linear interpolation:

```
# ------------------------------
# --- PREPROCESS TRANSFORMS ----

# - BandPass filter 
bpf = BandPassFilter(fs = ecg_fs)
ecg = bpf(ecg)

# - Linear interpolation
linear_interp = Linear_interpolation(fs_new = 257, fs_old = ecg_fs)
ecg = linear_interp(ecg)

# ------------------------------
# ------------------------------
```

The preprocessing part **is not mandatory for the repository to work**, but if transforms, such as the two mentioned, are used e.g. during the training phase, that can significantly slow down training. That's why it's recommended to preprocess the data before training using the script mentioned.

All the other transforms are set in the `dataset.py` script in `src/dataloader/`, which is run during training. Several transforms are already available in the script `transforms.py` --- from where `Linear_interpolation` and `BandPassFilter` can be found too --- in the same path.

### Terminal command

To use the script, simply use the following command

```
python preprocess_data.py
```

<font color = red>**NOTE!** The preprocessed ECGs will have different names as the original ones so it's important to mind if the preprosessing part is done or not!</font>

--------

## 3) Splitting data for training

All the data splitting is done with the `create_data_split_csvs.py` script. The main idea for that script is to split the data into csv files which can be later used in training and testing.

Csv files have the columns `path` (path for a spesific ECG recording), `age`, `gender` and all the diagnoses in SNOMED CT codes used as labels in classification. A value of 1 means that the patient has the disease. The main structure of csv files is as follows:


| path  | age  | gender  | 10370003  | 111975006 | 164890007 | *other diagnoses...* |
| ------------- |-------------|-------------| ------------- |-------------|-------------|-------------|
| ./data/A0002.mat | 49.0 | Female | 0 | 0 | 1 | ... |
| ./data/A0003.mat | 81.0 | Female | 0 | 1 | 1 | ... |
| ./data/A0004.mat | 45.0 |  Male  | 1 | 0 | 0 | ... |
| ... | ... |  ...  | ... | ... | ... | ... |


The script includes several attributes which need to be considered in the main block before running: 

1) The split itself is needed to be spesified using the `stratified` attribute which is a boolean. If the attribute is set `True`, the script performs stratified data split and respectively, if `False`, the database-wise split is performed. 

2) The `data_dir` attribute should be set to point to the right data directory where the data is loaded from. By default it's set to load the data from the `physionet_preprocessed_smoke` directory, which is the subdirectory of the `data` directory. 

3) The `csv_dir` attribute should be set to point to the wanted directory where the created csv files will be saved. By default it's set to save the csv files to the `physionet_stratified_smoke` directory which is found from `../data/split_csvs`.

4) The class labels are needed to be set with the `labels` attribute in the script. By default the labels are 
`'426783006', '426177001', '164934002', '427084000', '164890007', '39732003', '164889003', '59931005', '427393009' and '270492004'`, which are ten most common labels found in the [Exploration of the PhysioNet 2021 Data](./exploration_physionet2021_data.ipynb). The numbers of each diagnosis are represented below:

name | SNOMED CT code | Total number of diagnoses<br>in the whole data
-----|----------------|-------------------------------------------
sinus rhythm |426783006 | 28971
sinus bradycardia| 426177001 | 18918 
t wave abnormal| 164934002 | 11716
sinus tachycardia |427084000 | 9657 
atrial flutter| 164890007 | 8374
left axis deviation |39732003 | 7631 
atrial fibrillation |164889003 | 5255 
t wave inversion| 59931005 | 3989 
sinus arrhythmia |427393009 | 3790
1st degree av block| 270492004 | 3534 

<font color = red>**NOTE!**</font> <font color = green> **Here, the** `data_dir` **attribute is set with the assumption that *the data is preprocessed*. If that's not the case, you should use, for example, the original data directory, such as the** `physionet_data_smoke` **directory.** The paths for ECGs will be different in the csv files depending on whether preprocessing has been used or not.</font>


The splitting itself can be done in two ways

1) **Database-wise**. Above, the data was extracted in the following way 

   * CPSC Database and CPSC-Extra Database
   * St. Petersberg (INCART) Database
   * PTB and PTB-XL Database
   * The Georgia 12-lead ECG Challenge (G12EC) Database
   * Chapman-Shaoxing and Ningbo Database
   
   This structure can be used as a baseline for the data split. Simply, the function `dbwise_csvs(data_directory, save_directory, labels)` uses this structure and creates csvs based on it. The `data_directory` parameter refers to the location of the data, `save_directory` refers to the location where the csv files will be saved, and `labels` refers to the list of Snomed CT Coded labels which will be used in classification. Csv files are named according to the directories from which they were created, e.g., a csv file of CPSC Database and CPSC-Extra Database is names as `CPSC_CPSC-Extra.csv`.

   We can use this structure when creating yaml files for training and testing. But for example if we need to train a model using the first four sources in the list and using only the Chapman-Shaoxing and Ningbo database in testing, we need to create combined yaml files for training phase. In training we only give one csv file for a model to read which ECGs to use. The other csv files, in which there are ECGs from different databases, are made in the notebook [Yaml files of Database-wise Split for Training and Prediction](2_physionet_DBwise_yaml_files.ipynb) when the training and testing csv files are created.

2) **Stratified**. The function `stratified_csvs(data_directory, save_directory, labels, train_test_splits)` will perform the stratified split. The parameters are similar to the ones with the function `dbwise_csvs` but there is also the `train_test_splits` parameter which refers to the dictionary of train-test splits. The dictionary is a nested dictionary, i.e. a collection of dictionaries, where the internal directories refer to spesific train-test splits. For example, by default there's one train-test split set in the `train_test_splits` dictionary as follows:

   ```
   train_test_splits = {
   'split_1': {    
         'train': ['G12EC', 'INCART', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
         'test': 'CPSC_CPSC-Extra'
      }
   }
   ```
   where `split_1` is simply a name for this particular split, and it has keys `train` and `test` to initialize which databases are seen as training data and which ones as test data. Training data is further divided into training and validation sets. Names (e.g. `split_1`) are used to name the csv files. 

   Stratification itself is performed by the multilabel cross validator `MultilabelStratifiedShuffleSplit(n_splits, test_size, train_size, random_state)` from `iterative-stratification` package. The script will be using n_splits sized of the length of training dataset (in the yaml file it will be *4* as data is gathered from 'G12EC', 'INCART', 'PTB_PTBXL' and 'ChapmanShaoxing_Ningbo'). *n_splits must always be at least 2!* More information about this and other multilabel cross validators is available in [the GitHub repository of iterative-stratification](https://github.com/trent-b/iterative-stratification).

### Terminal commands

After initializing the needed attributes, the terminal command to perform wanted data split is the following one:

```
python create_data_split_csvs.py
```

-------------

## Example: Smoke testing

*All the data files for smoke testing are available in the repository.*

First, we want to **preprocess the data**. We make sure that the `preprocess_data.py` script has the original and new directories set as follows

```
from_directory = os.path.join(os.getcwd(), 'data', 'physionet_data_smoke')
new_directory = os.path.join(os.getcwd(), 'data', 'physionet_preprocessed_smoke')
```

The `from_directory` attribute refers to the directory where the original data is located, and the `new_directory` attribute where the preprocessed data is saved. Now preprocessing is performed with the following command:

```
python preprocess_data.py
```

When data is preprocessed, we can move on to **split the data into csv files**. Remember to check that the attributes are set as below before running the following command:

```
python create_data_split_csvs.py
```

####  1) Database-wise split

The `create_data_split_csvs.py` script should have the following attributes set **before the `if-else` statement** as follows:

```
stratified = False
data_dir =  'physionet_preprocessed_smoke'
csv_dir =  'physionet_DBwise_smoke'
labels = ['426783006', '426177001', '164934002', '427084000', '164890007', '39732003', '164889003', '59931005', '427393009', '270492004']
```

The csv files are saved in `./data/split_csvs/physionet_DBwise_smoke/` where you will find the following files:

```
ChapmanShaoxing_Ningbo.csv
CPSC_CPSC-Extra.csv
G12EC.csv
INCART.csv
PTB_PTBXL.csv
```

#### 2) Stratified split

Stratified data split is performed using dictionary of dictionaries where the wanted train-test splits are set. There is one split which is made by running the file.

- Train data is from the directories *G12EC, INCART, PTB_PTBXL* and *ChapmanShaoxing_Ningbo*
- Test data is from the directory *CPSC_CPSC-Extra*.

The following attributes set before the `if-else` statement as follows:

```
stratified = True
data_dir =  'physionet_preprocessed_smoke'
csv_dir =  'physionet_stratified_smoke'
labels = ['426783006', '426177001', '164934002', '427084000', '164890007', '39732003', '164889003', '59931005', '427393009', '270492004']
```

And so specify, which databases are used as training set and which one(s) as testing set, the `train_test_splits` attribute should be set **in the if block**.

```
train_test_splits = {
    'split_1': {    
        'train': ['G12EC', 'INCART', 'PTB_PTBXL', 'ChapmanShaoxing_Ningbo'],
        'test': 'CPSC_CPSC-Extra'
    }
}
```

The csv files are saved in `./data/split_csvs/physionet_stratified_smoke/` where you will find the following files:

```
test_split1.csv
train_split_1_1.csv
train_split_1_2.csv
train_split_1_3.csv
train_split_1_4.csv
val_split_1_1.csv
val_split_1_2.csv
val_split_1_3.csv
val_split_1_4.csv
```