# Preparation

We do the folloiwng:

1. Install [EVAR (Evaluation package for Audio Representations)](https://github.com/nttcslab/eval-audio-repr) and a program for training and testing on the CirCor dataset using various pre-trained audio representations.
2. Clone heart-murmur-detection repository.
3. Download dataset.
4. Create stratified splits.
5. Make the training data for EVAR.
6. Finally, check the data integrity.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
import librosa
import torch
import torchaudio

In [2]:
## CLEAN UP -- Uncomment and run this step when you need to clearn up.
# ! rm -fr evar heart-murmur-detection

## 1. Code setup: EVAR and a main program

Our experiments run using [EVAR (Evaluation package for Audio Representations)](https://github.com/nttcslab/eval-audio-repr) and use a main program, `circor_eval.py`, on top of it.

The following does:
- Clone EVAR and download the additional Python codes.
- Apply a patch on EVAR to extend it to the heart murmur detection tasks (circor1 to 3).
- Copy a program, `circor_eval.py`, that performs training and tests a model on the task.

In [3]:
! git clone https://github.com/nttcslab/eval-audio-repr.git evar
! (cd evar && git checkout 75eedb4e4c4628ac5478c1a975abe1969beaf291)
! (cd evar && curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py)
! (cd evar && curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py)
! (cd evar && curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py)
! (cd evar && patch -p1 < ../diff-evar.patch)
! (cd evar/external && ln -s ../../../.. m2d)
! cp circor_eval.py evar/

Cloning into 'evar'...
remote: Enumerating objects: 355, done.[K
remote: Counting objects: 100% (230/230), done.[K
remote: Compressing objects: 100% (152/152), done.[K
remote: Total 355 (delta 158), reused 127 (delta 78), pack-reused 125[K
Receiving objects: 100% (355/355), 894.25 KiB | 11.46 MiB/s, done.
Resolving deltas: 100% (220/220), done.
Note: switching to '75eedb4e4c4628ac5478c1a975abe1969beaf291'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 75eedb4 Update m2d.yaml
  % Total    % Received % Xfer

### 1-1 Setup pre-trained models

For the experiment using M2D, please download the weight file.

```sh
wget https://github.com/nttcslab/m2d/releases/download/v0.1.0/m2d_vit_base-80x608p16x16-221006-mr7_enconly.zip
unzip m2d_vit_base-80x608p16x16-221006-mr7_enconly.zip
```

For the setup of AST and BYOL-A, please follow the steps in [Preparing-models.md](https://github.com/nttcslab/eval-audio-repr/blob/main/Preparing-models.md) in EVAR.

## 2. Clone heart-murmur-detection repository

Thanks to the repository: https://github.com/Benjamin-Walker/heart-murmur-detection

- We use this commit: `https://github.com/Benjamin-Walker/heart-murmur-detection/tree/60f5420918b151e06932f70a52649d9562f0be2d`
- Then, we make local modifications using `diff-heart-murmur-detection.patch`

```bibtex
@article{walker2022DBResNet,
    title={Dual Bayesian ResNet: A Deep Learning Approach to Heart Murmur Detection},
    author={Benjamin Walker and Felix Krones and Ivan Kiskin and Guy Parsons and Terry Lyons and Adam Mahdi},
    journal={Computing in Cardiology},
    volume={49},
    year={2022}
}
```

We used this repository to create stratified data splits and perform the evaluation in exactly the same way as Walker et al.

In [5]:
! git clone https://github.com/Benjamin-Walker/heart-murmur-detection.git
! (cd heart-murmur-detection && git checkout 60f5420918b151e06932f70a52649d9562f0be2d)
! patch -p1 < diff-heart-murmur-detection.patch

Cloning into 'heart-murmur-detection'...
remote: Enumerating objects: 398, done.[K
remote: Counting objects: 100% (398/398), done.[K
remote: Compressing objects: 100% (261/261), done.[K
remote: Total 398 (delta 229), reused 266 (delta 115), pack-reused 0[K
Receiving objects: 100% (398/398), 2.35 MiB | 10.92 MiB/s, done.
Resolving deltas: 100% (229/229), done.
Note: switching to '60f5420918b151e06932f70a52649d9562f0be2d'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 60f5420 Update README.md
patching file 

## 3. Download dataset

We download `physionet.org/files/circor-heart-sound/1.0.3/` as in the previous studies.

**NOTE: Downloading takes about 1 hour depending on the network condition.**

In [4]:
# ! wget -r -N -c -np https://physionet.org/files/circor-heart-sound/1.0.3/

... you will see the downloading logs here ...

## 4. Create stratified splits

This step creates the stratified splits used in our experiments by using `datalist_stratified_data1~3.csv` files, whic are the exact list of the files for each split.

The resulting files are copied in the `heart-murmur-detection/data` folder.

### Notes for how we made the splits

To create stratified splits, we used the code from the repository `heart-murmur-detection` with modified options `--vali_size 0.1 --test_size 0.25`, for making training/validation/test sets with a proportion of 65:10:25.

- Running `main.py` creates stratified data splits under `data/stratified_data` (`--stratified_directory`).
- Then, we ran followings and got three splits under `data/stratified_data1`, `data/stratified_data2`, and `data/stratified_data3`.
    ```sh
    CUDA_VISIBLE_DEVICES=0 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data1 --vali_size 0.1 --test_size 0.25 --random_state 14 --recalc_features --spectrogram_directory data/spectrograms1 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs1
    CUDA_VISIBLE_DEVICES=0 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data2 --vali_size 0.1 --test_size 0.25 --random_state 42 --recalc_features --spectrogram_directory data/spectrograms2 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs2
    CUDA_VISIBLE_DEVICES=1 python main.py --full_data_directory physionet.org/files/circor-heart-sound/1.0.3/training_data --stratified_directory data/stratified_data3 --vali_size 0.1 --test_size 0.25 --random_state 84 --recalc_features --spectrogram_directory data/spectrograms3 --model_name resnet50dropout --recalc_output --dbres_output_directory outputs3
    ```
- After creating these folders, we made lists of the stratified files as datalist_stratified_data1.csv, datalist_stratified_data2.csv, and datalist_stratified_data3.csv.

In [7]:
import pandas as pd
from pathlib import Path
import shutil

split_csvs = ['./datalist_stratified_data1.csv', './datalist_stratified_data2.csv', './datalist_stratified_data3.csv']
df = pd.concat([pd.read_csv(f) for f in split_csvs], ignore_index=True)

dest = Path('heart-murmur-detection/data')
for f in df.dest_file.values:
    f = Path(f)
    f.parent.mkdir(exist_ok=True, parents=True)
    from_file = Path('physionet.org/files/circor-heart-sound/1.0.3/training_data')/f.name
    #print('Copy', from_file, 'to', f)
    shutil.copy(from_file, f)

## 5. Convert the data files for fine-tuning

We need to convert the original data samples (*.wav) into 5-s segments:
- Process 3 stratified data splits independently.
- Source filres are from the `heart-murmur-detection/data/stratified_dataX` folder.
- Converted into the `evar/work/16k/circorX` folder.
- All (long) source files are cropped into 5-s segments with a window duration of 5 s and stride of 2.5 s.

The evaluation package `EVAR` uses data samples under `evar/work/16k`, while the final test result calculation using the code from `heart-murmur-detection` uses the files in `heart-murmur-detection/data`.

The metadata files will also be created as `evar/evar/metadata/circorX.csv`.

In [8]:
dfs = []

for split_no in [1, 2, 3]:
    trn = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/train_data/').glob('*.wav'))
    val = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/vali_data/').glob('*.wav'))
    tst = sorted(Path(f'heart-murmur-detection/data/stratified_data{split_no}/test_data/').glob('*.wav'))
    #Tr, V, Te = len(trn), len(val), len(tst)

    itrn = sorted(list(set([int(f.stem.split('_')[0]) for f in trn])))
    ival = sorted(list(set([int(f.stem.split('_')[0]) for f in val])))
    itst = sorted(list(set([int(f.stem.split('_')[0]) for f in tst])))
    Tr, V, Te = len(itrn), len(ival), len(itst)
    N = Tr + V + Te
    print(f'Split #{split_no} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%)')
    print(' Training sample IDs are:', itrn[:3], '...')

    df = pd.read_csv('physionet.org/files/circor-heart-sound/1.0.3/training_data.csv')

    def get_split(pid):
        if pid in itrn: return 'train'
        if pid in ival: return 'valid'
        if pid in itst: return 'test'
        assert False, f'Patient ID {pid} Unknown'
    df['split'] = df['Patient ID'].apply(get_split)


    SR = 16000
    L = int(SR * 5.0)
    STEP = int(SR * 2.5)

    ROOT = Path('physionet.org/files/circor-heart-sound/1.0.3/training_data/')
    TO_FOLDER = Path(f'evar/work/16k/circor{split_no}')

    evardf = pd.DataFrame()

    for i, r in df.iterrows():
        pid, recloc, split, label = str(r['Patient ID']), r['Recording locations:'], r.split, r.Murmur
        # Not using recloc. Search real recordings...
        recloc = [f.stem.replace(pid+'_', '') for f in sorted(ROOT.glob(f'{pid}_*.wav'))]
        #print(pid, recloc, split, label)
        for rl in recloc:
            wav, sr = librosa.load(f'{ROOT}/{pid}_{rl}.wav', sr=SR)
            for widx, pos in enumerate(range(0, len(wav) - STEP + 1, STEP)):
                w = wav[pos:pos+L]
                org_len = len(w)
                if org_len < L:
                    w = np.pad(w, (0, L - org_len))
                    assert len(w) == L
                to_name = TO_FOLDER/split/f'{pid}_{rl}_{widx}.wav'
                to_rel_name = to_name.relative_to(TO_FOLDER)
                #print(pid, rl, len(wav)/SR, to_name, to_rel_name, org_len, len(w), pos)
                evardf.loc[to_name.stem, 'file_name'] = to_rel_name
                evardf.loc[to_name.stem, 'label'] = label
                evardf.loc[to_name.stem, 'split'] = split

                to_name.parent.mkdir(exist_ok=True, parents=True)
                w = torch.tensor(w * 32767.0).to(torch.int16).unsqueeze(0)
                torchaudio.save(to_name, w, SR)
    evardf.to_csv(f'evar/evar/metadata/circor{split_no}.csv', index=None)
    print('Split', split_no)
    display(evardf[:3])

df[:3]

Split #1 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)
 Training sample IDs are: [2530, 9979, 9983] ...
Split 1


Unnamed: 0,file_name,label,split
2530_AV_0,train/2530_AV_0.wav,Absent,train
2530_AV_1,train/2530_AV_1.wav,Absent,train
2530_AV_2,train/2530_AV_2.wav,Absent,train


Split #2 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)
 Training sample IDs are: [9979, 9983, 14241] ...
Split 2


Unnamed: 0,file_name,label,split
2530_AV_0,valid/2530_AV_0.wav,Absent,valid
2530_AV_1,valid/2530_AV_1.wav,Absent,valid
2530_AV_2,valid/2530_AV_2.wav,Absent,valid


Split #3 has samples: Training:611(64.86%), Val:95(10.08%), Test:236(25.05%)
 Training sample IDs are: [2530, 9979, 24160] ...
Split 3


Unnamed: 0,file_name,label,split
2530_AV_0,train/2530_AV_0.wav,Absent,train
2530_AV_1,train/2530_AV_1.wav,Absent,train
2530_AV_2,train/2530_AV_2.wav,Absent,train


Unnamed: 0,Patient ID,Recording locations:,Age,Sex,Height,Weight,Pregnancy status,Murmur,Murmur locations,Most audible location,...,Systolic murmur quality,Diastolic murmur timing,Diastolic murmur shape,Diastolic murmur grading,Diastolic murmur pitch,Diastolic murmur quality,Outcome,Campaign,Additional ID,split
0,2530,AV+PV+TV+MV,Child,Female,98.0,15.9,False,Absent,,,...,,,,,,,Abnormal,CC2015,,train
1,9979,AV+PV+TV+MV,Child,Female,103.0,13.1,False,Present,AV+MV+PV+TV,TV,...,Harsh,,,,,,Abnormal,CC2015,,train
2,9983,AV+PV+TV+MV,Child,Male,115.0,19.1,False,Unknown,,,...,,,,,,,Abnormal,CC2015,,test


In [10]:
# Example of evar/work/16k/circor3
len(evardf), evardf

(27361,
                        file_name   label  split
 2530_AV_0    train/2530_AV_0.wav  Absent  train
 2530_AV_1    train/2530_AV_1.wav  Absent  train
 2530_AV_2    train/2530_AV_2.wav  Absent  train
 2530_AV_3    train/2530_AV_3.wav  Absent  train
 2530_AV_4    train/2530_AV_4.wav  Absent  train
 ...                          ...     ...    ...
 85349_TV_2  train/85349_TV_2.wav  Absent  train
 85349_TV_3  train/85349_TV_3.wav  Absent  train
 85349_TV_4  train/85349_TV_4.wav  Absent  train
 85349_TV_5  train/85349_TV_5.wav  Absent  train
 85349_TV_6  train/85349_TV_6.wav  Absent  train
 
 [27361 rows x 3 columns])

## 6. Final steps: Check the data integrity

### 6-1. Original data check

1. "The data consists of samples for classes Present/Absent/Unknown of 179/695/68." 

In [11]:
df = pd.read_csv('./physionet.org/files/circor-heart-sound/1.0.3/training_data.csv')

print('Classes are:', df.Murmur.unique())

A, P, U = [sum(df.Murmur == s) for s in ['Absent', 'Present', 'Unknown']]
N = len(df)
print(f'Original circor-heart-sound/1.0.3/training_data.csv has samples: Absent:{A}({A/N*100:.2f}%), Present:{P}({P/N*100:.2f}%), Unknown:{U}({U/N*100:.2f}%)')

Classes are: ['Absent' 'Present' 'Unknown']
Original circor-heart-sound/1.0.3/training_data.csv has samples: Absent:695(73.78%), Present:179(19.00%), Unknown:68(7.22%)


2. "Each sample consists of multiple recordings of variable-length audio, and there are 3,163 recordings in total."

In [12]:
! find physionet.org -name *.wav |wc -l

3163


3. "The data were split with stratification by class labels into training/validation/test sets with a proportion of 65:10:25." -- The actual splits should be close to these numbers.


In [13]:
# Checking the preset stratification statistics.

split_csvs = ['./datalist_stratified_data1.csv', './datalist_stratified_data2.csv', './datalist_stratified_data3.csv']
for f in split_csvs:
    df = pd.read_csv(f)
    df = df[df.dest_file.str.endswith('.wav')]
    Tr, V, Te = [sum(df.dest_file.str.contains(s)) for s in ['/train_data/', '/vali_data/', '/test_data/']]
    N = len(df)
    assert N == (Tr + V + Te)
    print(f'Split {f} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%), total:{Tr+V+Te}')

Split ./datalist_stratified_data1.csv has samples: Training:2038(64.43%), Val:324(10.24%), Test:801(25.32%), total:3163
Split ./datalist_stratified_data2.csv has samples: Training:2069(65.41%), Val:301(9.52%), Test:793(25.07%), total:3163
Split ./datalist_stratified_data3.csv has samples: Training:2074(65.57%), Val:316(9.99%), Test:773(24.44%), total:3163


### 6-2. Fine-tuning data check

In [14]:
# Checking the created (and actually used) metadata files in EVAR.
# Note that the samples are split into 5-s unified-length audios with 2.5-s strides.

split_csvs = ['evar/evar/metadata/circor1.csv', 'evar/evar/metadata/circor2.csv', 'evar/evar/metadata/circor3.csv']
for f in split_csvs:
    df = pd.read_csv(f)
    Tr, V, Te = [sum(df.split == s) for s in ['train', 'valid', 'test']]
    N = len(df)
    print(f'Split {f} has samples: Training:{Tr}({Tr/N*100:.2f}%), Val:{V}({V/N*100:.2f}%), Test:{Te}({Te/N*100:.2f}%)')

Split evar/evar/metadata/circor1.csv has samples: Training:17746(64.86%), Val:2857(10.44%), Test:6758(24.70%)
Split evar/evar/metadata/circor2.csv has samples: Training:17986(65.74%), Val:2570(9.39%), Test:6805(24.87%)
Split evar/evar/metadata/circor3.csv has samples: Training:17949(65.60%), Val:2580(9.43%), Test:6832(24.97%)


### We finished preparation

We are done with data and code preparation.

Proceed to the notebook [1-Run-M2D.ipynb](1-Run-M2D.ipynb)