# Deep-Learning Analysis of Smartphone and Electronic-Stethoscope Phonocardiograms for Detection of Reduced Left Ventricular Ejection Fraction

This notebook rebuilds derived artifacts and runs experiments on Google Colab.
Default settings prioritize fast training (local /content storage for code, data, cache, and results).
Set `USE_LOCAL_DATA = False` if your dataset is too large for /content.

No deep-learning background is required: follow the steps in order and keep defaults unless you have a reason to change them.


## How to Use This Notebook
Run cells from top to bottom. Optional steps are controlled by simple flags (e.g., `RUN_CACHE`).
If you are unsure, keep the defaults.
For pooled experiments, only change `REPRESENTATION` and `BACKBONE`.
For within-device experiments, also set `TRAIN_DEVICES`, `VAL_DEVICES`, and `TEST_DEVICES` to the same device (e.g., `['iphone']`).

**Default behavior:**
- Builds metadata and splits
- Runs 5-fold CV for model selection
- Skips caching and final/eval-only runs unless you enable them


## Glossary (Quick)
- Representation: MFCC or gammatone spectrogram input
- Backbone: ImageNet-pretrained model (e.g., MobileNetV2, SwinV2)
- Normalization: standardize spectrograms (global or per-device)
- CV (5-fold): train 5 times on different splits, average results
- Checkpoint: saved model weights (`best.pth`)
- Eval-only: evaluate a saved checkpoint without retraining


## Step 0 — Mount Google Drive
Set `DRIVE_REPO_DIR` to the folder containing this repo and your data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

DRIVE_REPO_DIR = '/content/drive/MyDrive/phonocardiogram-lvef-deeplearning'
WORK_DIR = '/content/pcg_repo'
DATA_DIR = '/content/pcg_data'
RUNS_DIR = '/content/pcg_runs'
USE_LOCAL_DATA = True  # fastest default
PER_DEVICE_STATS = False  # set True for per-device normalization
SYNC_BACK_TO_DRIVE = True
SYNC_DERIVED = True
SYNC_DELETE = False  # set True to mirror local runs to Drive (deletes Drive-only files)
DRIVE_RUNS_DIR = f"{DRIVE_REPO_DIR}/runs"


## Step 1 — Copy Repo/Data to /content (Fast Mode)
This copies the repo (and optionally the dataset) to local Colab storage for speed.

In [None]:
import os
import shutil
import subprocess

def rsync(src, dst, excludes=None, delete=False):
    cmd = ['rsync', '-a']
    if delete:
        cmd.append('--delete')
    if excludes:
        for ex in excludes:
            cmd += ['--exclude', ex]
    cmd += [src, dst]
    print(' '.join(cmd))
    subprocess.run(cmd, check=True)

os.makedirs(WORK_DIR, exist_ok=True)
rsync(DRIVE_REPO_DIR + '/', WORK_DIR + '/', excludes=['.git', 'cache', 'splits', 'results', 'checkpoints', 'checkpoints_cpu', '__pycache__'], delete=True)

if USE_LOCAL_DATA:
    os.makedirs(DATA_DIR, exist_ok=True)
    rsync(DRIVE_REPO_DIR + '/heart_sounds/', DATA_DIR + '/heart_sounds/', delete=True)
    shutil.copy2(DRIVE_REPO_DIR + '/lvef.csv', DATA_DIR + '/lvef.csv')
    LVEF_CSV = f"{DATA_DIR}/lvef.csv"
    HEART_DIR = f"{DATA_DIR}/heart_sounds"
else:
    LVEF_CSV = f"{DRIVE_REPO_DIR}/lvef.csv"
    HEART_DIR = f"{DRIVE_REPO_DIR}/heart_sounds"

if not os.path.exists(LVEF_CSV):
    raise FileNotFoundError(f'Missing LVEF CSV: {LVEF_CSV}')
if not os.path.isdir(HEART_DIR):
    raise FileNotFoundError(f'Missing heart_sounds dir: {HEART_DIR}')

os.makedirs(RUNS_DIR, exist_ok=True)
os.chdir(WORK_DIR)
print('WORK_DIR:', WORK_DIR)
print('LVEF_CSV:', LVEF_CSV)
print('HEART_DIR:', HEART_DIR)
print('RUNS_DIR:', RUNS_DIR)


## Step 2 — Install Requirements
Colab already provides PyTorch. This installs the remaining packages from `requirements.txt`.

In [None]:
import sys
import subprocess
from pathlib import Path
import torch

print('torch:', torch.__version__)
print('cuda:', torch.version.cuda)
print('cuda available:', torch.cuda.is_available())

reqs = [r.strip() for r in Path('requirements.txt').read_text().splitlines() if r.strip()]
reqs = [r for r in reqs if not r.startswith('torch') and not r.startswith('torchaudio')]
cmd = [sys.executable, '-m', 'pip', 'install'] + reqs
print('Installing:', ' '.join(reqs))
subprocess.run(cmd, check=True)


### Check the filename pattern (important)
This workflow expects filenames like `aData2001A.wav`, where:
- First letter = device code (`a`=android_phone, `i`=iphone, `e`=digital stethoscope)
- Digits = patient ID
- Last letter = auscultation position (A/P/M/T)

If your naming differs, edit `FILENAME_RE` and `DEVICE_MAP` in `src/data/build_metadata.py` before running the next cell.


## Step 3 — Build Metadata
Creates `metadata.csv` by linking each WAV to patient ID, device, position, and label.

In [None]:
!python -m src.data.build_metadata \
  --lvef_csv {LVEF_CSV} \
  --heart_dir {HEART_DIR} \
  --output_csv metadata.csv


## Step 4 — Create Patient-Level Splits
Creates train/val/test splits and 5-fold CV splits (no patient leakage).

In [None]:
!python -m src.data.make_patient_splits \
  --metadata_csv metadata.csv \
  --output_dir splits

!python -m src.data.make_patient_cv_splits \
  --metadata_csv metadata.csv \
  --output_dir splits/cv \
  --n_splits 5 \
  --n_repeats 1


## Optional — Global TF Stats (for caching or non-CV runs)
Set `RUN_GLOBAL_STATS=True` only if you plan to cache tensors or train without CV.

In [None]:
RUN_GLOBAL_STATS = False  # set True for caching or non-CV training
# PER_DEVICE_STATS is set in Step 0

if RUN_GLOBAL_STATS:
    per_device_flag = '--per_device' if PER_DEVICE_STATS else ''

    !python -m src.data.compute_stats \
      --train_csv splits/metadata_train.csv \
      --representations mfcc gammatone \
      {per_device_flag}


## Optional — Precompute Cache
Set `RUN_CACHE=True` to build cached tensors for faster single-run training.

In [None]:
RUN_CACHE = False  # enable if you plan to train with --use_cache
CACHE_ROOT = '/content/pcg_cache'
NORMALIZATION = 'per_device' if PER_DEVICE_STATS else 'global'

if RUN_CACHE:
    if not os.path.exists('tf_stats.json'):
        per_device_flag = '--per_device' if PER_DEVICE_STATS else ''
        print('Computing tf_stats.json for caching...')
        !python -m src.data.compute_stats \
          --train_csv splits/metadata_train.csv \
          --representations mfcc gammatone \
          {per_device_flag}

    for rep in ['mfcc', 'gammatone']:
        print(f'Caching {rep}...')
        !python -m src.data.precompute_cache \
          --representation {rep} \
          --normalization {NORMALIZATION} \
          --cache_root {CACHE_ROOT} \
          --splits splits/metadata_train.csv splits/metadata_val.csv splits/metadata_test.csv


## Optional — QA Report
Use this to audit audio durations, sample rates, and label sanity checks.

In [None]:
# Optional QA report
# !mkdir -p reports
# !python -m src.data.qa_report \
#   --metadata_csv metadata.csv \
#   --output_json reports/qa_report.json \
#   --output_csv reports/qa_records.csv \
#   --fixed_duration 4.0


## Step 5 — 5-Fold CV (Model Selection)
Edit `REPRESENTATION` and `BACKBONE` and run this cell.
Repeat for each configuration you want to compare.
This step trains 5 times and summarizes results; use it to pick the best config.


In [None]:
# Default: 5-fold CV for model selection
REPRESENTATION = 'mfcc'  # change to 'gammatone' to compare
BACKBONE = 'mobilenetv2'  # change to another backbone to compare
TRAIN_DEVICES = None  # set e.g. ['iphone'] for within-device CV; leave None for pooled
VAL_DEVICES = None  # set equal to TRAIN_DEVICES for within-device CV
TEST_DEVICES = None  # set equal to TRAIN_DEVICES for within-device CV
NORMALIZATION = 'per_device' if PER_DEVICE_STATS else 'global'

AUTO_POS_WEIGHT = True
TUNE_THRESHOLD = True
AMP = True
USE_CACHE = False  # leave False for CV unless you built per-fold caches

import sys
import subprocess

cmd = [
    sys.executable,
    '-m',
    'src.experiments.run_cv',
    '--cv_index',
    'splits/cv/index.csv',
    '--results_dir',
    f'{RUNS_DIR}/results',
    '--output_dir',
    f'{RUNS_DIR}/checkpoints',
    '--',
    '--representation',
    REPRESENTATION,
    '--backbone',
    BACKBONE,
    '--normalization',
    NORMALIZATION,
]

if AUTO_POS_WEIGHT:
    cmd.append('--auto_pos_weight')
if TUNE_THRESHOLD:
    cmd.append('--tune_threshold')
if AMP:
    cmd.append('--amp')
if USE_CACHE:
    cmd.append('--use_cache')

if TRAIN_DEVICES:
    cmd += ['--train_device_filter', *TRAIN_DEVICES]
if VAL_DEVICES:
    cmd += ['--val_device_filter', *VAL_DEVICES]
if TEST_DEVICES:
    cmd += ['--test_device_filter', *TEST_DEVICES]

print('Running:', ' '.join(cmd))
subprocess.run(cmd, check=True)


## Tips to Avoid Mistakes
- If you change data or filename rules, rerun from Step 3.
- Device names must match metadata (`iphone`, `android_phone`, `digital_stethoscope`). If your labels differ (e.g., `android`), update `DEVICE_MAP` in `src/data/build_metadata.py` and rebuild metadata.
- Only set `USE_CACHE=True` if you ran the cache cell.
- For cross-device eval, replace `<run_name>` with the actual checkpoint folder.
- Do not tune anything using the test split.


## Step 6 — Final Within-Device Training
Enable this only after CV. This creates one final checkpoint per device using the chosen config.
This step requires `tf_stats.json` (run the Optional Global TF Stats or Cache step first).


In [None]:
# Optional: train a final within-device model (single run)
RUN_SINGLE = False  # set True after you pick the best config from CV

if RUN_SINGLE:
    import sys
    import subprocess

    REPRESENTATION = 'mfcc'
    BACKBONE = 'mobilenetv2'
    TRAIN_DEVICES = None  # set to one device for within-device final model
    VAL_DEVICES = None  # set equal to TRAIN_DEVICES for within-device
    TEST_DEVICES = None  # set equal to TRAIN_DEVICES for within-device
    NORMALIZATION = 'per_device' if PER_DEVICE_STATS else 'global'
    USE_CACHE = False  # set True only if you ran the caching cell

    if USE_CACHE and not RUN_CACHE:
        print('Warning: USE_CACHE=True but RUN_CACHE=False. Run the caching cell or set USE_CACHE=False.')
    if not USE_CACHE and not os.path.exists('tf_stats.json'):
        raise FileNotFoundError('tf_stats.json not found. Run the Optional Global TF Stats or Cache step first.')

    cmd = [
        sys.executable,
        '-m',
        'src.training.train',
        '--train_csv',
        'splits/metadata_train.csv',
        '--val_csv',
        'splits/metadata_val.csv',
        '--test_csv',
        'splits/metadata_test.csv',
        '--representation',
        REPRESENTATION,
        '--backbone',
        BACKBONE,
        '--normalization',
        NORMALIZATION,
        '--results_dir',
        f'{RUNS_DIR}/results',
        '--output_dir',
        f'{RUNS_DIR}/checkpoints',
        '--auto_pos_weight',
        '--tune_threshold',
        '--amp',
        '--per_device_eval',
        '--save_predictions',
        '--save_history',
    ]

    if USE_CACHE:
        cmd.append('--use_cache')
    if TRAIN_DEVICES:
        cmd += ['--train_device_filter', *TRAIN_DEVICES]
    if VAL_DEVICES:
        cmd += ['--val_device_filter', *VAL_DEVICES]
    if TEST_DEVICES:
        cmd += ['--test_device_filter', *TEST_DEVICES]

    print('Running:', ' '.join(cmd))
    subprocess.run(cmd, check=True)


## Step 7 — Cross-Device Evaluation (No Retraining)
Uses a saved checkpoint and evaluates on other devices. The model is not updated.


In [None]:
# Optional: cross-device evaluation from a saved checkpoint (no retraining)
RUN_EVAL_ONLY = False

if RUN_EVAL_ONLY:
    import sys
    import subprocess

    CHECKPOINT_PATH = f'{RUNS_DIR}/checkpoints/<run_name>/best.pth'
    TEST_DEVICES = ['iphone', 'digital_stethoscope']  # set target device(s); use metadata labels

    if not os.path.exists(CHECKPOINT_PATH):
        raise FileNotFoundError(f'Checkpoint not found: {CHECKPOINT_PATH}')

    # Representation/backbone/normalization are loaded from the checkpoint.
    cmd = [
        sys.executable,
        '-m',
        'src.training.train',
        '--eval_only',
        '--checkpoint_path',
        CHECKPOINT_PATH,
        '--train_csv',
        'splits/metadata_train.csv',
        '--val_csv',
        'splits/metadata_val.csv',
        '--test_csv',
        'splits/metadata_test.csv',
        '--results_dir',
        f'{RUNS_DIR}/results',
        '--per_device_eval',
        '--save_predictions',
    ]

    # If the checkpoint was trained with cached tensors, those cache files must exist.
    if TEST_DEVICES:
        cmd += ['--test_device_filter', *TEST_DEVICES]

    print('Running:', ' '.join(cmd))
    subprocess.run(cmd, check=True)


## Outputs to Inspect
All outputs are under `RUNS_DIR` (default: `/content/pcg_runs`):
- `RUNS_DIR/results/summary.csv`: main table of metrics per run (use this for tables)
- `RUNS_DIR/results/<run_name>/predictions_test.csv`: for ROC/PR curves (final models only)
- `RUNS_DIR/checkpoints/<run_name>/best.pth`: saved model weights


## Step 8 — Sync Outputs to Drive
Copies results and (optionally) derived artifacts back to Google Drive.

In [None]:
if SYNC_DERIVED:
    if os.path.exists('metadata.csv'):
        shutil.copy2('metadata.csv', f"{DRIVE_REPO_DIR}/metadata.csv")
    if os.path.exists('tf_stats.json'):
        shutil.copy2('tf_stats.json', f"{DRIVE_REPO_DIR}/tf_stats.json")
    if os.path.isdir('splits'):
        rsync('splits/', f"{DRIVE_REPO_DIR}/splits/", delete=SYNC_DELETE)
    print('Synced derived artifacts to drive.')

if SYNC_BACK_TO_DRIVE:
    os.makedirs(DRIVE_RUNS_DIR, exist_ok=True)
    rsync(RUNS_DIR + '/', DRIVE_RUNS_DIR + '/', delete=SYNC_DELETE)
    print('Synced runs to drive:', DRIVE_RUNS_DIR)
