# Deep-Learning Analysis of Smartphone and Electronic-Stethoscope Phonocardiograms for Detection of Reduced Left Ventricular Ejection Fraction

This notebook rebuilds derived artifacts and runs experiments on Google Colab.
This version runs directly from Google Drive (drive-only) to avoid long copy steps.
Outputs are written to `/content/pcg_runs` and synced back to Drive in the final step.

No deep-learning background is required: follow the steps in order and keep defaults unless you have a reason to change them.


## Glossary (Quick)
- Representation: MFCC or gammatone spectrogram input
- Backbone: ImageNet-pretrained model (e.g., MobileNetV2, SwinV2)
- Normalization: standardize spectrograms (single-set or per-device)
- CV (5-fold): train 5 times on different splits, average results
- Checkpoint: saved model weights (`best.pth`)
- Eval-only: evaluate a saved checkpoint without retraining


## Step 0 — Mount Google Drive
Set `DRIVE_REPO_DIR` to the folder containing this repo and your data.
This workflow runs code/data directly from Drive (no local copy).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
import subprocess

def rsync(src, dst, excludes=None, delete=False):
    cmd = ['rsync', '-a']
    if delete:
        cmd.append('--delete')
    if excludes:
        for ex in excludes:
            cmd += ['--exclude', ex]
    cmd += [src, dst]
    print(' '.join(cmd))
    subprocess.run(cmd, check=True)

def run_command(cmd):
    env = {**os.environ, 'PYTHONUNBUFFERED': '1'}
    print('Running:', ' '.join(cmd))
    proc = subprocess.Popen(
        cmd,
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1,
    )
    for line in proc.stdout:
        print(line, end='')
    returncode = proc.wait()
    if returncode != 0:
        raise subprocess.CalledProcessError(returncode, cmd)

DRIVE_REPO_DIR = '/content/drive/MyDrive/phonocardiogram-lvef-deeplearning'
WORK_DIR = DRIVE_REPO_DIR  # drive-only workflow
HEART_DIR = f"{DRIVE_REPO_DIR}/heart_sounds"
LVEF_CSV = f"{DRIVE_REPO_DIR}/lvef.csv"
RUNS_DIR = '/content/pcg_runs'

SYNC_BACK_TO_DRIVE = True
SYNC_DELETE = False  # set True to mirror local runs to Drive (deletes Drive-only files)
DRIVE_RUNS_DIR = f"{DRIVE_REPO_DIR}/runs"

if not os.path.exists(LVEF_CSV):
    raise FileNotFoundError(f'Missing LVEF CSV: {LVEF_CSV}')
if not os.path.isdir(HEART_DIR):
    raise FileNotFoundError(f'Missing heart_sounds dir: {HEART_DIR}')

os.makedirs(RUNS_DIR, exist_ok=True)
os.chdir(WORK_DIR)
print('WORK_DIR:', WORK_DIR)
print('LVEF_CSV:', LVEF_CSV)
print('HEART_DIR:', HEART_DIR)
print('RUNS_DIR:', RUNS_DIR)


### Optional - Restore Summary Only from Drive
If you only need the aggregated table, this restores `summary.csv` without copying all runs.
Note: this does not restore checkpoints or per-run metrics.


In [ ]:
import shutil

RESTORE_SUMMARY_FROM_DRIVE = True  # set False to start fresh

if RESTORE_SUMMARY_FROM_DRIVE:
    src = f"{DRIVE_RUNS_DIR}/results/summary.csv"
    dst_dir = f"{RUNS_DIR}/results"
    if not os.path.exists(src):
        print(f"No summary.csv found at {src}; skipping restore.")
    else:
        os.makedirs(dst_dir, exist_ok=True)
        dst = f"{dst_dir}/summary.csv"
        shutil.copy2(src, dst)
        print("Restored summary.csv to:", dst)


## Step 1 — Install Requirements
Colab already provides PyTorch. This installs the remaining packages from `requirements.txt`.

In [None]:
import sys
import subprocess
from pathlib import Path
import torch

print('torch:', torch.__version__)
print('cuda:', torch.version.cuda)
print('cuda available:', torch.cuda.is_available())

reqs = [r.strip() for r in Path('requirements.txt').read_text().splitlines() if r.strip()]
reqs = [r for r in reqs if not r.startswith('torch') and not r.startswith('torchaudio')]
cmd = [sys.executable, '-m', 'pip', 'install'] + reqs
print('Installing:', ' '.join(reqs))
subprocess.run(cmd, check=True)


### Check the filename pattern (important)
This workflow expects filenames like `aData2001A.wav`, where:
- First letter = device code (`a`=android_phone, `i`=iphone, `e`=digital stethoscope)
- Digits = patient ID
- Last letter = auscultation position (A/P/M/T)

If your naming differs, edit `FILENAME_RE` and `DEVICE_MAP` in `src/data/build_metadata.py` before running the next cell.


## Step 2 — Build Metadata
Creates `metadata.csv` by linking each WAV to patient ID, device, position, and label.
Only rerun if your raw data, labels, or filename rules changed.


In [None]:
!python -m src.data.build_metadata \
  --lvef_csv {LVEF_CSV} \
  --heart_dir {HEART_DIR} \
  --output_csv metadata.csv


## Step 3 — Create Patient-Level Splits
Creates train/val/test splits and 5-fold CV splits (no patient leakage).
Only rerun if metadata or split settings (folds/seed/val_size) changed.


In [None]:
!python -m src.data.make_patient_splits \
  --metadata_csv metadata.csv \
  --output_dir splits

!python -m src.data.make_patient_cv_splits \
  --metadata_csv metadata.csv \
  --output_dir splits/cv \
  --n_splits 5 \
  --n_repeats 1


## Optional — QA Report
Use this to summarize data quality and optionally estimate a simple SNR proxy.
Use `--max_files` to sample a subset for faster runtime.


In [None]:
# Optional QA report (uncomment to run)
# !mkdir -p reports
# !python -m src.data.qa_report \
#   --metadata_csv metadata.csv \
#   --output_json reports/qa_report.json \
#   --output_csv reports/qa_records.csv \
#   --fixed_duration 4.0 \
#   --compute_snr \
#   --max_files 200


## Step 4 — 5-Fold CV (Model Selection)
Edit `REPRESENTATION`, `BACKBONE`, and device filters, then run this cell.
This step trains 5 folds and summarizes results; use it to pick the best config.


**Input size note**
- Use 224x224 for MobileNet and EfficientNet-B0.
- Use 256x256 for SwinV2-Tiny/Small.
- Use 384x384 for EfficientNetV2-S.


### Change Only These Variables (Step 4)
- Set the values below only; keep the rest unchanged.
- `REPRESENTATION`: `mfcc` or `gammatone`
- `BACKBONE`: `mobilenetv2`, `mobilenetv3_large`, `efficientnet_b0`, `efficientnetv2_s`, `swinv2_tiny`, `swinv2_small`
- Device labels (must match metadata): `iphone`, `android_phone`, `digital_stethoscope`
- For within-device CV, set `TRAIN_DEVICES`, `VAL_DEVICES`, `TEST_DEVICES` to the same single-device list.
- For pooled CV, leave those three variables as `None`.
- `IMAGE_SIZE` is set automatically from `BACKBONE` in this cell.


In [None]:
# Default: 5-fold CV for model selection
REPRESENTATION = 'mfcc'  # change to 'gammatone' to compare
BACKBONE = 'mobilenetv2'  # change to another backbone to compare
TRAIN_DEVICES = None  # set e.g. ['iphone'] for within-device CV; leave None for pooled
VAL_DEVICES = None  # set equal to TRAIN_DEVICES for within-device CV
TEST_DEVICES = None  # set equal to TRAIN_DEVICES for within-device CV
    NORMALIZATION = 'global'  # single-set stats over the chosen subset

if BACKBONE.startswith('swinv2'):
    IMAGE_SIZE = 256
elif BACKBONE == 'efficientnetv2_s':
    IMAGE_SIZE = 384
else:
    IMAGE_SIZE = 224

RUN_TAG = '-'.join(TRAIN_DEVICES) if TRAIN_DEVICES else 'pooled'  # used to label CV runs
RUN_NAME_FORMAT = f"cv_{RUN_TAG}_im{IMAGE_SIZE}_r{{repeat:02d}}_f{{fold:02d}}_{{backbone}}_{{representation}}"

AUTO_POS_WEIGHT = True
TUNE_THRESHOLD = True
AMP = True

import sys
cmd = [
    sys.executable,
    '-m',
    'src.experiments.run_cv',
    '--cv_index',
    'splits/cv/index.csv',
    '--results_dir',
    f'{RUNS_DIR}/results',
    '--output_dir',
    f'{RUNS_DIR}/checkpoints',
    '--run_name_format',
    RUN_NAME_FORMAT,
    '--',
    '--representation',
    REPRESENTATION,
    '--backbone',
    BACKBONE,
    '--normalization',
    NORMALIZATION,
    '--image_size',
    str(IMAGE_SIZE),
]

if AUTO_POS_WEIGHT:
    cmd.append('--auto_pos_weight')
if TUNE_THRESHOLD:
    cmd.append('--tune_threshold')
if AMP:
    cmd.append('--amp')

if TRAIN_DEVICES:
    cmd += ['--train_device_filter', *TRAIN_DEVICES]
if VAL_DEVICES:
    cmd += ['--val_device_filter', *VAL_DEVICES]
if TEST_DEVICES:
    cmd += ['--test_device_filter', *TEST_DEVICES]

run_command(cmd)

## Optional — TF Stats (for non-CV runs)
Set `RUN_TF_STATS=True` only if you plan to train without CV (e.g., final within-device checkpoints).
These stats are computed from the training split only (no val/test patients). The default is a single set of stats over whatever subset you select.
If you want device-only stats for a within-device final checkpoint, add a device filter and match the image size to your backbone.

Backbone image size reminder: SwinV2 = 256, EfficientNetV2-S = 384, everything else = 224.

Example (android_phone, gammatone, SwinV2-Tiny):
```bash
python -m src.data.compute_stats \
  --train_csv splits/metadata_train.csv \
  --representations gammatone \
  --device_filter android_phone \
  --image_size 256
```


In [None]:
RUN_TF_STATS = False  # set True for non-CV training (final within-device or pooled)

# Example: if your best config is android_phone + gammatone + swinv2_tiny,
# use --device_filter android_phone and --image_size 256.
# For efficientnetv2_s, use --image_size 384. Otherwise, use 224.
# Tip: add --output_json tf_stats_<device>.json to keep per-device files.

if RUN_TF_STATS:

    !python -m src.data.compute_stats \
      --train_csv splits/metadata_train.csv \
      --representations gammatone \
      --device_filter android_phone \
      --image_size 256


## Tips to Avoid Mistakes
- If you change data, labels, or filename rules, rerun Step 2 and Step 3.
- Device names must match metadata (`iphone`, `android_phone`, `digital_stethoscope`). If your labels differ (e.g., `android`), update `DEVICE_MAP` in `src/data/build_metadata.py` and rebuild metadata.
- For cross-device eval, replace `<run_name>` with the actual checkpoint folder.
- Do not tune anything using the test split.


## Step 5 — Final Within-Device Training
Enable this only after CV. This creates one final checkpoint per device using the chosen config.
This step requires `tf_stats.json` (run the Optional TF Stats step first).

If you want device-only stats for a strict within-device final model, compute stats with `--device_filter` and the correct `--image_size`.
Example: for android_phone + gammatone + swinv2_tiny, use `--device_filter android_phone --representations gammatone --image_size 256`.


### Change Only These Variables
- Set `RUN_SINGLE = True` when you are ready to run this step.
- Replace the values below with one of the allowed options.

**Allowed options**
- `REPRESENTATION`: `mfcc` or `gammatone`
- `BACKBONE`: `mobilenetv2`, `mobilenetv3_large`, `efficientnet_b0`, `efficientnetv2_s`, `swinv2_tiny`, `swinv2_small`
- Device labels (must match metadata): `iphone`, `android_phone`, `digital_stethoscope`

**Within-device rule**
- Set `TRAIN_DEVICES`, `VAL_DEVICES`, and `TEST_DEVICES` to the same single-device list. Example: `['iphone']`.

**Image size**
- `IMAGE_SIZE` is set automatically from `BACKBONE` in this cell.


In [None]:
# Optional: train a final within-device model (single run)
RUN_SINGLE = False  # set True after you pick the best config from CV
# Available options:
#   REPRESENTATION: 'mfcc' | 'gammatone'
#   BACKBONE: 'mobilenetv2' | 'mobilenetv3_large' | 'efficientnet_b0' | 'efficientnetv2_s' | 'swinv2_tiny' | 'swinv2_small'
#   DEVICES: ['iphone'] | ['android_phone'] | ['digital_stethoscope'] (must match metadata)

if RUN_SINGLE:
    import sys

    REPRESENTATION = 'mfcc'
    BACKBONE = 'mobilenetv2'
    TRAIN_DEVICES = None  # set to one device for within-device final model
    VAL_DEVICES = None  # set equal to TRAIN_DEVICES for within-device
    TEST_DEVICES = None  # set equal to TRAIN_DEVICES for within-device
    NORMALIZATION = 'global'  # single-set stats over the chosen subset
    if BACKBONE.startswith('swinv2'):
        IMAGE_SIZE = 256
    elif BACKBONE == 'efficientnetv2_s':
        IMAGE_SIZE = 384
    else:
        IMAGE_SIZE = 224

    if not os.path.exists('tf_stats.json'):
        raise FileNotFoundError('tf_stats.json not found. Run the Optional TF Stats step first.')

    cmd = [
        sys.executable,
        '-m',
        'src.training.train',
        '--train_csv',
        'splits/metadata_train.csv',
        '--val_csv',
        'splits/metadata_val.csv',
        '--test_csv',
        'splits/metadata_test.csv',
        '--representation',
        REPRESENTATION,
        '--backbone',
        BACKBONE,
        '--normalization',
        NORMALIZATION,
        '--image_size',
        str(IMAGE_SIZE),
        '--results_dir',
        f'{RUNS_DIR}/results',
        '--output_dir',
        f'{RUNS_DIR}/checkpoints',
        '--auto_pos_weight',
        '--tune_threshold',
        '--amp',
        '--per_device_eval',
        '--save_predictions',
        '--save_history',
    ]

    if TRAIN_DEVICES:
        cmd += ['--train_device_filter', *TRAIN_DEVICES]
    if VAL_DEVICES:
        cmd += ['--val_device_filter', *VAL_DEVICES]
    if TEST_DEVICES:
        cmd += ['--test_device_filter', *TEST_DEVICES]

    run_command(cmd)


## Step 6 — Cross-Device Evaluation (No Retraining)
Uses a saved checkpoint and evaluates on other devices. The model is not updated.
Set `CHECKPOINT_PATH` manually to the within-device checkpoint you selected, and set `TEST_DEVICES` to the target devices you want to evaluate.


In [None]:
# Optional: cross-device evaluation from a saved checkpoint (no retraining)
RUN_EVAL_ONLY = False

if RUN_EVAL_ONLY:
    import sys

    CHECKPOINT_PATH = f'{RUNS_DIR}/checkpoints/<run_name>/best.pth'  # set manually to your chosen within-device run
    TEST_DEVICES = ['iphone', 'digital_stethoscope']  # evaluate this checkpoint on these devices (e.g., android model -> iphone + stethoscope)

    if not os.path.exists(CHECKPOINT_PATH):
        raise FileNotFoundError(f'Checkpoint not found: {CHECKPOINT_PATH}')

    # Representation/backbone/normalization are loaded from the checkpoint.
    cmd = [
        sys.executable,
        '-m',
        'src.training.train',
        '--eval_only',
        '--checkpoint_path',
        CHECKPOINT_PATH,
        '--train_csv',
        'splits/metadata_train.csv',
        '--val_csv',
        'splits/metadata_val.csv',
        '--test_csv',
        'splits/metadata_test.csv',
        '--results_dir',
        f'{RUNS_DIR}/results',
        '--per_device_eval',
        '--save_predictions',
    ]

    if TEST_DEVICES:
        cmd += ['--test_device_filter', *TEST_DEVICES]

    run_command(cmd)


## Outputs to Inspect
All outputs are under `RUNS_DIR` (default: `/content/pcg_runs`):
- `RUNS_DIR/results/summary.csv`: main table of metrics per run (use this for tables)
- `RUNS_DIR/results/selection/best_config_per_device.csv`: best config per device
- `RUNS_DIR/results/selection/config_summary_by_device.csv`: aggregated config stats per device
- `RUNS_DIR/results/<run_name>/predictions_test.csv`: for ROC/PR curves (final models only)
- `RUNS_DIR/checkpoints/<run_name>/best.pth`: saved model weights


## Step 7 — Sync Outputs to Drive
Copies results back to Google Drive.


In [None]:
if SYNC_BACK_TO_DRIVE:
    os.makedirs(DRIVE_RUNS_DIR, exist_ok=True)
    rsync(RUNS_DIR + '/', DRIVE_RUNS_DIR + '/', delete=SYNC_DELETE)
    print('Synced runs to drive:', DRIVE_RUNS_DIR)
