# 01 â€” Build Unified Manifest (ICBHI + Fraiwan) â€” 8-Class Lung Sound Project

This notebook builds a single **manifest_all.csv** and patient-wise splits (**train/val/test**) for our 8-class lung sound classification project.

## âœ… Final 8 classes
1. Normal  
2. Asthma  
3. COPD  
4. Pneumonia  
5. Bronchitis  
6. Heart failure  
7. Pleural effusion  
8. Lung fibrosis  

## ðŸ“Œ Datasets (local folders)
- **ICBHI 2017** (Kaggle `vbookshelf/respiratory-sound-database`)  
  Uses:
  - `patient_diagnosis.csv` (patient â†’ diagnosis)
  - `audio_and_txt_files/*.wav` (audio)

- **Fraiwan** (Mendeley)  
  Uses:
  - `audio/*.wav` (audio)
  - (Optional for inspection) `Data annotation.xlsx`

## Important design choices
- **Single-label classification**: Fraiwan entries with multiple diagnoses like `A + B` are **dropped**.
- **ICBHI used only for overlapping classes**: `Healthyâ†’Normal`, `Asthma`, `COPD`, `Pneumonia` (others are dropped).
- **Patient-wise split**: prevents leakage (same patient audio never appears in both train and test).


In [1]:
# Cell 1 â€” Imports + (optional) dependency note

from pathlib import Path
import random
import json

import pandas as pd

# Optional: if you want to read .xlsx (inspection only), pandas needs openpyxl.
# If you get "Missing optional dependency 'openpyxl'", run:
# !pip -q install openpyxl

SEED = 42
random.seed(SEED)

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 50)


## 1) Paths & basic sanity checks

This notebook assumes it is stored inside:
`lung_sound_project/notebooks/01_build_manifest.ipynb`

So the project root is one folder up from the notebook folder.


In [2]:
# Cell 2 â€” Resolve project root & dataset locations

CWD = Path.cwd().resolve()         # .../lung_sound_project/notebooks
PROJECT_ROOT = CWD.parents[0]      # .../lung_sound_project

# ICBHI (Kaggle vbookshelf/respiratory-sound-database)
ICBHI_ROOT = PROJECT_ROOT / "data/raw/icbhi/ICBHI_final_database"
ICBHI_AUDIO_DIR = ICBHI_ROOT / "audio_and_txt_files"
ICBHI_DIAG_CSV = ICBHI_ROOT / "patient_diagnosis.csv"

# Fraiwan (Mendeley)
FRAIWAN_AUDIO_DIR = PROJECT_ROOT / "data/raw/fraiwan/audio"
FRAIWAN_XLSX = PROJECT_ROOT / "data/raw/fraiwan/Data annotation.xlsx"  # optional inspection

# Outputs
OUT_DIR = PROJECT_ROOT / "data/processed/manifests"
OUT_DIR.mkdir(parents=True, exist_ok=True)

MODELS_DIR = PROJECT_ROOT / "models"
MODELS_DIR.mkdir(parents=True, exist_ok=True)

print("CWD:", CWD)
print("PROJECT_ROOT:", PROJECT_ROOT)

print("\nICBHI_ROOT exists:", ICBHI_ROOT.exists())
print("ICBHI_AUDIO_DIR exists:", ICBHI_AUDIO_DIR.exists())
print("ICBHI_DIAG_CSV exists:", ICBHI_DIAG_CSV.exists())

print("\nFRAIWAN_AUDIO_DIR exists:", FRAIWAN_AUDIO_DIR.exists())
print("FRAIWAN_XLSX exists:", FRAIWAN_XLSX.exists())


CWD: /teamspace/studios/this_studio/lung_sound_project/notebooks
PROJECT_ROOT: /teamspace/studios/this_studio/lung_sound_project

ICBHI_ROOT exists: True
ICBHI_AUDIO_DIR exists: True
ICBHI_DIAG_CSV exists: True

FRAIWAN_AUDIO_DIR exists: True
FRAIWAN_XLSX exists: True


In [3]:
# Cell 3 â€” Count audio files (sanity check)

icbhi_wavs = sorted(ICBHI_AUDIO_DIR.glob("*.wav")) if ICBHI_AUDIO_DIR.exists() else []
fraiwan_wavs = sorted(FRAIWAN_AUDIO_DIR.glob("*.wav")) if FRAIWAN_AUDIO_DIR.exists() else []

print("ICBHI wav count:", len(icbhi_wavs))     # expected ~920
print("Fraiwan wav count:", len(fraiwan_wavs)) # expected ~336

print("\nExample ICBHI file:", icbhi_wavs[0].name if icbhi_wavs else "N/A")
print("Example Fraiwan file:", fraiwan_wavs[0].name if fraiwan_wavs else "N/A")


ICBHI wav count: 920
Fraiwan wav count: 336

Example ICBHI file: 101_1b1_Al_sc_Meditron.wav
Example Fraiwan file: BP100_N,N,P R M,70,F.wav


## 2) Label mapping (8-class design)

### ICBHI (keep only 4 classes)
- Healthy â†’ Normal  
- Asthma â†’ Asthma  
- COPD â†’ COPD  
- Pneumonia â†’ Pneumonia  
All other ICBHI diagnoses are dropped (URTI/LRTI/Bronchiectasis/etc.)

### Fraiwan
We parse diagnosis from the filename pattern:
`<PATIENTID>_<DIAGNOSIS>,...wav`


In [4]:
# Cell 4 â€” Define final 8 classes and normalization helpers

EIGHT_CLASSES = [
    "Normal",
    "Asthma",
    "COPD",
    "Pneumonia",
    "Bronchitis",
    "Heart failure",
    "Pleural effusion",
    "Lung fibrosis",
]

label_to_id = {lbl: i for i, lbl in enumerate(EIGHT_CLASSES)}
id_to_label = {i: lbl for lbl, i in label_to_id.items()}

def normalize_label(raw: str):
    """Normalize raw label strings into one of the 8 final classes, else None."""
    s = (raw or "").strip().lower()
    s = " ".join(s.split())

    if s in {"n", "normal", "healthy"}:
        return "Normal"
    if s == "asthma":
        return "Asthma"
    if s == "copd":
        return "COPD"
    if s == "pneumonia":
        return "Pneumonia"
    if s == "bronchitis":
        return "Bronchitis"
    if s in {"heart failure", "heartfailure"}:
        return "Heart failure"
    if s in {"pleural effusion", "pleuraleffusion"}:
        return "Pleural effusion"
    if s in {"lung fibrosis", "lungfibrosis", "pulmonary fibrosis"}:
        return "Lung fibrosis"
    return None

def icbhi_to_final(label: str):
    """Map ICBHI diagnosis labels to our final labels (only 4 kept)."""
    if label == "Healthy":
        return "Normal"
    if label in {"Asthma", "COPD", "Pneumonia"}:
        return label
    return None


## 3) Fraiwan rows (single-label only)

We:
1. Read each `.wav` in `data/raw/fraiwan/audio`
2. Extract patient_id and diagnosis from filename
3. Drop multi-diagnosis samples containing `+`
4. Normalize label to our 8 classes


In [5]:
# Cell 5 â€” Build Fraiwan manifest rows from filenames

fraiwan_rows = []
fraiwan_unknown = []
fraiwan_multi_diag = 0
fraiwan_bad_name = 0

for wav_path in sorted(FRAIWAN_AUDIO_DIR.glob("*.wav")):
    stem = wav_path.stem  # name without extension
    if "_" not in stem:
        fraiwan_bad_name += 1
        continue

    pid, rest = stem.split("_", 1)              # BP101, Asthma,E W,P L M,12,F
    diag_raw = rest.split(",", 1)[0].strip()    # Asthma / COPD / N / Lung Fibrosis / ...

    # Drop multi-diagnosis cases for single-label classification
    if "+" in diag_raw:
        fraiwan_multi_diag += 1
        continue

    label = normalize_label(diag_raw)
    if label is None:
        fraiwan_unknown.append(diag_raw)
        continue

    fraiwan_rows.append({
        "filepath": str(wav_path),
        "dataset": "fraiwan",
        "patient_id": f"fraiwan_{pid}",
        "label": label,
    })

fraiwan_df = pd.DataFrame(fraiwan_rows)

print("Fraiwan total wav:", len(list(FRAIWAN_AUDIO_DIR.glob("*.wav"))))
print("Fraiwan rows kept:", len(fraiwan_df))
print("Dropped multi-diagnosis (+):", fraiwan_multi_diag)
print("Bad filename pattern:", fraiwan_bad_name)
print("Unknown diagnosis tokens (sample):", sorted(set(fraiwan_unknown))[:20])

fraiwan_df["label"].value_counts()


Fraiwan total wav: 336
Fraiwan rows kept: 309
Dropped multi-diagnosis (+): 9
Bad filename pattern: 0
Unknown diagnosis tokens (sample): ['Asthma and lung fibrosis', 'BRON', 'Plueral Effusion']


label
Normal           105
Asthma            96
Heart failure     54
COPD              27
Pneumonia         15
Lung fibrosis     12
Name: count, dtype: int64

## 4) ICBHI rows

We:
1. Load `patient_diagnosis.csv` (patient_id, diagnosis)
2. Map diagnosis using `icbhi_to_final` (keep only 4 classes)
3. For each `.wav`, read patient_id from filename prefix (e.g., `101_...wav` â†’ `101`)
4. Join to get label


In [6]:
# Cell 6 â€” Build ICBHI rows using patient_diagnosis.csv + wav filenames

# CSV has no header: pid, diagnosis
diag = pd.read_csv(ICBHI_DIAG_CSV, header=None, names=["pid", "diagnosis"])
diag["pid"] = diag["pid"].astype(str)
diag["label"] = diag["diagnosis"].apply(icbhi_to_final)

# Keep only our 4 overlapping classes
diag = diag.dropna(subset=["label"]).copy()
diag_map = dict(zip(diag["pid"], diag["label"]))

icbhi_rows = []
icbhi_skipped = 0

for wav_path in sorted(ICBHI_AUDIO_DIR.glob("*.wav")):
    pid = wav_path.stem.split("_", 1)[0].strip()  # "101" from "101_1b1_Al_sc_....wav"
    label = diag_map.get(pid)
    if label is None:
        icbhi_skipped += 1
        continue

    icbhi_rows.append({
        "filepath": str(wav_path),
        "dataset": "icbhi",
        "patient_id": f"icbhi_{pid}",
        "label": label,
    })

icbhi_df = pd.DataFrame(icbhi_rows)

print("ICBHI total wav:", len(list(ICBHI_AUDIO_DIR.glob("*.wav"))))
print("ICBHI rows kept:", len(icbhi_df))
print("ICBHI skipped (non-overlap diagnoses):", icbhi_skipped)

icbhi_df["label"].value_counts()


ICBHI total wav: 920
ICBHI rows kept: 866
ICBHI skipped (non-overlap diagnoses): 54


label
COPD         793
Pneumonia     37
Normal        35
Asthma         1
Name: count, dtype: int64

## 5) Combine and save `manifest_all.csv`

The combined manifest has columns:

- `filepath` (absolute path)
- `dataset` (icbhi / fraiwan)
- `patient_id` (prefixed to avoid collisions: `icbhi_101`, `fraiwan_BP101`)
- `label` (one of 8 classes)


In [7]:
# Cell 7 â€” Combine manifests and save manifest_all.csv

manifest_all = pd.concat([fraiwan_df, icbhi_df], ignore_index=True)

print("Total rows:", len(manifest_all))
print("Unique patients:", manifest_all["patient_id"].nunique())
print("\nOverall label counts:\n", manifest_all["label"].value_counts())

manifest_path = OUT_DIR / "manifest_all.csv"
manifest_all.to_csv(manifest_path, index=False)
print("\nSaved:", manifest_path)


Total rows: 1175
Unique patients: 406

Overall label counts:
 label
COPD             820
Normal           140
Asthma            97
Heart failure     54
Pneumonia         52
Lung fibrosis     12
Name: count, dtype: int64

Saved: /teamspace/studios/this_studio/lung_sound_project/data/processed/manifests/manifest_all.csv


## 6) Patient-wise split (train/val/test)

We split by **patient_id** to prevent leakage.

Default split: **70% train, 10% val, 20% test**.

If a class has too few patients, we keep them in train.


In [8]:
# Cell 8 â€” Patient-wise stratified split and save train/val/test CSVs

def patient_wise_split(df: pd.DataFrame, train=0.7, val=0.1, test=0.2, seed=42):
    assert abs((train + val + test) - 1.0) < 1e-6

    # Check per-patient unique label
    per_patient = df.groupby("patient_id")["label"].nunique()
    conflict = per_patient[per_patient > 1].index.tolist()
    if conflict:
        print("WARNING: Patients with multiple labels found. Dropping:", len(conflict))
        df = df[~df["patient_id"].isin(conflict)].copy()

    patient_label = df.groupby("patient_id")["label"].first()

    # Patients grouped by label
    by_label = {}
    for lbl in sorted(df["label"].unique()):
        pts = patient_label[patient_label == lbl].index.tolist()
        rnd = random.Random(seed)
        rnd.shuffle(pts)
        by_label[lbl] = pts

    train_pts, val_pts, test_pts = set(), set(), set()

    for lbl, pts in by_label.items():
        n = len(pts)

        # Too small: keep in train
        if n < 3:
            train_pts.update(pts)
            continue

        n_test = max(1, round(n * test))
        n_val = max(1, round(n * val))
        if n_test + n_val >= n:
            n_test = max(1, n_test - 1)

        test_chunk = pts[:n_test]
        val_chunk = pts[n_test:n_test + n_val]
        train_chunk = pts[n_test + n_val:]

        test_pts.update(test_chunk)
        val_pts.update(val_chunk)
        train_pts.update(train_chunk)

    train_df = df[df["patient_id"].isin(train_pts)].reset_index(drop=True)
    val_df   = df[df["patient_id"].isin(val_pts)].reset_index(drop=True)
    test_df  = df[df["patient_id"].isin(test_pts)].reset_index(drop=True)

    return train_df, val_df, test_df

train_df, val_df, test_df = patient_wise_split(manifest_all, train=0.7, val=0.1, test=0.2, seed=SEED)

print("Rows  -> train:", len(train_df), "val:", len(val_df), "test:", len(test_df))
print("Pats  -> train:", train_df["patient_id"].nunique(),
      "val:", val_df["patient_id"].nunique(),
      "test:", test_df["patient_id"].nunique())

train_df.to_csv(OUT_DIR / "train.csv", index=False)
val_df.to_csv(OUT_DIR / "val.csv", index=False)
test_df.to_csv(OUT_DIR / "test.csv", index=False)

print("Saved splits into:", OUT_DIR)


Rows  -> train: 732 val: 152 test: 291
Pats  -> train: 286 val: 40 test: 80
Saved splits into: /teamspace/studios/this_studio/lung_sound_project/data/processed/manifests


## 7) Save label maps (for later training + Gradio inference)

We store:
- `models/label_to_id.json`
- `models/id_to_label.json`

These must remain consistent throughout training and inference.


In [9]:
# Cell 9 â€” Save label mappings for later reuse

label_to_id_path = MODELS_DIR / "label_to_id.json"
id_to_label_path = MODELS_DIR / "id_to_label.json"

with open(label_to_id_path, "w", encoding="utf-8") as f:
    json.dump(label_to_id, f, indent=2)

with open(id_to_label_path, "w", encoding="utf-8") as f:
    json.dump(id_to_label, f, indent=2)

print("Saved:", label_to_id_path)
print("Saved:", id_to_label_path)


Saved: /teamspace/studios/this_studio/lung_sound_project/models/label_to_id.json
Saved: /teamspace/studios/this_studio/lung_sound_project/models/id_to_label.json


## 8) Quick distribution report (for your report)

We print class counts for overall + each split.


In [10]:
# Cell 10 â€” Distribution report

def show_counts(name, df):
    print(f"\n== {name} ==")
    print(df["label"].value_counts())

show_counts("Overall", manifest_all)
show_counts("Train", train_df)
show_counts("Val", val_df)
show_counts("Test", test_df)



== Overall ==
label
COPD             820
Normal           140
Asthma            97
Heart failure     54
Pneumonia         52
Lung fibrosis     12
Name: count, dtype: int64

== Train ==
label
COPD             476
Normal            99
Asthma            68
Pneumonia         42
Heart failure     38
Lung fibrosis      9
Name: count, dtype: int64

== Val ==
label
COPD             120
Normal            14
Asthma            10
Heart failure      5
Pneumonia          2
Lung fibrosis      1
Name: count, dtype: int64

== Test ==
label
COPD             224
Normal            27
Asthma            19
Heart failure     11
Pneumonia          8
Lung fibrosis      2
Name: count, dtype: int64


## 9) Optional: Inspect Fraiwan Excel annotations (not required for manifest)

The Fraiwan Excel file does **not** map rows directly to filenames (no filename column),  
so we use it only to understand possible multi-diagnosis patterns & label variants.


In [11]:
# Cell 11 (Optional) â€” Inspect Fraiwan Excel sheet

if FRAIWAN_XLSX.exists():
    try:
        xls = pd.ExcelFile(FRAIWAN_XLSX)
        print("Sheets:", xls.sheet_names)
        df_x = pd.read_excel(FRAIWAN_XLSX, sheet_name=xls.sheet_names[0])
        print("Shape:", df_x.shape)
        if "Diagnosis" in df_x.columns:
            print("\nTop Diagnosis values (Excel):")
            print(df_x["Diagnosis"].astype(str).value_counts().head(20))
            print("\nExamples with '+':")
            print(df_x[df_x["Diagnosis"].astype(str).str.contains(r"\+", na=False)][["Age", "Gender", "Diagnosis"]].head(10))
        else:
            print("No 'Diagnosis' column found in Excel.")
    except Exception as e:
        print("Excel read failed:", e)
else:
    print("Excel file not found (optional).")


Sheets: ['Sheet1']
Shape: (154, 10)

Top Diagnosis values (Excel):
Diagnosis
nan                               42
N                                 35
Asthma                            17
heart failure                     15
asthma                            15
COPD                               8
pneumonia                          5
Lung Fibrosis                      4
Heart Failure                      3
BRON                               3
Heart Failure + COPD               2
Plueral Effusion                   2
Heart Failure + Lung Fibrosis      1
Asthma and lung fibrosis           1
copd                               1
Name: count, dtype: int64

Examples with '+':
    Age Gender                       Diagnosis
3  72.0      F  Heart Failure + Lung Fibrosis 
4  71.0      M            Heart Failure + COPD
6  65.0      M            Heart Failure + COPD


âœ… Done. Next notebook: **02_train_model.ipynb**  
(Load `train.csv/val.csv/test.csv`, build mel spectrogram sequences, train DenseNet121 + LSTM, and save `best_model.pth` + `config.json`.)
