# Preprocessing

This notebook prepares the GRID audio-visual corpus for the multimodal enhancer.

**Pipeline (per speaker `s{i}_processed`, excluding `s21`):**
- Normalize folders: move `.mpg` into `videos/`, create `audio/`.
- Extract clean audio (16 kHz mono, PCM) from each video into `audio/`.
- Synthesize noisy audio in `audio_noisy/` by adding Gaussian noise at a random SNR.
- Create aligned **train/val/test** splits (80/10/10) by stem for `videos/`, `audio/`, and `audio_noisy/`.
- Export Mel-spectrogram features (`n_mels=80`) to `.npy` in `audio_mel/` and `audio_noisy_mel/`.

**Resulting structure (per speaker):**

In [17]:
# import stantment
import os
import shutil
import random
import librosa
import soundfile as sf
import numpy as np
from pathlib import Path
from moviepy import VideoFileClip
from tqdm import tqdm

# costant stantment
ROOT = Path("../../data/dataset")

## Normalize per-speaker folders

This cell prepares each `s{i}_processed` directory (excluding `s21`) by:
- ensuring `videos/` and `audio/` subfolders exist, and
- moving any loose `*.mpg` files from the speaker root into `videos/`.

**Notes**
- Idempotent: re-running won’t break the layout (already moved files are skipped).
- Make sure `ROOT` points to `.../data/dataset`.
- This step only reorganizes files; audio extraction happens in the next cell.

In [4]:
for i in range(1,35):
    if i == 21: continue
    
    subject_folder = ROOT / f"s{i}_processed"
    
    if not subject_folder.exists():
        print(f"Directory {subject_folder} not found.")
        continue

    videos_dir = subject_folder / "videos"
    audio_dir  = subject_folder / "audio"
    
    videos_dir.mkdir(exist_ok=True) 
    audio_dir.mkdir(exist_ok=True)

    for file in subject_folder.glob("*.mpg"):
        dest = videos_dir / file.name
        print(f"Change dir for {file} -> {dest}")
        shutil.move(str(file), str(dest))

print("Completed!")

Change dir for ../../data/dataset/s1_processed/prwq3s.mpg -> ../../data/dataset/s1_processed/videos/prwq3s.mpg
Change dir for ../../data/dataset/s1_processed/pbib8p.mpg -> ../../data/dataset/s1_processed/videos/pbib8p.mpg
Change dir for ../../data/dataset/s1_processed/lrae3s.mpg -> ../../data/dataset/s1_processed/videos/lrae3s.mpg
Change dir for ../../data/dataset/s1_processed/pgid6p.mpg -> ../../data/dataset/s1_processed/videos/pgid6p.mpg
Change dir for ../../data/dataset/s1_processed/pbao8n.mpg -> ../../data/dataset/s1_processed/videos/pbao8n.mpg
Change dir for ../../data/dataset/s1_processed/prbx3s.mpg -> ../../data/dataset/s1_processed/videos/prbx3s.mpg
Change dir for ../../data/dataset/s1_processed/lbbk6p.mpg -> ../../data/dataset/s1_processed/videos/lbbk6p.mpg
Change dir for ../../data/dataset/s1_processed/bgwu6n.mpg -> ../../data/dataset/s1_processed/videos/bgwu6n.mpg
Change dir for ../../data/dataset/s1_processed/sbig6p.mpg -> ../../data/dataset/s1_processed/videos/sbig6p.mpg
C

## Extract mono 16 kHz audio from videos

This cell scans each speaker folder (excluding `s21`), finds all `*.mpg` in `videos/`,
and writes the audio tracks to `audio/` as mono 16 kHz WAV (`pcm_s16le`).
- Idempotent: existing WAVs are skipped.
- Progress bars show per-speaker and per-file status.
- Requires `ffmpeg` (MoviePy uses it under the hood).

In [None]:
for i in tqdm([x for x in range(1, 35) if x != 21], desc=f"Extraction"):
    subject_dir = ROOT / f"s{i}_processed"
    videos_dir = subject_dir / "videos"
    audio_dir = subject_dir / "audio"
    audio_dir.mkdir(exist_ok=True)

    mpg_files = sorted(videos_dir.glob("*.mpg"))
    if not mpg_files:
        tqdm.write(f"No .mpg file found in {videos_dir}")
        continue

    file_bar = tqdm(mpg_files, desc=f"s{i} audio extraction", leave=False)
    for mpg_path in file_bar:
        wav_path = audio_dir / (mpg_path.stem + ".wav")
        if wav_path.exists():
            continue

        try:
            with VideoFileClip(str(mpg_path)) as clip:
                clip.audio.write_audiofile(
                    str(wav_path),
                    fps=16000,
                    nbytes=2,
                    codec='pcm_s16le',
                    logger=None
                )
            file_bar.set_postfix_str(f"Extracting: {mpg_path.name}")
        except Exception as e:
            file_bar.set_postfix_str(f"Error: {mpg_path.name}")
            tqdm.write(f"Error with {mpg_path}: {e}")
    file_bar.close()

print("Extraction completed!")

Extraction:  18%|█▊        | 6/33 [03:09<17:40, 39.28s/it]Exception ignored in: <function FFMPEG_AudioReader.__del__ at 0x1075b0720>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/moviepy/audio/io/readers.py", line 304, in __del__
    self.close()
  File "/opt/anaconda3/lib/python3.12/site-packages/moviepy/audio/io/readers.py", line 294, in close
    if self.proc:
       ^^^^^^^^^
AttributeError: 'FFMPEG_AudioReader' object has no attribute 'proc'
Extraction: 100%|██████████| 33/33 [36:28<00:00, 66.33s/it] 

Extraction completed!





## Create noisy counterparts at target SNR

This cell generates a noisy version of every clean WAV per speaker (excluding `s21`) and
writes it to `audio_noisy/` with the same filename. Gaussian noise is added to reach a
target Signal-to-Noise Ratio (SNR) in dB, sampled uniformly from a range (default: 5–100 dB).

- **Idempotent:** existing files are skipped.
- **Mono handling:** if an input WAV is stereo, only the first channel is used.
- **SNR sampling:** adjust `snr_range` (e.g., 5–20 dB) for more realistic conditions.
- **Reproducibility:** set `random.seed(...)` before the loop if deterministic noise is desired.

In [14]:
def add_noise(audio, snr_db) -> np.ndarray:
    """
    Adds Gaussian noise to an audio signal at the specified SNR (in dB).

    Args:
        audio (np.ndarray): The input audio signal.
        snr_db (float): Desired signal-to-noise ratio in decibels.

    Returns:
        np.ndarray: The noisy audio signal.
    """
    signal_power = np.mean(audio ** 2)
    snr_linear   = 10 ** (snr_db / 10)
    noise_power  = signal_power / snr_linear
    noise        = np.random.normal(0, np.sqrt(noise_power), audio.shape)
    return audio + noise

snr_range = (5,100)

for i in tqdm([x for x in range(1,35) if x != 21], desc="Noising"):
    subject_dir = ROOT / f"s{i}_processed"
    audio_dir   = subject_dir / "audio"
    noisy_dir   = subject_dir / "audio_noisy"
    noisy_dir.mkdir(exist_ok=True)

    wav_files = sorted(audio_dir.glob("*.wav"))
    if not wav_files:
        tqdm.write(f"No .wav file found in {audio_dir}")
        continue

    for wav_path in tqdm(wav_files, desc=f"s{i} add noise", leave=False):
        out_path = noisy_dir / wav_path.name
        if out_path.exists(): continue

        audio, sr = sf.read(str(wav_path))
        if len(audio.shape) > 1:
            audio = audio[:, 0]

        snr_db      = random.uniform(*snr_range)
        noisy_audio = add_noise(audio, snr_db=snr_db)
        sf.write(str(out_path), noisy_audio, sr)

print("Add noise!")

Noising: 100%|██████████| 33/33 [00:00<00:00, 75.24it/s]

Add noise!





## Create aligned train/val/test splits (by stem)

This step splits each speaker’s data into **train/val/test = 80/10/10** using a seeded
shuffle (`seed=42`). File stems present **in all three modalities** (`audio/*.wav`,
`audio_noisy/*.wav`, `videos/*.mpg`) are intersected to avoid mismatches; only those
common items are split.

- **Aligned across modalities:** the same stems go to the same split for audio, noisy audio, and video.
- **Deterministic:** fixed random seed for reproducible partitions.
- **Idempotent:** creates `train/val/test` subfolders and moves files there; skipping missing files.
- **Adjustable:** change `split_ratios` to modify the partition (e.g., `[0.9, 0.05, 0.05]`).

After this, each `sX_processed/{audio,audio_noisy,videos}/` directory contains
`train/`, `val/`, and `test/` subfolders with perfectly synchronized items.

In [16]:
split_ratios = [0.8, 0.1, 0.1]
random.seed(42)

for i in tqdm([x for x in range(1, 35) if x != 21], desc="Separation"):
    subject     = ROOT / f"s{i}_processed"
    base_names  = {p.stem for p in (subject/"audio").glob("*.wav")}
    base_names &= {p.stem for p in (subject/"videos").glob("*.mpg")} 
    base_names &= {p.stem for p in (subject/"audio_noisy").glob("*.wav")}

    base_list   = sorted(list(base_names))
    random.shuffle(base_list)
    n           = len(base_list)
    n_train     = int(split_ratios[0] * n)
    n_val       = int(split_ratios[1] * n)

    splits = {
        "train": base_list[:n_train],
        "val":   base_list[n_train:n_train+n_val],
        "test":  base_list[n_train+n_val:]
    }

    for modality in ["audio", "audio_noisy", "videos"]:
        mod_dir = subject / modality
        if not mod_dir.exists():
            continue
        for split, names in splits.items():
            target = mod_dir / split
            target.mkdir(exist_ok=True)
            for nm in names:
                ext = ".wav" if "audio" in modality else ".mpg"
                src = mod_dir / (nm + ext)
                if src.exists():
                    shutil.move(str(src), target / src.name)

print("Split completed!")

Separation: 100%|██████████| 33/33 [00:07<00:00,  4.31it/s]

Split completed!





## Offline Mel-spectrogram precomputation

This step converts every `.wav` in each split (`audio/` and `audio_noisy/`) to a
log-Mel spectrogram and saves it as a NumPy array (`.npy`) with the same stem.

- **I/O:** reads `…/audio{,_noisy}/{train,val,test}/*.wav`, writes to
  `…/audio_mel/` and `…/audio_noisy_mel/` mirroring the split structure.
- **Features:** `librosa.feature.melspectrogram` at 16 kHz with `n_mels=80`
  (configurable), then `power_to_db` (log scale). Output shape is
  `[n_mels, T]`.
- **Idempotent:** skips files that already exist.
- **Consistency tip:** keep Mel params (`sr`, `n_mels`, `n_fft`, hop, window,
  `power`) aligned with what the model expects. If the training pipeline uses
  linear magnitude (`power=1.0`), set that here as well.

After this stage, dataloaders can directly `np.load()` Mel tensors from disk for
faster, repeatable training.

In [20]:
def extract_and_save_mel(audio_path, out_path, sr=16000, n_mels=80) -> None:
    """
    Extracts the mel-spectrogram from an audio file and saves it as a NumPy array.

    Args:
        audio_path (str or Path): Path to the input audio file.
        out_path (str or Path): Path where the mel-spectrogram will be saved (.npy).
        sr (int, optional): Target sampling rate for audio loading. Default is 16000.
        n_mels (int, optional): Number of mel bands to generate. Default is 80.

    Returns:
        None
    """
    y, _    = librosa.load(audio_path, sr=sr)
    mel     = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
    mel_db  = librosa.power_to_db(mel, ref=np.max)
    np.save(out_path, mel_db)

for i in tqdm([x for x in range(1,35) if x != 21], desc="Spectr"):
    subject_dir = ROOT / f"s{i}_processed"

    for typ in ["audio", "audio_noisy"]:
        for split in ["train", "val", "test"]:
            in_folder = subject_dir / typ / split
            if not in_folder.exists(): continue

            out_folder = subject_dir / f"{typ}_mel" / split
            out_folder.mkdir(parents=True, exist_ok=True)
            wav_files = list(in_folder.glob("*.wav"))

            for wav in tqdm(wav_files, desc=f"s{i}-{typ}/{split}", leave=False):
                out_path = out_folder / (wav.stem + ".npy")
                if out_path.exists(): continue
                extract_and_save_mel(str(wav), str(out_path))

print("MelSpect preprocessing completed!")



Spectr: 100%|██████████| 33/33 [04:50<00:00,  8.79s/it]

MelSpect preprocessing completed!



