# 01 - BLOCS SMAD Segments EDA

Exploratory data analysis for the BLOCS segment manifest and the acoustic stats / QC flags produced by `build_acoustic_stats.py`.

## How to use this notebook

This notebook assumes you have:

1. Synced audio and segments locally with `python scripts/sync_b2_data.py`.
2. Built the acoustic-stats manifest with `python -m data_processing.build_acoustic_stats`.

If you want to experiment freely, make a personal copy in `notebooks/local` (e.g., `01_smad_segments_eda_<yourname>.ipynb`) and work in that file.

## 1. Load configuration and datasets

In this section we load project settings, point to the BLOCS metadata directory, and open the base segment manifest (`blocs_smad_segments`) and the acoustic-stats–augmented manifest (`blocs_smad_v1`).

In [None]:
from pathlib import Path
import sys

import pandas as pd
from datasets import load_from_disk

from utils.config_utils import load_env, add_project_root_to_path
load_env()

from config import get_settings
from utils.dams_types import BLOCS_SMAD_SEGMENTS, BLOCS_SMAD_V1

add_project_root_to_path()
# Load project settings for file paths and parameters.
settings = get_settings()
metadata_dir = Path(settings.metadata_path)
segments_dir = Path(settings.segments_path)

# Define manifest paths.
base_manifest_path = metadata_dir / BLOCS_SMAD_SEGMENTS
acoustic_manifest_path = metadata_dir / BLOCS_SMAD_V1

# Load HF datasets to pandas dataframes.
ds_segments = load_from_disk(base_manifest_path)
ds_acoustic = load_from_disk(acoustic_manifest_path)
df_segments = ds_segments.to_pandas()
df_acoustic = ds_acoustic.to_pandas()

len(df_segments), len(df_acoustic) # Should be the same number of rows.

### 1.1 HuggingFace Datasets

This project uses HuggingFace Datasets. If unfamiliar, you may want to check out their tutorials: https://huggingface.co/docs/datasets/en/index

## 2. Inspect schema and basic statistics

Here we inspect columns, dtypes, and simple descriptive statistics for segment-level metadata.
The goal is to get a feel for what a “segment row” looks like before training.

+ `split`: train/val/test split designation. Default is unsplit.
+ `label_source`: source of transcription labels (none, gold, or specific model). Default is none.
+ `*_label`: binary multi-label targets for presence of speech, music, or noise (0/1).
+ `*_score`: teacher confidence scores for specific label types (0.0–1.0).

In [None]:
# Preview first few rows of segment-level metadata from blocs_smad_segments.

display(df_segments.head())

In [None]:
# Preview first few rows of acoustic stats and QC flags from blocs_smad_v1.

display(df_acoustic.head())

In [None]:
# Get familiar with the acoustic names and dtypes.

df_acoustic.dtypes

In [None]:
# Basic descriptive statistics for numeric columns.

with pd.option_context('display.float_format', '{:.6f}'.format):
    display(df_acoustic.describe(include="number").T)


## 3. Acoustic flags: prevalence and distribution

In this section we summarize how often each acoustic flag fires (too quiet, mostly silence, heavily clipped, too short, had error), both as counts and percentages. This helps sanity-check the thresholds used in `build_acoustic_stats.py`.

In [None]:
# Summarize prevalence of acoustic QC flags.

# Identify all boolean or 0/1 flag columns.
flag_cols = [
    col for col in df_acoustic.columns
    if col.endswith('_flag') or col in
       ['too_quiet', 'mostly_silence', 'heavily_clipped',
        'too_short', 'had_error']
    if col in df_acoustic.columns
]

# Compute counts and percentages.
flag_summary = pd.DataFrame({
    'count': df_acoustic[flag_cols].sum(),
    'percent': (df_acoustic[flag_cols].mean() * 100)
}).sort_values('percent', ascending=False)

with pd.option_context('display.float_format', '{:.2f}'.format):
    display(flag_summary)

## 4. Distributions of core acoustic features

We visualize the distributions of key numeric features (for example duration, RMS energy, peak amplitude) and look for outliers or pathological patterns that might affect training.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import MaxNLocator

def set_plot_style() -> None:
    """Set a clean, readable style for EDA plots without grid lines."""
    sns.set_style("white", {"axes.grid": False})
    sns.set_context("talk")


def style_generic_plot(ax, title: str, xlabel: str, ylabel: str) -> None:
    """Apply consistent styling to a single axis."""
    sns.despine(ax=ax)  # removes top and right spines

    # turn OFF all gridlines
    ax.grid(False)

    ax.set_title(title, fontsize=14)
    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=12)
    ax.xaxis.set_major_locator(MaxNLocator(nbins=6))
    ax.tick_params(axis="both", which="major", labelsize=10)
    ax.set_facecolor("white")


set_plot_style()

# Core acoustic features we care about for QC and modeling.
acoustic_cols = [
    "rms_db",
    "silence_ratio",
    "zero_crossing_rate",
    "snr_db",
    "energy_variance",
]

# Keep only columns that exist in the current manifest.
acoustic_cols = [c for c in acoustic_cols if c in df_acoustic.columns]

# HUSL palette
colors = sns.color_palette("husl", n_colors=len(acoustic_cols))

fig, axes = plt.subplots(len(acoustic_cols), 1, figsize=(8, 2.5 * len(acoustic_cols)))

if len(acoustic_cols) == 1:
    axes = [axes]

for ax, col, color in zip(axes, acoustic_cols, colors):
    ax.hist(df_acoustic[col], bins=40, color=color, edgecolor="black", alpha=0.9)
    style_generic_plot(
        ax,
        title=f"Distribution of {col}",
        xlabel=col,
        ylabel="count",
    )

plt.tight_layout()
plt.show()

## 5. Listen to example segments

To ground the numbers in actual audio, we sample a few segments and listen to them. We can filter by QC flags (for example “good” segments versus “mostly silence”) to verify that the flags match our intuition.

In [None]:
from IPython.display import HTML, Audio, display

from utils.audio_io import load_waveform


def _resolve_segment_path(row) -> Path:
    """Resolve the on-disk path for a segment from a row."""
    candidate_cols = ["segment_path", "segment_relpath", "segment_file"]
    for col in candidate_cols:
        if col in row.index:
            return segments_dir / row[col]
    raise KeyError(
        "Could not find a segment path column. "
        "Update `_resolve_segment_path` with the correct column name."
    )


def play_random_segment(df, query: str | None = None, sample_rate: int = 16_000) -> None:
    """Sample a random segment (optionally filtered by a query) and play it."""
    subset = df.query(query) if query else df
    if subset.empty:
        print("No rows matched the query.")
        return

    row = subset.sample(1).iloc[0]
    audio_path = _resolve_segment_path(row)

    # Clean filename for display
    segment_name = audio_path.name

    print(f"\n▶ Segment: {segment_name}")
    print(f"   Path: {audio_path}")
    print()

    waveform, sr = load_waveform(audio_path)
    sr = sr or sample_rate

    # Show the row metadata and play the audio.
    display(row)
    display(Audio(waveform.squeeze().numpy(), rate=sr))

In [None]:
# Let's also define a function to play a grid of segments.

def play_segment_grid(
    df,
    query: str | None = None,
    n: int = 6,
    n_cols: int = 3,
    sample_rate: int = 16_000,
) -> None:
    """Display a small grid of audio segments with inline players."""
    subset = df.query(query) if query else df
    if subset.empty:
        print("No rows matched the query.")
        return

    n = min(n, len(subset))
    sampled = subset.sample(n, random_state=0)

    cell_html_blocks: list[str] = []

    for _, row in sampled.iterrows():
        audio_path = _resolve_segment_path(row)
        waveform, sr = load_waveform(audio_path)
        sr = sr or sample_rate

        segment_name = audio_path.name

        # Build an embedded audio player for this segment
        audio_widget = Audio(waveform.squeeze().numpy(), rate=sr, embed=True)
        audio_html = audio_widget._repr_html_()

        # Wrap filename + audio in a small card
        card_html = f"""
        <div style="font-size:12px; margin-bottom:4px; font-family:monospace;">
            {segment_name}
        </div>
        {audio_html}
        """
        cell_html_blocks.append(card_html)

    # Arrange cards into a table grid
    rows_html: list[str] = []
    for i in range(0, len(cell_html_blocks), n_cols):
        cells = cell_html_blocks[i : i + n_cols]
        row_html = "".join(
            f"<td style='padding:8px; vertical-align:top;'>{cell}</td>"
            for cell in cells
        )
        rows_html.append(f"<tr>{row_html}</tr>")

    table_html = "<table><tbody>" + "".join(rows_html) + "</tbody></table>"
    display(HTML(table_html))

In [None]:


# 1) Completely random segment
play_random_segment(df_acoustic)

# 2) A "good" segment that passed all QC flags (uncomment if these columns exist)
# play_random_segment(
#     df_acoustic,
#     query="(~too_quiet) and (~mostly_silence) and (~heavily_clipped) and (~too_short) and (~had_error)",
# )

# 3) A segment that is mostly silence (uncomment if `mostly_silence` exists)
# play_random_segment(df_acoustic, query="mostly_silence")

In [None]:
# 6 random segments from the whole distribution
play_segment_grid(df_acoustic, n=6, n_cols=3)

# 6 segments that are mostly silence (if that flag exists)
# play_segment_grid(df_acoustic, query="mostly_silence", n=6, n_cols=3)

# 6 "good" segments with no QC issues
# play_segment_grid(
#     df_acoustic,
#     query="(~too_quiet) and (~mostly_silence) and (~heavily_clipped) and (~too_short) and (~had_error)",
#     n=6,
#     n_cols=3,
# )

## 6. Correlations among acoustic features

To understand how the core acoustic statistics relate to each other and whether they form meaningful structure for modeling, we compute correlation matrices and inspect which features co-vary. This helps us see whether the dataset naturally separates into regimes (for example speech-like, music-like, noise-like).

In [None]:
# Correlation matrix for numeric acoustic features.

numeric_cols = df_acoustic.select_dtypes(include="number").columns
exclude_cols = [
    "speech_label", "music_label", "noise_label",
    "speech_score", "music_score", "noise_score",
    "start_time", "end_time",
]
numeric_cols = [col for col in numeric_cols if col not in exclude_cols]

corr = df_acoustic[numeric_cols].corr()

plt.figure(figsize=(8, 6))
ax = sns.heatmap(
    corr,
    cmap="coolwarm",
    center=0.0,
    square=True,
    cbar_kws={"shrink": 0.8},
    xticklabels=numeric_cols,
    yticklabels=numeric_cols,
    linewidths=0.5,       # <-- add this
    linecolor="white",    # <-- and this
    vmin=-1.0,
    vmax=1.0,
)

# Title and axis label styling
ax.set_title(
    "Correlation matrix of acoustic features",
    fontsize=14,
    fontweight="bold",
)

# Tilt x-axis labels so they are readable.
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    ha="right",
    fontsize=10,
    fontweight="bold",
)

ax.set_yticklabels(
    ax.get_yticklabels(),
    rotation=0,
    fontsize=10,
    fontweight="bold",
)

plt.tight_layout()
plt.show()

## 7. Distribution of gold labels (speech, music, noise)

This section activates once gold labels are mapped onto segments. For now, it will print a message if no label_source == "gold" rows exist.

In [None]:
# Check per class distributions of gold labels.

label_cols = ['speech_label', 'music_label', 'noise_label']

df_gold = df_acoustic[df_acoustic['label_source'] == 'gold'].copy()

if df_gold.empty:
    print("No gold labels found yet (label_source == 'gold'). "
          "Run the gold-to-segment mapping pipeline first.")
else:
    # Per-class prevalence
    print("Per-class prevalence among gold labels:")
    display(df_gold[label_cols].mean().to_frame('prevalence'))

    # Multi-label combo distribution
    combo = (
        df_gold[label_cols]
        .astype(int)
        .astype(str)
        .agg(''.join, axis=1)
    )
    print("\nMulti-label pattern distribution (e.g., 100 = speech only):")
    display(combo.value_counts(normalize=True).to_frame('fraction'))

## 8. Pairwise relationships between acoustic features

To capture simple nonlinear structure that correlations might miss, we look at selected 2D projections of the acoustic feature space (for example `rms_db` vs `snr_db`). This helps reveal distinct regimes (for example quiet speech, loud music) and potential overlapping regions.

In [None]:
# Selected 2D acoustic feature relationships with stable hexbin plots.

import numpy as np

pair_specs = [
    ("rms_db", "snr_db"),
    ("rms_db", "silence_ratio"),
    ("zero_crossing_rate", "rms_db"),
]

# Collect all columns used in the pairs.
subset_cols = {c for pair in pair_specs for c in pair}
subset_cols = [c for c in subset_cols if c in df_acoustic.columns]

pair_sample = df_acoustic[subset_cols].copy()

# Replace infinities with NaN, then drop any NaNs.
pair_sample = pair_sample.replace([np.inf, -np.inf], np.nan).dropna()

# Subsample for readability if it is huge.
if len(pair_sample) > 20000:
    pair_sample = pair_sample.sample(20000, random_state=0)

fig, axes = plt.subplots(1, len(pair_specs), figsize=(5 * len(pair_specs), 4))

if len(pair_specs) == 1:
    axes = [axes]

for ax, (x_col, y_col) in zip(axes, pair_specs):
    if x_col not in pair_sample.columns or y_col not in pair_sample.columns:
        continue

    hb = ax.hexbin(
        pair_sample[x_col],
        pair_sample[y_col],
        gridsize=40,
        cmap="mako",
        mincnt=1,
    )
    style_generic_plot(
        ax,
        title=f"{y_col} vs {x_col}",
        xlabel=x_col,
        ylabel=y_col,
    )
    # Optional: colorbar per subplot
    cb = fig.colorbar(hb, ax=ax, shrink=0.75)
    cb.set_label("count")

plt.tight_layout()
plt.show()

## 9. PCA projection of acoustic features

We project the standardized acoustic features into a low-dimensional space using PCA. This provides a coarse view of how segments are arranged in feature space and whether there are obvious clusters or outliers.

In [None]:
def plot_pca_scatter(X_pca, explained_var):
    fig, ax = plt.subplots(figsize=(8, 6))

    # Dark background to match your acoustic plots
    ax.set_facecolor('#1e1e1e')
    fig.patch.set_facecolor('#1e1e1e')

    # Single scatter color to match previous figures
    point_color = '#4db8ff'

    ax.scatter(
        X_pca[:, 0],
        X_pca[:, 1],
        s=10,
        alpha=0.35,
        color=point_color,
        edgecolors='none'
    )

    ax.set_title(
        "PCA of acoustic features",
        fontsize=18,
        color='white',
        pad=12
    )
    ax.set_xlabel(
        f"PC1 ({explained_var[0]:.1f} percent var)",
        fontsize=14,
        color='white'
    )
    ax.set_ylabel(
        f"PC2 ({explained_var[1]:.1f} percent var)",
        fontsize=14,
        color='white'
    )

    ax.tick_params(axis='both', colors='white', labelsize=12)

    # White spines to match your correlation heatmap and hexbin style
    for spine in ax.spines.values():
        spine.set_edgecolor('white')
        spine.set_linewidth(1.2)

    plt.tight_layout()
    plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Reuse the core acoustic columns from earlier.
pca_features = [
    "rms_db",
    "silence_ratio",
    "zero_crossing_rate",
    "snr_db",
    "energy_variance",
]
pca_features = [c for c in pca_features if c in df_acoustic.columns]

X = df_acoustic[pca_features].copy()

# 1) Clean: remove inf / -inf / NaN.
X = X.replace([np.inf, -np.inf], np.nan).dropna()

# 2) Drop constant columns.
zero_var_cols = [c for c in X.columns if X[c].nunique() <= 1]
if zero_var_cols:
    print("Dropping zero-variance columns for PCA:", zero_var_cols)
    X = X.drop(columns=zero_var_cols)

if X.shape[1] < 2:
    raise ValueError(f"Need at least 2 non-constant numeric features, got {X.shape[1]}")

# 3) Standardize.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = np.asarray(X_scaled, dtype=np.float64)

# Final sanity check on the scaled matrix.
finite_mask = np.isfinite(X_scaled).all(axis=1)
if not finite_mask.all():
    dropped = (~finite_mask).sum()
    print(f"Dropping {dropped} rows with non-finite values after scaling.")
    X_scaled = X_scaled[finite_mask]
    X = X.iloc[finite_mask].reset_index(drop=True)

print(
    "X_scaled shape:", X_scaled.shape,
    "| max abs:", float(np.nanmax(np.abs(X_scaled)))
)

# 4) Fit PCA and check components.
pca = PCA(n_components=2, random_state=0)
pca.fit(X_scaled)

assert np.isfinite(pca.components_).all(), "Non-finite value in PCA components."

# 5) Transform inside a local error-state context to silence spurious warnings.
with np.errstate(divide="ignore", invalid="ignore", over="ignore"):
    X_pca = pca.transform(X_scaled)

df_pca = pd.DataFrame(X_pca, columns=["pc1", "pc2"])
expl_var = pca.explained_variance_ratio_ * 100

# Optional subsample for plotting (use same sample for k-means viz if you want).
plot_sample = df_pca
if len(plot_sample) > 8000:
    plot_sample = plot_sample.sample(8000, random_state=0)

# Use the styled helper defined in the cell above
plot_pca_scatter(plot_sample[["pc1", "pc2"]].to_numpy(), expl_var)

## 10. Unsupervised clustering of segments

As a simple check for natural acoustic regimes, we run k-means on the standardized acoustic features and visualize the clusters in the PCA space. Clusters that do not align with our expectations (for example clear speech-like vs music-like vs noise-like groups) may indicate interesting structure or domain shifts.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

n_clusters = 3

# Extra sanity check (will raise if something is truly wrong)
assert np.isfinite(X_scaled).all(), "Non-finite values in X_scaled for k-means."

with np.errstate(divide="ignore", over="ignore", invalid="ignore"):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init="auto")
    cluster_labels = kmeans.fit_predict(X_scaled)

df_pca_clusters = df_pca.copy()
df_pca_clusters["cluster"] = cluster_labels

plt.figure(figsize=(6, 5))
sns.scatterplot(
    data=df_pca_clusters,
    x="pc1",
    y="pc2",
    hue="cluster",
    palette="husl",
    s=12,
    alpha=0.6,
)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("k-means clusters in PCA space")
plt.legend(title="cluster", loc="best")
plt.tight_layout()
plt.show()

cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
display(cluster_counts.to_frame(name="count"))

In [None]:
# Interpret the clusters.

cluster_summary = (
    df_acoustic.assign(cluster=cluster_labels)[
        ["cluster", "rms_db", "silence_ratio", "zero_crossing_rate", "snr_db", "energy_variance"]
    ]
    .groupby("cluster")
    .agg(["mean", "std"])
)
display(cluster_summary)

### Summarize the acoustic results from the clusters to check your understanding.

+ `Cluster 0`:
+ `Cluster 1`:
+ `Cluster 2`:

## 11. Waveform and spectrogram views for individual segments

Finally, we visualize raw audio for selected segments as both waveforms and log-mel spectrograms. This helps connect acoustic statistics (for example `rms_db`, `silence_ratio`, `snr_db`) with the underlying time–frequency patterns that SMAD models will see.

In [None]:
import torch
import torchaudio

def plot_waveform_and_spectrogram(
    audio_path: Path,
    sample_rate: int = 16_000,
    n_fft: int = 512,
    hop_length: int = 160,
    n_mels: int = 64,
) -> None:
    """Plot waveform and log-mel spectrogram for a single segment.

    The waveform is plotted with symmetric y-limits for easier comparison
    across segments. The log-mel spectrogram uses time in seconds on the
    x-axis instead of frame index.
    """
    waveform, sr = load_waveform(audio_path)
    sr = sr or sample_rate
    waveform = waveform.squeeze()

    # Waveform time axis
    t = torch.arange(waveform.shape[-1]) / sr

    fig, axes = plt.subplots(2, 1, figsize=(8, 4), sharex=False)

    # Waveform
    max_abs = float(waveform.abs().max())
    axes[0].plot(t.numpy(), waveform.numpy(), linewidth=1.0)
    axes[0].set_ylim(-1.05 * max_abs, 1.05 * max_abs)
    axes[0].set_title(f"Waveform: {audio_path.name}")
    axes[0].set_xlabel("time (s)")
    axes[0].set_ylabel("amplitude")

    # Mel spectrogram
    mel_transform = torchaudio.transforms.MelSpectrogram(
        sample_rate=sr,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
    )
    mel_spec = mel_transform(waveform)
    mel_db = torchaudio.functional.amplitude_to_DB(
        mel_spec, multiplier=10.0, amin=1e-10, db_multiplier=0.0
    )

    mel_db_np = mel_db.numpy()

    # Time axis in seconds for spectrogram
    spec_frames = mel_db_np.shape[-1]
    spec_duration = spec_frames * hop_length / sr
    extent = [0.0, spec_duration, 0, n_mels]

    im = axes[1].imshow(
        mel_db_np,
        origin="lower",
        aspect="auto",
        interpolation="nearest",
        extent=extent,
        cmap="magma",
    )
    axes[1].set_title("Log-mel spectrogram")
    axes[1].set_xlabel("time (s)")
    axes[1].set_ylabel("mel bin")
    fig.colorbar(im, ax=axes[1], shrink=0.8, label="dB")

    plt.tight_layout()
    plt.show()


def visualize_random_segment(df, query: str | None = None):
    subset = df.query(query) if query else df
    if subset.empty:
        print("No rows matched the query.")
        return

    row = subset.sample(1).iloc[0]
    audio_path = _resolve_segment_path(row)

    # Metadata block header
    # print(f"\n=== Segment metadata: {audio_path.name} ===")

    # Show as a 1-row DataFrame (nice tabular view)
    # row_df = row.to_frame().T
    # display(row_df)

    # Plot
    plot_waveform_and_spectrogram(audio_path)

    # Print text row for the record.
    print()
    print(row.to_string())

In [None]:
# Example usages:

# Any random segment
visualize_random_segment(df_acoustic)

# A segment with high silence_ratio (if that column exists)
# visualize_random_segment(df_acoustic, query="silence_ratio > 0.7")

## 12. Summary and next steps

In this notebook we:

- Loaded the BLOCS SMAD segment and acoustic manifests and verified that the segmentation looks sane.
- Inspected acoustic QC flags (too quiet, mostly silence, heavily clipped, too short, had error) and their prevalence.
- Explored distributions of core acoustic features and listened to random segments to connect the statistics to what we hear.
- Examined correlations, 2D relationships, PCA structure, and simple k-means clusters to see how segments are arranged in acoustic space.
- Added a placeholder view of gold label distributions that will activate once `label_source == "gold"` rows are available.

Once you are comfortable with the acoustic and segmentation behavior, the recommended next step is:

- Open `02_ast_teacher_sanity.ipynb` to analyze how AST, Whisper-AT, and CLAP-family teachers behave on BLOCS segments, compare them against gold labels (when available), and calibrate confidence thresholds for pseudo-labeling.