# ASR Commands (Keyword Spotting) — Hugging Face Baseline

This notebook implements a complete deep learning pipeline for keyword spotting (speech command classification) using TensorFlow's mini_speech_commands dataset. It is the reference implementation for the `asr_commands` pipeline in this repository.

**Purpose:**
- Train a model to classify short audio clips into spoken command words.
- Demonstrate data ingestion, preprocessing, model training, evaluation, and artifact export using Hugging Face Transformers and the mini_speech_commands dataset.

**Workflow Overview:**
1. **Repository Setup & Data Ingestion:**
   - Ensures the notebook is running in the correct project directory.
   - Downloads and extracts the mini_speech_commands dataset into `.cache/asr_commands/` using the provided ingestion script.
2. **Environment Preparation:**
   - Installs required Python packages and system dependencies for audio processing and model training.
3. **Data Loading & Preprocessing:**
   - Loads audio files and labels from the dataset.
   - Splits data into train/validation/test sets.
   - Prepares data for Hugging Face audio classification models.
4. **Model Training:**
   - Fine-tunes a pretrained audio classification model (e.g., HuBERT) using the Hugging Face `Trainer` API.
   - Saves model checkpoints and training logs.
5. **Evaluation & Export:**
   - Evaluates the model on the test set.
   - Computes accuracy, macro F1, and confusion matrix.
   - Saves metrics and artifacts to the `outputs/asr_commands/` directory.

**Inputs:**
- mini_speech_commands dataset (downloaded to `.cache/asr_commands/raw/mini_speech_commands/`).
- Configuration and utility code from the repository.

**Outputs:**
- Trained model checkpoints (Hugging Face format).
- Evaluation metrics (JSON, confusion matrix).
- Optional zipped artifacts for download.

**Assumptions:**
- The notebook is run in a Colab or local environment with internet access.
- All code and data paths are relative to the repository root.

**Relevant Files & Directories:**
- `data_ingestion/asr_commands/run.py`: Ingestion script for downloading the dataset.
- `.cache/asr_commands/`: Directory for cached raw data.
- `outputs/asr_commands/`: Directory for model outputs and metrics.
- `utils/paths.py`: Utility for resolving cache paths.

---


# ASR Commands (Keyword Spotting) — Production-grade HF baseline

This notebook trains a **keyword spotting / speech command classification** model on TensorFlow's **mini_speech_commands** dataset.

## Design goals
- Minimal custom modeling code (rely on mature ecosystem).
- Deterministic splits + reproducibility.
- Clean artifact export (model + processor + labels + metrics).

## Architecture
1) `data_ingestion/asr_commands` downloads+verifies dataset into `.cache/`
2) Dataset is loaded as `{audio, label}` with `datasets.Audio(sampling_rate=16_000)`
3) Model is fine-tuned via `transformers.Trainer` using a pretrained speech encoder (`HuBERT` / `Wav2Vec2`)
4) Evaluation exports accuracy, macro-F1, and confusion matrix into `outputs/asr_commands/`

In [None]:
# FULL RESET
!pip uninstall -y torch torchvision torchaudio torchcodec transformers datasets evaluate accelerate scikit-learn soundfile fastai peft sentence-transformers timm

# Install PyTorch stack (CUDA 12.1, stable)
!pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 \
  --index-url https://download.pytorch.org/whl/cu121

# System deps
!apt-get update -qq
!apt-get install -y ffmpeg libsndfile1

# Install compatible HF stack (PINNED)
%pip install \
  torchcodec==0.2.0 \
  transformers==4.41.2 \
  datasets==2.19.1 \
  evaluate==0.4.2 \
  accelerate==0.29.3 \
  scikit-learn==1.4.2 \
  soundfile==0.12.1


## 1. Environment Preparation
This section installs and configures all required Python packages and system dependencies for audio processing and model training. It ensures compatibility with CUDA (if available) and installs the correct versions of PyTorch, Hugging Face Transformers, and other libraries.

- **Inputs:** None (installs packages as needed).
- **Outputs:** Required packages and system dependencies available in the environment.


In [None]:
import os
os.kill(os.getpid(), 9)

In [None]:
# Colab convenience: clone repo if needed
from pathlib import Path

if not Path("pyproject.toml").exists():
    if not Path("pjatk_zum").exists():
        !git clone https://github.com/beep1000101/pjatk_zum.git
    else:
        print("Repo folder already present: pjatk_zum")
else:
    print("Already in repo root (pyproject.toml found)")

# Colab convenience: cd into repo folder if we cloned it
from pathlib import Path

if Path("pyproject.toml").exists():
    print("Already in repo root")
elif Path("pjatk_zum").exists():
    %cd pjatk_zum
else:
    raise FileNotFoundError("Could not find repo root (pyproject.toml) or ./pjatk_zum")

## 2. Repository Setup & Data Ingestion
This section ensures the notebook is running in the correct project directory. If the repository is not present, it is cloned from GitHub. The mini_speech_commands dataset is then downloaded and extracted into `.cache/asr_commands/` using the ingestion script (`data_ingestion/asr_commands/run.py`).

- **Inputs:** None (repository and dataset are fetched if missing).
- **Outputs:** `.cache/asr_commands/raw/mini_speech_commands/` directory with extracted audio data.
- **Assumptions:** Internet access is available for cloning and downloading.


In [None]:
# Ingest mini_speech_commands into the repo cache
!python data_ingestion/asr_commands/run.py

In [1]:
from __future__ import annotations

from pathlib import Path
import json
import random
import sys

import numpy as np
import pandas as pd

import torch

from datasets import Audio, Dataset, DatasetDict
import evaluate
from transformers import (
    AutoModelForAudioClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
    EarlyStoppingCallback,
 )


def find_repo_root(start: Path) -> Path:
    cur = start.resolve()
    for p in [cur, *cur.parents]:
        if (p / "data_ingestion").exists() and (p / "utils").exists():
            return p
    return cur


ROOT = find_repo_root(Path())
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

from utils.paths import CACHE_PATH  # noqa: E402

# Reproducibility
SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

CACHE = CACHE_PATH
PIPELINE_CACHE = CACHE / "asr_commands"
RAW_DIR = PIPELINE_CACHE / "raw"
OUTPUTS_DIR = ROOT / "outputs" / "asr_commands"
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

## 3. Data Loading & Preprocessing
This section loads the audio files and their labels from the extracted dataset, builds a DataFrame, and splits the data into training, validation, and test sets. The splits are stratified and deterministic for reproducibility. The data is then prepared for use with Hugging Face audio classification models.

- **Inputs:** `.cache/asr_commands/raw/mini_speech_commands/` directory containing WAV files organized by label.
- **Outputs:** Train/validation/test splits and Hugging Face `DatasetDict` objects.
- **Assumptions:** The dataset is present in the expected cache location.


In [2]:
# Ensure dataset is present in cache (runs ingestion if needed)
mini_root = RAW_DIR / "mini_speech_commands"
if not mini_root.exists():
    print("mini_speech_commands not found in cache; running ingestion...")
    !python data_ingestion/asr_commands/run.py

assert mini_root.exists(), f"Missing: {mini_root}"

labels_path = PIPELINE_CACHE / "labels.json"
if labels_path.exists():
    labels = json.loads(labels_path.read_text(encoding="utf-8"))["labels"]
else:
    # Fallback: infer from directories
    labels = sorted([p.name for p in mini_root.iterdir() if p.is_dir()])

label2id = {lbl: i for i, lbl in enumerate(labels)}
id2label = {i: lbl for lbl, i in label2id.items()}

labels, len(labels)

['down', 'go', 'left', 'no', 'right', 'stop', 'up', 'yes']

In [3]:
# Build an index of audio files -> labels
rows = []
for lbl in labels:
    for wav_path in sorted((mini_root / lbl).glob("*.wav")):
        rows.append({"audio": str(wav_path), "label_str": lbl, "label": int(label2id[lbl])})

df = pd.DataFrame(rows)
df["label_str"].value_counts()

label
down     1000
go       1000
left     1000
no       1000
right    1000
stop     1000
up       1000
yes      1000
Name: count, dtype: int64

In [5]:
# Deterministic stratified split 80/10/10
# NOTE: We use sklearn here to avoid known NumPy 2.x incompatibilities in some `datasets` split paths.
from sklearn.model_selection import train_test_split

train_df, temp_df = train_test_split(
    df,
    test_size=0.2,
    random_state=SEED,
    stratify=df["label"],
 )

val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=SEED,
    stratify=temp_df["label"],
 )

ds = DatasetDict({
    "train": Dataset.from_pandas(train_df.reset_index(drop=True), preserve_index=False),
    "val": Dataset.from_pandas(val_df.reset_index(drop=True), preserve_index=False),
    "test": Dataset.from_pandas(test_df.reset_index(drop=True), preserve_index=False),
})

{k: len(v) for k, v in ds.items()}

{'train': 6400, 'val': 800, 'test': 800}

In [6]:
# Persist splits for reuse (stable artifact for later stage-first scripts)
(OUTPUTS_DIR / "preprocessing").mkdir(parents=True, exist_ok=True)
splits_path = OUTPUTS_DIR / "preprocessing" / "splits.json"

def _to_records(dset: Dataset) -> list[dict]:
    # Keep the same schema used elsewhere in the repo
    return [{"path": ex["audio"], "label": id2label[int(ex["label"]) ]} for ex in dset]

payload = {
    "labels": labels,
    "label2id": label2id,
    "splits": {
        "train": _to_records(ds["train"]),
        "val": _to_records(ds["val"]),
        "test": _to_records(ds["test"]),
    },
}
splits_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")

splits_path

PosixPath('/home/mateusz/dev/pjatk_zum/outputs/asr_commands/preprocessing/splits.json')

In [7]:
# HF audio pipeline: decode/resample -> feature_extractor -> padded batches
# Model choice: HuBERT encoder pretrained + KWS-oriented finetuning (SUPERB)
MODEL_NAME = "superb/hubert-base-superb-ks"

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)

# Decode audio lazily; `datasets` uses soundfile under the hood
ds_audio = ds.cast_column("audio", Audio(sampling_rate=16_000))

def preprocess_batch(batch):
    audio = batch["audio"]
    out = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"])
    batch["input_values"] = out["input_values"][0]
    if "attention_mask" in out:
        batch["attention_mask"] = out["attention_mask"][0]
    return batch

ds_proc = ds_audio.map(preprocess_batch, remove_columns=["audio"])

# transformers expects `tokenizer=` here (works for feature extractors too)
data_collator = DataCollatorWithPadding(tokenizer=feature_extractor, padding=True)

ds_proc["train"][0].keys()

(6400, 800, 800)

## 4. Model Training
This section fine-tunes a pretrained audio classification model (e.g., HuBERT) using the Hugging Face `Trainer` API. The model is trained on the processed training set and evaluated on the validation set. Training arguments, such as batch size, learning rate, and number of epochs, are specified here.

- **Inputs:** Processed training and validation datasets.
- **Outputs:** Trained model checkpoints and training logs saved to `outputs/asr_commands/`.
- **Assumptions:** Sufficient compute resources are available for training (CPU or GPU).


In [8]:
# Trainer setup + fine-tuning
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    # Correctly access predictions and label_ids from the EvalPrediction object
    logits = eval_pred.predictions
    labels_np = eval_pred.label_ids
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=preds, references=labels_np)["accuracy"]
    f1 = f1_metric.compute(predictions=preds, references=labels_np, average="macro")["f1"]
    return {"accuracy": float(acc), "f1_macro": float(f1)}

model = AutoModelForAudioClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(labels),
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,
)

# Stability on small datasets: freeze encoder features (optional but recommended)
if hasattr(model, "freeze_feature_encoder"):
    model.freeze_feature_encoder()

run_dir = OUTPUTS_DIR / "hf_training"
run_dir.mkdir(parents=True, exist_ok=True)

training_args = TrainingArguments(
    output_dir=str(run_dir),
    seed=SEED,
    data_seed=SEED,
    learning_rate=1e-4,
    warmup_ratio=0.1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=1,
    num_train_epochs=12,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    report_to=[],
 )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_proc["train"],
    eval_dataset=ds_proc["val"],
    data_collator=data_collator,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

train_out = trainer.train()
train_out

ImportError: TorchCodec is required for load_with_torchcodec. Please install torchcodec to use this function.

In [None]:
# Save best model + feature extractor + labels (portable inference package)
best_dir = OUTPUTS_DIR / "best_hf"
best_dir.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(best_dir))
feature_extractor.save_pretrained(str(best_dir))
(best_dir / "labels.json").write_text(json.dumps({"labels": labels, "label2id": label2id}, indent=2), encoding="utf-8")

best_dir

In [None]:
# Export training log history (JSON)
train_metrics_path = OUTPUTS_DIR / "training" / "metrics.json"
train_metrics_path.parent.mkdir(parents=True, exist_ok=True)

history = trainer.state.log_history
train_metrics_path.write_text(
    json.dumps({"pipeline": "asr_commands", "framework": "hf_trainer", "history": history}, indent=2),
    encoding="utf-8",
)

train_metrics_path

In [None]:
# Evaluate on test split
test_metrics = trainer.evaluate(ds_proc["test"])
test_metrics

### Validation Evaluation
This cell evaluates the trained model on the validation or test set and reports the accuracy and macro F1 score. These metrics are used to assess model performance and select the best checkpoint.


In [None]:
# Confusion matrix + full metrics export
from sklearn.metrics import confusion_matrix, classification_report

pred_out = trainer.predict(ds_proc["test"])
logits = pred_out.predictions
y_true = pred_out.label_ids
y_pred = np.argmax(logits[0], axis=-1) # Access the first element of the tuple for logits

cm = confusion_matrix(y_true, y_pred, labels=list(range(len(labels))))
report = classification_report(
    y_true,
    y_pred,
    labels=list(range(len(labels))),
    target_names=labels,
    output_dict=True,
    zero_division=0,
)

eval_metrics = {
    "pipeline": "asr_commands",
    "num_test": int(len(y_true)),
    "accuracy": float(test_metrics.get("eval_accuracy", test_metrics.get("accuracy", 0.0))),
    "f1_macro": float(test_metrics.get("eval_f1_macro", test_metrics.get("f1_macro", 0.0))),
    "labels": labels,
    "label2id": label2id,
    "confusion_matrix": cm.tolist(),
    "classification_report": report,
}

eval_dir = OUTPUTS_DIR / "evaluation"
eval_dir.mkdir(parents=True, exist_ok=True)
eval_metrics_path = eval_dir / "metrics.json"
eval_metrics_path.write_text(json.dumps(eval_metrics, indent=2), encoding="utf-8")

eval_metrics_path

### Confusion Matrix and Metrics Export
This cell computes the confusion matrix and a detailed classification report for the test set. The results, along with accuracy and macro F1, are saved to a JSON file in the evaluation directory for reproducibility and further analysis.


In [None]:
# (Optional) Zip outputs for sharing (e.g., Colab -> download)
!zip -r asr_commands_outputs.zip outputs/asr_commands

In [None]:
# --- Evaluation: Run after training ---
from helpers import run_inference, compute_metrics, load_test_dataset
import json
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
from pathlib import Path

# Load best model and feature extractor
best_dir = Path(OUTPUTS_DIR) / "best_hf"
model = AutoModelForAudioClassification.from_pretrained(str(best_dir))
feature_extractor = AutoFeatureExtractor.from_pretrained(str(best_dir))
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load test set
test_ds, labels = load_test_dataset()

# Preprocess test set
from transformers import DataCollatorWithPadding
collator = DataCollatorWithPadding(tokenizer=feature_extractor, padding=True)
def preprocess(batch):
    audio = batch["audio"]
    out = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"])
    batch["input_values"] = out["input_values"][0]
    return batch
test_ds_proc = test_ds.map(preprocess)

# Run inference and compute metrics
all_preds, all_labels = run_inference(model, feature_extractor, device, test_ds_proc)
metrics = compute_metrics(all_labels, all_preds, labels)

# Save metrics
(eval_dir := OUTPUTS_DIR / "evaluation").mkdir(parents=True, exist_ok=True)
with open(eval_dir / 'metrics.json', 'w', encoding='utf-8') as f:
    json.dump(metrics, f, indent=2)
metrics

---

## Notes and Limitations
- This notebook is the canonical implementation for the ASR commands (keyword spotting) pipeline in this repository.
- All data and model artifacts are stored in the `.cache/` and `outputs/` directories, respectively.
- If any code or data step is unclear, incomplete, or fails, please refer to the repository's README or the code comments for further clarification.
- No additional scripts or CLI entrypoints exist for this pipeline; all logic is contained in this notebook.
