# W2V-BERT Embedding Demo

This notebook shows how to load the packaged eager `w2vbert_speaker` model and compute a speaker embedding for a short audio sample.

By convention this repository uses the following repo-relative directories (so the notebook does not "guess"):

- Eager artifacts: `../pretrained/w2vbert_speaker_eager`
- Scripted artifacts: `../pretrained/w2vbert_speaker_scripted`
- Dataset: `../datasets/voxceleb1test`

If you place artifacts in different locations, update the `EAGER_DIR`, `SCRIPTED_DIR`, or `DATASET_DIR` variables near the top of the notebook. See `ARTIFACTS.md` for details and download links.


In [1]:
from pathlib import Path
import sys

def find_repo_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / "recipes").exists() and (candidate / "deeplab").exists():
            return candidate
    raise RuntimeError("Unable to locate the repository root.")

NOTEBOOK_DIR = Path.cwd()
REPO_ROOT = find_repo_root(NOTEBOOK_DIR)

SRC_PATHS = [
    REPO_ROOT,
    REPO_ROOT / "recipes/DeepASV",
    REPO_ROOT / "deeplab/pretrained/audio2vector/module/transformers/src",
]

for candidate in SRC_PATHS:
    resolved = str(candidate)
    if candidate.exists() and resolved not in sys.path:
        sys.path.append(resolved)

print(f"Repository root: {REPO_ROOT}")

Repository root: /Users/zb/NWG/w2v-BERT-2.0_SV


In [None]:
import torch

from recipes.DeepASV.utils.inference import W2VBERT_SPK_Module

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Standardized eager artifacts dir
EAGER_DIR = (REPO_ROOT.parent / "pretrained" / "w2vbert_speaker_eager").resolve()
embedding_model = W2VBERT_SPK_Module(device=device, model_path=str(EAGER_DIR)).load_model()

Using device: cpu
      spk_model: <All weights matched>
      spk_model: <All weights matched>


In [None]:
import librosa

import soundfile as sf

import torch



# Use the canonical dataset dir
DATASET_DIR = (REPO_ROOT.parent / "datasets" / "voxceleb1test").resolve()

target_audio = DATASET_DIR / "wav" / "id10270" / "5r0dWxy17C8" / "00001.wav"

if not target_audio.exists():
    alt_path = REPO_ROOT / "datasets/voxceleb1test/wav/id10270/5r0dWxy17C8/00001.wav"
    if alt_path.exists():
        target_audio = alt_path
    else:
        raise FileNotFoundError(f"Audio file not found at {target_audio} or {alt_path}")

signal, sr = sf.read(str(target_audio), dtype="float32")
if signal.ndim > 1:
    signal = signal.mean(axis=1)


target_sr = embedding_model.hparams.get("sample_rate", 16000)

if sr != target_sr:
    signal = librosa.resample(signal, orig_sr=sr, target_sr=target_sr)
    sr = target_sr


waveform = torch.from_numpy(signal).unsqueeze(0).to(torch.float32)

print(f"Loaded waveform shape: {waveform.shape}, sample rate: {sr}")

Loaded waveform shape: torch.Size([1, 133761]), sample rate: 16000


In [4]:
embeddings = embedding_model(waveform)
embedding_vector = embeddings.squeeze(0).detach().cpu().numpy()

print(f"Embedding shape: {embedding_vector.shape}")
embedding_vector[:8]

Embedding shape: (256,)


array([-0.16015907, -0.8298738 ,  0.70560724, -0.14280552,  0.35778913,
        0.47867072, -0.12521656, -0.90179497], dtype=float32)