# Minimal Scripted Inference

This notebook shows the minimal steps to compute a speaker embedding using the preprocessed TorchScript artifact (`w2vbert_speaker_script_preprocessed.pt`) and the saved Hugging Face feature-extractor. It requires downloading the scripted artifacts (see `ARTIFACTS.md`) and placing them in:

- `../pretrained/w2vbert_speaker_scripted`

If you keep the scripted artifacts elsewhere, update the `SCRIPTED_DIR` variable in the code cell near the top of the notebook. `ARTIFACTS.md` contains the download link and examples for invoking the scripted package.


## Prerequisites

- A python venv with the `w2vbert_speaker_scripted` package installed.
- Download `w2vbert_speaker_scripted` from [here](https://drive.google.com/drive/folders/1bz_GmREFNNTuIPrsAEJsyAp7CFwQ_vIz?usp=sharing).



In [None]:
from pathlib import Path
import torch
import soundfile as sf
import librosa

# Standardized scripted artifact dir and dataset dir
SCRIPTED_DIR = (REPO_ROOT.parent / 'pretrained' / 'w2vbert_speaker_scripted').resolve()
DATASET_DIR = (REPO_ROOT.parent / 'datasets' / 'voxceleb1test').resolve()

SCRIPTED_PREPROCESSED = SCRIPTED_DIR / 'w2vbert_speaker_script_preprocessed.pt'
# Example audio
AUDIO_PATH = (DATASET_DIR / 'wav' / 'id10270' / '5r0dWxy17C8' / '00001.wav')
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print('Scripted model exists:', SCRIPTED_PREPROCESSED.exists())
print('Audio exists:', AUDIO_PATH.exists())
print('Using device:', DEVICE)


Scripted model exists: True
Audio exists: True
Using device: cpu


In [13]:
# Instantiate the packaged scripted wrapper which bundles the HF extractor
from w2vbert_speaker_scripted import W2VBERT_SPK_Scripted

if not SCRIPTED_PREPROCESSED.exists():
    raise FileNotFoundError(f'Scripted preprocessed artifact not found at {SCRIPTED_PREPROCESSED}; run the export script to create it.')

scripted_wrapper = W2VBERT_SPK_Scripted(scripted_path=str(SCRIPTED_PREPROCESSED), device=DEVICE)
print('Instantiated scripted wrapper with bundled feature extractor')
print('Extractor sampling rate:', scripted_wrapper.feature_extractor.sampling_rate)


Instantiated scripted wrapper with bundled feature extractor
Extractor sampling rate: 16000


In [14]:
# Load and (if necessary) resample audio to the extractor's sampling rate
wave, sr = sf.read(str(AUDIO_PATH), dtype='float32')
print('Loaded waveform shape:', wave.shape, 'sample rate:', sr)
# convert to mono if necessary
if wave.ndim > 1:
    wave = wave.mean(axis=1)

# create torch waveform with batch dim [1, T]
waveform = torch.from_numpy(wave).unsqueeze(0).to(torch.float32)


Loaded waveform shape: (133761,) sample rate: 16000


In [15]:
# Compute embedding using the scripted wrapper (bundled extractor)
with torch.inference_mode():
    emb = scripted_wrapper(waveform.cpu())
    emb_np = emb.squeeze(0).detach().cpu().numpy()
print('Scripted embedding shape:', emb_np.shape)


Scripted embedding shape: (256,)
