# Walkthrough: Automatic Speech Recognition + Speaker Diarization

The purpose of this notebook is to provide a working example of inference workflow from [this blog post](https://huggingface.co/blog/asr-diarization) by the Hugging Face team.

## 1. Clone supporting files from Hub repo

In [None]:
!git clone https://huggingface.co/sergeipetrov/asrdiarization-handler
!cp asrdiarization-handler/config.py asrdiarization-handler/requirements.txt asrdiarization-handler/diarization_utils.py .
!rm -rf asrdiarization-handler
!pip install -r requirements.txt

## 2. Download model artifacts to local dir

**Note:** This step is only here because it was requested to download model files separately to facilitate security scans.

In [1]:
import os
from pathlib import Path
from huggingface_hub import interpreter_login, snapshot_download

In [2]:
# login in to cache token
# could also use `notebook_login()`
# interpreter_login()

In [3]:
DIARIZATION_MODEL = "pyannote/speaker-diarization-3.1"
ASR_MODEL = "openai/whisper-large-v3"
ASSISTANT_MODEL = "distil-whisper/distil-large-v3"

In [8]:
MODELS_DIR = Path("models")
os.makedirs(MODELS_DIR, exist_ok=True)

In [12]:
for model in [ASR_MODEL, ASSISTANT_MODEL, DIARIZATION_MODEL]:
    snapshot_download(model, local_dir=MODELS_DIR / model.split("/")[1])

Fetching 21 files:   0%|          | 0/21 [00:00<?, ?it/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors.index.fp32.json:   0%|          | 0.00/118k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

model.fp32-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

pytorch_model.bin.index.fp32.json:   0%|          | 0.00/118k [00:00<?, ?B/s]

model.fp32-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

flax_model.msgpack:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.fp32-00001-of-00002.bin:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

pytorch_model.fp32-00002-of-00002.bin:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Fetching 26 files:   0%|          | 0/26 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/37.8k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

model.fp32.safetensors:   0%|          | 0.00/3.03G [00:00<?, ?B/s]

flax_model.msgpack:   0%|          | 0.00/3.03G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

decoder_model_quantized.onnx:   0%|          | 0.00/121M [00:00<?, ?B/s]

decoder_with_past_model.onnx:   0%|          | 0.00/452M [00:00<?, ?B/s]

decoder_model_merged.onnx:   0%|          | 0.00/478M [00:00<?, ?B/s]

decoder_model_merged_quantized.onnx:   0%|          | 0.00/122M [00:00<?, ?B/s]

decoder_model.onnx:   0%|          | 0.00/478M [00:00<?, ?B/s]

decoder_with_past_model_quantized.onnx:   0%|          | 0.00/115M [00:00<?, ?B/s]

encoder_model.onnx:   0%|          | 0.00/646k [00:00<?, ?B/s]

encoder_model.onnx_data:   0%|          | 0.00/2.55G [00:00<?, ?B/s]

encoder_model_quantized.onnx:   0%|          | 0.00/645M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

(…)707818488.t1v-n-d928564b-w-0.763446.0.v2:   0%|          | 0.00/726k [00:00<?, ?B/s]

(…).1708076214.t1v-n-d928564b-w-0.5978.0.v2:   0%|          | 0.00/402k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Fetching 24 files:   0%|          | 0/24 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/469 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

(…)L.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/624k [00:00<?, ?B/s]

handler.py:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

.github/workflows/sync_to_hub.yaml:   0%|          | 0.00/467 [00:00<?, ?B/s]

(…)M.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/2.70k [00:00<?, ?B/s]

(…)L.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

(…)M.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/573k [00:00<?, ?B/s]

(…)I.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/2.70k [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/365k [00:00<?, ?B/s]

(…)I.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/573k [00:00<?, ?B/s]

(…)g.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/3.48k [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/37.1k [00:00<?, ?B/s]

(…)g.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/940k [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/8.23k [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/65.7k [00:00<?, ?B/s]

requirements.txt:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

(…)e.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/31.9k [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/648k [00:00<?, ?B/s]

(…)e.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

(…)D.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/3.60M [00:00<?, ?B/s]

(…)E.SpeakerDiarization.Benchmark.test.eval:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

(…)E.SpeakerDiarization.Benchmark.test.rttm:   0%|          | 0.00/1.44M [00:00<?, ?B/s]