# Walkthrough: Automatic Speech Recognition + Speaker Diarization

The purpose of this notebook is to provide a working example of inference workflow from [this blog post](https://huggingface.co/blog/asr-diarization) by the Hugging Face team. In particular, we build upon on the workflow from [the Audio Course here](https://huggingface.co/learn/audio-course/en/chapter7/transcribe-meeting#speaker-diarization).


## 1. Install requirements


In [7]:
!pip install --upgrade transformers huggingface_hub pyannote.audio datasets git+https://github.com/huggingface/speechbox

Collecting git+https://github.com/huggingface/speechbox
  Cloning https://github.com/huggingface/speechbox to /private/var/folders/w0/6t9rxkj97rv47l9sc0q22yth0000gn/T/pip-req-build-jbk__o3w
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/speechbox /private/var/folders/w0/6t9rxkj97rv47l9sc0q22yth0000gn/T/pip-req-build-jbk__o3w
  Resolved https://github.com/huggingface/speechbox to commit e7339dc021c8aa3047f824fb5c24b5b2c8197a76
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting transformers
  Downloading transformers-4.42.4-py3-none-any.whl (9.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting regex!=2019.12.17
  Using cached regex-2024.5.15-cp310-cp310-macosx_11_0_arm64.whl (278 kB)
Collecting safetensors>=0.4.1
  Usin

## 2. Download model artifacts to local dir

**Note:**

- To access the diarization models, we first have to agree to the model’s terms of use: [pyannote/speaker-diarization](https://huggingface.co/pyannote/speaker-diarization).
- And subsequently the segmentation model’s terms of use: [pyannote/segmentation](https://huggingface.co/pyannote/segmentation).
- This step is only here because it was requested to download model files separately to facilitate security scans.


In [1]:
import os
from pathlib import Path
from huggingface_hub import interpreter_login, snapshot_download

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# login in to cache token
# could also use `notebook_login()`
interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (osxkeychain).

In [2]:
DIARIZATION_MODEL = "pyannote/speaker-diarization-3.1"
ASR_MODEL = "openai/whisper-large-v3"

MODELS_DIR = Path("models")
os.makedirs(MODELS_DIR, exist_ok=True)

# for model in [ASR_MODEL, DIARIZATION_MODEL]:
#     snapshot_download(model, local_dir=MODELS_DIR / model.split("/")[1])

## 3. Load sample data and listen to it


In [3]:
from datasets import load_dataset

concatenated_librispeech = load_dataset(
    "sanchit-gandhi/concatenated_librispeech", split="train", streaming=True
)
sample = next(iter(concatenated_librispeech))

In [4]:
from IPython.display import Audio

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

## 4. Initialize models


In [5]:
import torch

from pyannote.audio import Pipeline
from transformers import pipeline
from huggingface_hub import get_token
from speechbox import ASRDiarizationPipeline

#### Initalize models from local


In [6]:
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
TORCH_DTYPE = torch.float32 if DEVICE.type == "cpu" else torch.float16

In [10]:
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=MODELS_DIR / ASR_MODEL.split("/")[1],
    torch_dtype=TORCH_DTYPE,
    device=DEVICE,
)

diarization_pipeline = Pipeline.from_pretrained(
    checkpoint_path=MODELS_DIR / DIARIZATION_MODEL.split("/")[1] / "config.yaml",
    use_auth_token=get_token(),
)

full_pipeline = ASRDiarizationPipeline(
    asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


##### Helper functions to clean up output


In [8]:
def tuple_to_string(start_end_tuple, ndigits=1):
    return str((round(start_end_tuple[0], ndigits), round(start_end_tuple[1], ndigits)))


def format_as_transcription(raw_segments):
    return "\n\n".join(
        [
            chunk["speaker"] + " " + tuple_to_string(chunk["timestamp"]) + chunk["text"]
            for chunk in raw_segments
        ]
    )

## 5. Run the pipeline


In [9]:
out = full_pipeline(sample["audio"].copy(), language="en")
print(format_as_transcription(out))

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


SPEAKER_00 (0.0, 15.1) the second in importance is as follows sovereignty may be defined to be the right of making laws in france the king really exercises a portion of the sovereign power since the laws have no weight

SPEAKER_01 (15.1, 21.7) he was in a fevered state of mind owing to the blight his wife's action threatened to cast upon his entire future
