## Init

In [None]:
import numpy as np
import pandas as pd
import os
import IPython.display as ipd

# Choosing the Models for the Pipeline
They need to be lightweight, flexible and high performance. The main questions the pipeline will answer are "Who said that?" "What was said?" 

## Speech to Text via Open-AI's 'Whisper'
**"What was said?"**

Whisper is very good, very lightweight - ideal for a real-time app. But it can't handle multi-speaker, we need to seperate speakers first.

In [1]:
import whisper
model = whisper.load_model("base")  # "tiny" < "base" < "small" < "medium" < "large"

In [2]:
pathIDs = ['0644', '0505', '2099', '4135', '3820']
for pathID in pathIDs:
    file_path = f'../../data/commonvoice0104-en/cv-corpus-21.0-delta-2025-03-14/en/clips/common_voice_en_4191{pathID}.mp3'
    try:
        result = model.transcribe(file_path, fp16=False)
        ipd.display(ipd.Audio(file_path))
        print(result["text"])
    except:
        print(f"File not found: {file_path}")
        continue

 Few recreationally suitable beaches exist naturally on the lake.


 His criticism of Communism is evident to some of these works.


 that the councillary calls decided and prepared to've the constitution it treble grade, Production stages and related art ships helps students guide faster dabboxes, more and more creation effects, for example, senses.


 After this screening, Altman discussed various aspects of the production during a question and answer session.


 The secret ballot, then considered a novelty, had not yet been introduced to in Canada.


## Speaker Identification with 'pyannote'
**"Who said that?"**

For speaker diarization; doesn't identify speakers, but can extract/seperate e.g. Speaker1, Speaker2, ...

## (WIP - not functional)

### setup

In [3]:
from lhotse import download, prepare
from pathlib import Path

# 1. Download LibriCSS into ./data/libricss
download("libricss", target_dir=Path("../data/libricss"), force_download=False)

# 2. Prepare manifests (lists of recordings + annotations)
recordings, supervisions = prepare(
    "libricss", corpus_dir=Path("data/libricss")
)

# 3. Filter recordings for clean vs hard overlap cases
def overlap_level(sup):
    return sup.tags.get("overlap_ratio", 0.0)

clean = []
hard = []
from collections import defaultdict
counts = defaultdict(int)

for rec in recordings:
    # Each rec has a supervision list: we take the first
    sup = next(supervisions[rec.recording_id])
    lvl = overlap_level(sup)
    if lvl <= 0.1 and counts["clean"] < 2:
        clean.append(rec)
        counts["clean"] += 1
    elif lvl >= 0.3 and counts["hard"] < 2:
        hard.append(rec)
        counts["hard"] += 1
    if counts["clean"] == 2 and counts["hard"] == 2:
        break

print("Selected clean:", [r.recording_id for r in clean])
print("Selected hard:", [r.recording_id for r in hard])


ImportError: cannot import name 'download' from 'lhotse' (C:\Users\conno\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\lhotse\__init__.py)

### this...

In [None]:
from pyannote.audio import Pipeline

# Load pretrained pipeline (will download model)
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

# Apply to audio file (WAV, 16kHz recommended)
diarization = pipeline("your_audio.wav")

# Print speaker segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")


### or...

In [None]:
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")

# Example on a clean sample
wav = clean[0].recording_path
diar = pipeline(str(wav))
print("Clean case results:", diar)

# And similarly for a hard case
wav2 = hard[0].recording_path
print("Hard case results:")
diar2 = pipeline(str(wav2))
print(diar2)