### Necessary imports and requirements

Code from https://github.com/m-bain/whisperX?tab=readme-ov-file. 

Some issues running whisperx on python 3.12.3. The following package version combinations worked:

torchaudio==2.2.0, triton==2.2.0 and torch==2.2.0.

This code requires Hugging Face-token.

Also, as whisperx uses pyannote, permissions need to be granted to access pyannote segmentation and diarization models.

Segmentation: https://huggingface.co/pyannote/segmentation-3.0

Diarization: https://huggingface.co/pyannote/speaker-diarization-3.1

In [None]:
import os
import whisperx
import json 
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()

### Settings

Adjust e.g. batch size to 4 and compute type to int8 if low on GPU memory.

In [None]:
device = "cuda" 
audioFile = os.getenv('AUDIO_PATH')
batchSize = 16
computeType = "float16"

### 1. Transcription with original whisper (batched)

In [None]:
model = whisperx.load_model(os.getenv('CONFIG'), device, compute_type=computeType) # choose large-v2

In [None]:
audio = whisperx.load_audio(audioFile)
transcription = model.transcribe(audio, batch_size=batchSize)
print(transcription["segments"]) # before alignment

### 2. Align whisper output

It's possible that the alignment model needs a different language model for Swedish audio. Should implement this from HF then:

https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish

In [None]:
model_a, metadata = whisperx.load_align_model(language_code=transcription["language"], device=device)
wresult = whisperx.align(transcription["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print(wresult["segments"]) # after alignment

### 3. Assign speaker labels
Add min/max number of speakers if known.

In [None]:
diarize_model = whisperx.DiarizationPipeline(use_auth_token=os.getenv('HF_TOKEN'), device=device)
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

In [None]:
result = whisperx.assign_word_speakers(diarize_segments, wresult)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

### 4. Export timestamps, assigned speakers and text to .json.

In [None]:
data = {
    "segments": []
}

for seg in result['segments']:
    formattedSegment = {
        "start": seg['start'],
        "end": seg['end'],
        "speaker": seg['speaker'],
        "text": seg['text']
    }
    data["segments"].append(formattedSegment)

with open('output.json', 'w') as jsonFile:
    json.dump(data, jsonFile, indent=4)