# Greek Medical Dictation Pipeline Evaluation

This notebook evaluates the performance of three Whisper models (Small, Medium, and Large-v2) fine-tuned for Greek medical dictation, enhanced with a GPT-2 language model for transcription reranking. The goal is to assess transcription quality on a combined Greek audio dataset using standard metrics.

## Objective

* Evaluate ASR Performance: Compare the default Whisper transcriptions (greedy decoding) with GPT-2 reranked transcriptions.
* Metrics: Word Error Rate (WER), Normalized WER, Character Error Rate (CER), BLEU score, and perplexity.
* Dataset: A combined test set from "Vardis/Greek_Mosel", Common Voice (Greek), and Fleurs (Greek), standardized to 16kHz audio.


## Workflow

### Dataset Preparation:

* Load and split datasets: "Vardis/Greek_Mosel", Common Voice 11.0 (el), and Fleurs (el_gr).
Combine and shuffle train, validation, and test splits (80% train, 10% validation, 10% test).
Standardize audio to 16kHz and rename text fields to sentence.


### Model Setup:

* Load fine-tuned Whisper models (Vardis/Whisper-Small-Greek, Vardis/Whisper-Medium-Greek, Vardis/Whisper-LoRA-Greek) with LoRA weights merged.
* Load GPT-2 model (Vardis/Medical_Speech_Greek_GPT2) for reranking.
Use torch.float16 and device_map="auto" for GPU acceleration.


### Evaluation Process:

* For each Whisper model (Small, Medium, Large-v2):
Generate default transcriptions (greedy decoding) and n-best hypotheses (beam search, n=5).
* Rerank hypotheses using GPT-2 perplexity scores.
* Compute WER, Normalized WER, CER, BLEU, and perplexity for default and reranked transcriptions.


### Results:

* Report average metrics across the test set, comparing default and reranked performance.
Metrics include:
- WER: Word-level errors (standard and normalized).
- CER: Character-level errors.
- BLEU: Translation quality score.
- Perplexity: Language model confidence.



This evaluation provides insights into the effectiveness of fine-tuned Whisper models and GPT-2 reranking for Greek medical dictation tasks.




In [1]:
!pip install evaluate jiwer
!pip install datasets==3.6.0


Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

## Load Dataset

In [2]:
from datasets import load_dataset, IterableDatasetDict
import os


os.environ["CUDA_VISIBLE_DEVICES"] = "0"
language = "Greek"
language_abbr = "el"
language_abbr2 = "el_gr"
task = "transcribe"


a = IterableDatasetDict()
b = IterableDatasetDict()
c = IterableDatasetDict()


a_full = load_dataset("Vardis/Greek_Mosel", split="train")
a_temp = a_full.train_test_split(test_size=0.2, seed=42)  # 80% train 
a_val_test = a_temp["test"].train_test_split(test_size=0.5, seed=42)  # 10% val + 10% test
a["train"] = a_temp["train"]
a["validation"] = a_val_test["train"]
a["test"] = a_val_test["test"]

b_full = load_dataset("mozilla-foundation/common_voice_11_0", language_abbr, split="train+validation+test")
b_temp = b_full.train_test_split(test_size=0.2, seed=42)
b_val_test = b_temp["test"].train_test_split(test_size=0.5, seed=42)
b["train"] = b_temp["train"]
b["validation"] = b_val_test["train"]
b["test"] = b_val_test["test"]

c_full = load_dataset("google/fleurs", language_abbr2, split="train+validation+test")
c_temp = c_full.train_test_split(test_size=0.2, seed=42)
c_val_test = c_temp["test"].train_test_split(test_size=0.5, seed=42)
c["train"] = c_temp["train"]
c["validation"] = c_val_test["train"]
c["test"] = c_val_test["test"]



b = b.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
c = c.remove_columns(["id", "num_samples", "path", "raw_transcription", "gender", "lang_id", "language", "lang_group_id"])

a = a.rename_column("text", "sentence")
c = c.rename_column("transcription", "sentence")


print(a)
print(b)
print(c)

from datasets import Audio

a = a.cast_column("audio", Audio(sampling_rate=16000))
b = b.cast_column("audio", Audio(sampling_rate=16000))
c = c.cast_column("audio", Audio(sampling_rate=16000))

from datasets import concatenate_datasets

combined_train = concatenate_datasets([a['train'], b['train'], c['train']])
combined_test = concatenate_datasets([a['test'], b['test'], c['test']])
combined_test = combined_test.shuffle(seed=42)
combined_valid = concatenate_datasets([a['validation'], b['validation'], c['validation']])

combined_dataset = IterableDatasetDict({
    'train': combined_train,
    "validation": combined_valid,
    'test': combined_test
})

dataset = combined_dataset
print(dataset)

README.md:   0%|          | 0.00/322 [00:00<?, ?B/s]

data/train-00000-of-00007.parquet:   0%|          | 0.00/497M [00:00<?, ?B/s]

data/train-00001-of-00007.parquet:   0%|          | 0.00/494M [00:00<?, ?B/s]

data/train-00002-of-00007.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

data/train-00003-of-00007.parquet:   0%|          | 0.00/497M [00:00<?, ?B/s]

data/train-00004-of-00007.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

data/train-00005-of-00007.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

data/train-00006-of-00007.parquet:   0%|          | 0.00/505M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3876 [00:00<?, ? examples/s]

README.md: 0.00B [00:00, ?B/s]

common_voice_11_0.py: 0.00B [00:00, ?B/s]

languages.py: 0.00B [00:00, ?B/s]

release_stats.py: 0.00B [00:00, ?B/s]

The repository for mozilla-foundation/common_voice_11_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_11_0.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


n_shards.json: 0.00B [00:00, ?B/s]

audio/el/train/el_train_0.tar:   0%|          | 0.00/57.4M [00:00<?, ?B/s]

audio/el/dev/el_dev_0.tar:   0%|          | 0.00/51.0M [00:00<?, ?B/s]

audio/el/test/el_test_0.tar:   0%|          | 0.00/50.9M [00:00<?, ?B/s]

audio/el/other/el_other_0.tar:   0%|          | 0.00/238M [00:00<?, ?B/s]

audio/el/invalidated/el_invalidated_0.ta(…):   0%|          | 0.00/23.3M [00:00<?, ?B/s]

transcript/el/train.tsv:   0%|          | 0.00/482k [00:00<?, ?B/s]

transcript/el/dev.tsv:   0%|          | 0.00/423k [00:00<?, ?B/s]

transcript/el/test.tsv:   0%|          | 0.00/410k [00:00<?, ?B/s]

transcript/el/other.tsv:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

transcript/el/invalidated.tsv:   0%|          | 0.00/201k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1914it [00:00, 142588.90it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1701it [00:00, 133672.66it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1696it [00:00, 118481.98it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 9072it [00:00, 136668.60it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 797it [00:00, 113791.75it/s]


README.md: 0.00B [00:00, ?B/s]

fleurs.py: 0.00B [00:00, ?B/s]

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


data/el_gr/audio/train.tar.gz:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

data/el_gr/audio/dev.tar.gz:   0%|          | 0.00/141M [00:00<?, ?B/s]

data/el_gr/audio/test.tar.gz:   0%|          | 0.00/349M [00:00<?, ?B/s]

train.tsv: 0.00B [00:00, ?B/s]

dev.tsv: 0.00B [00:00, ?B/s]

test.tsv: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3100
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 388
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 388
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4248
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 531
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 532
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3308
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 414
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 414
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
       

## Loading Greek GPT-2 

In [3]:
import torch
import gc
import math
import numpy as np
from tqdm import tqdm
from transformers import AutoProcessor, WhisperForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM, WhisperProcessor
from peft import PeftModel, PeftConfig
import string
import re
import evaluate
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

device = "cuda" if torch.cuda.is_available() else "cpu"


lm_tokenizer = AutoTokenizer.from_pretrained("Vardis/Medical_Speech_Greek_GPT2")
base_model = AutoModelForCausalLM.from_pretrained(
    "lighteternal/gpt2-finetuned-greek",
    torch_dtype=torch.float16, 
    device_map="auto"
)
# LoRA weights
lm_model = PeftModel.from_pretrained(base_model, "Vardis/Medical_Speech_Greek_GPT2").to(device)

lm_model.to(device)


2025-08-26 19:08:47.197211: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756235327.550338      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756235327.648106      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/822 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/788 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/6.50M [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=2304, nx=768)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
        

## Small Pipeline

In [9]:
import math
import numpy as np
from tqdm import tqdm
from transformers import AutoProcessor, WhisperForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM, WhisperProcessor
from peft import PeftModel, PeftConfig
import string
import re
import evaluate
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-small",
    torch_dtype=torch.float16,
    device_map="auto"
)

# LoRA weights
ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Small-Greek"
)

# Merge LoRA → base weights
whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Small-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

whisper_model.to(device)

def calculate_perplexity(sentence, model, tokenizer, device):
    """Calculate loss of a sentence using the LM."""
    model.eval()
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    input_ids = inputs.input_ids
    labels = input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    with torch.no_grad():
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss.item()
    return math.exp(loss) 

def get_whisper_transcriptions(audio_array, sr, n_best=5):
    input_features = processor(audio_array, sampling_rate=sr, return_tensors="pt").input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(
            input_features,
            max_length=225,  
            num_beams=1,     # Explicitly enforce greedy decoding
            do_sample=False  
        )
    default_transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    # N-best hypotheses (beam search)
    beam_outputs = whisper_model.generate(
        input_features,
        num_beams=n_best,
        num_return_sequences=n_best,
        return_dict_in_generate=True,
        max_length=225 
    )
    n_best_transcriptions = processor.batch_decode(beam_outputs.sequences, skip_special_tokens=True)
    
    return default_transcription, n_best_transcriptions

def rerank_hypotheses(hypotheses, model, tokenizer, device):
    """Rerank hypotheses by LM perplexity and return the best one."""
    perplexities = [calculate_perplexity(hyp, model, tokenizer, device) for hyp in hypotheses]
    best_index = perplexities.index(min(perplexities))
    return hypotheses[best_index], perplexities[best_index]

smooth_fn = SmoothingFunction().method1

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    """Normalize text for WER computation: lowercase, remove punctuation, standardize whitespace."""
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

def compute_cer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    cer_score = cer_metric.compute(predictions=hypotheses, references=references)
    return 100 * cer_score

def compute_bleu(reference, hypothesis):
    ref_tokens = list(reference)
    hyp_tokens = list(hypothesis)
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn)

default_perps, reranked_perps = [], []
default_preds, reranked_preds = [], []
references = []


for i, item in enumerate(tqdm(dataset["test"])):
    
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    ground_truth = item["sentence"]

    # Get Whisper transcriptions
    default_trans, n_best_hyps = get_whisper_transcriptions(audio_array, sampling_rate, n_best=5)

    # Without reranking
    default_perp = calculate_perplexity(default_trans, lm_model, lm_tokenizer, device)

    # With reranking
    reranked_trans, reranked_perp = rerank_hypotheses(n_best_hyps, lm_model, lm_tokenizer, device)

    references.append(ground_truth)
    default_preds.append(default_trans)
    reranked_preds.append(reranked_trans)

    default_perps.append(default_perp)
    reranked_perps.append(reranked_perp)

avg_default_perp = np.mean(default_perps)
avg_reranked_perp = np.mean(reranked_perps)

avg_default_wer = compute_wer(references, default_preds)
avg_reranked_wer = compute_wer(references, reranked_preds)
avg_default_normalized_wer = compute_normalized_wer(references, default_preds)
avg_reranked_normalized_wer = compute_normalized_wer(references, reranked_preds)

avg_default_cer = compute_cer(references, default_preds)
avg_reranked_cer = compute_cer(references, reranked_preds)

default_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, default_preds)]
reranked_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, reranked_preds)]
avg_default_bleu = np.mean(default_bleus)
avg_reranked_bleu = np.mean(reranked_bleus)

print(f"Average Default Perplexity: {avg_default_perp:.2f}")
print(f"Average Reranked Perplexity: {avg_reranked_perp:.2f}")
print(f"Global Default WER: {avg_default_wer:.4f}")
print(f"Global Reranked WER: {avg_reranked_wer:.4f}")
print(f"Global Default Normalized WER: {avg_default_normalized_wer:.4f}")
print(f"Global Reranked Normalized WER: {avg_reranked_normalized_wer:.4f}")
print(f"Global Default CER: {avg_default_cer:.4f}")
print(f"Global Reranked CER: {avg_reranked_cer:.4f}")
print(f"Average Default BLEU: {avg_default_bleu:.4f}")
print(f"Average Reranked BLEU: {avg_reranked_bleu:.4f}")

gc.collect()
torch.cuda.empty_cache()

100%|██████████| 1334/1334 [2:17:33<00:00,  6.19s/it] 


Average Default Perplexity: 1035.33
Average Reranked Perplexity: 888.52
Global Default WER: 30.3130
Global Reranked WER: 27.3895
Global Default Normalized WER: 26.5397
Global Reranked Normalized WER: 23.5772
Global Default CER: 13.2758
Global Reranked CER: 11.8036
Average Default BLEU: 0.8235
Average Reranked BLEU: 0.8417


## Medium Pipeline

In [7]:
base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-medium",
    torch_dtype=torch.float16,
    device_map="auto"
)

ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Medium-Greek"
)

whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Medium-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

whisper_model.to(device)

def calculate_perplexity(sentence, model, tokenizer, device):
    """Calculate loss of a sentence using the LM."""
    model.eval()
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    input_ids = inputs.input_ids
    labels = input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    with torch.no_grad():
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss.item()
    return math.exp(loss) 

def get_whisper_transcriptions(audio_array, sr, n_best=5):
    input_features = processor(audio_array, sampling_rate=sr, return_tensors="pt").input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(
            input_features,
            max_length=225,  
            num_beams=1,     # Explicitly enforce greedy decoding
            do_sample=False  
        )
    default_transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    # N-best hypotheses (beam search)
    beam_outputs = whisper_model.generate(
        input_features,
        num_beams=n_best,
        num_return_sequences=n_best,
        return_dict_in_generate=True,
        max_length=225 
    )
    n_best_transcriptions = processor.batch_decode(beam_outputs.sequences, skip_special_tokens=True)
    
    return default_transcription, n_best_transcriptions

def rerank_hypotheses(hypotheses, model, tokenizer, device):
    """Rerank hypotheses by LM perplexity and return the best one."""
    perplexities = [calculate_perplexity(hyp, model, tokenizer, device) for hyp in hypotheses]
    best_index = perplexities.index(min(perplexities))
    return hypotheses[best_index], perplexities[best_index]

smooth_fn = SmoothingFunction().method1

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    """Normalize text for WER computation: lowercase, remove punctuation, standardize whitespace."""
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

def compute_cer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    cer_score = cer_metric.compute(predictions=hypotheses, references=references)
    return 100 * cer_score

def compute_bleu(reference, hypothesis):
    ref_tokens = list(reference)
    hyp_tokens = list(hypothesis)
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn)

default_perps, reranked_perps = [], []
default_preds, reranked_preds = [], []
references = []

    
for i, item in enumerate(tqdm(dataset["test"])):
   
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    ground_truth = item["sentence"]

    # Get Whisper transcriptions
    default_trans, n_best_hyps = get_whisper_transcriptions(audio_array, sampling_rate, n_best=5)

    # Without reranking
    default_perp = calculate_perplexity(default_trans, lm_model, lm_tokenizer, device)

    # With reranking
    reranked_trans, reranked_perp = rerank_hypotheses(n_best_hyps, lm_model, lm_tokenizer, device)

    references.append(ground_truth)
    default_preds.append(default_trans)
    reranked_preds.append(reranked_trans)

    default_perps.append(default_perp)
    reranked_perps.append(reranked_perp)

avg_default_perp = np.mean(default_perps)
avg_reranked_perp = np.mean(reranked_perps)

avg_default_wer = compute_wer(references, default_preds)
avg_reranked_wer = compute_wer(references, reranked_preds)
avg_default_normalized_wer = compute_normalized_wer(references, default_preds)
avg_reranked_normalized_wer = compute_normalized_wer(references, reranked_preds)

avg_default_cer = compute_cer(references, default_preds)
avg_reranked_cer = compute_cer(references, reranked_preds)

default_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, default_preds)]
reranked_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, reranked_preds)]
avg_default_bleu = np.mean(default_bleus)
avg_reranked_bleu = np.mean(reranked_bleus)

print(f"Average Default Perplexity: {avg_default_perp:.2f}")
print(f"Average Reranked Perplexity: {avg_reranked_perp:.2f}")
print(f"Global Default WER: {avg_default_wer:.4f}")
print(f"Global Reranked WER: {avg_reranked_wer:.4f}")
print(f"Global Default Normalized WER: {avg_default_normalized_wer:.4f}")
print(f"Global Reranked Normalized WER: {avg_reranked_normalized_wer:.4f}")
print(f"Global Default CER: {avg_default_cer:.4f}")
print(f"Global Reranked CER: {avg_reranked_cer:.4f}")
print(f"Average Default BLEU: {avg_default_bleu:.4f}")
print(f"Average Reranked BLEU: {avg_reranked_bleu:.4f}")

gc.collect()
torch.cuda.empty_cache()

100%|██████████| 1334/1334 [5:19:34<00:00, 14.37s/it]  


Average Default Perplexity: 518.27
Average Reranked Perplexity: 378.11
Global Default WER: 19.4459
Global Reranked WER: 18.2357
Global Default Normalized WER: 16.1724
Global Reranked Normalized WER: 14.8634
Global Default CER: 8.9602
Global Reranked CER: 8.3561
Average Default BLEU: 0.8893
Average Reranked BLEU: 0.8960


## Large Pipeline

In [4]:


base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)

ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Large-v2-Greek"
)

whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Large-v2-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")


whisper_model.to(device)

def calculate_perplexity(sentence, model, tokenizer, device):
    """Calculate loss of a sentence using the LM."""
    model.eval()
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    input_ids = inputs.input_ids
    labels = input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    with torch.no_grad():
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss.item()
    return math.exp(loss) 

def get_whisper_transcriptions(audio_array, sr, n_best=5):
    input_features = processor(audio_array, sampling_rate=sr, return_tensors="pt").input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(
            input_features,
            max_length=225,  
            num_beams=1,     
            do_sample=False  
        )
    default_transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    beam_outputs = whisper_model.generate(
        input_features,
        num_beams=n_best,
        num_return_sequences=n_best,
        return_dict_in_generate=True,
        max_length=225  
    )
    n_best_transcriptions = processor.batch_decode(beam_outputs.sequences, skip_special_tokens=True)
    
    return default_transcription, n_best_transcriptions

def rerank_hypotheses(hypotheses, model, tokenizer, device):
    """Rerank hypotheses by LM perplexity and return the best one."""
    perplexities = [calculate_perplexity(hyp, model, tokenizer, device) for hyp in hypotheses]
    best_index = perplexities.index(min(perplexities))
    return hypotheses[best_index], perplexities[best_index]

smooth_fn = SmoothingFunction().method1

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    """Normalize text for WER computation: lowercase, remove punctuation, standardize whitespace."""
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

def compute_cer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    cer_score = cer_metric.compute(predictions=hypotheses, references=references)
    return 100 * cer_score

def compute_bleu(reference, hypothesis):
    ref_tokens = list(reference)
    hyp_tokens = list(hypothesis)
    return sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn)

default_perps, reranked_perps = [], []
default_preds, reranked_preds = [], []
references = []

for i, item in enumerate(tqdm(dataset["test"])):
    
        
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    ground_truth = item["sentence"]

    # Get Whisper transcriptions
    default_trans, n_best_hyps = get_whisper_transcriptions(audio_array, sampling_rate, n_best=5)

    # Without reranking
    default_perp = calculate_perplexity(default_trans, lm_model, lm_tokenizer, device)

    # With reranking
    reranked_trans, reranked_perp = rerank_hypotheses(n_best_hyps, lm_model, lm_tokenizer, device)

    references.append(ground_truth)
    default_preds.append(default_trans)
    reranked_preds.append(reranked_trans)

    default_perps.append(default_perp)
    reranked_perps.append(reranked_perp)

avg_default_perp = np.mean(default_perps)
avg_reranked_perp = np.mean(reranked_perps)

avg_default_wer = compute_wer(references, default_preds)
avg_reranked_wer = compute_wer(references, reranked_preds)
avg_default_normalized_wer = compute_normalized_wer(references, default_preds)
avg_reranked_normalized_wer = compute_normalized_wer(references, reranked_preds)

avg_default_cer = compute_cer(references, default_preds)
avg_reranked_cer = compute_cer(references, reranked_preds)

default_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, default_preds)]
reranked_bleus = [compute_bleu(ref, hyp) for ref, hyp in zip(references, reranked_preds)]
avg_default_bleu = np.mean(default_bleus)
avg_reranked_bleu = np.mean(reranked_bleus)

print(f"Average Default Perplexity: {avg_default_perp:.2f}")
print(f"Average Reranked Perplexity: {avg_reranked_perp:.2f}")
print(f"Global Default WER: {avg_default_wer:.4f}")
print(f"Global Reranked WER: {avg_reranked_wer:.4f}")
print(f"Global Default Normalized WER: {avg_default_normalized_wer:.4f}")
print(f"Global Reranked Normalized WER: {avg_reranked_normalized_wer:.4f}")
print(f"Global Default CER: {avg_default_cer:.4f}")
print(f"Global Reranked CER: {avg_reranked_cer:.4f}")
print(f"Average Default BLEU: {avg_default_bleu:.4f}")
print(f"Average Reranked BLEU: {avg_reranked_bleu:.4f}")

gc.collect()
torch.cuda.empty_cache()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

adapter_config.json:   0%|          | 0.00/931 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/126M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  0%|          | 0/1334 [00:00<?, ?it/s]Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
100%|██████████| 1334/1334 [8:34:48<00:00, 23.16s/it]  


Average Default Perplexity: nan
Average Reranked Perplexity: 234.60
Global Default WER: 14.9019
Global Reranked WER: 14.6952
Global Default Normalized WER: 12.0567
Global Reranked Normalized WER: 11.9818
Global Default CER: 8.4469
Global Reranked CER: 8.6675
Average Default BLEU: 0.9203
Average Reranked BLEU: 0.9206
