# Whisper Model Evaluation for Greek ASR

This notebook evaluates the performance of six Whisper models (base and fine-tuned versions of Small, Medium, and Large-v2) for automatic speech recognition (ASR) on Greek audio. The focus is to compare base and fine-tuned models on a combined Greek dataset using standard ASR metrics.


## Objective

* Evaluate Model Performance: Assess base and fine-tuned Whisper models for Greek ASR accuracy.
* Metrics: Word Error Rate (WER), Normalized WER, and Character Error Rate (CER).
* Dataset: A combined test set from "Vardis/Greek_Mosel", Common Voice (Greek), and Fleurs (Greek), standardized to 16kHz audio.

## Workflow

### Dataset Preparation:

* Load and split datasets: "Vardis/Greek_Mosel", Common Voice 11.0 (el), and Fleurs (el_gr).
* Combine and shuffle train, validation, and test splits (80% train, 10% validation, 10% test).
* Standardize audio to 16kHz and rename text fields to sentence.


### Model Setup:

* Load base Whisper models: openai/whisper-small, openai/whisper-medium, openai/whisper-large-v2.
* Load fine-tuned models: Vardis/Whisper-Small-Greek, Vardis/Whisper-Medium-Greek, Vardis/Whisper-LoRA-Greek (with LoRA weights merged).
* Use torch.float16 and device_map="auto" for GPU acceleration.


### Evaluation Process:

* For each model (base and fine-tuned):
- Generate transcriptions for the test set using greedy decoding (max_length=225).
- Collect predictions and reference sentences.
- Compute WER, Normalized WER, and CER.


### Results:

* Report metrics for each model:
- WER: Word-level errors (standard).
- Normalized WER: WER after lowercasing, removing punctuation, and standardizing whitespace.
- CER: Character-level errors.
* Compare base vs. fine-tuned performance to assess the impact of fine-tuning.


This evaluation highlights the improvements of fine-tuned Whisper models over their base counterparts for Greek ASR tasks.



In [1]:
!pip install evaluate jiwer
!pip install datasets==3.6.0


Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [2]:
from datasets import load_dataset, IterableDatasetDict
import os


os.environ["CUDA_VISIBLE_DEVICES"] = "0"
language = "Greek"
language_abbr = "el"
language_abbr2 = "el_gr"
task = "transcribe"


a = IterableDatasetDict()
b = IterableDatasetDict()
c = IterableDatasetDict()


a_full = load_dataset("Vardis/Greek_Mosel", split="train")
a_temp = a_full.train_test_split(test_size=0.2, seed=42)  # 80% train 
a_val_test = a_temp["test"].train_test_split(test_size=0.5, seed=42)  # 10% val + 10% test
a["train"] = a_temp["train"]
a["validation"] = a_val_test["train"]
a["test"] = a_val_test["test"]

b_full = load_dataset("mozilla-foundation/common_voice_11_0", language_abbr, split="train+validation+test")
b_temp = b_full.train_test_split(test_size=0.2, seed=42)
b_val_test = b_temp["test"].train_test_split(test_size=0.5, seed=42)
b["train"] = b_temp["train"]
b["validation"] = b_val_test["train"]
b["test"] = b_val_test["test"]

c_full = load_dataset("google/fleurs", language_abbr2, split="train+validation+test")
c_temp = c_full.train_test_split(test_size=0.2, seed=42)
c_val_test = c_temp["test"].train_test_split(test_size=0.5, seed=42)
c["train"] = c_temp["train"]
c["validation"] = c_val_test["train"]
c["test"] = c_val_test["test"]



b = b.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
c = c.remove_columns(["id", "num_samples", "path", "raw_transcription", "gender", "lang_id", "language", "lang_group_id"])

a = a.rename_column("text", "sentence")
c = c.rename_column("transcription", "sentence")


print(a)
print(b)
print(c)

from datasets import Audio

a = a.cast_column("audio", Audio(sampling_rate=16000))
b = b.cast_column("audio", Audio(sampling_rate=16000))
c = c.cast_column("audio", Audio(sampling_rate=16000))

from datasets import concatenate_datasets

combined_train = concatenate_datasets([a['train'], b['train'], c['train']])
combined_test = concatenate_datasets([a['test'], b['test'], c['test']])
combined_test = combined_test.shuffle(seed=42)
combined_valid = concatenate_datasets([a['validation'], b['validation'], c['validation']])

combined_dataset = IterableDatasetDict({
    'train': combined_train,
    "validation": combined_valid,
    'test': combined_test
})

dataset = combined_dataset
print(dataset)

README.md:   0%|          | 0.00/322 [00:00<?, ?B/s]

data/train-00000-of-00007.parquet:   0%|          | 0.00/497M [00:00<?, ?B/s]

data/train-00001-of-00007.parquet:   0%|          | 0.00/494M [00:00<?, ?B/s]

data/train-00002-of-00007.parquet:   0%|          | 0.00/498M [00:00<?, ?B/s]

data/train-00003-of-00007.parquet:   0%|          | 0.00/497M [00:00<?, ?B/s]

data/train-00004-of-00007.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

data/train-00005-of-00007.parquet:   0%|          | 0.00/499M [00:00<?, ?B/s]

data/train-00006-of-00007.parquet:   0%|          | 0.00/505M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3876 [00:00<?, ? examples/s]

README.md: 0.00B [00:00, ?B/s]

common_voice_11_0.py: 0.00B [00:00, ?B/s]

languages.py: 0.00B [00:00, ?B/s]

release_stats.py: 0.00B [00:00, ?B/s]

The repository for mozilla-foundation/common_voice_11_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_11_0.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


n_shards.json: 0.00B [00:00, ?B/s]

audio/el/train/el_train_0.tar:   0%|          | 0.00/57.4M [00:00<?, ?B/s]

audio/el/dev/el_dev_0.tar:   0%|          | 0.00/51.0M [00:00<?, ?B/s]

audio/el/test/el_test_0.tar:   0%|          | 0.00/50.9M [00:00<?, ?B/s]

audio/el/other/el_other_0.tar:   0%|          | 0.00/238M [00:00<?, ?B/s]

audio/el/invalidated/el_invalidated_0.ta(…):   0%|          | 0.00/23.3M [00:00<?, ?B/s]

transcript/el/train.tsv:   0%|          | 0.00/482k [00:00<?, ?B/s]

transcript/el/dev.tsv:   0%|          | 0.00/423k [00:00<?, ?B/s]

transcript/el/test.tsv:   0%|          | 0.00/410k [00:00<?, ?B/s]

transcript/el/other.tsv:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

transcript/el/invalidated.tsv:   0%|          | 0.00/201k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1914it [00:00, 130308.21it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1701it [00:00, 138604.17it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1696it [00:00, 132018.25it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 9072it [00:00, 142940.37it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 797it [00:00, 97687.33it/s]


README.md: 0.00B [00:00, ?B/s]

fleurs.py: 0.00B [00:00, ?B/s]

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


data/el_gr/audio/train.tar.gz:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

data/el_gr/audio/dev.tar.gz:   0%|          | 0.00/141M [00:00<?, ?B/s]

data/el_gr/audio/test.tar.gz:   0%|          | 0.00/349M [00:00<?, ?B/s]

train.tsv: 0.00B [00:00, ?B/s]

dev.tsv: 0.00B [00:00, ?B/s]

test.tsv: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3100
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 388
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 388
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4248
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 531
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 532
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 3308
    })
    validation: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 414
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 414
    })
})
IterableDatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
       

## Small Whisper

### Base Model

In [5]:
import re
import string
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import evaluate
from tqdm import tqdm
from peft import PeftModel

device = "cuda" if torch.cuda.is_available() else "cpu"


whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-small",
    torch_dtype=torch.float16,
    device_map="auto"
)


processor = WhisperProcessor.from_pretrained("openai/whisper-Small")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

# Normalize text for WER computation: lowercase, remove punctuation, standardize whitespace.
def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    # Convert audio to input features
    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    # Decode using the fine-tuned whisper model
    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER: {normalized_wer_score:.2f}%")

import torch
import gc

gc.collect()
torch.cuda.empty_cache()


100%|██████████| 1334/1334 [21:55<00:00,  1.01it/s]


Test WER: 43.62%
Test CER: 21.61%
Test Normalized WER: 36.69%


### Fine-tuned Model

In [6]:

base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-small",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA weights
ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Small-Greek"
)

# Merge LoRA → base weights
whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Small-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER fine med: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER fine med: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER fine med: {normalized_wer_score:.2f}%")

import torch
import gc

gc.collect()
torch.cuda.empty_cache()

100%|██████████| 1334/1334 [22:16<00:00,  1.00s/it]


Test WER fine med: 30.31%
Test CER fine med: 13.28%
Test Normalized WER fine med: 26.54%


## Medium Whisper

### Base Model

In [7]:
whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-medium",
    torch_dtype=torch.float16,
    device_map="auto"
)


processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER: {normalized_wer_score:.2f}%")

gc.collect()
torch.cuda.empty_cache()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

100%|██████████| 1334/1334 [39:54<00:00,  1.80s/it] 


Test WER: 34.71%
Test CER: 19.30%
Test Normalized WER: 27.21%


### Fine-tuned Model

In [8]:
base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-Medium",
    torch_dtype=torch.float16,
    device_map="auto"
)

ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Medium-Greek"
)

whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Medium-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER fine med: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER fine med: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER fine med: {normalized_wer_score:.2f}%")

gc.collect()
torch.cuda.empty_cache()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

adapter_config.json: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/75.6M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

100%|██████████| 1334/1334 [42:47<00:00,  1.92s/it] 


Test WER fine med: 19.45%
Test CER fine med: 8.96%
Test Normalized WER fine med: 16.17%


## Large Whisper V2

### Base Model

In [9]:
whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER: {normalized_wer_score:.2f}%")

gc.collect()
torch.cuda.empty_cache()

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

100%|██████████| 1334/1334 [55:51<00:00,  2.51s/it] 


Test WER: 26.41%
Test CER: 14.55%
Test Normalized WER: 18.86%


### Fine-tuned Model

In [7]:
base_whisper = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)

ft_whisper = PeftModel.from_pretrained(
    base_whisper, 
    "Vardis/Whisper-Large-v2-Greek"
)

whisper_model = ft_whisper.merge_and_unload().to(device)

processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Large-v2-Greek")
whisper_model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="el", task="transcribe")

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def normalize_text(text):
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def compute_wer(references, hypotheses):
    if len(references) != len(hypotheses):
        raise ValueError("References and hypotheses must have the same length")
    wer_score = wer_metric.compute(predictions=hypotheses, references=references)
    return 100 * wer_score

def compute_normalized_wer(references, hypotheses):
    """Compute WER after normalizing references and hypotheses."""
    normalized_refs = [normalize_text(ref) for ref in references]
    normalized_hyps = [normalize_text(hyp) for hyp in hypotheses]
    wer_score = wer_metric.compute(predictions=normalized_hyps, references=normalized_refs)
    return 100 * wer_score

pred_strs = []
ref_strs = []

for item in tqdm(dataset["test"]):
    audio_array = item["audio"]["array"]
    sampling_rate = item["audio"]["sampling_rate"]
    reference = item["sentence"]

    input_features = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    ).input_features.to(device, dtype=whisper_model.dtype)

    with torch.no_grad():
        pred_ids = whisper_model.generate(input_features, max_length=225)

    prediction = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    pred_strs.append(prediction)
    ref_strs.append(reference)

wer_score = compute_wer(ref_strs, pred_strs)
print(f"Test WER fine med: {wer_score:.2f}%")

cer_score = 100 * cer_metric.compute(predictions=pred_strs, references=ref_strs)
print(f"Test CER fine med: {cer_score:.2f}%")

normalized_wer_score = compute_normalized_wer(ref_strs, pred_strs)
print(f"Test Normalized WER fine med: {normalized_wer_score:.2f}%")

gc.collect()
torch.cuda.empty_cache()

preprocessor_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  0%|          | 0/1334 [00:00<?, ?it/s]Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 1334/1334 [57:54<00:00,  2.60s/it] 


Test WER fine med: 14.90%
Test CER fine med: 8.45%
Test Normalized WER fine med: 12.06%


NameError: name 'gc' is not defined