## Summary

This notebook presents an evaluation of the Whisper **Medium** model on 2,000 audio samples from the dataset _La Banque Sonore des Dialectes Bretonnes_, representing approximately **27%** of the full dataset.

The evaluation was conducted entirely on **Kaggle**.

The following metrics were computed:

- **Average WER** (Word Error Rate) and **CER** (Character Error Rate) by **municipality**
- **Average WER** and **CER** by **department**
- **Overall WER** and **CER** across the full sample

In [33]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/banque-sonore-data/Audio/utt_00946.wav
/kaggle/input/banque-sonore-data/Audio/utt_02806.wav
/kaggle/input/banque-sonore-data/Audio/utt_00472.wav
/kaggle/input/banque-sonore-data/Audio/utt_05266.wav
/kaggle/input/banque-sonore-data/Audio/utt_05171.wav
/kaggle/input/banque-sonore-data/Audio/utt_02248.wav
/kaggle/input/banque-sonore-data/Audio/utt_02988.wav
/kaggle/input/banque-sonore-data/Audio/utt_06666.wav
/kaggle/input/banque-sonore-data/Audio/utt_02938.wav
/kaggle/input/banque-sonore-data/Audio/utt_04330.wav
/kaggle/input/banque-sonore-data/Audio/utt_02842.wav
/kaggle/input/banque-sonore-data/Audio/utt_07170.wav
/kaggle/input/banque-sonore-data/Audio/utt_04092.wav
/kaggle/input/banque-sonore-data/Audio/utt_07114.wav
/kaggle/input/banque-sonore-data/Audio/utt_07267.wav
/kaggle/input/banque-sonore-data/Audio/utt_02750.wav
/kaggle/input/banque-sonore-data/Audio/utt_01732.wav
/kaggle/input/banque-sonore-data/Audio/utt_02622.wav
/kaggle/input/banque-sonore-data/Audio/utt_022

In [3]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.13.0


In [4]:
!pip install -U openai-whisper

Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->openai-whisper)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->openai-whisper)
  Downloading nvidi

In [34]:
import sys

sys.path.append('/kaggle/input/python-code/text')

import os
import wave
import json
import jiwer
import audio



import whisper

In [6]:
import pandas as pd
df = pd.read_parquet("hf://datasets/Bretagne/Banque_Sonore_Dialectes_Bretons/train.parquet")

In [35]:
def fix_row(row):
    if 'peurunvan :' in row['fr']:
        # Extract Breton text after "peurunvan :"
        new_text = row['fr'].split('peurunvan :', 1)[1].strip()
        # Replace both Br and Fr columns with this new_text
        row['br'] = new_text
        row['fr'] = new_text
    return row

df = df.apply(fix_row, axis=1)

In [36]:
df

Unnamed: 0,audio,br,fr,city
0,{'bytes': b'RIFF$x\x06\x00WAVEfmt \x10\x00\x00...,Pell zo n'eo ket o welet ac'hanon.,Ça fait longtemps qu'il n'est pas venu me voir.,22107
1,{'bytes': b'RIFF$\xb1\x03\x00WAVEfmt \x10\x00\...,N'oc'h ket bet pell zo o welet ac'hanon.,Vous n'êtes pas venus me voir depuis longtemps.,22107
2,{'bytes': b'RIFF$r\x03\x00WAVEfmt \x10\x00\x00...,Pell zo n'eo ket bet o welet ac'hanon.,Ça fait longtemps qu'il n'est pas venu me voir.,22107
3,{'bytes': b'RIFF$\x97\x05\x00WAVEfmt \x10\x00\...,Ma... ma zi ne vo ket gwerzhet.,Ma... ma maison ne sera pas vendue.,22107
4,{'bytes': b'RIFF$\xa0\x05\x00WAVEfmt \x10\x00\...,Hon... hon ti ne vo ket gwerzhet.,Notre... notre maison ne sera pas vendue.,22107
...,...,...,...,...
7286,{'bytes': b'RIFF$\xce\x07\x00WAVEfmt \x10\x00\...,Ne zeue ket ag ar vro Kemper. Ne zeue ket ag a...,Il ne venait pas du pays de Quimper. Il ne ven...,56247
7287,{'bytes': b'RIFF$\xa6\x08\x00WAVEfmt \x10\x00\...,Bout a oa linad razh d'en-dro ar feunteun. Bou...,Il y avait des orties tout autour de la fontai...,56247
7288,{'bytes': b'RIFF$i\x03\x00WAVEfmt \x10\x00\x00...,Ne oa ket aes o zennañ.,Ce n'était pas facile de les arracher.,56247
7289,{'bytes': b'RIFF$3\x03\x00WAVEfmt \x10\x00\x00...,Ne oa ket aes o zennañ.,Ce n'était pas facile de les arracher.,56247


In [37]:
dataset_path = "/kaggle/input/banque-sonore-data/Audio"

In [38]:
model = whisper.load_model("medium")

100%|█████████████████████████████████████| 1.42G/1.42G [04:46<00:00, 5.34MiB/s]


In [39]:
from load_ground_truth import load_ground_truth_dict

In [40]:
import sys
sys.path.append('/kaggle/input/python-code/text')

from transcriber_whisper import transcribe_audio

In [13]:
!pip install universal_edit_distance

Collecting universal_edit_distance
  Downloading universal_edit_distance-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Downloading universal_edit_distance-0.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (275 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.8/275.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: universal_edit_distance
Successfully installed universal_edit_distance-0.4.2


In [41]:
sys.path.append('/kaggle/input/python-code/text')

from evaluate_wer_cer import evaluate_wer_cer

In [42]:
def generate_transcription(row):
    file_id = row['file']
    utt_name = f"utt_{file_id:05d}.wav"
    full_path = os.path.join(dataset_path, utt_name)
    return transcribe_audio(full_path, model)

In [43]:
# If your DataFrame is called df
small_df = df.sample(n=2000, random_state=42)  # random_state makes it reproducible
small_df

Unnamed: 0,audio,br,fr,city
1057,{'bytes': b'RIFF\x8c\xdf\x00\x00WAVEfmt \x10\x...,ur c'harzh,une haie,29032
2654,{'bytes': b'RIFF\xae\xe5\x04\x00WAVEfmt \x10\x...,Mari zo deuet a-benn enno antronoz.,Marie est venue vers eux le lendemain.,29168
4401,{'bytes': b'RIFF$1\x08\x00WAVEfmt \x10\x00\x00...,"An amzer gwechall, ne oa... ne oa ket, eu... n...","Autrefois, c'était... c'était, euh... ça n'éta...",29230
5843,{'bytes': b'RIFF\xa4\x81\x03\x00WAVEfmt \x10\x...,Piv zo bet o troc'hiñ ar wezenn ?,Qui a coupé l'arbre ?,29293
7177,{'bytes': b'RIFF$e\x04\x00WAVEfmt \x10\x00\x00...,"Pa oufemp ar wirionez, n'he lavarfemp ket.","Si nous savions la vérité, nous ne la dirions ...",56175
...,...,...,...,...
981,{'bytes': b'RIFF8@\x01\x00WAVEfmt \x10\x00\x00...,skarzhañ,vider,29032
1732,{'bytes': b'RIFF$\xc4\x05\x00WAVEfmt \x10\x00\...,Lak da zornoù e-barzh da boch ! Tenn anezhe er...,Mets tes mains dans ta poche ! Retire-les d'en...,29150
6185,{'bytes': b'RIFF$F\x05\x00WAVEfmt \x10\x00\x00...,"Nend, nend ay ket ganin bremañ, c'hwi [?], me ...","Nend, nend ay ket ganin bremañ, c'hwi [?], me ...",56076
2354,{'bytes': b'RIFF\x08j\x06\x00WAVEfmt \x10\x00\...,"Oh nann, mes me a gred lar mar vije bet chomet...","Oh nann, mes me a gred lar mar vije bet chomet...",29153


In [44]:
# Step 1: Filter where city starts with '22'
# df_22 = df[df['city'].astype(str).str.startswith('22')].copy()

# Step 2: Add a 'file' column with the original index
small_df['file'] = small_df.index

small_df['transcription'] = small_df.apply(generate_transcription, axis=1)

# Optional: Reset index for a clean new DataFrame
small_df = small_df.reset_index(drop=True)

Transcribing /kaggle/input/banque-sonore-data/Audio/utt_01057.wav...
[00:00.000 --> 00:00.500]  Chass!
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_02654.wav...
[00:00.000 --> 00:03.980]  Des Canaries ou iki personnes Lilian.
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_04401.wav...
[00:00.000 --> 00:02.300]  pubsberdy73hw hin??
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_07177.wav...
[00:00.000 --> 00:03.000]  A pouwèyèm erwèrjonni nìlarem tʃət.
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_03805.wav...
[00:00.000 --> 00:03.000]  Es ist schon zu mein Besuch.
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_03961.wav...
[00:00.000 --> 00:02.860]  Ben mis labouret, mis goniume fée.
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_06147.wav...
[00:00.000 --> 00:03.360]  Now I'm so better do a fatten, the ray never proud.
Transcribing /kaggle/input/banque-sonore-data/Audio/utt_03551.wav...
[00:00.000 --> 00:20.720]  Falls
Transcrib

In [25]:
small_df

Unnamed: 0,audio,br,fr,city,file,transcription
0,{'bytes': b'RIFF\x8c\xdf\x00\x00WAVEfmt \x10\x...,ur c'harzh,une haie,29032,1057,tBack
1,{'bytes': b'RIFF\xae\xe5\x04\x00WAVEfmt \x10\x...,Mari zo deuet a-benn enno antronoz.,Marie est venue vers eux le lendemain.,29168,2654,"Mori, aso du benen internoos."
2,{'bytes': b'RIFF$1\x08\x00WAVEfmt \x10\x00\x00...,"An amzer gwechall, ne oa... ne oa ket, eu... n...","Autrefois, c'était... c'était, euh... ça n'éta...",29230,4401,N Tubzrodutile shelling The baby she is looki...
3,{'bytes': b'RIFF\xa4\x81\x03\x00WAVEfmt \x10\x...,Piv zo bet o troc'hiñ ar wezenn ?,Qui a coupé l'arbre ?,29293,5843,pún
4,{'bytes': b'RIFF$e\x04\x00WAVEfmt \x10\x00\x00...,"Pa oufemp ar wirionez, n'he lavarfemp ket.","Si nous savions la vérité, nous ne la dirions ...",56175,7177,W age sk sabesrati per dia 15.
...,...,...,...,...,...,...
1995,{'bytes': b'RIFF8@\x01\x00WAVEfmt \x10\x00\x00...,skarzhañ,vider,29032,981,Skaazi
1996,{'bytes': b'RIFF$\xc4\x05\x00WAVEfmt \x10\x00\...,Lak da zornoù e-barzh da boch ! Tenn anezhe er...,Mets tes mains dans ta poche ! Retire-les d'en...,29150,1732,"Lactazor no batabosh, tennemis to stahol."
1997,{'bytes': b'RIFF$F\x05\x00WAVEfmt \x10\x00\x00...,"Nend, nend ay ket ganin bremañ, c'hwi [?], me ...","Nend, nend ay ket ganin bremañ, c'hwi [?], me ...",56076,6185,d punches mit z студenta
1998,{'bytes': b'RIFF\x08j\x06\x00WAVEfmt \x10\x00\...,"Oh nann, mes me a gred lar mar vije bet chomet...","Oh nann, mes me a gred lar mar vije bet chomet...",29153,2354,Mennaudne melee la hamā vier reckless bot car


In [45]:
from filter_char import filter_out_chars
from normalizer import normalize_sentence
from utils import pre_process


def process_br_text(br):
    PUNCTUATION = '<>.?!,;:«»“”"()[]/…–—•'
    br = filter_out_chars(br, PUNCTUATION + '*')
    br = normalize_sentence(br, autocorrect=True)
    br = pre_process(br).replace('-', ' ').lower()
    return br

In [46]:
small_df['br_processed'] = small_df['br'].apply(process_br_text)

In [47]:
from jiwer import wer, cer

# Compute WER and CER for each row
small_df['wer'] = small_df.apply(lambda row: wer(row['br_processed'], row['transcription']), axis=1)
small_df['cer'] = small_df.apply(lambda row: cer(row['br_processed'], row['transcription']), axis=1)

In [48]:
small_df = small_df[['br', 'br_processed', 'fr', 'transcription', 'city', 'wer', 'cer', 'file']]
small_df

Unnamed: 0,br,br_processed,fr,transcription,city,wer,cer,file
0,ur c'harzh,ur c'harzh,une haie,Chass!,29032,1.0,0.800000,1057
1,Mari zo deuet a-benn enno antronoz.,mari zo deuet a benn enno antronoz,Marie est venue vers eux le lendemain.,Des Canaries ou iki personnes Lilian.,29168,1.0,0.882353,2654
2,"An amzer gwechall, ne oa... ne oa ket, eu... n...",an amzer gwechall ne oa ne oa ket eu ne oa ket...,"Autrefois, c'était... c'était, euh... ça n'éta...","On a besoin d'une échelle, on a besoin d'une ...",29230,1.0,0.660714,4401
3,Piv zo bet o troc'hiñ ar wezenn ?,piv zo bet o troc'hiñ ar wezenn,Qui a coupé l'arbre ?,pubsberdy73hw hin??,29293,1.0,0.838710,5843
4,"Pa oufemp ar wirionez, n'he lavarfemp ket.",pa oufemp ar wirionez n'he lavarfemp ket,"Si nous savions la vérité, nous ne la dirions ...",A pouwèyèm erwèrjonni nìlarem tʃət.,56175,1.0,0.650000,7177
...,...,...,...,...,...,...,...,...
1995,skarzhañ,skarzhañ,vider,атurate a notre biblioteca .,29032,5.0,3.125000,981
1996,Lak da zornoù e-barzh da boch ! Tenn anezhe er...,lak da zornoù e barzh da boch tenn anezhe er m...,Mets tes mains dans ta poche ! Retire-les d'en...,"If you have any questions, please leave a com...",29150,1.0,0.803030,1732
1997,"Nend, nend ay ket ganin bremañ, c'hwi [?], me ...",nend nend ay ket ganin bremañ c'hwi me am eus ...,"Nend, nend ay ket ganin bremañ, c'hwi [?], me ...","D'ici que n'aimons pas bien. M'a mis en 1714,...",56076,1.0,0.768293,6185
1998,"Oh nann, mes me a gred lar mar vije bet chomet...",oh nann mes me a gred lâr mar vije bet chomet ...,"Oh nann, mes me a gred lar mar vije bet chomet...","Und dann ребята lachen, und dann schauen, cou...",29153,1.0,0.784091,2354


In [49]:
city_avg = small_df.groupby('city')[['wer', 'cer']].mean()

# Optional: round to 2 decimal places
city_avg = city_avg.round(2)

for city_code, row in city_avg.iterrows():
    print(f"Average WER and CER for {city_code}:\nWER: {row['wer']}\nCER: {row['cer']}\n")

Average WER and CER for 22107:
WER: 1.02
CER: 0.74

Average WER and CER for 22167:
WER: 1.05
CER: 0.77

Average WER and CER for 22181:
WER: 1.07
CER: 0.86

Average WER and CER for 22230:
WER: 1.07
CER: 0.82

Average WER and CER for 22331:
WER: 1.02
CER: 0.72

Average WER and CER for 22336:
WER: 0.99
CER: 0.68

Average WER and CER for 22386:
WER: 1.0
CER: 0.66

Average WER and CER for 29005:
WER: 1.04
CER: 0.77

Average WER and CER for 29031:
WER: 1.83
CER: 1.59

Average WER and CER for 29032:
WER: 1.12
CER: 0.82

Average WER and CER for 29058:
WER: 0.97
CER: 0.68

Average WER and CER for 29080:
WER: 1.0
CER: 0.71

Average WER and CER for 29082:
WER: 1.25
CER: 0.77

Average WER and CER for 29091:
WER: 0.99
CER: 0.74

Average WER and CER for 29105:
WER: 1.02
CER: 0.68

Average WER and CER for 29122:
WER: 1.01
CER: 0.74

Average WER and CER for 29136:
WER: 1.03
CER: 0.75

Average WER and CER for 29139:
WER: 1.0
CER: 0.75

Average WER and CER for 29146:
WER: 1.0
CER: 0.99

Average WER and 

In [50]:
# Filter for cities starting with '22'
small_df_22 = small_df[small_df['city'].str.startswith('22')]
avg_22 = small_df_22[['wer', 'cer']].mean().round(4)


# Filter for cities starting with '22'
small_df_29 = small_df[small_df['city'].str.startswith('29')]
avg_29 = small_df_29[['wer', 'cer']].mean().round(4)


# Filter for cities starting with '56'
small_df_56 = small_df[small_df['city'].str.startswith('56')]
avg_56 = small_df_56[['wer', 'cer']].mean().round(4)



print("Evaluation for --Côtes-d'Armor-- dialects 22xxx :")
print(f"WER: {avg_22['wer']}")
print(f"CER: {avg_22['cer']}\n")

print("Evaluation for --Finistère-- dialects 29xxx :")
print(f"WER: {avg_29['wer']}")
print(f"CER: {avg_29['cer']}\n")

print("Evaluation for --Morbihan-- dialects 56xxx :")
print(f"WER: {avg_56['wer']}")
print(f"CER: {avg_56['cer']}")

Evaluation for --Côtes-d'Armor-- dialects 22xxx :
WER: 1.0445
CER: 0.7724

Evaluation for --Finistère-- dialects 29xxx :
WER: 1.0476
CER: 0.7725

Evaluation for --Morbihan-- dialects 56xxx :
WER: 1.0792
CER: 0.8193


In [51]:
average_wer = small_df['wer'].mean()
average_cer = small_df['cer'].mean()

# Optional: round them
average_wer = round(average_wer, 2)
average_cer = round(average_cer, 2)

print(f"Overall Average WER: {average_wer}")
print(f"Overall Average CER: {average_cer}")

Overall Average WER: 1.05
Overall Average CER: 0.78
