## Summary

In this notebook:

- I created a DataFrame based on **"La Banque Sonore des Dialectes Bretons"**, containing the columns:
  - `br`: the original Breton sentence,
  - `fr`: the French translation,
  - `city`: the municipality associated with the speaker.

- There were some issues due to poor scraping of the Banque Sonore website: in some cases, the `br` column contained a dialectal form, while the `fr` column contained the standardized version of Breton.  
  ➤ This issue has been fixed.

- I added the following columns to the DataFrame:
  - `br_processed`: the cleaned version of the Breton text,
  - `transcription`: the transcription of the audio using the **Vosk** model,
  - `wer`: the Word Error Rate,
  - `cer`: the Character Error Rate.

- I computed:
  - the average WER and CER by **municipality**,
  - the average WER and CER by **department**,
  - and the overall WER and CER for the entire dataset.


In [2]:
import pandas as pd

In [3]:
# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_parquet("/home/ouassim/Downloads/0000.parquet",columns=["br","fr","city"])

In [4]:
df

Unnamed: 0,br,fr,city
0,Pell zo n'eo ket o welet ac'hanon.,Ça fait longtemps qu'il n'est pas venu me voir.,22107
1,N'oc'h ket bet pell zo o welet ac'hanon.,Vous n'êtes pas venus me voir depuis longtemps.,22107
2,Pell zo n'eo ket bet o welet ac'hanon.,Ça fait longtemps qu'il n'est pas venu me voir.,22107
3,Ma... ma zi ne vo ket gwerzhet.,Ma... ma maison ne sera pas vendue.,22107
4,Hon... hon ti ne vo ket gwerzhet.,Notre... notre maison ne sera pas vendue.,22107
...,...,...,...
7286,Ne zeue ket ag ar vro Kemper. Ne zeue ket ag a...,Il ne venait pas du pays de Quimper. Il ne ven...,56247
7287,Bout a oa linad razh d'en-dro ar feunteun. Bou...,Il y avait des orties tout autour de la fontai...,56247
7288,Ne oa ket aes o zennañ.,Ce n'était pas facile de les arracher.,56247
7289,Ne oa ket aes o zennañ.,Ce n'était pas facile de les arracher.,56247


In [5]:
def fix_row(row):
    if 'peurunvan :' in row['fr']:
        # Extract Breton text after "peurunvan :"
        new_text = row['fr'].split('peurunvan :', 1)[1].strip()
        # Replace both Br and Fr columns with this new_text
        row['br'] = new_text
        row['fr'] = new_text
    return row

df = df.apply(fix_row, axis=1)

In [6]:
dataset_path = "/home/ouassim/Desktop/stage/Data/Banque sonore des Dialectes Bretonnes/Audio/Audio"
model_path = "/home/ouassim/Desktop/stage/Models/Anaouder model/vosk-model-br-25 .02/vosk-model-br-25.02"

In [None]:
from vosk import Model

model = Model(model_path)

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/ouassim/Desktop/stage/Models/Anaouder model/vosk-model-br-25 .02/vosk-model-br-25.02/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from /home/ouassim/Desktop/stage/Models/Anaouder model/vosk-model-br-25 .02/vosk-model-br-25.02/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:297) Loading words from /home/ouassim/Desktop/stage/Models/Anaouder model/vosk-model-br-25 .02/vosk-mode

In [None]:


from load_ground_truth import load_ground_truth_dict

from transcriber import transcribe_audio

from evaluate_wer_cer import evaluate_wer_cer


In [35]:
def generate_transcription(row):
    file_id = row['file']
    utt_name = f"utt_{file_id:05d}.wav"
    full_path = os.path.join(dataset_path, utt_name)
    return transcribe_audio(full_path, model)

In [None]:
# Step 1: Filter where city starts with '22'
# df_22 = df[df['city'].astype(str).str.startswith('22')].copy()

# Step 2: Add a 'file' column with the original index
df['file'] = df.index

df['transcription'] = df.apply(generate_transcription, axis=1)

# Optional: Reset index for a clean new DataFrame
df = df.reset_index(drop=True)




In [88]:
from filter_char import filter_out_chars
from normalizer import normalize_sentence
from utils import pre_process


def process_br_text(br):
    PUNCTUATION = '<>.?!,;:«»“”"()[]/…–—•'
    br = filter_out_chars(br, PUNCTUATION + '*')
    br = normalize_sentence(br, autocorrect=True)
    br = pre_process(br).replace('-', ' ').lower()
    return br

In [89]:
df['br_processed'] = df['br'].apply(process_br_text)

In [None]:
# Result : getting a df with Br / Fr / city / transcription / File
# So : making new one for cleaned data : Br ==> Cleaned Br 

# Making prediction only for Vosk / for Whisper it needs GPU 

# Making a new column for WER and CER per instance


In [90]:
from jiwer import wer, cer

# Compute WER and CER for each row
df['wer'] = df.apply(lambda row: wer(row['br_processed'], row['transcription']), axis=1)
df['cer'] = df.apply(lambda row: cer(row['br_processed'], row['transcription']), axis=1)


In [91]:
df = df[['br', 'br_processed', 'fr', 'transcription', 'city', 'wer', 'cer', 'file']]
df

Unnamed: 0,br,br_processed,fr,transcription,city,wer,cer,file
0,Pell zo n'eo ket o welet ac'hanon.,pell zo n'eo ket o welet ac'hanon,Ça fait longtemps qu'il n'est pas venu me voir.,pell zo n'eo ket bet evel amañ,22107,0.428571,0.363636,0
1,N'oc'h ket bet pell zo o welet ac'hanon.,n'oc'h ket bet pell zo o welet ac'hanon,Vous n'êtes pas venus me voir depuis longtemps.,n'oc'h ket bet pell zo ivez ha nann,22107,0.375000,0.282051,1
2,Pell zo n'eo ket bet o welet ac'hanon.,pell zo n'eo ket bet o welet ac'hanon,Ça fait longtemps qu'il n'est pas venu me voir.,pell zo n'eo ket bet e wel ac'hanon,22107,0.250000,0.081081,2
3,Ma... ma zi ne vo ket gwerzhet.,ma ma zi ne vo ket gwerzhet,Ma... ma maison ne sera pas vendue.,met se n'eo ket avat,22107,0.857143,0.592593,3
4,Hon... hon ti ne vo ket gwerzhet.,hon hon ti ne vo ket gwerzhet,Notre... notre maison ne sera pas vendue.,c'hoant hon te n'eo ket gwall vat,22107,0.857143,0.482759,4
...,...,...,...,...,...,...,...,...
7286,Ne zeue ket ag ar vro Kemper. Ne zeue ket ag a...,ne zeue ket ag ar vro kemper ne zeue ket ag ar...,Il ne venait pas du pays de Quimper. Il ne ven...,setu hag ar vro Kemper setu hag o vro Kemper,56247,0.785714,0.385965,7286
7287,Bout a oa linad razh d'en-dro ar feunteun. Bou...,bout a oa linad razh d'en dro ar feunteun bout...,Il y avait des orties tout autour de la fontai...,da ouel leuniet a ra torfed den ma voe leuniet...,56247,0.944444,0.662651,7287
7288,Ne oa ket aes o zennañ.,ne oa ket aes o zennañ,Ce n'était pas facile de les arracher.,met te zo lienenn,56247,1.000000,0.681818,7288
7289,Ne oa ket aes o zennañ.,ne oa ket aes o zennañ,Ce n'était pas facile de les arracher.,oa ket aes ouzh lienenn,56247,0.500000,0.500000,7289


In [92]:
# Assuming your DataFrame has columns: 'city', 'wer', and 'cer'
city_avg = df.groupby('city')[['wer', 'cer']].mean()

# Optional: round to 2 decimal places
city_avg = city_avg.round(2)

# Print each city's average WER and CER
for city_code, row in city_avg.iterrows():
    print(f"Average WER and CER for {city_code}:\nWER: {row['wer']}\nCER: {row['cer']}\n")


Average WER and CER for 22107:
WER: 0.66
CER: 0.4

Average WER and CER for 22167:
WER: 0.7
CER: 0.41

Average WER and CER for 22181:
WER: 0.68
CER: 0.42

Average WER and CER for 22230:
WER: 0.62
CER: 0.37

Average WER and CER for 22331:
WER: 0.71
CER: 0.42

Average WER and CER for 22336:
WER: 0.62
CER: 0.37

Average WER and CER for 22386:
WER: 0.68
CER: 0.4

Average WER and CER for 29005:
WER: 0.76
CER: 0.48

Average WER and CER for 29031:
WER: 1.01
CER: 0.63

Average WER and CER for 29032:
WER: 0.75
CER: 0.48

Average WER and CER for 29058:
WER: 0.71
CER: 0.44

Average WER and CER for 29080:
WER: 0.52
CER: 0.3

Average WER and CER for 29082:
WER: 0.66
CER: 0.41

Average WER and CER for 29083:
WER: 1.15
CER: 0.4

Average WER and CER for 29091:
WER: 0.68
CER: 0.42

Average WER and CER for 29105:
WER: 0.38
CER: 0.22

Average WER and CER for 29122:
WER: 0.8
CER: 0.52

Average WER and CER for 29136:
WER: 0.78
CER: 0.49

Average WER and CER for 29139:
WER: 0.89
CER: 0.52

Average WER and CE

In [95]:
# Filter for cities starting with '22'
df_22 = df[df['city'].str.startswith('22')]
avg_22 = df_22[['wer', 'cer']].mean().round(4)


# Filter for cities starting with '22'
df_29 = df[df['city'].str.startswith('29')]
avg_29 = df_29[['wer', 'cer']].mean().round(4)


# Filter for cities starting with '56'
df_56 = df[df['city'].str.startswith('56')]
avg_56 = df_56[['wer', 'cer']].mean().round(4)



print("Evaluation for --Côtes-d'Armor-- dialects 22xxx :")
print(f"WER: {avg_22['wer']}")
print(f"CER: {avg_22['cer']}\n")

print("Evaluation for --Finistère-- dialects 29xxx :")
print(f"WER: {avg_29['wer']}")
print(f"CER: {avg_29['cer']}\n")

print("Evaluation for --Morbihan-- dialects 56xxx :")
print(f"WER: {avg_56['wer']}")
print(f"CER: {avg_56['cer']}")


Evaluation for --Côtes-d'Armor-- dialects 22xxx :
WER: 0.6591
CER: 0.3906

Evaluation for --Finistère-- dialects 29xxx :
WER: 0.7492
CER: 0.4668

Evaluation for --Morbihan-- dialects 56xxx :
WER: 0.831
CER: 0.5264


In [94]:
# Assuming 'wer' and 'cer' columns exist
average_wer = df['wer'].mean()
average_cer = df['cer'].mean()

# Optional: round them
average_wer = round(average_wer, 2)
average_cer = round(average_cer, 2)

print(f"Overall Average WER: {average_wer}")
print(f"Overall Average CER: {average_cer}")


Overall Average WER: 0.75
Overall Average CER: 0.47
