## Evaluation Summary of Whisper Turbo on Common Voice Dataset

I conducted an evaluation of the **Whisper Turbo** model on the **Common Voice** dataset.

- **Number of evaluated instances**: 2,865  
- The dataset is **diverse in terms of voices**, with many different speakers contributing to the recordings.

### Metrics Computed

For each instance, the following metrics were calculated:

- **Word Error Rate (WER)**
- **Character Error Rate (CER)**

### Aggregated Metrics

I also computed average WER and CER:

- **Globally**, across the entire dataset
- By gender
- By age group
- By accent

In [29]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/scripts-code/text/Whisper_Evaluation_small.ipynb
/kaggle/input/scripts-code/text/load_ground_truth_common_voice.py
/kaggle/input/scripts-code/text/definitions.py
/kaggle/input/scripts-code/text/wazemour@bocal.cs.univ-paris8.fr
/kaggle/input/scripts-code/text/Evaluation_Departement_Banque.ipynb
/kaggle/input/scripts-code/text/transcriber_whisper.py
/kaggle/input/scripts-code/text/Whisper_Evaluation_tiny_2000.ipynb
/kaggle/input/scripts-code/text/normalizer.py
/kaggle/input/scripts-code/text/Whisper_Evaluation_base_2000.ipynb
/kaggle/input/scripts-code/text/filter_char.py
/kaggle/input/scripts-code/text/transcriber.py
/kaggle/input/scripts-code/text/gateway-bocal
/kaggle/input/scripts-code/text/utils.py
/kaggle/input/scripts-code/text/MP3toWAV.py
/kaggle/input/scripts-code/text/tokenizer.py
/kaggle/input/scripts-code/text/load_ground_truth.py
/kaggle/input/scripts-code/text/Whisper_Evaluation_tiny.ipynb
/kaggle/input/scripts-code/text/evaluate_wer_cer.py
/kaggle/input/scrip

In [30]:
!pip install jiwer



In [31]:
import os
import json
import jiwer

import sys

In [32]:
import pandas as pd

tsv_file="/kaggle/input/test-tsv/test.tsv"
# Read the TSV file into a DataFrame
df = pd.read_csv(tsv_file, sep='\t')

df

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,client_id,path,sentence_id,sentence,sentence_domain,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,20c2ec32d71eb83bd7878d32354cf7ed1387b0cf8c0244...,common_voice_br_43418986.mp3,1f2c296645f2350caab2a1b020bdca95b9c5997f49a857...,Kanañ a raent en hent don.,general,2,0,,,,,br,
1,28549a5a700d20b91b7c5729ab381b3de5f3ef7b39896d...,common_voice_br_18382821.mp3,1956f9fa11493b990b8e4d902dd02c9cd5b744ba539367...,digor e vez da Verc'her.,,2,0,,,,,br,
2,601b90c17534eafa23b4c3d9056562d5ae7f3a4e09ba66...,common_voice_br_17977507.mp3,88a54965ce62fa884bb2156ed7ab051410e40161481e64...,Re nebeut a zo deuet da reiñ dorn dimp.,,2,0,,,,,br,
3,644a43f222bf6121fccc51e3995fad29c937d0c67e2652...,common_voice_br_41622444.mp3,83dabe599e2706e4fcbb78c6084b364c0e268c0d541fe2...,Kemerit ho tafar hag azezit !,,2,0,,,,,br,
4,6f9932e48b262881952f248473d5e56e25fbf4608d79c7...,common_voice_br_19737696.mp3,c2b949dd328e1ca91003278af8831a1cfc7f784c225089...,Be'h dezhi !,,2,1,,,,,br,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2860,37350e73c62e7ba51aca62e2e850852e5fc0529881992a...,common_voice_br_17994057.mp3,746b3e1ba4b1e90fd730feef0ee74421edcc90e220479c...,Ar memes tra a vo gulennet diganeoc'h.,,2,1,,,,,br,
2861,37350e73c62e7ba51aca62e2e850852e5fc0529881992a...,common_voice_br_17995860.mp3,be0d10e614a2afd816043b51fbaa674dda05334c7f8bb4...,Alies em galv da zont da reiñ dorn dezhañ.,,2,0,,,,,br,
2862,37350e73c62e7ba51aca62e2e850852e5fc0529881992a...,common_voice_br_17996179.mp3,c365c47e239116c4493b9e8eb5f0eaf365da027d707e34...,Me/ Te. Eñ/Hi.,,2,0,,,,,br,
2863,37350e73c62e7ba51aca62e2e850852e5fc0529881992a...,common_voice_br_17998028.mp3,d81e755e5f77c0cf0483082e015ec2e8cdbb5a686ffc1d...,Ar c'harr-tan ruz nevez-flamm-hont.,,2,0,,,,,br,


In [33]:
df = df[['path','sentence','gender','sentence_domain','age','accents']]

df

Unnamed: 0,path,sentence,gender,sentence_domain,age,accents
0,common_voice_br_43418986.mp3,Kanañ a raent en hent don.,,general,,
1,common_voice_br_18382821.mp3,digor e vez da Verc'her.,,,,
2,common_voice_br_17977507.mp3,Re nebeut a zo deuet da reiñ dorn dimp.,,,,
3,common_voice_br_41622444.mp3,Kemerit ho tafar hag azezit !,,,,
4,common_voice_br_19737696.mp3,Be'h dezhi !,,,,
...,...,...,...,...,...,...
2860,common_voice_br_17994057.mp3,Ar memes tra a vo gulennet diganeoc'h.,,,,
2861,common_voice_br_17995860.mp3,Alies em galv da zont da reiñ dorn dezhañ.,,,,
2862,common_voice_br_17996179.mp3,Me/ Te. Eñ/Hi.,,,,
2863,common_voice_br_17998028.mp3,Ar c'harr-tan ruz nevez-flamm-hont.,,,,


In [34]:
dataset_path = "/kaggle/input/common-voice-data/Test"

In [35]:
!pip install -U openai-whisper



In [36]:
import whisper

model = whisper.load_model("turbo")

In [37]:
import sys

sys.path.append('/kaggle/input/scripts-code/text')


from load_ground_truth_common_voice import load_ground_truth_dict_commonvoice

from transcriber import transcribe_audio

In [38]:
def generate_transcription(row):
    audio_filename = row['path']  # e.g common_voice_br_43418986.mp3
    full_path = os.path.join(dataset_path, audio_filename)
    return transcribe_audio(full_path, model)

In [39]:
import sys

# Apply transcription
df['transcription'] = df.apply(generate_transcription, axis=1)

# Optional: reset index if needed
df = df.reset_index(drop=True)

Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_43418986.mp3...
[00:00.000 --> 00:03.860]  Kana ar ein einen dome
Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_18382821.mp3...
[00:00.000 --> 00:02.580]  Digo'rvi da vẽ.
[00:02.580 --> 00:03.580]  HTyarch.
Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_17977507.mp3...
[00:00.880 --> 00:03.780]  Reineu botazu, dovet ar Ainhoump, ne dîmes.
[00:11.780 --> 00:13.180]  Baez-le ni uge un tâschi'n'hiss rulli maoja.
[00:14.500 --> 00:17.180]  Son arainet est differenti d bel than.
Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_41622444.mp3...
[00:00.000 --> 00:02.440]  Kéméry do Tafr gazezit.
Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_19737696.mp3...
[00:00.000 --> 00:01.380]  Be'r'a da'vi.
Transcribing /kaggle/input/common-voice-data/Test/common_voice_br_25993226.mp3...
[00:00.000 --> 00:01.900]  Nog vaim drillien.
Transcribing /kaggle/input/commo

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['transcription'] = df.apply(generate_transcription, axis=1)


In [51]:
df 

Unnamed: 0,sentence,sentence_processed,transcription,wer,cer,gender,sentence_domain,age,accents
0,Kanañ a raent en hent don.,kanañ a raent en hent don,Kana ar ein einen dome,1.000000,0.480000,,general,,
1,digor e vez da Verc'her.,digor e vez da verc'her,Digo'rvi da vẽ. HTyarch.,1.000000,0.695652,,,,
2,Re nebeut a zo deuet da reiñ dorn dimp.,re nebeut a zo deuet da reiñ dorn deomp,"Reineu botazu, dovet ar Ainhoump, ne dîmes. B...",2.333333,2.538462,,,,
3,Kemerit ho tafar hag azezit !,kemerit ho tafar hag azezit,Kéméry do Tafr gazezit.,1.000000,0.444444,,,,
4,Be'h dezhi !,be'h dezhi,Be'r'a da'vi.,1.000000,0.800000,,,,
...,...,...,...,...,...,...,...,...,...
2860,Ar memes tra a vo gulennet diganeoc'h.,ar memes tra a vo goulennet diganeoc'h,Ar mei me strav o gulennet di Ganaq.,1.142857,0.421053,,,,
2861,Alies em galv da zont da reiñ dorn dezhañ.,alies em galv da zont da reiñ dorn dezhañ,"Al ii e s'em gaf de zon de rei d'orne, deson.",1.222222,0.463415,,,,
2862,Me/ Te. Eñ/Hi.,me te eñhi,"Ne, te, en, hi",1.333333,0.600000,,,,
2863,Ar c'harr-tan ruz nevez-flamm-hont.,ar c'harr tan ruz nevez flamm hont,Arhartan rû nevez flambant.,0.857143,0.382353,,,,


In [52]:
ground_truth_path="/kaggle/input/test-tsv/test.tsv"

ground_truth_dict = load_ground_truth_dict_commonvoice(ground_truth_path)

In [53]:
from filter_char import filter_out_chars
from normalizer import normalize_sentence
from utils import pre_process


def process_br_text(br):
    PUNCTUATION = '<>.?!,;:«»“”"()[]/…–—•'
    br = filter_out_chars(br, PUNCTUATION + '*')
    br = normalize_sentence(br, autocorrect=True)
    br = pre_process(br).replace('-', ' ').lower()
    return br

In [54]:
df['sentence_processed'] = df['sentence'].apply(process_br_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentence_processed'] = df['sentence'].apply(process_br_text)


In [55]:
from jiwer import wer, cer

# Compute WER and CER for each row
df['wer'] = df.apply(lambda row: wer(row['sentence_processed'], row['transcription']), axis=1)
df['cer'] = df.apply(lambda row: cer(row['sentence_processed'], row['transcription']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['wer'] = df.apply(lambda row: wer(row['sentence_processed'], row['transcription']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cer'] = df.apply(lambda row: cer(row['sentence_processed'], row['transcription']), axis=1)


In [56]:
df = df[['sentence', 'sentence_processed', 'transcription', 'wer', 'cer','gender','sentence_domain','age','accents']]
df

Unnamed: 0,sentence,sentence_processed,transcription,wer,cer,gender,sentence_domain,age,accents
0,Kanañ a raent en hent don.,kanañ a raent en hent don,Kana ar ein einen dome,1.000000,0.480000,,general,,
1,digor e vez da Verc'her.,digor e vez da verc'her,Digo'rvi da vẽ. HTyarch.,1.000000,0.695652,,,,
2,Re nebeut a zo deuet da reiñ dorn dimp.,re nebeut a zo deuet da reiñ dorn deomp,"Reineu botazu, dovet ar Ainhoump, ne dîmes. B...",2.333333,2.538462,,,,
3,Kemerit ho tafar hag azezit !,kemerit ho tafar hag azezit,Kéméry do Tafr gazezit.,1.000000,0.444444,,,,
4,Be'h dezhi !,be'h dezhi,Be'r'a da'vi.,1.000000,0.800000,,,,
...,...,...,...,...,...,...,...,...,...
2860,Ar memes tra a vo gulennet diganeoc'h.,ar memes tra a vo goulennet diganeoc'h,Ar mei me strav o gulennet di Ganaq.,1.142857,0.421053,,,,
2861,Alies em galv da zont da reiñ dorn dezhañ.,alies em galv da zont da reiñ dorn dezhañ,"Al ii e s'em gaf de zon de rei d'orne, deson.",1.222222,0.463415,,,,
2862,Me/ Te. Eñ/Hi.,me te eñhi,"Ne, te, en, hi",1.333333,0.600000,,,,
2863,Ar c'harr-tan ruz nevez-flamm-hont.,ar c'harr tan ruz nevez flamm hont,Arhartan rû nevez flambant.,0.857143,0.382353,,,,


In [63]:
print(df['gender'].value_counts(dropna=False))
print("--------------------------------")

# Filter for male voice audio
df_male = df[df['gender'].str.startswith('male_masculine', na=False)]

avg_male = df_male[['wer', 'cer']].mean().round(4)


# Filter for female voice audio
df_female = df[df['gender'].str.startswith('female_feminine', na=False)]

avg_female = df_female[['wer', 'cer']].mean().round(4)

print("Evaluation for Female audio files :")
print(f"WER: {avg_female['wer']}")
print(f"CER: {avg_female['cer']}\n")

print("Evaluation for Male audio files :")
print(f"WER: {avg_male['wer']}")
print(f"CER: {avg_male['cer']}\n")

gender
NaN                1875
male_masculine      760
female_feminine     230
Name: count, dtype: int64
--------------------------------
Evaluation for Female audio files :
WER: 1.0287
CER: 0.5921

Evaluation for Male audio files :
WER: 1.1065
CER: 0.7063



In [59]:
print(df['sentence_domain'].value_counts(dropna=False))

sentence_domain
NaN                       2583
general                    275
history_law_government       2
general,general              1
media_entertainment          1
service_retail               1
technology_robotics          1
healthcare                   1
Name: count, dtype: int64


In [60]:
print(df['age'].value_counts(dropna=False))
print("--------------------------------")

# Filter for twenties voice audio
df_twenties = df[df['age'].str.startswith('twenties', na=False)]
avg_twenties = df_twenties[['wer', 'cer']].mean().round(4)

# Filter for thirties voice audio
df_thirties = df[df['age'].str.startswith('thirties', na=False)]
avg_thirties = df_thirties[['wer', 'cer']].mean().round(4)

# Filter for fourties voice audio
df_fourties = df[df['age'].str.startswith('fourties', na=False)]
avg_fourties = df_fourties[['wer', 'cer']].mean().round(4)

# Filter for fifties voice audio
df_fifties = df[df['age'].str.startswith('fifties', na=False)]
avg_fifties = df_fifties[['wer', 'cer']].mean().round(4)

# Filter for sixties voice audio
df_sixties = df[df['age'].str.startswith('sixties', na=False)]
avg_sixties = df_sixties[['wer', 'cer']].mean().round(4)

print("Evaluation for Twenties audio files :")
print(f"WER: {avg_twenties['wer']}")
print(f"CER: {avg_twenties['cer']}\n")

print("Evaluation for Thirties audio files :")
print(f"WER: {avg_thirties['wer']}")
print(f"CER: {avg_thirties['cer']}\n")

print("Evaluation for Fourties audio files :")
print(f"WER: {avg_fourties['wer']}")
print(f"CER: {avg_fourties['cer']}\n")

print("Evaluation for Fifties audio files :")
print(f"WER: {avg_fifties['wer']}")
print(f"CER: {avg_fifties['cer']}\n")

print("Evaluation for Sixties audio files :")
print(f"WER: {avg_sixties['wer']}")
print(f"CER: {avg_sixties['cer']}\n")

age
NaN          1628
twenties      600
thirties      206
fourties      140
fifties       137
sixties       105
seventies      49
Name: count, dtype: int64
--------------------------------
Evaluation for Twenties audio files :
WER: 1.0899
CER: 0.6869

Evaluation for Thirties audio files :
WER: 1.1251
CER: 0.6987

Evaluation for Fourties audio files :
WER: 1.1045
CER: 0.6466

Evaluation for Fifties audio files :
WER: 1.027
CER: 0.5964

Evaluation for Sixties audio files :
WER: 1.0205
CER: 0.5787



In [61]:
print(df['accents'].value_counts(dropna=False))
print("--------------------------------")
# Filter for Leoneg voice audio
df_Leoneg = df[df['accents'].str.startswith('Leoneg', na=False)]
avg_Leoneg = df_Leoneg[['wer', 'cer']].mean().round(4)

df_Kerneveg = df[df['accents'].str.startswith('Kerneveg', na=False)]
avg_Kerneveg = df_Kerneveg[['wer', 'cer']].mean().round(4)

df_Gwenedeg = df[df['accents'].str.startswith('Gwenedeg', na=False)]
avg_Gwenedeg = df_Gwenedeg[['wer', 'cer']].mean().round(4)

print("Evaluation for Leoneg audio files :")
print(f"WER: {avg_Leoneg['wer']}")
print(f"CER: {avg_Leoneg['cer']}\n")

print("Evaluation for Kerneveg audio files :")
print(f"WER: {avg_Kerneveg['wer']}")
print(f"CER: {avg_Kerneveg['cer']}\n")

print("Evaluation for Gwenedeg audio files :")
print(f"WER: {avg_Gwenedeg['wer']}")
print(f"CER: {avg_Gwenedeg['cer']}\n")

accents
NaN                      2399
Leoneg                    215
Kerneveg                  136
Gwenedeg                   88
Brezhoneg Breizh-Uhel      10
Gwenedeg,Kerneveg           9
Standard                    5
Tregerieg                   3
Name: count, dtype: int64
--------------------------------
Evaluation for Leoneg audio files :
WER: 1.0445
CER: 0.6362

Evaluation for Kerneveg audio files :
WER: 1.0319
CER: 0.5724

Evaluation for Gwenedeg audio files :
WER: 1.1673
CER: 0.827



In [62]:
# Assuming 'wer' and 'cer' columns exist
average_wer = df['wer'].mean()
average_cer = df['cer'].mean()

# Optional: round them
average_wer = round(average_wer, 2)
average_cer = round(average_cer, 2)

print(f"Overall Average WER: {average_wer}")
print(f"Overall Average CER: {average_cer}")

Overall Average WER: 1.08
Overall Average CER: 0.65
