# Common Voice Dataset Training with Whisper

This notebook demonstrates how to train or fine-tune the Whisper model using the Common Voice dataset.

In [2]:
%pip install --upgrade datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

Collecting transformers
  Downloading transformers-4.51.0-py3-none-any.whl.metadata (38 kB)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting tensorboard
  Downloading tensorboard-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting gradio
  Downloading gradio-5.23.3-py3-none-any.whl.metadata (16 kB)
Collecting datasets[audio]
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub>=0.24.0 (from datasets[audio])
  Downloading huggingface_hub-0.30.1-py3-none-any.whl.metadata (13 kB)
Collecting click>=8.1.8 (from jiwer)
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting absl-py>=0.4 (from tensorboard)
  Downloadin

In [4]:
import os
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from src.config.hyperparameters import Hyperparameters
from src.models.whisper.model import WhisperModel
from src.utils.common_voice import process_common_voice_metadata
from datasets import load_dataset, DatasetDict

# Log into HuggingFace

In [11]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Import datasets

In [14]:
from datasets import load_dataset, DatasetDict, DownloadConfig

common_voice = DatasetDict()
config = DownloadConfig(resume_download=True)

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="train", trust_remote_code=True, download_config=config)
common_voice["validation"] = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="validation", trust_remote_code=True, download_config=config)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "es", split="test", trust_remote_code=True, download_config=config)

print(common_voice)

Downloading data:   0%|          | 0/30 [00:00<?, ?files/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)bf6034f75ee0514e1bf46b923a14dc798b74c0b3:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)714862aeda48d7ff99e4a480c9ac2f4e32219a8a:   0%|          | 0.00/984M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)d857ec8aef05984455b363adda3840bd9f0d9a33:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)f0502c1d51ced2171cb85521baff5351a71bd739:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)eb6a143235e385933f31db622289ec8940ec9f4e:   0%|          | 0.00/595M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)7f9a0c7a1f5d8320bb712ae6dcf2b430883caf35:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)68872cb153e384c53fbc87689835a33895de51ca:   0%|          | 0.00/479M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train.tsv:   0%|          | 0.00/65.5M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


dev.tsv:   0%|          | 0.00/3.79M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test.tsv:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


other.tsv:   0%|          | 0.00/302M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


invalidated.tsv:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Reading metadata...: 230467it [00:01, 144704.01it/s]


Generating validation split: 0 examples [00:00, ? examples/s]

Reading metadata...: 15520it [00:00, 156890.51it/s]


Generating test split: 0 examples [00:00, ? examples/s]

Reading metadata...: 15520it [00:00, 146253.65it/s]


Generating other split: 0 examples [00:00, ? examples/s]

Reading metadata...: 1180383it [00:08, 145942.18it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]

Reading metadata...: 52095it [00:00, 126766.31it/s]


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 230467
    })
    validation: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 15520
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 15520
    })
})


In [15]:
from datasets import concatenate_datasets
common_voice['train_full'] = concatenate_datasets([common_voice['train'], common_voice['validation']])

# Preprocess data

In [17]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Spanish", task="transcribe")
print(common_voice["train_full"][0])

{'client_id': '34719bb7c7344da7733b85c9d7215d24326093f1a2cd3a445bdc6dfe9ec4a8c9fe9729a73f6c29764545276bff81ffa65d3944f6da7a3ee3c06d0eb124fac797', 'path': 'C:\\Users\\PC\\.cache\\huggingface\\datasets\\downloads\\extracted\\d23b4c86f4c9ac20e0a765eb29f593dcfe5f2b57e5776ffde9ee387f4e75c807\\es_train_0/common_voice_es_18338585.mp3', 'audio': {'path': 'C:\\Users\\PC\\.cache\\huggingface\\datasets\\downloads\\extracted\\d23b4c86f4c9ac20e0a765eb29f593dcfe5f2b57e5776ffde9ee387f4e75c807\\es_train_0/common_voice_es_18338585.mp3', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -3.25855126e-06, -3.52389725e-06, -3.05285812e-06]), 'sampling_rate': 48000}, 'sentence': '¿ Qué tal a tres de cinco ?', 'up_votes': 2, 'down_votes': 1, 'age': '', 'gender': '', 'accent': '', 'locale': 'es', 'segment': ''}


In [18]:
from datasets import Audio
common_voice = common_voice.remove_columns(['accent', 'age', 'client_id', 'down_votes', 'gender', 'locale', 'path', 'segment', 'up_votes'])
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
print(common_voice["train_full"][0])

{'audio': {'path': 'C:\\Users\\PC\\.cache\\huggingface\\datasets\\downloads\\extracted\\d23b4c86f4c9ac20e0a765eb29f593dcfe5f2b57e5776ffde9ee387f4e75c807\\es_train_0/common_voice_es_18338585.mp3', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
        7.67193796e-07, -4.82132691e-07, -3.25116753e-06]), 'sampling_rate': 16000}, 'sentence': '¿ Qué tal a tres de cinco ?'}


In [None]:
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small", language="Spanish", task="transcribe")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Spanish", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Spanish", task="transcribe")

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [20]:
common_voice = common_voice.map(
    prepare_dataset,
    remove_columns=common_voice.column_names["train_full"],
    desc="Processing audio files",
)

Processing audio files:   0%|          | 0/230467 [00:00<?, ? examples/s]

Processing audio files:   0%|          | 0/15520 [00:00<?, ? examples/s]

Processing audio files:   0%|          | 0/15520 [00:00<?, ? examples/s]

Processing audio files:   0%|          | 0/245987 [00:00<?, ? examples/s]

# Training

In [22]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.generation_config.language = "es"
model.generation_config.task = "transcribe"
model.generation_config.force_decoder_ids = None

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

In [23]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


In [24]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

In [25]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}




Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-hi",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    gradient_checkpointing=True,
    fp16=True,
    # evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    # load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)


In [32]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(


In [33]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss
25,0.8723
50,0.8258
75,0.5987
100,0.4272
125,0.3144
150,0.2842
175,0.2862
200,0.2761
225,0.2936
250,0.2891


KeyboardInterrupt: 

# Upload to HuggingFace so it can be reused

In [None]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0", 
    "dataset_args": "",
    "language": "es",
    "model_name": "VoxLens - OpenAI Whisper Small Spanish",
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
}

trainer.push_to_hub(**kwargs)

# Test the model

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("brauliodev/voxlens")
processor = WhisperProcessor.from_pretrained("brauliodev/voxlens")

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="brauliodev/voxlens")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text",
    title="Whisper Small Spanish",
    description="OpenAI Whisper Small Spanish model fine-tuned on Common Voice 11.0",
)

iface.launch()
