# **Finetune wav2vec2.0 on Librispeech-10min**

This notebook aims to show the flow of a typical **Automatic Speech Reconition** task, where the model is able to transcribe given speech into text. We will load the [wa2vec2](https://arxiv.org/pdf/2006.11477.pdf) pretrained model from [Hugging Face](https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/wav2vec2#overview), and finetune it by adding a linear classification layer together with [CTC](https://distill.pub/2017/ctc/) loss on top of it. For demonstration purposes, we will experiment on a rather small dataset, a subset of [Librispeech](https://ieeexplore-ieee-org.kuleuven.e-bronnen.be/document/7178964) that contains just 10min of speech.

Before running this notebook, please ensure that you are on GPU runtime (Runtime -> Change runtime type -> GPU).

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Jun 17 08:44:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   47C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## **Dataset preparation**

The original Librispeech dataset contains 960h training data and around 10h data for both development and test set. We have extracted the data we are going to use which can be downloaded with the following command. The whole set contains 10min training data together with 7min valid and test data seperately.  

In [None]:
!git clone https://github.com/XinnianZhao/wav2vec2-finetuning.git

Cloning into 'wav2vec2-finetuning'...
remote: Enumerating objects: 12093, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 12093 (delta 1), reused 9 (delta 1), pack-reused 12082 (from 1)[K
Receiving objects: 100% (12093/12093), 1.24 GiB | 9.51 MiB/s, done.
Resolving deltas: 100% (6/6), done.


Now we can have a look at a subfolder of the training data. As we can see, there are several `flac` files and a text file including all the transcripts for the `flac`.

In [None]:
!ls ./wav2vec2-finetuning/data/train/other/4959/28865

4959-28865-0000.flac  4959-28865-0004.flac  4959-28865-0008.flac
4959-28865-0001.flac  4959-28865-0005.flac  4959-28865-0009.flac
4959-28865-0002.flac  4959-28865-0006.flac  4959-28865-0010.flac
4959-28865-0003.flac  4959-28865-0007.flac  4959-28865.trans.txt


The following steps are intended to transfer the `flac` into computale arrays, and map to their transcripts one by one.

In [None]:
import soundfile as sf

REQUIRED_SAMPLE_RATE = 16000

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

In [None]:
def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 1}
  return samples

In [None]:
import os
def get_filelist(dir):
    Filelist = []
    for home, dirs, files in os.walk(dir):
        for filename in files:
            Filelist.append(os.path.join(home, filename))
    return Filelist

In [None]:
def fetch_sound_text_mapping(data_dir):
  all_files = get_filelist(data_dir)

  flac_files = [f for f in all_files if f.endswith(".flac")]
  txt_files = [f for f in all_files if f.endswith(".txt")]

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  assert len(txt_samples) == len(speech_samples)

  samples = {}
  samples["speech"] = []
  samples["text"] = []

  for file_id in speech_samples.keys():
    # if len(speech_samples[file_id]) < AUDIO_MAXLEN:
    samples["speech"].append(speech_samples[file_id])
    samples["text"].append(txt_samples[file_id])
  # samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in speech_samples.keys() if len(speech_samples[file_id]) < AUDIO_MAXLEN]
  return samples

We then put all the data in a dictionary and further transform the dictionary to a `Dataset` class which can be loaded by the model.

In [None]:
# AUDIO_MAXLEN = 246000
# LABEL_MAXLEN = 256
train_dir = "./wav2vec2-finetuning/data/train"
dev_dir = "./wav2vec2-finetuning/data/dev-other"
test_dir = "./wav2vec2-finetuning/data/test-other"

Libri = {}
Libri["train"] = fetch_sound_text_mapping(train_dir)
Libri["dev"] = fetch_sound_text_mapping(dev_dir)
Libri["test"] = fetch_sound_text_mapping(test_dir)

In [None]:
Libri.keys()

dict_keys(['train', 'dev', 'test'])

In [None]:
!pip install datasets==1.18.3

Collecting datasets==1.18.3
  Downloading datasets-1.18.3-py3-none-any.whl.metadata (22 kB)
Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
Successfully installed datasets-1.18.3


In [None]:
from datasets import Dataset, DatasetDict
import numpy as np
np.object = object

train_dataset = Dataset.from_dict(Libri["train"])
dev_dataset = Dataset.from_dict(Libri["dev"])
test_dataset = Dataset.from_dict(Libri["test"])

librispeech = DatasetDict({"train":train_dataset,"dev":dev_dataset,"test":test_dataset})
librispeech

DatasetDict({
    train: Dataset({
        features: ['speech', 'text'],
        num_rows: 52
    })
    dev: Dataset({
        features: ['speech', 'text'],
        num_rows: 209
    })
    test: Dataset({
        features: ['speech', 'text'],
        num_rows: 211
    })
})

Let's listen to a couple of audio files to better understand the dataset and verify that the speech is correctly mapped to its transcript.

**Note**: *You can click the following cell a couple of times to listen to different speech samples.*

In [None]:
import IPython.display as ipd
import random

rand_int = random.randint(0, len(librispeech["train"]))

print(librispeech["train"][rand_int]["text"])
ipd.Audio(data=np.asarray(librispeech["train"][rand_int]["speech"]), autoplay=True, rate=16000)

COURAGE HATTERAS SAID THE DOCTOR HANDING HIM THE WEAPON WHICH HE HAD CAREFULLY LOADED MEANWHILE NEVER FEAR BUT BE SURE YOU DON'T SHOW YOURSELVES TILL I FIRE THE DOCTOR SOON JOINED THE OLD BOATSWAIN BEHIND THE HUMMOCK AND TOLD HIM WHAT THEY HAD BEEN DOING


## **Data processing**

### **Build vocabulary**
You probably have noticed that the transcripts in the last step do not contain any punctuations except for `'`.  That's because these special charaters don't really correspond to a characteristic sound unit and removing them makes it easier for classification. While in English, we need to keep the `'` character to differentiate between words, e.g., "it's" and "its" which have very different meanings.

The mapping task is still in need of a vocabulary. In CTC, it is common to classify speech chunks into letters, so we will do the same here. We extract all distinct letters of the data and build our vocabulary from this set of letters.

In [None]:
def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
vocabs = librispeech.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=librispeech.column_names["train"])



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["dev"]["vocab"][0]))

In [None]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

The final vocabulary is as below. To make it clearer that `" "` has its own token class, we give it a more visible character `|`. In addition, we also add an "unknown" token so that the model can later deal with characters not encountered in the training set.

We also add a padding token that corresponds to CTC's "*blank token*". The "blank token" is a core component of the CTC algorithm.

In [None]:
vocab_dict

{"'": 0,
 'D': 1,
 'X': 2,
 'C': 3,
 'T': 4,
 'B': 5,
 'Q': 6,
 'E': 7,
 'F': 8,
 'W': 9,
 'N': 10,
 'L': 11,
 'J': 12,
 'Z': 14,
 'K': 15,
 'U': 16,
 'A': 17,
 'V': 18,
 'M': 19,
 'P': 20,
 'I': 21,
 'O': 22,
 'S': 23,
 'Y': 24,
 'G': 25,
 'R': 26,
 'H': 27,
 '|': 13,
 '[UNK]': 28,
 '[PAD]': 29}

The vocabulary can be saved as a json file for further use.

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

### **Process data**

In the following sections, we will make use of `transformers` modules in `Hugging Face`, which contains many convinient packaged functions or classes for processing, training and decoding.

We follow the offitial documents given by the `Hugging Face`, which has clear instructions for finetuning a wav2vec2 model. We omit most of the notes here, and you can always find the explanation for each step in this [link](https://huggingface.co/blog/fine-tune-wav2vec2-english). And if you are interested in the souce codes or want to check the meaning of the arguments in each class and function, you can check it [here](https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/wav2vec2#overview).

In [None]:
!pip install transformers



In [None]:
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2Processor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)


**Note:** Sinece we will customize some processors and produce many models during training, uploading them directly to `Hugging Face` is mush easier for resusage than saving in google drive.

To do so you have to open an account on `Hugging Face` and generate an authentication token to fill in dialog box in the next stepfrom the Hugging Face website. (sign up [here](https://huggingface.co/join).)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
repo_name = "w2v2-libri-10min"
# tokenizer.push_to_hub(repo_name)

In [None]:
def prepare_dataset(batch):

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(batch["speech"], sampling_rate=16000).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    batch["labels"] = tokenizer(batch["text"]).input_ids
    return batch

In [None]:
librispeech = librispeech.map(prepare_dataset, remove_columns=librispeech.column_names["train"], num_proc=4)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


## **Training**

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
!pip install jiwer

from datasets import load_metric

wer_metric = load_metric("wer")

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.13.0


Downloading:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [None]:
import numpy as np

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2GroupNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (activation): GELUActivation()
          (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
        )
        (1-4): 4 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder)

In the finetuning, the feature extractor modules in wav2vec2, that are the 7 convolution layers as shown above, are not updated.

In [None]:
model.freeze_feature_encoder()

And in the first `warmup_steps`, only the linear layer is updated.

If you have colab pro and can have access to better GPU, you can enlarger the `batch_size`.

For demonstration purples, the following parameters are not tuned delicately and of course can not result in an optimal model.

In [None]:
# !pip install --upgrade accelerate
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=16,
  eval_strategy="steps",
  max_steps=2500,
  fp16=True,
  gradient_checkpointing=True,
  save_steps=250,
  eval_steps=250,
  logging_steps=250,
  learning_rate=3e-4,
  weight_decay=0.005,
  warmup_steps=500,
  save_total_limit=2,
  load_best_model_at_end=True,
  metric_for_best_model="wer",
  greater_is_better=False,
  push_to_hub=True,
)

In [None]:
librispeech["train"][0]

{'input_values': [-0.009483915753662586,
  -0.008955958299338818,
  -0.006844126153737307,
  -0.004732294473797083,
  -0.005788210313767195,
  -0.005788210313767195,
  -0.006844126153737307,
  -0.007900042459368706,
  -0.008955958299338818,
  -0.007900042459368706,
  -0.005260252393782139,
  -0.003148420713841915,
  -0.002620462793856859,
  -0.002620462793856859,
  -0.004204336553812027,
  -0.004732294473797083,
  -0.005260252393782139,
  -0.004732294473797083,
  -0.003676378633826971,
  -0.002620462793856859,
  0.0005472851335071027,
  0.004242991097271442,
  0.007410738617181778,
  0.009522570297122002,
  0.009522570297122002,
  0.009522570297122002,
  0.010050528682768345,
  0.011106444522738457,
  0.011106444522738457,
  0.012162360362708569,
  0.012162360362708569,
  0.011106444522738457,
  0.011634401977062225,
  0.012162360362708569,
  0.010578486137092113,
  0.011634401977062225,
  0.012162360362708569,
  0.01321827620267868,
  0.01321827620267868,
  0.010050528682768345,
  0.0

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=librispeech["train"],
    eval_dataset=librispeech["dev"],
    tokenizer=processor.feature_extractor,
)

  trainer = Trainer(


### **Training**

**Note:** Training will take around 120 minutes, so make sure that your training doesn't stop due to inactivity. A simple hack to prevent this is to paste the following code into the console of this tab (*right mouse click -> inspect -> Console tab and insert code*).




```javascript
function ConnectButton(){
    console.log("Connect pushed");
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click()
}
setInterval(ConnectButton,60000);
```

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:


Abort: 

All the related files are uploaded to `Hugging Face`. You can check them on browser with the link https://huggingface.co/yourUsername/w2v2-libri.


In [None]:
trainer.push_to_hub()

## **Evaluation**

### **Greedy decoding**

Let's load the model we trained to do the evaluation. And you can modify the following path with your own directory of repository in `Hugging Face`. That is `"yourUsername/w2v2-libri"` if you did not change the `repo_name` in former step.

If you haven't finished the model training, you can also run the decoding by using the default path given.

In [None]:
processor = Wav2Vec2Processor.from_pretrained("Xinnian/w2v2-libri-10min")
model = Wav2Vec2ForCTC.from_pretrained("Xinnian/w2v2-libri-10min").cuda()

**With CTC**

In [None]:
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

In [None]:
results_dev = librispeech["dev"].map(map_to_result, remove_columns=librispeech["dev"].column_names)
results_test = librispeech["test"].map(map_to_result, remove_columns=librispeech["test"].column_names)

In [None]:
print("dev WER: {:.3f}".format(wer_metric.compute(predictions=results_dev["pred_str"], references=results_dev["text"])))
print("test WER: {:.3f}".format(wer_metric.compute(predictions=results_test["pred_str"], references=results_test["text"])))

In [None]:
results_test["pred_str"][0]

**Without CTC**

In [None]:
with torch.no_grad():
  logits = model(torch.tensor(librispeech["test"][:1]["input_values"], device="cuda")).logits

pred_ids = torch.argmax(logits, dim=-1)

# convert ids to tokens
" ".join(processor.tokenizer.convert_ids_to_tokens(pred_ids[0].tolist()))