This notebook is modifed from Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers Colab created by Sanchit Gandhi avalible at https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb. Our work is only the modifications to the original notebook.

# GPU

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Dec  5 06:58:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    21W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Imports

In [None]:
# use datasets to download and prepare our training data and transformers to load and train our Whisper model.
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-uzweyubl
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-uzweyubl
[31mERROR: Operation cancelled by user[0m


In [None]:

token = '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'

# import the relavant libraries for loggin in
from huggingface_hub import HfApi, HfFolder

# set api for login and save token\
api=HfApi()
api.set_access_token(token)
folder = HfFolder()
folder.save_token(token)

In [None]:
import pickle
from datasets import Audio
from datasets import Dataset
from datasets import Features

# Load Data

 ## Load WhisperFeatureExtractor
 load feature extractor from the pre-trained checkpoint with default values

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny.en")

## Load WhisperTokenizer

Whisper model outputs a sequence of token ids. 

The tokenizer maps each of these token ids to their corresponding text string. 

We will load the pre-trained tokenizer and use it for fine-tuning without any further modifications.

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny.en", language="English", task="transcribe")

## Combine To Create A WhisperProcessor

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en", language="English", task="transcribe")

## Load DataSet from Hub

In [None]:
from datasets import load_dataset
from datasets import DownloadConfig

laugm_train = load_dataset(r"DTU54DL/librispeech5k-augmentated-train-prepared", download_config=DownloadConfig(delete_extracted=True))
laugm_val = load_dataset(r"DTU54DL/librispeech-augmentated-validation-prepared", download_config=DownloadConfig(delete_extracted=True))

In [None]:
train = laugm_train["train.360"].select(range(0, 500))
test = laugm_val["validation"].select(range(0, 100))

# Feature extraction

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["text"]).input_ids
    return batch

In [None]:
train = train.map(prepare_dataset)

In [None]:
test = test.map(prepare_dataset)

# Training and Evaluation

We'll follow these steps:

* Define **data collator**: data collator takes pre-processed data and prepares PyTorch tensors ready for the model.
* **Evaluation metrics**: during evaluation, we evaluate the model using WER metric. We need to define a compute_metrics function that handles this computation.
* **Load pre-trained checkpoint**: load a pre-trained checkpoint and configure it correctly for training
* Define **training configuration**: this will be used by **Trainer** to define the training schedule.

After tuning the model, we evaluate it on test data to verify that we have correctly trained it to transcribe speech.

## Define Data Collator

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Initialise the defined data collator :

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation Metrics

In [None]:
import evaluate

metric = evaluate.load("wer")

Define function that takes model predictions and returns the WER metric.

* It first replaces -100 with the pad_token_id in the label_ids (undoing the step we applied in the data collator to ignore padded tokens correctly in the loss).

* It then decodes the predicted and label ids to strings. 

* Finally, it computes the WER between the predictions and reference labels:

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Define Training Configuration

**Final step**: define all parameters related to training.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="/whisper-tiny-laugm",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

**Note**: if one does not want to upload the model checkpoints to the Hub, set push_to_hub=False.

Forward training arguments to Trainer along with model,
dataset, data collator and `compute_metrics` function

## Load a Pre-Trained Checkpoint 

Override generation arguments - no tokens are forced as decoder outputs (see forced_decoder_ids), no tokens are suppressed during generation (see suppress_tokens):

In [None]:
# load the pre-trained Whisper small checkpoint.
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train,
    eval_dataset=test,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    
)

## Training

Training will take approx 5-10 hours depending on GPU / the one allocated to this Google Colab. If using this Google Colab directly to fine-tune a Whisper model, you should make sure that training isn't interrupted due to inactivity. 

Simple workaround to prevent this is to paste the following code into the console of this tab (right mouse click -> inspect -> Console tab -> insert code).

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton, 60000);
```

In [None]:
trainer.train()
torch.cuda.empty_cache() 