# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers - Training

## Prepare Environment

First of all, let's try to secure a decent GPU for our Colab! Unfortunately, it's becoming much harder to get access to a good GPU with the free version of Google Colab. However, with Google Colab Pro one should have no issues in being allocated a V100 or P100 GPU.

To get a GPU, click _Runtime_ -> _Change runtime type_, then change _Hardware accelerator_ from _None_ to _GPU_.

We can verify that we've been assigned a GPU and view its specifications:

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Dec  8 13:51:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    32W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we need to update the Unix package `ffmpeg` to version 4:

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:7 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:9 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease [15.9 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:14 http://ppa.launchpad.net

We'll employ several popular Python packages to fine-tune the Whisper model.
We'll use `datasets` to download and prepare our training data and 
`transformers` to load and train our Whisper model. We'll also require
the `soundfile` package to pre-process audio files, `evaluate` and `jiwer` to
assess the performance of our model. Finally, we'll
use `gradio` to build a flashy demo of our fine-tuned model.

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install hopsworks

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-1usu_qq5
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-1usu_qq5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 35.9 MB/s 
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.26.0.dev0-py3-none-any.whl size=5938687 sha256=a65f60483c62d8d3f1cf09fbdf02b0bf1fef26b708d4b91a182b365b18f5e206


We strongly advise you to upload model checkpoints directly the [Hugging Face Hub](https://huggingface.co/) 
whilst training. The Hub provides:
- Integrated version control: you can be sure that no model checkpoint is lost during training.
- Tensorboard logs: track important metrics over the course of training.
- Model cards: document what a model does and its intended use cases.
- Community: an easy way to share and collaborate with the community!

Linking the notebook to the Hub is straightforward - it simply requires entering your 
Hub authentication token when prompted. Find your Hub authentication token [here](https://huggingface.co/settings/tokens):

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [None]:
import hopsworks
project = hopsworks.login()

Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated

Paste it here: ··········
Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/5279


## Load Dataset

Using 🤗 Datasets, downloading and preparing data is extremely simple. 
We can download and prepare the Common Voice splits in just one line of code. 

First, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

Since Hindi is very low-resource, we'll combine the `train` and `validation` 
splits to give approximately 8 hours of training data. We'll use the 4 hours 
of `test` data as our held-out test set:

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test", use_auth_token=True)

print(common_voice)

Downloading builder script:   0%|          | 0.00/8.30k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Downloading and preparing dataset common_voice_11_0/sv-SE to /root/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/sv-SE/11.0.0/f8e47235d9b4e68fa24ed71d63266a02018ccf7194b2a8c9c598a5f3ab304d9f...


Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/197M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/139M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.6M [00:00<?, ?B/s]

     

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #3:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #4:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312k [00:00<?, ?B/s]

     

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #4:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #3:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 7308it [00:00, 99698.40it/s]


Generating validation split: 0 examples [00:00, ? examples/s]



Reading metadata...: 5052it [00:00, 125676.13it/s]


Generating test split: 0 examples [00:00, ? examples/s]




Reading metadata...: 5069it [00:00, 84182.02it/s]


Generating other split: 0 examples [00:00, ? examples/s]





Reading metadata...: 5699it [00:00, 98568.44it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]






Reading metadata...: 1346it [00:00, 66334.53it/s]


Dataset common_voice_11_0 downloaded and prepared to /root/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/sv-SE/11.0.0/f8e47235d9b4e68fa24ed71d63266a02018ccf7194b2a8c9c598a5f3ab304d9f. Subsequent calls will reuse this data.




DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 12360
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 5069
    })
})


Most ASR datasets only provide input audio samples (`audio`) and the 
corresponding transcribed text (`sentence`). Common Voice contains additional 
metadata information, such as `accent` and `locale`, which we can disregard for ASR.
Keeping the notebook as general as possible, we only consider the input audio and
transcribed text for fine-tuning, discarding the additional metadata information:

In [None]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 12360
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 5069
    })
})


## Prepare Feature Extractor, Tokenizer and Data

The ASR pipeline can be de-composed into three stages: 
1) A feature extractor which pre-processes the raw audio-inputs
2) The model which performs the sequence-to-sequence mapping 
3) A tokenizer which post-processes the model outputs to text format

In 🤗 Transformers, the Whisper model has an associated feature extractor and tokenizer, 
called [WhisperFeatureExtractor](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperFeatureExtractor)
and [WhisperTokenizer](https://huggingface.co/docs/transformers/main/model_doc/whisper#transformers.WhisperTokenizer) 
respectively.

We'll go through details for setting-up the feature extractor and tokenizer one-by-one!

### Load WhisperFeatureExtractor

The Whisper feature extractor performs two operations:
1. Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s
2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model

<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/spectrogram.jpg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 2:</b> Conversion of sampled audio array to log-Mel spectrogram.
Left: sampled 1-dimensional audio signal. Right: corresponding log-Mel spectrogram. Figure source:
<a href="https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html">Google SpecAugment Blog</a>.
</figcaption>

We'll load the feature extractor from the pre-trained checkpoint with the default values:

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

Downloading:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Load WhisperTokenizer

The Whisper model outputs a sequence of _token ids_. The tokenizer maps each of these token ids to their corresponding text string. For Hindi, we can load the pre-trained tokenizer and use it for fine-tuning without any further modifications. We simply have to 
specify the target language and the task. These arguments inform the 
tokenizer to prefix the language and task tokens to the start of encoded 
label sequences:

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Swedish", task="transcribe")

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

### Combine To Create A WhisperProcessor

To simplify using the feature extractor and tokenizer, we can _wrap_ 
both into a single `WhisperProcessor` class. This processor object 
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`, 
and can be used on the audio inputs and model predictions as required. 
In doing so, we only need to keep track of two objects during training: 
the `processor` and the `model`:

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Swedish", task="transcribe")

## Load Dataset

First, let's download the dataset from Hopsworks.

In [None]:
from datasets import load_dataset, DatasetDict

dataset_api = project.get_dataset_api()

In [None]:
import os
from datasets import list_datasets
if not os.path.exists("common_voice"):
  os.mkdir("common_voice")
  os.mkdir("common_voice/train")
  os.mkdir("common_voice/test")

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/train/dataset.arrow.zip", local_path="./common_voice/train", overwrite=True)

Downloading: 0.000%|          | 0/1120412077 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/train/dataset_info.json", local_path="./common_voice/train", overwrite=True)

Downloading: 0.000%|          | 0/1143 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/train/state.json", local_path="./common_voice/train", overwrite=True)

Downloading: 0.000%|          | 0/289 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/test/dataset.arrow", local_path="./common_voice/test", overwrite=True)

Downloading: 0.000%|          | 0/4868700344 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/test/dataset_info.json", local_path="./common_voice/test", overwrite=True)

Downloading: 0.000%|          | 0/1143 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/test/state.json", local_path="./common_voice/test", overwrite=True)

Downloading: 0.000%|          | 0/289 elapsed<00:00 remaining<?

In [None]:
downloaded_file_path = dataset_api.download(
    "/Projects/chyilin0/chyilin0_Training_Datasets/common_voice/dataset_dict.json", local_path="./common_voice", overwrite=True)

Downloading: 0.000%|          | 0/29 elapsed<00:00 remaining<?

Then unzip the compressed training dataset and load the whole dataset

In [None]:
from zipfile import ZipFile

with ZipFile("./common_voice/train/dataset.arrow.zip",'r') as zip:
  zip.extractall()

In [None]:
#Load the downloaded Hugging Face dataset from local disk 
from datasets import load_dataset, DatasetDict
common_voice = DatasetDict.load_from_disk("common_voice")



## Training and Evaluation

Now that we've prepared our data, we're ready to dive into the training pipeline. 
The [🤗 Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer)
will do much of the heavy lifting for us. All we have to do is:

- Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.

- Evaluation metrics: during evaluation, we want to evaluate the model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric. We need to define a `compute_metrics` function that handles this computation.

- Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.

- Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule.

Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it 
to transcribe speech in Hindi.

### Define a Data Collator

The data collator for a sequence-to-sequence speech model is unique in the sense that it 
treats the `input_features` and `labels` independently: the  `input_features` must be 
handled by the feature extractor and the `labels` by the tokenizer.

The `input_features` are already padded to 30s and converted to a log-Mel spectrogram 
of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`
to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.

The `labels` on the other hand are un-padded. We first pad the sequences
to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens 
are then replaced by `-100` so that these tokens are **not** taken into account when 
computing the loss. We then cut the BOS token from the start of the label sequence as we 
append it later during training.

We can leverage the `WhisperProcessor` we defined earlier to perform both the 
feature extractor and the tokenizer operations:

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Let's initialise the data collator we've just defined:

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing 
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:

In [None]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

We then simply have to define a function that takes our model 
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the 
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

Now let's load the pre-trained Whisper `small` checkpoint. Again, this 
is trivial through use of 🤗 Transformers!

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("Yilin98/whisper-small-hi")

Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/967M [00:00<?, ?B/s]

Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    num_train_epochs=1,
    output_dir="/content/checkpoint/whisper-small-hi",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

**Note**: if one does not want to upload the model checkpoints to the Hub, 
set `push_to_hub=False`.

We can forward the training arguments to the 🤗 Trainer along with our model,
dataset, data collator and `compute_metrics` function:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=cc["train"],
    eval_dataset=cc["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Cloning https://huggingface.co/Yilin98/whisper-small-hi into local empty directory.


Download file checkpoint-3500/optimizer.pt:   0%|          | 16.0k/1.80G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.42k/3.42k [00:00<?, ?B/s]

Download file checkpoint-3500/training_args.bin:  98%|#########8| 3.44k/3.50k [00:00<?, ?B/s]

Clean file training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file checkpoint-3500/training_args.bin:  29%|##8       | 1.00k/3.50k [00:00<?, ?B/s]

Download file pytorch_model.bin:   0%|          | 3.43k/922M [00:00<?, ?B/s]

Download file checkpoint-3500/scheduler.pt: 100%|##########| 627/627 [00:00<?, ?B/s]

Clean file checkpoint-3500/scheduler.pt: 100%|##########| 627/627 [00:00<?, ?B/s]

Download file checkpoint-3500/scaler.pt: 100%|##########| 557/557 [00:00<?, ?B/s]

Clean file checkpoint-3500/scaler.pt: 100%|##########| 557/557 [00:00<?, ?B/s]

Download file checkpoint-3500/pytorch_model.bin:   0%|          | 3.43k/922M [00:00<?, ?B/s]

Download file checkpoint-3500/rng_state.pth: 100%|##########| 14.2k/14.2k [00:00<?, ?B/s]

Clean file checkpoint-3500/rng_state.pth:   7%|7         | 1.00k/14.2k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file checkpoint-3500/pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file checkpoint-3500/optimizer.pt:   0%|          | 1.00k/1.80G [00:00<?, ?B/s]

max_steps is given, it will override any value given in num_train_epochs
Using cuda_amp half precision backend


We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:

In [None]:
processor.save_pretrained(training_args.output_dir, push_to_hub=True)

Feature extractor saved in /content/checkpoint/whisper-small-hi/preprocessor_config.json
tokenizer config file saved in /content/checkpoint/whisper-small-hi/tokenizer_config.json
Special tokens file saved in /content/checkpoint/whisper-small-hi/special_tokens_map.json
added tokens file saved in /content/checkpoint/whisper-small-hi/added_tokens.json
Uploading the following files to Yilin98/whisper-small-hi: normalizer.json,preprocessor_config.json,merges.txt,tokenizer_config.json,vocab.json,special_tokens_map.json,added_tokens.json


### Training

Training will take approximately 5-10 hours depending on your GPU or the one 
allocated to this Google Colab. If using this Google Colab directly to 
fine-tune a Whisper model, you should make sure that training isn't 
interrupted due to inactivity. A simple workaround to prevent this is 
to paste the following code into the console of this tab (_right mouse click_ 
-> _inspect_ -> _Console tab_ -> _insert code_).

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton, 60000);
```

The peak GPU memory for the given training configuration is approximately 15.8GB. 
Depending on the GPU allocated to the Google Colab, it is possible that you will encounter a CUDA `"out-of-memory"` error when you launch training. 
In this case, you can reduce the `per_device_train_batch_size` incrementally by factors of 2 
and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.gradient_accumulation_steps)
to compensate.

To launch training, simply execute:

In [None]:
trainer.train(resume_from_checkpoint=True)

Loading model from /content/checkpoint/whisper-small-hi/checkpoint-3500.
***** Running training *****
  Num examples = 12360
  Num Epochs = 6
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4000
  Number of trainable parameters = 241734912
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 4
  Continuing training from global step 3500
  Will skip the first 4 epochs then the first 408 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/408 [00:00<?, ?it/s]

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
4000,0.0049,0.334228,19.942997


`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
`use_cache = True` is incompatible with gradient c

TrainOutput(global_step=4000, training_loss=0.0010559742953628302, metrics={'train_runtime': 5522.3259, 'train_samples_per_second': 11.589, 'train_steps_per_second': 0.724, 'total_flos': 1.84579222044672e+19, 'train_loss': 0.0010559742953628302, 'epoch': 5.17})

At last, We got our best WER as 19.94% at 4000 steps. 


We can submit our checkpoint to the [`hf-speech-bench`](https://huggingface.co/spaces/huggingface/hf-speech-bench) on push by setting the appropriate key-word arguments (kwargs):

In [None]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",  # a 'pretty' name for the training dataset
    "dataset_args": "config: hi, split: test",
    "language": "sv",
    "model_name": "Whisper Small Hi - Swedish",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command and save the preprocessor object we created:

In [None]:
trainer.push_to_hub(**kwargs)

Saving model checkpoint to /content/checkpoint/whisper-small-hi
Configuration saved in /content/checkpoint/whisper-small-hi/config.json
Model weights saved in /content/checkpoint/whisper-small-hi/pytorch_model.bin
Feature extractor saved in /content/checkpoint/whisper-small-hi/preprocessor_config.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.30k/922M [00:00<?, ?B/s]

Upload file runs/Dec08_14-07-45_9fb8b27d1e18/events.out.tfevents.1670509682.9fb8b27d1e18.72.0:  42%|####1     …

To https://huggingface.co/Yilin98/whisper-small-hi
 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'https://user: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@huggingface.co/Yilin98/whisper-small-hi'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'https://user: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@huggingface.co/Yilin98/whisper-small-hi'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull

OSError: ignored

However, as you can see, the pushing to our Huggingface repo is not working for us. Alternatively, we saved our checkpoints and model to google drive, downloaded them to our laptop and pushed to huggingface.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
# Save the model to google drive
%cp -R /content/checkpoint/whisper-small-hi /gdrive/MyDrive/

# All done!