# Feature Pipeline for Swedish ASR Fine Tuning

## Introduction

<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/whisper_architecture.svg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 1:</b> Whisper model. The architecture
follows the standard Transformer-based encoder-decoder model. A
log-Mel spectrogram is input to the encoder. The last encoder
hidden states are input to the decoder via cross-attention mechanisms. The
decoder autoregressively predicts text tokens, jointly conditional on the
encoder hidden states and previously predicted tokens. Figure source:
<a href="https://openai.com/blog/whisper/">OpenAI Whisper Blog</a>.</figcaption>
</figure>

The Whisper checkpoints come in five configurations of varying model sizes.
The smallest four are trained on either English-only or multilingual data.
The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:

| Size   | Layers | Width | Heads | Parameters | English-only                                         | Multilingual                                      |
|--------|--------|-------|-------|------------|------------------------------------------------------|---------------------------------------------------|
| tiny   | 4      | 384   | 6     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny.)  |
| base   | 6      | 512   | 8     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)   |
| small  | 12     | 768   | 12    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)  |
| medium | 24     | 1024  | 16    | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
| large  | 32     | 1280  | 20    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)  |

For demonstration purposes, we'll fine-tune the multilingual version of the
[`"small"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB).
As for our data, we'll train and evaluate our system on a low-resource language
taken from the [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve
strong performance in this language.

## Prepare Environment

We need to login to HuggingFace to download the dataset.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Some global variable for config.

In [None]:
MODEL_SIZE = "small" # tiny, base, small, ...
LANG_CODE = "sv-SE"
LANG_NAME = "Swedish"
MODEL_VERSION = "v2"

We need to install a few dependencies.

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg
!pip install datasets>=2.6.1 git+https://github.com/huggingface/transformers

Repository: 'deb https://ppa.launchpadcontent.net/jonathonf/ffmpeg-4/ubuntu/ jammy main'
Description:
Backport of FFmpeg 4 and associated libraries. Now includes AOM/AV1 support!

FDK AAC is not compatible with GPL and FFmpeg can't be redistributed with it included. Please don't ask for it to be added to this public PPA.

---

PPA supporters:

BigBlueButton (https://bigbluebutton.org)

---

Donate to FFMPEG: https://ffmpeg.org/donations.html
Donate to Debian: https://www.debian.org/donations
Donate to this PPA: https://ko-fi.com/jonathonf
More info: https://launchpad.net/~jonathonf/+archive/ubuntu/ffmpeg-4
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding key to /etc/apt/trusted.gpg.d/jonathonf-ubuntu-ffmpeg-4.gpg with fingerprint 4AB0F789CBA31744CC7DA76A8CF63AD3F06FC659
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ In

We need to mount Google Drive to store data to it.

In [None]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


## Load Dataset

We use the [mozilla-foundation/common_voice_11_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset.


### Download Dataset

In [None]:
from datasets import load_dataset, DatasetDict, DownloadConfig

common_voice = DatasetDict()

raw_data_path = "./raw_data/"

download_conf = DownloadConfig(
    token=True,
    cache_dir=raw_data_path,
)
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", LANG_CODE, split="train+validation", download_config=download_conf)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", LANG_CODE, split="test", download_config=download_conf)

print(common_voice)

Downloading builder script:   0%|          | 0.00/8.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/197M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/139M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.6M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/312k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 7308it [00:00, 55861.68it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 5052it [00:00, 71683.44it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 5069it [00:00, 84291.16it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 5699it [00:00, 58759.44it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1346it [00:00, 68323.04it/s]


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 12360
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 5069
    })
})


## Prepare Feature Extractor, Tokenizer and Data

### Load WhisperFeatureExtractor

The Whisper feature extractor performs two operations:
1. Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s
2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model

<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/spectrogram.jpg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 2:</b> Conversion of sampled audio array to log-Mel spectrogram.
Left: sampled 1-dimensional audio signal. Right: corresponding log-Mel spectrogram. Figure source:
<a href="https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html">Google SpecAugment Blog</a>.
</figcaption>

We'll load the feature extractor from the pre-trained checkpoint with the default values:

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(f"openai/whisper-{MODEL_SIZE}")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Load WhisperTokenizer

https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperTokenizer

The Whisper model outputs a sequence of _token ids_. The tokenizer maps each of these token ids to their corresponding text string.

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(f"openai/whisper-{MODEL_SIZE}", language=LANG_NAME, task="transcribe")

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

## Prepare Data

Let's print the first example of the Common Voice dataset to see
what form the data is in:

In [None]:
print(common_voice["train"][0])

{'client_id': '782ec7b338418a4966cc49ae09265d258705091874fb4d3a7fc76c9541738a997af0f51e9ef6045dc01874a060b482c7adfbfff2a51b50fa8d03764248956d48', 'path': './raw_data/extracted/40784a27e162d09ad00f11f09f5a86a0cd56ee87ffaa2341f563f63a5cc19a5d/sv-SE_train_0/common_voice_sv-SE_20466896.mp3', 'audio': {'path': './raw_data/extracted/40784a27e162d09ad00f11f09f5a86a0cd56ee87ffaa2341f563f63a5cc19a5d/sv-SE_train_0/common_voice_sv-SE_20466896.mp3', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 48000}, 'sentence': 'Du ser ut att ha gjort det här hela livet.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'female', 'accent': '', 'locale': 'sv-SE', 'segment': ''}


Since
our input audio is sampled at 48kHz, we need to _downsample_ it to
16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model.

In [None]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

Re-loading the first audio sample in the Common Voice dataset will resample
it to the desired sampling rate:

In [None]:
print(common_voice["train"][0])

{'client_id': '782ec7b338418a4966cc49ae09265d258705091874fb4d3a7fc76c9541738a997af0f51e9ef6045dc01874a060b482c7adfbfff2a51b50fa8d03764248956d48', 'path': './raw_data/extracted/40784a27e162d09ad00f11f09f5a86a0cd56ee87ffaa2341f563f63a5cc19a5d/sv-SE_train_0/common_voice_sv-SE_20466896.mp3', 'audio': {'path': './raw_data/extracted/40784a27e162d09ad00f11f09f5a86a0cd56ee87ffaa2341f563f63a5cc19a5d/sv-SE_train_0/common_voice_sv-SE_20466896.mp3', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000}, 'sentence': 'Du ser ut att ha gjort det här hela livet.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'female', 'accent': '', 'locale': 'sv-SE', 'segment': ''}


### Transform Data
This is the main data processing step that creates the features.
Unlike in the tutorial, we actually do it in batches, which is a bit faster.

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

#def prepare_dataset(batch):
    # compute log-Mel input features from input audio array
    #batch["input_features"] = [feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0] for audio in batch["audio"]]

    # encode target text to label ids
    #batch["labels"] = [tokenizer(sentence).input_ids for sentence in batch["sentence"]]
    #return batch

#common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2, batched=True, batch_size=128)
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

Map (num_proc=2):   0%|          | 0/12360 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/5069 [00:00<?, ? examples/s]

## Save Features in Google Drive

In [None]:
# save features to drive
drive_features_path = f"/content/drive/MyDrive/ID2223/lab2/{MODEL_VERSION}/{LANG_NAME}/features/{MODEL_SIZE}/"
common_voice.save_to_disk(drive_features_path, max_shard_size="1GB")

Saving the dataset (0/12 shards):   0%|          | 0/12360 [00:00<?, ? examples/s]

Saving the dataset (0/5 shards):   0%|          | 0/5069 [00:00<?, ? examples/s]

In [None]:
test_v1 = "9d72e2e08c8a3c47"
train_v1 = "7564f9602fee7770"
test_v2 = "f01f52aef4fa5fba"
train_v2 = "0a0def596547b5a1"

# load features from drive
from datasets import load_from_disk

drive_features_path = f"/content/drive/MyDrive/ID2223/lab2/v1/swedish/features/{MODEL_SIZE}/"
common_voice_v1 = load_from_disk(drive_features_path)



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
print(common_voice["train"][12000]["input_features"][16])
print(common_voice_v1["train"][12000]["input_features"][16])
print(common_voice["train"][12000]["labels"])
print(common_voice_v1["train"][12000]["labels"])

[-0.7454781532287598, -0.7454781532287598, -0.7454781532287598, -0.47161686420440674, -0.4697835445404053, -0.5579724311828613, -0.5971349477767944, -0.5251418352127075, -0.4510633945465088, -0.418179988861084, -0.5701056718826294, -0.442257285118103, -0.493937611579895, -0.3336390256881714, -0.5032496452331543, -0.48250865936279297, -0.47455036640167236, -0.6313267946243286, -0.4681462049484253, -0.48624396324157715, -0.5473730564117432, -0.6280766725540161, -0.5531842708587646, -0.5454846620559692, -0.48922228813171387, -0.5870174169540405, -0.49881649017333984, -0.5354394912719727, -0.42693662643432617, -0.554484486579895, -0.5857061147689819, -0.4437748193740845, -0.35414111614227295, -0.45352840423583984, -0.6511633396148682, -0.36377596855163574, -0.7454781532287598, -0.5811154842376709, -0.4593040943145752, -0.5245180130004883, -0.3889038562774658, -0.3174154758453369, -0.22594690322875977, -0.3608362674713135, -0.48156237602233887, -0.3279153108596802, -0.23989784717559814, -0.