This notebook is intended to make the distillation process the easiest as possible for you. 

It is divided in three sections for which you can find extensive details in the [readme of the original distil-whisper repo]().
Each section will call launching scripts located under the `./run-scripts` directory. The parameters that you should change are the one you will find in this notebook, with the aim of making this as simple as possible for you. For more advanced contributors, you can also modify refer to the aforementionned readme and directly modify other parameters on the lauching scripts.

Process launched through this notebook will be persistent if the notebook is closed. Logs will be written to txt files under `./logs` directory (created with the first logs), as well as logged on wandb.

# 0. set up

## 0.1 HuggingFace Hub access

This process requires to access the Hugging Face hub as well as pushing datasets. To give access to this Spaces:

1. create and copy an User Access Token with write role (see [this](https://huggingface.co/docs/hub/en/security-tokens) tutorial)
2. go to this Spaces' settings, Variables and secrets, New secret. Set `Name` to `HF_TOKEN` and paste your User Access Token in `Value (private)`

## 0.2 wandb access

As well as logging to txt files under `./logs`, scripts will also report to wandb. To give access to this Spaces:
1. copy your wandb API key (located in your wandb account's User settings)
2. go to this Spaces' settings, Variables and secrets, New secret. Set `Name` to `WANDB_API_KEY` and paste your  wandb API key in `Value (private)`

# 1. run pseudo labelling

We run here the pseudo labelling step, where audio are concatenated to 30 seconds samples and pseudo labelled using whisper-large-v3. To do so, the targeted dataset is streamed from the Hugging Face hub and processed on the fly. For this reason, it is necessary for your dataset to be beforehand on the hub. The result will be saved to disk and pushed to the hub under your username (to avoid reprocessing it!). 

Please set the below parameters to the values corresponding to your Hugging Face Hub dataset. Refer to the given example using Common Voice 17 spanish config!

Re-execute 1. for each for your datasets.

## parameters

In [4]:
dataset_name = "google/fleurs"
dataset_config_name = "es_419" 
audio_column_name = "audio"
text_column_name = "transcription"
id_column_name = "id"
speaker_id_column_name = None # either None
language = "es"
dataset_split_name = "test" # remains unchanged
model_name_or_path = "openai/whisper-large-v3" # remains unchanged

## launch pseudo labelling!

In [3]:
import os

# Build the command string
command = f"""

chmod +x ./run-scripts/run_pseudo_labelling.sh
./run-scripts/run_pseudo_labelling.sh "{model_name_or_path}" "{dataset_name}" "{dataset_config_name}" "{dataset_split_name}" "{audio_column_name}" "{text_column_name}" "{language}" "{id_column_name}"
"""

# Execute the command
os.system(command)

0

## push dataset to the hub

In [None]:
import shutil
from datasets import load_dataset

dataset_path = "/data/distil-colab/tmp/mozilla-foundation/common_voice_17_0_es_pseudo_labelled"
hub_path = "eustlb/common_voice_17_0_es_pseudo_labelled"

try: 
    ds = load_dataset(
        dataset_path,
        num_proc=48
    )
except:
    pass
else:
    # Loading the dataset caches it in /data/.cache/huggingface/datasets.
    # Therefore, we don't need the one under dataset_path, let's saves disk space.
    shutil.rmtree(dataset_path)

    ds.push_to_hub(hub_path)

    # the dataset is now on the hub, let's free the cache
    shutil.rmtree("/data/.cache/huggingface/datasets")

# 2. training

## 2.1 create student model

In [5]:
# to be changed
save_dir = "/home/eustache_lebihan/dev/distil-colab/student-model"

# remains unchanged
teacher_model_checkpoint = "distil-whisper/distil-large-v3"

In [6]:
import os

# Build the command string
command = f"""
chmod +x ./run-scripts/create_student_model.sh
./run-scripts/create_student_model.sh "{teacher_model_checkpoint}" "{save_dir}"
"""

# # Execute the command
os.system(command)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


0

## 2.2 train the model

### parameters

In [3]:
# to be changed to your specific case
language = "es"

# training set 
train_dataset_name = "eustlb/fleurs_es_pseudo_labelled"
train_dataset_config_name = "es_419"
train_split_name = "test"
text_column_name = "transcription"

# validation set
eval_dataset_name = "eustlb/fleurs_es_pseudo_labelled"
eval_dataset_config_name = "es_419"
eval_split_name = "test"
eval_text_column_name = "transcription"

output_dir = "/home/eustache_lebihan/dev/distil-colab/distil-large-v3-es"

In [6]:
# training parameters, should remain unchanged
model_name_or_path = "/home/eustache_lebihan/dev/distil-colab/student-model"
max_steps = 50 # optimization steps
warmup_steps = 10
learning_rate = 0.0001
timestamp_probability = 0.5
condition_on_prev_probability = 0.2
per_device_train_batch_size = 16
per_device_eval_batch_size = 16
dataloader_num_workers = 4
wer_threshold = 20

### launch training!

In [7]:
import os

# Build the command string
command = f"""
chmod +x ./run-scripts/run_training.sh
./run-scripts/run_training.sh "{model_name_or_path}" "{train_dataset_name}" {train_dataset_config_name} "{train_split_name}" "{text_column_name}" "{eval_dataset_name}" "{eval_dataset_config_name}" "{eval_split_name}" "{eval_text_column_name}" "{warmup_steps}" "{learning_rate}" "{timestamp_probability}" "{condition_on_prev_probability}" "{language}" "{max_steps}" "{wer_threshold}" "{per_device_train_batch_size}" "{per_device_eval_batch_size}" "{dataloader_num_workers}" "{output_dir}"
"""

# Execute the command
os.system(command)

256

# 3. Evaluation

## 3.1 Whisper large-v3

In [20]:
model_name_or_path = "openai/whisper-large-v3"
wandb_name = model_name_or_path + "es-short-form" 
language = "es"
dataset_names = "mozilla-foundation/common_voice_17_0+facebook/multilingual_librispeech+facebook/voxpopuli+google/fleurs"
dataset_config_names = "es+spanish+es+es_419"
dataset_split_names = "test+test+test+test" 
text_column_names = "sentence+text+raw_text+transcription" 

In [21]:
import os

# Build the command string
command = f"""
chmod +x ./run-scripts/run_short_form_eval.sh
./run-scripts/run_short_form_eval.sh "{model_name_or_path}" "{wandb_name}" {dataset_names} "{dataset_config_names}" "{dataset_split_names}" "{text_column_names}" "{language}"
"""

# Execute the command
os.system(command)