# **Fine-tuning T-one with 🤗 Transformers**

This notebook provides the fine-tuning recipe for `T-one` pretrained acoustic model.

### Installation
To work with audio files, you may need to install libsndfile manually. Please refer to the [SoundFile](https://python-soundfile.readthedocs.io/en/0.13.1/#installation) installation documentation for details.

Install the dependencies:
```bash
poetry install -E finetune
```

Select Python kernel with the installed dependencies.

In [1]:
import torch


torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

In [2]:
# Sample rate used in the base model
SAMPLE_RATE = 8000

# Padding added to each audio during training to improve performance
LEFT_PAD_MS = 300
RIGHT_PAD_MS = 300
LEFT_PAD_SIZE = round(SAMPLE_RATE * LEFT_PAD_MS * 1e-3)
RIGHT_PAD_SIZE = round(SAMPLE_RATE * RIGHT_PAD_MS * 1e-3)

## Prepare Data, Tokenizer, Feature Extractor

### Load data either using Huggingface datasets or locally

If you want to use your own data, create a metadata manifest JSON-lines file with the following columns: ```audio, text```

In [3]:
from datasets import Audio, load_dataset


def load_from_local(
    train_path,
    validation_path,
    sample_rate=SAMPLE_RATE,
    audio_column="audio",
    text_column="text",
    ):
    """
    Load audios from a JSON-lines manifest containing the following features in each row: audio and text.
    """
    manifest_dict = {"train": train_path, "validation": validation_path}
    dataset = load_dataset("json", data_files=manifest_dict)
    if text_column != "text":
        dataset = dataset.rename_column(text_column, "text")
    if audio_column != "audio":
        dataset = dataset.rename_column(audio_column, "audio")
    dataset = dataset.cast_column(audio_column, Audio(sampling_rate=sample_rate))
    return dataset


def load_from_huggingface(
        dataset_name,
        sample_rate=SAMPLE_RATE,
        audio_column="audio",
        text_column="text",
        **kwargs
    ):
    """
    Load an ASR dataset from huggingface. 
    You might need to change this function if you use a custom dataset.
    """
    dataset = load_dataset(dataset_name, **kwargs)
    dataset = dataset.select_columns([audio_column, text_column])
    if text_column != "text":
        dataset = dataset.rename_column(text_column, "text")
    if audio_column != "audio":
        dataset = dataset.rename_column(audio_column, "audio")
    dataset = dataset.cast_column(audio_column, Audio(sampling_rate=sample_rate))
    return dataset

### Dataset

In this example we will fine-tune on the dataset https://huggingface.co/datasets/Vikhrmodels/ToneBooks by Vikhrmodels. It contains recordings of books with labelled tone and timbre, but we will use only audio and transcriptions.

You can use your own dataset with `load_from_local` function.

In [4]:
# This will save the dataset to the cache folder
asr_dataset = load_from_huggingface( 
    dataset_name="Vikhrmodels/ToneBooks", 
    audio_column="audio",
    text_column="text",
)

In [5]:
# Run this if you don't want to use cache for all processing steps
from datasets import disable_caching


disable_caching()

### Preprocess Data

Now we should process the data with the model's feature extractor. We will use the Wav2Vec2 tokenizer and feature extractor as they are close to what we need.

The feature extractor is used to add padding and normalize the audio.

In [6]:
from transformers import Wav2Vec2CTCTokenizer


tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(
    "t-tech/T-one", pad_token="[PAD]", word_delimiter_token="|"
)

The next step is to wrap Wav2Vec2FeatureExtractor into a Wav2Vec2Processor together with the tokenizer.

In [7]:
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2Processor


feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1,
    sampling_rate=SAMPLE_RATE,
    padding_value=0.0,
    return_attention_mask=False,
    do_normalize=False,
)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Preprocess the data

Now we have to prepare all training examples. We will normalize the text - convert to lowercase and remove all symbols that are not in the vocabulary.  

In [8]:
import os
import re
import numpy as np


# How many processors to use for data processing
NUM_PROC = os.cpu_count()
REG = re.compile("[а-яё]+")


def prepare_dataset(batch):
    audio = batch["audio"]
    audio_array = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    
    # Add magic padding on both sides to improve model performance
    batch["input_values"] = np.pad(audio_array, (LEFT_PAD_SIZE, RIGHT_PAD_SIZE), mode="constant")
    batch["input_lengths"] = len(batch["input_values"])
    
    text = " ".join(REG.findall(batch["text"].lower()))
    batch["labels"] = processor(text=text).input_ids
    return batch


asr_dataset = asr_dataset.map(prepare_dataset, remove_columns=["audio", "text"], num_proc=NUM_PROC)
max_input_length_in_sec = 20.0
asr_dataset["train"] = asr_dataset["train"].filter(
    lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_lengths"]
)

Map (num_proc=128): 100%|██████████| 91976/91976 [00:37<00:00, 2451.95 examples/s]
Map (num_proc=128): 100%|██████████| 4841/4841 [00:08<00:00, 576.70 examples/s] 
Filter: 100%|██████████| 91976/91976 [00:00<00:00, 517039.13 examples/s]


Long input sequences require a lot of memory. Since `T-one` is based on `self-attention` the memory requirement scales quadratically with the input length for long input sequences. For this demo, we will filter all sequences that are longer than 20 seconds out of the training dataset.

## Training

The data is processed so that we are ready to start setting up the training pipeline. We will make use of 🤗's [Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer) for which we need:



#### A Data Collator


This data collator must pad both audio and text to the maximum length in the batch separately. The padding tokens in the labels with `-100` so that those tokens are **not** taken into account when computing the loss.

In [None]:
from tone.training.data_collator import DataCollatorCTCWithPadding


data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

#### Load a pretrained checkpoint

We need to load a pretrained `T-one` checkpoint and configure it correctly for training.

The tokenizer's `pad_token_id` must match with the model's `pad_token_id` or CTC's *blank token* in case of a CTC speech model.

In [None]:
from tone.training.model_wrapper import ToneForCTC


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ToneForCTC.from_pretrained("t-tech/T-one").to(device)

#### Evaluation metric
During training, the model should be evaluated using the Word Error Rate (WER). We define a `compute_metrics` function accordingly.

In [11]:
import evaluate


wer_metric = evaluate.load("wer")


def compute_metrics(preds):
    pred_logits = preds.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    # Pred ids get padded by -100
    pred_ids[(pred_logits == -100).all(axis=-1)] = processor.tokenizer.pad_token_id
    preds.label_ids[preds.label_ids == -100] = processor.tokenizer.pad_token_id
    
    # Group repeating tokens to get the final transcription
    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(preds.label_ids, group_tokens=False)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

#### Define the training configuration.

Set up training parameters to optimize fine-tuning performance.

In [12]:
from transformers import TrainingArguments


training_args = TrainingArguments(
    output_dir = "tone_ctc",
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 64,
    dataloader_num_workers = 12,
    eval_on_start = True,
    num_train_epochs = 10,
    bf16 = True,
    lr_scheduler_type = "linear",
    eval_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy = "epoch",
    learning_rate = 5e-5,
    weight_decay = 1e-6,
    warmup_ratio = 0.05,
    save_total_limit = 2,
)

Now all instances can be passed to Trainer and we are ready to start training!

In [13]:
from transformers import Trainer


trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=asr_dataset["train"],
    eval_dataset=asr_dataset["validation"],
    tokenizer=processor.feature_extractor,
)

  trainer = Trainer(


### Training

Training will take a couple of hours depending on the GPU used. During training train and validation loss will be printed along with WER of greedy predictions of the model on the validation set.

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Wer
0,No log,0.383116,0.09847
1,0.249300,0.158617,0.084745
2,0.163800,0.133952,0.075291
3,0.129200,0.136993,0.067144
4,0.105900,0.13126,0.066748
5,0.091800,0.136012,0.06255
6,0.083400,0.145133,0.062065
7,0.077100,0.141332,0.063915
8,0.073200,0.140899,0.060979
9,0.070200,0.140912,0.060509


TrainOutput(global_step=13900, training_loss=0.11125729512825287, metrics={'train_runtime': 9193.6291, 'train_samples_per_second': 96.694, 'train_steps_per_second': 1.512, 'total_flos': 5.585367925856816e+19, 'train_loss': 0.11125729512825287, 'epoch': 10.0})

We have successfuly fine-tuned `T-one` on a new dataset while reducing WER from 9.84 to 6.04. You can do the same with your own dataset.

You can export this checkpoint to onnx using the script `tone/scripts/export.py`. 

You can get better results if you use Beam Search decoding with a language model with `StreamingCTCPipeline` initialized with your own checkpoint.