<a href="https://colab.research.google.com/github/aminojagh/HFLLM/blob/main/AutomaticSpeechRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install transformers datasets evaluate accelerate torchcodec jiwer

In [None]:
user_name = "amin-oj"
from huggingface_hub import notebook_login
notebook_login()

# Automatic speech recognition

Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users every day, and there are many other useful user-facing applications like live captioning and note-taking during meetings.

This guide will show you how to:

1. Fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
2. Use your fine-tuned model for inference.

## Load MInDS-14 dataset

In [None]:
from datasets import load_dataset, Audio
minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
minds = minds.train_test_split(test_size=0.2)
print(minds)
minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])
minds["train"][0]

There are two fields:

- `audio`: a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
- `transcription`: the target text.

## Preprocess

The next step is to load a Wav2Vec2 processor to process the audio signal:

In [None]:
from transformers import AutoProcessor
checkpoint = "facebook/wav2vec2-base"
processor = AutoProcessor.from_pretrained(checkpoint)

The MInDS-14 dataset has a sampling rate of 8000Hz (you can find this information in its [dataset card](https://huggingface.co/datasets/PolyAI/minds14)), which means you'll need to resample the dataset to 16000Hz to use the pretrained Wav2Vec2 model:

In [None]:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

As you can see in the `transcription` above, the text contains a mix of uppercase and lowercase characters. The Wav2Vec2 tokenizer is only trained on uppercase characters so you'll need to make sure the text matches the tokenizer's vocabulary:

In [None]:
def uppercase(example):
    return {"transcription": example["transcription"].upper()}
    # TODO: we're just returning the transcription.
    # what about the other fields?

minds = minds.map(uppercase)
minds["train"][0]

Now create a preprocessing function that:

1. Calls the `audio` column to load and resample the audio file.
2. Extracts the `input_values` from the audio file and tokenize the `transcription` column with the processor.

In [None]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch = processor(audio["array"],
                      sampling_rate=audio["sampling_rate"],
                      text=batch["transcription"])
    batch["input_length"] = len(batch["input_values"][0])
    return batch

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by increasing the number of processes with the `num_proc` parameter. Remove the columns you don't need with the [remove_columns](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.remove_columns) method:

In [None]:
encoded_minds = minds.map(
    prepare_dataset,
    remove_columns=minds.column_names["train"],
    # remove ['path', 'audio', 'transcription']
    num_proc=8
)

encoded_minds.column_names

🤗 Transformers doesn't have a data collator for ASR, so you'll need to adapt the [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding) to create a batch of examples. It'll also dynamically pad your text and labels to the length of the longest element in its batch (instead of the entire dataset) so they are a uniform length.

<font color="lightgreen">While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.</font>

Unlike other data collators, this specific data collator needs to apply a different padding method to `input_values` and `labels`:

In [None]:
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: Union[bool, str] = "longest"

    def __call__(
        self, features: list[dict[str, Union[list[int], torch.Tensor]]]
    ) -> dict[str, torch.Tensor]:
        # split inputs and labels since they have to be
        # of different lengths and need different padding methods
        input_features = [
            {"input_values": feature["input_values"][0]}
            for feature in features
        ]
        label_features = [
            {"input_ids": feature["labels"]} for feature in features
        ]
        # TODO: does "input_ids" have a special meaning here?
        # based on ??processor.pad:
        # the processor.feature_extractor.pad is applied on input_features
        # the processor.tokenizer.pad is applied on labels

        batch = self.processor.pad(input_features,
                                   padding=self.padding,
                                   return_tensors="pt")

        labels_batch = self.processor.pad(labels=label_features,
                                          padding=self.padding,
                                          return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"]\
                  .masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollatorCTCWithPadding(
    processor=processor,
    padding="longest"
)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (refer to the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about loading and computing metrics):

In [None]:
import numpy as np
import evaluate
wer = evaluate.load("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer_score = wer.compute(predictions=pred_str, references=label_str)

    return {"wer": wer_score}

## Train

In [None]:
from transformers import AutoModelForCTC, TrainingArguments, Trainer
model = AutoModelForCTC.from_pretrained(
    checkpoint,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

model_name = checkpoint.split("/")[-1]
task = "ASR"
data_id = "minds14"
ckpt_name = f"{model_name}-finetuned-{task}-{data_id}"

In [None]:
training_args = TrainingArguments(
    output_dir=ckpt_name,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=150,
    max_steps=500,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    eval_strategy="steps",
    per_device_eval_batch_size=16,
    save_steps=100,
    eval_steps=100,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    report_to = 'none'
    # to disable w&b
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    processing_class=processor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
torch.cuda.empty_cache()
trainer.train()

In [None]:
trainer.push_to_hub()

<Tip>

For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.

</Tip>

## Inference

In [None]:
import torch
from transformers import AutoProcessor
from transformers import AutoModelForCTC
from datasets import load_dataset, Audio
from transformers import pipeline


dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]['array']


# 1-simpleset way using `pipeline`

# our own model might generate empty text. use this model instead
model_name = "openai/whisper-base"

transcriber = pipeline("automatic-speech-recognition", model=model_name)
print(transcriber(audio_file))

# 2-Manually Load a processor to preprocess the audio file and transcription
# and return the `input` as PyTorch tensors:

model_name = f"{user_name}/{ckpt_name}"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForCTC.from_pretrained(model_name)

inputs = processor(
    audio_file, sampling_rate=sampling_rate, return_tensors="pt"
)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)