<a href="https://colab.research.google.com/github/dxvsh/LearningPytorch/blob/main/Week7/DLP_Week7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DLP Week 7

Note: Read up on [CTC](https://huggingface.co/learn/audio-course/en/chapter3/ctc) at HF


## Install the required packages

In [None]:
!pip install datasets transformers evaluate jiwer > /dev/null

In [None]:
import torch, datasets, evaluate
import numpy as np
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor, AutoModelForCTC, TrainingArguments, Trainer

Explore the data here [ASR Task data](https://huggingface.co/datasets/SPRINGLab/asr-task-data)

It contains audio samples along with their labelled text transcript. The input is the audio sample and the label is the transcribed text for that audio.

Load the dataset into colab:

In [None]:
dataset = datasets.load_dataset("SPRINGLab/asr-task-data")

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'text'],
        num_rows: 8000
    })
})

This dataset contains 8000 audio and transcribed text pairs.

In [None]:
dataset['train'][100]

{'audio': {'path': None,
  'array': array([ 0.08392334,  0.06274414,  0.03656006, ..., -0.0368042 ,
         -0.03457642, -0.04806519]),
  'sampling_rate': 16000},
 'text': 'number which is a categorical variable which you get down so from a simple neuron we will'}

The below function takes a batch of data as input. Each batch contains text samples. It concatenates all the text samples in the batch into a single string `all_text`. It then creates a list of unique characters `vocab` from this combined text and returns a disctionary with two keys: `vocab`, containing all the list of unique characters, and `all_text`, containing the combined text.

In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
vocabs = dataset.map(extract_all_chars, batch_size=8)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [None]:
vocabs

DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'vocab', 'all_text'],
        num_rows: 8000
    })
})

Looking at an example from the mapped dataset.

In [None]:
vocabs['train'][100]

{'audio': {'path': None,
  'array': array([ 0.08392334,  0.06274414,  0.03656006, ..., -0.0368042 ,
         -0.03457642, -0.04806519]),
  'sampling_rate': 16000},
 'text': 'number which is a categorical variable which you get down so from a simple neuron we will',
 'vocab': [['o',
   'b',
   'a',
   'y',
   'p',
   's',
   'w',
   'r',
   'd',
   ' ',
   'l',
   'f',
   'u',
   'c',
   'h',
   'n',
   'g',
   'i',
   't',
   'e',
   'v',
   'm']],
 'all_text': ['n u m b e r   w h i c h   i s   a   c a t e g o r i c a l   v a r i a b l e   w h i c h   y o u   g e t   d o w n   s o   f r o m   a   s i m p l e   n e u r o n   w e   w i l l']}

The `vocab_list` is initialized to hold all the unique characters found across the dataset. We iterate over the vocabularies from each batch in the training set and extend `vocab_list` with the characters from each batch's vocabulary.

In [None]:
vocab_list = []

In [None]:
for v in vocabs['train']['vocab']:
    vocab_list.extend(v[0])

This converts vocab_list into a set to remove any duplicate characters and then back into a list.

In [None]:
vocab_list = list(set(vocab_list))

In [None]:
vocab_list

['o', 'b', 'a', ';', 'y', ':', 'q', 'z', '?', 'p', 's', 'w', "'", 'r', 'd', 'x', ',', '.', ' ', 'j', 'k', 'l', 'f', 'u', 'c', 'h', 'n', 'g', 'i', 't', 'e', 'v', 'm']

In [None]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}

This creates a dictionary vocab_dict where each unique character is mapped to unique integer ID as tokens, starting from 0. This is how our vocab_dict looks like right now:

In [None]:
vocab_dict

{'o': 0, 'b': 1, 'a': 2, ';': 3, 'y': 4, ':': 5, 'q': 6, 'z': 7, '?': 8, 'p': 9, 's': 10, 'w': 11, "'": 12, 'r': 13, 'd': 14, 'x': 15, ',': 16, '.': 17, ' ': 18, 'j': 19, 'k': 20, 'l': 21, 'f': 22, 'u': 23, 'c': 24, 'h': 25, 'n': 26, 'g': 27, 'i': 28, 't': 29, 'e': 30, 'v': 31, 'm': 32}

Let's use the pipe character "|" to indicate spaces " "

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

This merges the mapping for the space character " " with the pipe character "|". The space character is deleted from the dictionary after its ID is reassigned to the pipe character.

To make it clearer that " " has its own token class, we give it a more visible charater "|". In addition, we also add an "[UNK]" token so that the model can later deal with the characters not encountered in the training set.

In [None]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

35

Observe that our desired changes were made:

In [None]:
vocab_dict

{'o': 0, 'b': 1, 'a': 2, ';': 3, 'y': 4, ':': 5, 'q': 6, 'z': 7, '?': 8, 'p': 9, 's': 10, 'w': 11, "'": 12, 'r': 13, 'd': 14, 'x': 15, ',': 16, '.': 17, 'j': 19, 'k': 20, 'l': 21, 'f': 22, 'u': 23, 'c': 24, 'h': 25, 'n': 26, 'g': 27, 'i': 28, 't': 29, 'e': 30, 'v': 31, 'm': 32, '|': 18, '[UNK]': 33, '[PAD]': 34}

Lets dump this vocab_dict to a json file.

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

A `Wav2Vec2CTCTokenizer` is created using the saved vocab.json file. **It constructs a Wav2Vec2CTC tokenizer.**



See HF Docs for [Wav2Vec2CTCTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2#transformers.Wav2Vec2FeatureExtractor)

It defines the following special tokens:
- `unk_token` : "[UNK]" for unknown characters
- `pad_token` : "[PAD]" for padding.
- `word_delimeter_token` : "|", which is used as the token for separating words (since spaces were replaced by pipes earlier).

In [None]:
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

The `Wav2Vec2FeatureExtractor` is initialized to process audio features.

See HF Docs for [Wav2Vec2FeatureExtractor](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2#transformers.Wav2Vec2FeatureExtractor)

Key params:
- `feature_size = 1` : Indicates 1D audio features
- `sampling_rate = 16000` : Audio is expected to have a sampling rate of 16,000KHz
- `padding_value = 0` : Padding values are set to 0.0
- `do_normalize = True` : Normalization of audio is enabled.
- `return_attention_mask = False` : Attention masks are not used.

In [None]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

The `Wav2Vec2Processor` combines the feature extractor and tokenizer. It constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor and a Wav2Vec2 CTC tokenizer into a single processor.

`Wav2Vec2Processor` needs two params:
- `feature_extractor` (Wav2Vec2FeatureExtractor) — An instance of Wav2Vec2FeatureExtractor. The feature extractor is a required input.
- `tokenizer` (PreTrainedTokenizer) — An instance of PreTrainedTokenizer. The tokenizer is a required input.

This processor is responsible for handling both feature extraction and tokenisation.

In [None]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [None]:
print("Target text:", dataset['train'][100]['text'])
print('Input array shape:', np.asarray(dataset['train'][100]['audio']['array']).shape)
print('Sampling rate:', dataset['train'][100]['audio']['sampling_rate'])

Target text: number which is a categorical variable which you get down so from a simple neuron we will
Input array shape: (80976,)
Sampling rate: 16000


The below function takes a batch of data and prepares the data into a format suitable for training an ASR model. After processing, we'll now have the following fields: `input_values`, `input_length` and `labels`

In [None]:
def prepare_dataset(batch):
    audio = batch['audio']
    # batched output is 'un-batched' to ensure mapping is correct
    batch['input_values'] = processor(audio['array'], sampling_rate=audio['sampling_rate']).input_values[0]
    batch['input_length'] = len(batch['input_values'])

    with processor.as_target_processor():
        batch['labels'] = tokenizer(batch['text']).input_ids
    return batch

In [None]:
dataset = dataset.map(prepare_dataset, batch_size=8)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]



 The below data collator class ensures that audio and text data within a batch have consistent lengths by applying **padding** and prepares the data for training of an ASR model.

In [None]:
@dataclass
class DatacollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj: `False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors='pt'
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DatacollatorCTCWithPadding(processor=processor, padding=True)

We define an error metric **(WER)** and function to compute it while training.

In [None]:
wer_metric = evaluate.load('wer')

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
dataset = dataset['train'].train_test_split(test_size=0.1, shuffle=True, seed=42)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'input_values', 'input_length', 'labels'],
        num_rows: 7200
    })
    test: Dataset({
        features: ['audio', 'text', 'input_values', 'input_length', 'labels'],
        num_rows: 800
    })
})

Load the pretrained wav2vec model from facebook:

In [None]:
model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We **freeze the weights of the pretrained Wav2Vec model.** We don't want its weights to be updated/disturbed. We just want to fine tune on it on our task specific dataset. Not train the entire thing again!

In [None]:
model.freeze_feature_encoder()

In [None]:
training_args = TrainingArguments(
    run_name="AsrTaskModel",
    output_dir="AsrTaskModel",
    group_by_length=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    num_train_epochs=5,
    fp16=True,
    gradient_checkpointing=True,
    save_steps=1000,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=1000,
    save_total_limit=2,
    load_best_model_at_end=True,
    save_strategy="steps",
)




In [None]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=processor.feature_extractor,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss,Wer
500,4.4737,3.093708,1.0
1000,1.8537,0.939619,0.55781
1500,0.8488,0.67559,0.418809
2000,0.6624,0.574462,0.354423
2500,0.5308,0.484679,0.302158
3000,0.4447,0.481162,0.282386
3500,0.4152,0.444981,0.270008
4000,0.3589,0.452595,0.261755
4500,0.3302,0.428146,0.259348


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=4500, training_loss=1.1020528632269966, metrics={'train_runtime': 3630.1582, 'train_samples_per_second': 9.917, 'train_steps_per_second': 1.24, 'total_flos': 2.1133710337425715e+18, 'train_loss': 1.1020528632269966, 'epoch': 5.0})

Output: `TrainOutput(global_step=4500, training_loss=1.1020528632269966, metrics={'train_runtime': 3630.1582, 'train_samples_per_second': 9.917, 'train_steps_per_second': 1.24, 'total_flos': 2.1133710337425715e+18, 'train_loss': 1.1020528632269966, 'epoch': 5.0})`