## Task: Finetuning an Automatic Speed Recognition (ASR) AI Model

This notebook presents my solutions for how I approached to fine-tune the pre-trained ```facebook/wav2vec2-large-960h``` model. The dataset used will be the Common Voice dataset as stated in the question in the folder directory ```data/cv-valid-train```

Before running any of the code, ensure that you have set up your environment properly using the instructions from the main README.md to install the necessary packages. However in the event that there are still errors, run the following code to install the appropriate packages:
```datasets```, ```transformers```, ```librosa```, ```jiwer```

We will be uploading my training checkpoints to HuggingFace Hub while training, hence we will need to download Git-LFS and authenticate my HuggingFace account

```Note: Based on my research done, I will use Word Error Rate (WER) as the metric to evaluate the performance of the ASR Model```

In [1]:
# Run this cell only if you need authentication for the Hugging Face Hub
from huggingface_hub import notebook_login

# notebook_login()

Run the following commands in your bash terminal to download git-lfs

1. Linux
```bash
sudo apt-get install git-lfs
```
2. macOS
```bash
brew install git-lfs
```

## Importing Relevant Modules

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import soundfile as sf
import librosa
import time
import os
import re
import json

import torch
import torchaudio
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import transformers

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from tqdm import tqdm

Then, run the following code to make the folder directories (if the folders do not exist in the current ```asr-train``` folder)

In [3]:
os.makedirs("data", exist_ok=True)
os.makedirs("logs", exist_ok=True)
os.makedirs("models", exist_ok=True)

In [4]:
# Set the device type to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


```txt
Some useful commands if you get error intialising CUDA on your machine (run in order)
- sudo rmmod nvidia-uvm
- sudo modprobe nvidia-uvm
```

## Preparing Data, Tokenizer, Feature Extractor

ASR models transcribes speech to text, hence we will need both:
1. ```Tokenizer```: Tokenizer that processes the model's output format to text
2. ```Feature Extractor```: Process speech signal to the model's input format, e.g. feature vector

In [5]:
# Set Pandas display options to show full text
pd.set_option("display.max_colwidth", None)

In [6]:
# Load Common Voice Train dataset
common_voice_train = pd.read_csv("data/cv-valid-train.csv")

common_voice_train.head()

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the old king had said,1,0,,,,
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said,1,0,,,,
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dreams said the old woman,1,0,,,,
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those omens that brought you to me,1,0,,,,
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me be the first to welcome you to earth,3,2,,,,


### Create Tokenizer for Speech Regonition

The ```Connectionist Temporal Classification (CTC) tokenizer``` will be the tokenizer used on the pre-trained ```facebook/wav2vec2-large-960h``` model because it allows the model to effectively handle variable-length audio input by predicting a sequence of characters (with "blank" token) where the alignment between the predicted characters and the actual audio is not strictly required. 

Referenced from [CTC research link](https://distill.pub/2017/ctc/)

Common Voice actually provides more information about each audio file such as ```gender```, ```accent```, etc. Since the evaluation metric is done using ```WER```, we can drop the remaining columns and only consider the column ```text``` for fine-tuning

In [7]:
common_voice_train.drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"], inplace=True)
common_voice_train.head()

Unnamed: 0,filename,text
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the old king had said
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dreams said the old woman
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those omens that brought you to me
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me be the first to welcome you to earth


Based on the top results of the dataframe, we can see that the transcriptions are very clean whereby all the characters are already in lowercase and special characters such as ```, . ? ! ; :``` are not present in the text. This is important because it helps when tokenizer the text and reducies ambuigity of the transcript.

However, we will still do another round of check to ensure that the transcripts are cleaned.

In [8]:
# Check if all the text in the 'text' column is lowercase
if all(common_voice_train['text'].str.islower()):
    print("All text in the 'text' column is lowercase.")

# Check for the following special characters in the 'text' column
if not any(common_voice_train['text'].str.contains(r'[,\.\?\!;:\-\—\–]', regex=True)):
    print("Special characters not found in the 'text' column.")

All text in the 'text' column is lowercase.
Special characters not found in the 'text' column.


In CTC, it works by classifying speech chunks into letters, so we will need to extract all the distinct letters from the ```common_voice_train``` dataset and build a vocabulary from this set of letters. This will be used later on when initialising the CTC tokenizer.

In [9]:
# Extract all unique letters from the 'text' column
unique_letters = set(''.join(common_voice_train['text']))
unique_letters = sorted(unique_letters)

print("Unique letters:", unique_letters)

Unique letters: [' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


We can then see the extracted letters from the text column including special character ```'```, but this does not need to be removed due to following reasons:
1. Handling Contractions in Text: In natural language, contractions like ```i'm```, ```don't``` are common and these words expands to ```i am``` and ```do not``` respectively. Hence, the apostrophe ```'``` is a critical part of the contractions to ensure that the semantic meaning of the sentence is not lost.
2. CTC Tokenizer Behaviour: CTC tokenizer is able to handle special characters like ```'``` so it will be able to split ```i'm``` into ```["i", "'", "m"]``` during the decoding process, ensuring that the transcribed text is grammatically correct.

Furthermore, we will include ```[UNK], [PAD]``` tokens which corresponds to CTC's unknown and "blank token" respectively. The "blank token" is a core component of the CTC algorithm. Then, to improve the blank space token we will replace it with ```|``` as the word delimiter token

In [10]:
# Creating the vocabulary dictionary
vocab_dict = {letter: idx for idx, letter in enumerate(unique_letters)}
vocab_dict['|'] = len(vocab_dict) # Replacing the space character with delimiter token
vocab_dict['[UNK]'] = len(vocab_dict) # Blank token for CTC 
vocab_dict['[PAD]'] = len(vocab_dict) # Padding token for CTC
del vocab_dict[" "]

print(vocab_dict)
print("Vocabulary size:", len(vocab_dict))

{"'": 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'j': 11, 'k': 12, 'l': 13, 'm': 14, 'n': 15, 'o': 16, 'p': 17, 'q': 18, 'r': 19, 's': 20, 't': 21, 'u': 22, 'v': 23, 'w': 24, 'x': 25, 'y': 26, 'z': 27, '|': 28, '[UNK]': 29, '[PAD]': 30}
Vocabulary size: 30


Let's save the vocabulary as a JSON file format.

In [11]:
# Save the vocabulary dictionary to a JSON file
with open("vocab.json", "w") as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [12]:
from transformers import Wav2Vec2CTCTokenizer

# Load the tokenizer from together with the vocabulary JSON file
tokenizer = Wav2Vec2CTCTokenizer("vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

### Creating the Feature Extractor

This is a critical step in fine-tuning the Wav2Vec2 model as it is responsible for converting raw audio signals into meaningful input features (e.g., log-mel spectrograms) that the model can process.

Process of Feature Extractor
1. Converting raw audio files into log-mel spectrograms, which are 2D representations of audio waveforms into a format suitable for input to the Wav2Vec2 model. 
2. Normalises the spectrograms to ensure consistent input to the model, and handles padding with truncation to ensure all inputs have the same length

Another thing to take note that ```facebook/wav2vec2-large-960h``` was pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. Since Common Voice was sampled at 48kHz, we need to ensure the audio data input has to be downsized to 16kHz for fine-tuning.

In [13]:
from transformers import Wav2Vec2FeatureExtractor

# Load the feature extractor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

Then, we will combine the feature extractor and tokenizer into a processor

In [14]:
from transformers import Wav2Vec2Processor

# Load the processor
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Data preprocessing for Speech Data

From the common_voice_train dataframe, we will now split it in the ratio of 70:30 where 30% of the data will be used as the validatiion dataset

Then, from the split dataset we will create a ```CustomASRDataset``` to read in the audio files inside the folder directory ```data/csv-valid-train```. Subsequently, we will process the audio files to downsample them from 48kHz to 16kHz, tokenize the text, and return a PyTorch Dataset

In [15]:
import IPython.display as ipd

# Using an example case to demonstrate downsampling and truncation of audio files
audio_path = "data/cv-valid-train/sample-000823.mp3"
audio, sample_rate = sf.read(audio_path)

print(f"Original Sample Rate: {sample_rate} Hz")
print(f"Original Audio Length: {len(audio) / sample_rate:.2f} seconds")
print("Playing Original Audio:")
ipd.display(ipd.Audio(audio, rate=sample_rate))

if sample_rate != 16000:
    audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

# Ensure that max_duration of audio is less than 10 seconds
max_samples = 10 * sample_rate
if len(audio) > max_samples:
    audio = audio[:max_samples]
    
print(f"New Sample Rate: {sample_rate} Hz")
print(f"New Audio Length: {len(audio) / sample_rate:.2f} seconds")
print("Playing Resampled Audio:")
ipd.display(ipd.Audio(audio, rate=sample_rate))

Original Sample Rate: 48000 Hz
Original Audio Length: 8.71 seconds
Playing Original Audio:


New Sample Rate: 16000 Hz
New Audio Length: 8.71 seconds
Playing Resampled Audio:


## Data Preparation

```txt
After multiple rounds of experimentations with trying to fine-tune the ASR model using all 195776 rows of transcripts and audio files, I ran into the following issues,
1. CUDA Out of Memory Error
2. Long Duration of Fine-tuning models
3. Overfitting Issues

Based on the above problems, I will make the assumption to take a subset of the 195776 rows of data for my fine-tuning process due to the interest of time and lack of computational resources.

However, in order to determine how to take the subset of data, I will be explainaing the steps on how I determine what kind of data points I will be taking for the fine-tuning process.
```

In [16]:
# 70% Training, 30% Validation
train_df, val_df= train_test_split(common_voice_train, test_size=0.3, random_state=42)
print(f"Train samples: {len(train_df)}, Validation samples: {len(val_df)}")

Train samples: 137043, Validation samples: 58733


After testing the logic for downsampling the audio files and truncating them, we will now create the train_df and val_df which will be used in the training phase.

In [17]:
class CustomASRDataset(Dataset):
    def __init__(self, df, processor, max_duration=10):
        """
        Custom dataset class for ASR fine-tuning.

        Args:
            df (DataFrame): Pandas dataframe with columns 'filename' and 'text'.
            processor (Wav2Vec2Processor): Hugging Face processor for feature extraction & tokenization.
        """
        self.df = df.reset_index(drop=True)  # Reset index to avoid issues
        self.processor = processor
        self.max_samples = max_duration * 16000  # 16 kHz sampling rate

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Load audio file
        file_path = os.path.join("data", self.df.iloc[idx]["filename"]) # Pointing to the data folder 
        transcript = self.df.iloc[idx]["text"]

        # Read and resample the audio
        audio, sample_rate = sf.read(file_path)
        if sample_rate != 16000:
            audio = librosa.resample(audio, orig_sr=sample_rate, target_sr=16000)

        # Ensure audio does not exceed max duration
        if len(audio) > self.max_samples:
            audio = audio[:self.max_samples]

        # Convert audio to model input format
        inputs = self.processor(
            audio,
            sampling_rate=16000,
            padding="max_length",
            max_length=16000 * 10,  # 10 seconds max
            truncation=True,
            return_attention_mask=True
        )

        # Convert text to tokenized labels (for CTC models)
        labels = self.processor.tokenizer(transcript, return_tensors="pt").input_ids

        return {
            "input_values": inputs.input_values[0],
            "attention_mask": inputs.attention_mask[0],
            "labels": labels[0]
        }

In [18]:
# Create training and validation datasets (max_duration: Maximum duration of audio in seconds)
train_dataset = CustomASRDataset(train_df, processor, max_duration=10)
val_dataset = CustomASRDataset(val_df, processor, max_duration=10)

## Training of ASR Model

..... to continue 

### Set-up Trainer

We will start by defining the data collator. The code for the data collator was adopted from the [following website](https://github.com/huggingface/transformers/blob/9a06b6b11bdfc42eea08fa91d0c737d1863c99e3/examples/research_projects/wav2vec2/run_asr.py) which is in the huggingface transformers GitHub repository.

In summary, this data collator treats the ```input_values``` and ```labels``` differently so seperate padding functions are applied to them. This is necessary because speech input and output are of different modalities so the same padding function cannot be used.

In [19]:
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [20]:
# Running the DataCollatorCTCWithPadding class
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Next, we will need to define the evaluation metric which is mentioned above, ```Word Error Rate (WER)``` will be used from the ```jiwer``` library.

The model will return a sequence of logit vectors:
$\mathbf{y}_1, \ldots, \mathbf{y}_m$ with $\mathbf{y}_1 = f_{\theta}(x_1, \ldots, x_n)[0]$ and $n >> m$.

A logit vector $\mathbf{y}_1$ contains the log-odds for each word in the vocabulary we defined earlier, thus $\text{len}(\mathbf{y}_i) =$ `config.vocab_size`. We are interested in the most likely prediction of the model and thus take the `argmax(...)` of the logits. 

Then, we need to transform the encoded labels back to the original string by replacing `-100` with the `pad_token_id` and decoding the ids while making sure that consecutive tokens are **not** grouped to the same token in CTC style.

In [21]:
import jiwer

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    # Replace -100 with the padding token id
    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # We do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    # Compute WER using jiwer
    wer = jiwer.wer(label_str, pred_str)

    return {"wer": wer}

Now, we will load the pretrained ```facebook/wav2vec2-large-960h``` checkpoint. It is important to set the tokenizer's pad_token_id as the pad_token_id for the model parameters which is equivalent to the CTC's blank token.

In order to improve the GPU inference, we will load the model with ```FlashAttention-2``` which will help with improve the efficiency in the self-attention layers by reducing memory usage and speed up the training process

In [22]:
from transformers import Wav2Vec2ForCTC

# Load pre-trained Wav2Vec2 model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base-960h",
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Ensure that model is on GPU
model.to(device)

# Show model architecture
print(model)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2GroupNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (activation): GELUActivation()
          (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
        )
        (1-4): 4 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder)

The first component of Wav2Vec2 consists of the CNN layers which are used to extract the accoustically meaningful features from the raw speech signal. This part of the model is pre-trained sufficiently as stated in the [paper](https://arxiv.org/abs/2006.11477), hence we do not need to fine-tune these layers.

Thus, we will freeze these layers by setting ```requires_grad=False``` for all parameters of the feature extraction.

In [23]:
# Freeze feature extractor (CNN layers)
for param in model.wav2vec2.feature_extractor.parameters():
    param.requires_grad = False

# Do a double check to ensure feature extractor is frozen
for name, param in model.named_parameters():
    if not param.requires_grad:
        print(f"{name} is frozen.")

wav2vec2.feature_extractor.conv_layers.0.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.0.layer_norm.weight is frozen.
wav2vec2.feature_extractor.conv_layers.0.layer_norm.bias is frozen.
wav2vec2.feature_extractor.conv_layers.1.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.2.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.3.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.4.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.5.conv.weight is frozen.
wav2vec2.feature_extractor.conv_layers.6.conv.weight is frozen.


### Setting the Hyperparameters

Here, we will be using the TrainingArguments function from the transformers library and the [documentation](https://huggingface.co/docs/transformers/main/main_classes/trainer#trainingarguments) can be found here.

In [24]:
from transformers import TrainingArguments

# Defining the hyperparmeters (for hyperparamter tuning)
batch_size = 2
learning_rate = 1e-4
weight_decay = 0.005
warmup_steps = 500
gradient_accumulation_steps = 2

# Setting up the training arguments
training_args = TrainingArguments(
  output_dir="./wav2vec2-large-960h-cv",
  group_by_length=True,
  per_device_train_batch_size=batch_size,
  per_device_eval_batch_size=batch_size,
  gradient_accumulation_steps=gradient_accumulation_steps,
  eval_strategy="steps",
  num_train_epochs=3,
  gradient_checkpointing=True,
  save_steps=50,
  eval_steps=50,
  logging_steps=50,
  learning_rate=learning_rate,
  weight_decay=weight_decay,
  warmup_steps=warmup_steps,
  save_total_limit=2,
  push_to_hub=True,
  hub_model_id="enlihhhhh/wav2vec2-large-960h-cv",
  fp16=True,
  disable_tqdm=False,
)

In [25]:
from transformers import Trainer, EarlyStoppingCallback

# Initialise Trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=processor.feature_extractor, # processing_class replaced tokenizer argument 
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] # Early stopping callback to prevent overfitting
)

### Training process

In [26]:
from transformers import logging

logging.set_verbosity_info()  # Show training info

In [27]:
trainer.train()

***** Running training *****
  Num examples = 137,043
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 2
  Total optimization steps = 102,783
  Number of trainable parameters = 90,195,872
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


AssertionError: EarlyStoppingCallback requires metric_for_best_model to be defined

In [45]:
trainer.state.log_history

[{'loss': 121.3034,
  'grad_norm': 1733.9974365234375,
  'learning_rate': 3e-05,
  'epoch': 0.000583754122763492,
  'step': 10},
 {'loss': 49.2909,
  'grad_norm': 139.5467987060547,
  'learning_rate': 9.999416115219931e-05,
  'epoch': 0.001167508245526984,
  'step': 20},
 {'loss': 17.6039,
  'grad_norm': 85.82910919189453,
  'learning_rate': 9.997469832619697e-05,
  'epoch': 0.001751262368290476,
  'step': 30},
 {'loss': 15.2123,
  'grad_norm': 26.17442512512207,
  'learning_rate': 9.995523550019464e-05,
  'epoch': 0.002335016491053968,
  'step': 40},
 {'loss': 10.6112,
  'grad_norm': 15.897912979125977,
  'learning_rate': 9.99357726741923e-05,
  'epoch': 0.00291877061381746,
  'step': 50}]

## References 
1. [Official Research by Meta AI](https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
2. [HuggingFace Resouces on wav2vec2](https://huggingface.co/docs/transformers/en/model_doc/wav2vec2)

###