<a href="https://colab.research.google.com/github/alexziweiwang/ALgo_CV_MW/blob/main/base_torgo_of_azw__dev5_Wave2Vec2_torgo_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # **Fine-tuning Wav2Vec2 for Torgo DataSet with 🤗 Transformers**

# **Ensure that GPU and RAM is set up: will be needed for training purpose**

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Nov 15 09:55:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P0    42W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# ensure enough memory present so that training does not stop
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


# **Packages needed:** <br>
`datasets`: to transform the dataset <br>
`transformers`: upgraded version of RNN (allows to process a large quantity of text) <br>
`librosa`: needed for the audio files <br>
`jiwer`: **most important:** WER metric

In [3]:
%%capture
!pip install datasets==1.18.3
!pip install transformers==4.23.1
!pip install jiwer
!pip install librosa
# %cd /content/espnet/tools
!make CUDA_VERSION=10.2

In [4]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


# **Download the torgo dataset that will be finetuned against the state of the art model**

In [5]:
%cd /content
!gdown 1FUls9tWqAPD9mggqzkaDtIhNYE0fJ1aF
!mkdir downloads
%cd downloads
!gdown 1hu3l5E8OY8jMHSIN2Bafg3cMw-icmlZM
!tar -xzf torgo.tar.gz && ls torgo

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.q

# **Connect to Hugging Face to store the results of the model**

In [6]:
# to store the model checkpoints, we will need to install another package
%%capture
!apt install git-lfs

# !!! Important: Run the above before everything

#                        .

#                            ..

# **Crucial stage: Preparation of data, Tokenizer and Feature Extractor**

ASR models transcribe speech to text which leads to the requirement of a feature extractor and tokenizer: <br>
`feature extractor`: processes speech signal to the required input format: audio processing: feature vector <br>
`tokenizer`: converts the model's output to text format <br>
`wave2vec2` has the following tokenizer: `wave2vec2CTCTokenizer` and feature extractor: `wave2vec2FeatureExtractor`

# **Tokenizer**

In [7]:
# load the dataset, observe structure, divide into training and test set (evaluation later)
from datasets import load_dataset, load_metric, DatasetDict, Dataset, Audio

data = load_dataset('csv', data_files='/content/output.csv')
print(data)



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e829bfabd4696e4c/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e829bfabd4696e4c/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'session', 'text', 'audio'],
        num_rows: 5534
    })
})


## change speaker id here

In [8]:
# creating a train and testing dataset

torgo_dataset = data['train'].train_test_split(test_size=0.2)
#torgo_dataset

#torgo_dataset = DatasetDict()

#for speaker id = 1, as testing speaker
#torgo_dataset['train'] = data['train'].filter(lambda x: x != 1, input_columns=['speaker_id'])
#torgo_dataset['test'] = data['train'].filter(lambda x: x == 1, input_columns=['speaker_id'])




In [9]:
# remove columns that we do not need
torgo_dataset = torgo_dataset.remove_columns(["Unnamed: 0"])
torgo_dataset

DatasetDict({
    train: Dataset({
        features: ['session', 'text', 'audio'],
        num_rows: 4427
    })
    test: Dataset({
        features: ['session', 'text', 'audio'],
        num_rows: 1107
    })
})

In [10]:
# ignore special characters: with no language model hard to classify them
# also convert all the text into lowercase: makes life much more easier
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).lower() + " "
    return batch

In [11]:
# use map function to carry out the process/transformation
torgo_dataset = torgo_dataset.map(remove_special_characters)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

In [12]:
# write a function that will first concatenate all the transcriptions to one single transcription and them we map them to characters
# In short: creating tokens: determine the length of array

def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

vocabs = torgo_dataset.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=torgo_dataset.column_names["train"])


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [13]:
# we create the union of all distinct letters in the training dataset and test dataset and convert the resulting list into 
# an enumerated dictionary

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'d': 0,
 't': 1,
 '1': 2,
 'n': 3,
 'p': 4,
 'b': 5,
 'e': 6,
 'q': 7,
 'a': 8,
 'f': 9,
 "'": 10,
 'g': 11,
 'r': 12,
 'w': 13,
 'c': 14,
 'u': 15,
 ' ': 16,
 'o': 17,
 'k': 18,
 'v': 19,
 'h': 20,
 'y': 21,
 'l': 22,
 '3': 23,
 'i': 24,
 's': 25,
 'm': 26,
 'z': 27,
 'x': 28,
 'j': 29}

In [14]:
# from the above tokens: we given the space token visibility by using the symbol (|)
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [15]:
# adding tokens for anything unknown discovered and padding for the blank token
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

32


Linear layer we add on top of the pretrained checkpoint will have an output dimension of 32.

In [16]:
# jsonify the file next
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [17]:
# instantiate an object of the tokenizer class
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [18]:
# upload tokenizer to the Hugging Face Repo
repo_name = "base-on-torgo2"

In [19]:
# push it to Hugging face to use it later
tokenizer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/alexziweiwang/base-on-torgo2/commit/f726466ba73d928b39c8d7b82f4f2f8c7e1dc833', commit_message='Upload tokenizer', commit_description='', oid='f726466ba73d928b39c8d7b82f4f2f8c7e1dc833', pr_url=None, pr_revision=None, pr_num=None)

# **Feature Extractor**

To convert speech to text: it has to first discretized: create individual units: called **sampling**

A higher sampling rate leads to a better approximation of the real speech signal but also necessitates more values per second

A Wav2Vec2 feature extractor object requires the following parameters to be instantiated:

- `feature_size`: Speech models take a sequence of feature vectors as an input. While the length of this sequence obviously varies, the feature size should not. In the case of Wav2Vec2, the feature size is 1 because the model was trained on the raw speech signal ${}^2$.
- `sampling_rate`: The sampling rate at which the model is trained on.
- `padding_value`: For batched inference, shorter inputs need to be padded with a specific value
- `do_normalize`: Whether the input should be *zero-mean-unit-variance* normalized or not. Usually, speech models perform better when normalizing the input
- `return_attention_mask`: Whether the model should make use of an `attention_mask` for batched inference. In general, models should **always** make use of the `attention_mask` to mask padded tokens. However, due to a very specific design choice of `Wav2Vec2`'s "base" checkpoint, better results are achieved when using no `attention_mask`. 

In [20]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

In [21]:
# wrap the feature extractor and tokenizer into a single processor class: when testing will only need model and processor object
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

#**Prepare Dataset**

In [22]:
torgo_dataset = torgo_dataset.cast_column("audio", Audio(sampling_rate=16000))

In [23]:
torgo_dataset["train"][5]["audio"]

{'path': '/content/downloads/Torgo/M01/Session1/wav_arrayMic/0040.wav',
 'array': array([ 0.0020752 , -0.00317383,  0.00119019, ...,  0.0005188 ,
         0.0010376 ,  0.00131226], dtype=float32),
 'sampling_rate': 16000}

In [24]:
# testing out sample audio files that have been loaded
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(torgo_dataset["train"]))

#print(torgo_dataset["train"][rand_int]["text"])
#ipd.Audio(data=np.asarray(torgo_dataset["train"][rand_int]["audio"]["array"]), autoplay=True, rate=16000)

# **Processing the dataset expected by the model**

1. load and resample the audio data: call batch["audio"]
2. extract values from the loaded audio file
3. encode the transcriptions to label ids

In [25]:
def prepare_dataset(batch):
    # load the the audio data into batch
    audio = batch["audio"]

    # extract the values from the audio files
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    # encode it to the label ids
    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

In [26]:
torgo_dataset = torgo_dataset.map(prepare_dataset, remove_columns=torgo_dataset.column_names["train"], num_proc=4)

  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "


Long input sequences require a lot of memory. Since `Wav2Vec2` is based on `self-attention` the memory requirement scales quadratically with the input length for long input sequences.

In [27]:
torgo_dataset

DatasetDict({
    train: Dataset({
        features: ['input_values', 'input_length', 'labels'],
        num_rows: 4427
    })
    test: Dataset({
        features: ['input_values', 'input_length', 'labels'],
        num_rows: 1107
    })
})

### changed

In [28]:
max_input_length_in_sec = 9.0
min_input_length_in_sec= 1.0
torgo_dataset["train"] = torgo_dataset["train"].filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])
torgo_dataset["train"] = torgo_dataset["train"].filter(lambda x: x > min_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

# **Training and Evaluation**

**Need for a  data collabtor** <br>
wave2vec2 has a much larger input length as compared to the output length. For the input size, it is efficient to pad training batches to the longest sample in the batch (not overall sample)

In [29]:
# data collator

import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [30]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

- Evaluation metric. During training, the model should be evaluated on the word error rate. We should define a `compute_metrics` function accordingly

- Load a pretrained checkpoint. We need to load a pretrained checkpoint and configure it correctly for training.

- Define the training configuration.

After having fine-tuned the model, we will correctly evaluate it on the test data and verify that it has indeed learned to correctly transcribe speech.

In [31]:
# load the word error rate metric
wer_metric = load_metric("wer")


Downloading:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [32]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### model assigning

In [33]:
# assign the model
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "yongjian/wav2vec2-large-a",
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)


#model: yongjian/wav2vec2-large-a contains [self-training] and has best wer of 0.557

Downloading:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

The first component of wav2vec2 has acoustic features from the raw speech signal. This portion has been pretrained sufficiently and does not need to be pretrained anymore and hence freezed.

# **Define the parameters that are related to model training**


To give more explanation on some of the parameters:
- `group_by_length` makes training more efficient by grouping training samples of similar input length into one batch. This can significantly speed up training time by heavily reducing the overall number of useless padding tokens that are passed through the model
- `learning_rate` and `weight_decay` were heuristically tuned until fine-tuning has become stable. Note that those parameters strongly depend on the Timit dataset and might be suboptimal for other speech datasets.

In [34]:
model.freeze_feature_encoder()

In [35]:
# clear out cuda memory
import torch
# torch.cuda.empty_cache()

### changed

In [36]:
# parameters for the training
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=4,
  evaluation_strategy="steps",
  num_train_epochs=30,
  fp16=False,
  gradient_checkpointing=True,
  save_steps=500,
  eval_steps=500,
  logging_steps=500,
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=1000,
  save_total_limit=2,
)

In [37]:
# pass all instances to the trainer as the final step before training
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=torgo_dataset["train"],
    eval_dataset=torgo_dataset["test"],
    tokenizer=processor.feature_extractor,
)

### action of training

In [38]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4106
  Num Epochs = 30
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 30810
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "


Step,Training Loss,Validation Loss,Wer
500,27.1611,3.589469,1.0
1000,3.1244,3.065362,1.199427
1500,2.818,2.662825,1.296097
2000,2.5923,2.358784,1.280702
2500,2.3242,2.016878,1.271393
3000,1.9537,1.954355,1.255997
3500,1.683,1.698588,1.245614
4000,1.4922,1.522381,1.235947
4500,1.3044,1.396806,1.162549
5000,1.1734,1.210307,1.105979


The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1107
  Batch size = 8
Saving model checkpoint to base-on-torgo2/checkpoint-500
Configuration saved in base-on-torgo2/checkpoint-500/config.json
Model weights saved in base-on-torgo2/checkpoint-500/pytorch_model.bin
Feature extractor saved in base-on-torgo2/checkpoint-500/preprocessor_config.json
  "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1107
  Batch s

TrainOutput(global_step=30810, training_loss=1.001694628063327, metrics={'train_runtime': 12234.3924, 'train_samples_per_second': 10.068, 'train_steps_per_second': 2.518, 'total_flos': 1.1513521671004778e+19, 'train_loss': 1.001694628063327, 'epoch': 30.0})

In [39]:
# push to trained model to huggingface
trainer.push_to_hub()

OSError: ignored

# **Evaluate**

In [None]:
processor = Wav2Vec2Processor.from_pretrained("alexziweiwang/wav2vec2-base-torgo")

# **Running experiments on own to learn about Wav2vec2 especially Transformers**

In [None]:
#!pip install transformers

In [None]:
# other libraries that are needed
#import librosa
#import torch 
#from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2ForCTC

In [None]:
# crucial role of the tokenizer: convert it into the input form
#tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
#model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

In [None]:
# previously loaded another audio file in .wav format
# now testing with torgo dataset file
#audio_speech, rate = librosa.load('/content/downloads/Torgo/M03/Session2/wav_headMic/0191.wav', sr= 16000)
#audio_speech

In [None]:
#import IPython.display as display
#display.Audio('/content/downloads/Torgo/M03/Session2/wav_headMic/0191.wav')

In [None]:
# pass the audio to the tokenizer and get it back in pytorch format
#input_values = tokenizer(audio_speech, return_tensors = 'pt')

# get the logits: non normalized predicted values
#logits = model(input_values).logits

In [None]:
# store the logits

#prediction = torch.argmax(logits)

In [None]:
# decode the audio into text
#transcription = tokenizer.decode(prediction)
#print(transcription)