<a href="https://colab.research.google.com/github/cahya-wirawan/indonesian-speech-recognition/blob/main/Weights_%26_Biases_Hugging_Face_XLSR_Fine_tune_Week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get the most out of W&B!

Demo of how to get the most out of Weights and Biases (W&B) for the Hugging Face XLSR

### "WANDB" Headings
To quickly see where W&B is being used throughout this notebook just search (control + f) "**wandb**" in this notebook and you'll quickly find headings above each relevant code cell where a feature of W&B is being used

### Weights and Biases Signup
If you don't have a Weights and Biases account you can sign up for a personal account or even start a 1 month trial for a company Team account here: https://wandb.ai/site/pricing

### 100GB Free
Each Weights and Biases user has 100GB free storage so you can log and version your models, datasets, tokenizers, processors etc while testing. Once you are happy with your final model you can then upload it to the Hugging Face Model Hub to share with the world!

### Resources

-  [W&B Hugging Face integration docs](https://docs.wandb.ai/integrations/huggingface)

### Credit
This notebook demos W&B functionality with XLSR training code taken from `@m3hrdadfi`'s notebook [here](https://colab.research.google.com/github/m3hrdadfi/notebooks/blob/main/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers_ipynb.ipynb)





## Setup

In [None]:
!nvidia-smi

Sun Mar 21 15:49:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Using [this notebook](https://colab.research.google.com/drive/1D6krVG0PPJR2Je9g5eN_2h6JP73_NUXz) to get a P100

In [None]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          69G   39G   30G  57% /
tmpfs            64M     0   64M   0% /dev
tmpfs            13G     0   13G   0% /sys/fs/cgroup
shm              13G     0   13G   0% /dev/shm
tmpfs            13G   24K   13G   1% /var/colab
/dev/sda1        75G   40G   35G  54% /opt/bin
tmpfs            13G     0   13G   0% /proc/acpi
tmpfs            13G     0   13G   0% /proc/scsi
tmpfs            13G     0   13G   0% /sys/firmware


In [None]:
%env LC_ALL=C.UTF-8
%env LANG=C.UTF-8
%env TRANSFORMERS_CACHE=/content/cache
%env HF_DATASETS_CACHE=/content/cache
%env CUDA_LAUNCH_BLOCKING=1

env: LC_ALL=C.UTF-8
env: LANG=C.UTF-8
env: TRANSFORMERS_CACHE=/content/cache
env: HF_DATASETS_CACHE=/content/cache
env: CUDA_LAUNCH_BLOCKING=1


## WANDB: Install wandb and Latest transformers `WandbCallback` code

In [None]:
%%capture
!pip install git+https://github.com/huggingface/datasets.git
# !pip install git+https://github.com/huggingface/transformers.git  # Install transformers from PR, see below
!pip install torchaudio
!pip install librosa
!pip install jiwer
!pip install wandb 

#### Get latest WandbCallback features
There are a number of updates to the `WandbCallback` in `transformers` [PR 10826](https://github.com/huggingface/transformers/pull/10826) waiting to be merged so we will pip install `transformers` using this version to make sure we have the latest updates of `WandbCallback`

In [None]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git@refs/pull/10826/head
  Cloning https://github.com/huggingface/transformers.git (to revision refs/pull/10826/head) to /tmp/pip-req-build-ngpm6za1
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-ngpm6za1
  Running command git fetch -q https://github.com/huggingface/transformers.git refs/pull/10826/head
  Running command git checkout -q 36507a6554f72cda55da6ed50d0130a50e7534bd
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.5.0.dev0-cp37-none-any.whl size=1972541 sha256=384555afb292d45fa0f8a08a92a8e9e43e135d7457c5d1322d1b4c6f76eaa3d4
  Stored in directory: /tmp/pip-ephem-wheel-cache-6gjtv6ot/w

## WANDB: Login to W&B and Set Env Variables

The `WandbCallback` in Hugging Face `transformers` picks up a number of arguments such as your username and project name using environment variables. We'll set these here now

If you want to log to your own personal project instead of the W&B Public Project here you can change `WANDB_ENTITY` to your W&B username and change `WANDB_PROJECT` to whatever project name you like

In [None]:
import os
import wandb

# W&B company account
%env WANDB_ENTITY = wandb
entity = os.environ["WANDB_ENTITY"]

# Choose the public W&B project
%env WANDB_PROJECT = xlsr
project_name = os.environ["WANDB_PROJECT"]

# Log your trained model to W&B as an Artifact
%env WANDB_LOG_MODEL = true 

# # Disable logging of gradients to speed things up a little
# %env WANDB_WATCH = false   

env: WANDB_ENTITY=wandb
env: WANDB_PROJECT=xlsr
env: WANDB_LOG_MODEL=true


In [None]:
# Login with your own wandb token, sign-up for an account at www.wandb.ai if you don't have one
wandb.login() #YOURTOKEN 

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Data

In [None]:
!rm -rf /content/cv-corpus-6.1-2020-12-11
!wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz

!tar -xzf tr.tar.gz
!rm -rf fa.tar.gz

--2021-03-21 15:49:59--  https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz
Resolving voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com (voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com)... 52.218.249.66
Connecting to voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com (voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com)|52.218.249.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 620848700 (592M) [application/octet-stream]
Saving to: ‘tr.tar.gz’


2021-03-21 15:50:11 (49.4 MB/s) - ‘tr.tar.gz’ saved [620848700/620848700]



In [None]:
from datasets import load_dataset, load_metric

import pandas as pd
import numpy as np

from tqdm import tqdm

import os
import string
import six
import re

In [None]:
# from datasets import load_dataset, load_metric

# common_voice_train = load_dataset("common_voice", "tr", split="train+validation")
# common_voice_test = load_dataset("common_voice", "tr", split="test")

In [None]:
abs_path_to_data = os.path.join("/content", "cv-corpus-6.1-2020-12-11", "tr")
!ls {abs_path_to_data}/*.tsv

/content/cv-corpus-6.1-2020-12-11/tr/dev.tsv
/content/cv-corpus-6.1-2020-12-11/tr/invalidated.tsv
/content/cv-corpus-6.1-2020-12-11/tr/other.tsv
/content/cv-corpus-6.1-2020-12-11/tr/reported.tsv
/content/cv-corpus-6.1-2020-12-11/tr/test.tsv
/content/cv-corpus-6.1-2020-12-11/tr/train.tsv
/content/cv-corpus-6.1-2020-12-11/tr/validated.tsv


In [None]:
def normalizer(text):
    # Use your custom normalizer
    return text

In [None]:
train_df = pd.concat([
    pd.read_csv(f"{abs_path_to_data}/train.tsv", sep="\t"),
    pd.read_csv(f"{abs_path_to_data}/dev.tsv", sep="\t")
])
_train_df = train_df.copy()
total_records = len(train_df)
train_df["id"] = range(0, total_records)
print(f"Step 0: {len(train_df)}")

train_df["path"] = abs_path_to_data + "/clips/" + train_df["path"]
train_df["status"] = train_df["path"].apply(lambda path: True if os.path.exists(path) else None)
train_df = train_df.dropna(subset=["path"])
train_df = train_df.drop("status", 1)
print(f"Step 1: {len(train_df)}")

train_df["sentence"] = train_df["sentence"].apply(lambda t: normalizer(t))
train_df = train_df.dropna(subset=["sentence"])
print(f"Step 2: {len(train_df)}")

term_a = set(list(range(0, total_records)))
term_b = set(train_df["id"].values.tolist())
removed_items_train = [_train_df.iloc[index]["path"] for index in list(term_a - term_b)]
train_df = train_df.reset_index(drop=True)
# train_df.head()

Step 0: 3478
Step 1: 3478
Step 2: 3478


In [None]:
print(f"Items to be removed {len(removed_items_train)}")

Items to be removed 0


In [None]:
test_df = pd.read_csv(f"{abs_path_to_data}/test.tsv", sep="\t")

_test_df = test_df.copy()
total_records = len(test_df)
test_df["id"] = range(0, total_records)
print(f"Step 0: {len(test_df)}")

test_df["path"] = abs_path_to_data + "/clips/" + test_df["path"]
test_df["status"] = test_df["path"].apply(lambda path: True if os.path.exists(path) else None)
test_df = test_df.dropna(subset=["path"])
test_df = test_df.drop("status", 1)
print(f"Step 1: {len(test_df)}")

test_df["sentence"] = test_df["sentence"].apply(lambda t: normalizer(t))
test_df = test_df.dropna(subset=["sentence"])
print(f"Step 2: {len(test_df)}")

term_a = set(list(range(0, total_records)))
term_b = set(test_df["id"].values.tolist())
removed_items_test = [_test_df.iloc[index]["path"] for index in list(term_a - term_b)]
test_df = test_df.reset_index(drop=True)
# test_df.head()

Step 0: 1647
Step 1: 1647
Step 2: 1647


In [None]:
print(f"Items to be removed {len(removed_items_test)}")

Items to be removed 0


In [None]:
removed_items = removed_items_train + removed_items_test

for path in removed_items:
    if os.path.exists(path):
        os.remove(path)

In [None]:
text = " ".join(train_df["sentence"].values.tolist() + test_df["sentence"].values.tolist())
vocab = list(sorted(set(text)))

# print(len(vocab), vocab)

In [None]:
!rm /content/train.csv
!rm /content/test.csv

train_df.to_csv("/content/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv("/content/test.csv", sep="\t", encoding="utf-8", index=False)

print(train_df.shape)
print(test_df.shape)

rm: cannot remove '/content/train.csv': No such file or directory
rm: cannot remove '/content/test.csv': No such file or directory
(3478, 11)
(1647, 11)


## Create HF Datasets

In [None]:
common_voice_train = load_dataset("csv", data_files={"train": "/content/train.csv"}, delimiter="\t")["train"]
common_voice_test = load_dataset("csv", data_files={"test": "/content/test.csv"}, delimiter="\t")["test"]

print(common_voice_train)
print(common_voice_test)

Using custom data configuration default-a24d5dcf04463d3f


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /content/cache/csv/default-a24d5dcf04463d3f/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /content/cache/csv/default-a24d5dcf04463d3f/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.


Using custom data configuration default-166353eb148492a6


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /content/cache/csv/default-166353eb148492a6/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /content/cache/csv/default-166353eb148492a6/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.
Dataset({
    features: ['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'id'],
    num_rows: 3478
})
Dataset({
    features: ['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'id'],
    num_rows: 1647
})


## WANDB: Lets keep the meta data here
We commented out the `.remove_columns` call here in the cell below as the metadata will be useful for our EDA later

In [None]:
# common_voice_train = common_voice_train.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])
# common_voice_test = common_voice_test.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])

## Clean Up Data

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'

def remove_special_characters(batch):
    text = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
    text = normalizer(text)
    batch["text"] = text
    return batch

In [None]:
common_voice_train = common_voice_train.map(remove_special_characters, remove_columns=["sentence"])
common_voice_test = common_voice_test.map(remove_special_characters, remove_columns=["sentence"])

HBox(children=(FloatProgress(value=0.0, max=3478.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1647.0), HTML(value='')))




In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
vocab_train = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=common_voice_train.column_names)
vocab_test = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=common_voice_test.column_names)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

39

### Save Vocab

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

### Save CSVs

In [None]:
!mkdir -p /content/dataset

In [None]:
trainset = []

for item in tqdm(common_voice_train, position=0, total=len(common_voice_train)):
    features = common_voice_train.features
    data = {}
    for key in features:
        data[key] = item[key]
    
    trainset.append(data)

trainset = pd.DataFrame(trainset)
trainset.to_csv("/content/dataset/train.csv", sep="\t")


testset = []

for item in tqdm(common_voice_test, position=0, total=len(common_voice_test)):
    features = common_voice_test.features
    data = {}
    for key in features:
        data[key] = item[key]
    
    testset.append(data)

testset = pd.DataFrame(testset)
testset.to_csv("/content/dataset/test.csv", sep="\t")

100%|██████████| 3478/3478 [00:00<00:00, 12691.69it/s]
100%|██████████| 1647/1647 [00:00<00:00, 13195.13it/s]


## Load Tokenizer, Processor and FeatureExtractors

In [None]:
# save_dir = "/content/gdrive/MyDrive/wav2vec2-large-xlsr-turkish"
save_dir = "/content/wav2vec2-large-xlsr-turkish"
!ls {save_dir}

ls: cannot access '/content/wav2vec2-large-xlsr-turkish': No such file or directory


In [None]:
import transformers

In [None]:
import os
from transformers.trainer_utils import get_last_checkpoint

last_checkpoint = None

if os.path.exists(save_dir):
    last_checkpoint = get_last_checkpoint(save_dir)
    
print(last_checkpoint if last_checkpoint else 0)

ModuleNotFoundError: ignored

In [None]:
from transformers import Wav2Vec2CTCTokenizer

if not os.path.exists(save_dir):
    print("NotExist")
    tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
else:
    tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(save_dir)

In [None]:
from transformers import Wav2Vec2FeatureExtractor

if not os.path.exists(save_dir):
    print("NotExist")
    feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)
else:
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(save_dir)

In [None]:
from transformers import Wav2Vec2Processor

if not os.path.exists(save_dir):
    print("NotExist")
    processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
else:
    processor = Wav2Vec2Processor.from_pretrained(save_dir)

In [None]:
if not os.path.exists(save_dir):
    print("NotExist")
    processor.save_pretrained(save_dir)
    print("Saved!")

## Preprocess Data

In [None]:
common_voice_train[0]

In [None]:
# def prepare_dataset(batch):
#     # check that all files have the correct sampling rate
#     assert (
#         len(set(batch["sampling_rate"])) == 1
#     ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

#     batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0]).input_values
    
#     with processor.as_target_processor():
#         batch["labels"] = processor(batch["target_text"]).input_ids
#     return batch

In [None]:
# common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names, batch_size=4, num_proc=4, batched=True)
# common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names, batch_size=4, num_proc=4, batched=True)

## WANDB Artifacts: Log Data to a wandb Audio Table for Exploration

### WANDB: Create a wandb Table object to put our data in

In [None]:
# Column names for the table we will save to Artifacts
columns = ['speech', 'transcription', 'duration', 'mean_loudness', 
           'max_loudness', 'gender', 'age', 'downvotes', 'accent', 
           'sampling_rate', 'filename']

# Create table object
wandb_table = wandb.Table(columns=columns)

### Add data to wandb Table

We'll fill up our table row by row, before uploading the enire table to W&B Artifacts. A row is just a list of the objects or data that we would like to log for each column 

We'll add all our meta data, plus additional data about the "loudness" of the audio, to help with our EDA

In [None]:
import librosa
import torchaudio

In [None]:
def get_loudness_stats(sa, sr):
  # Return mean and max loudness given a speeach array and sample rate
  # Credit: https://stackoverflow.com/questions/64913424/how-to-compute-loudness-from-audio-signal
  # Compute the spectrogram (magnitude)
  n_fft = 2048
  hop_length = 1024
  spec_mag = abs(librosa.stft(sa, n_fft=n_fft, hop_length=hop_length))

  # Convert the spectrogram into dB
  spec_db = librosa.amplitude_to_db(spec_mag)

  # Compute A-weighting values
  freqs = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
  a_weights = librosa.A_weighting(freqs)
  a_weights = np.expand_dims(a_weights, axis=1)

  # Apply the A-weghting to the spectrogram in dB
  spec_dba = spec_db + a_weights

  # Compute the "loudness" value
  loudness = librosa.feature.rms(S=librosa.db_to_amplitude(spec_dba))

  return np.mean(loudness[0]), np.max(loudness[0])

In [None]:
def log_row_to_table(ndx=None, wandb_table=None, ds=None, verbose=True):
  # Grab each item of interest to log
  sampling_rate = 16_000
  speech_array, _ = torchaudio.load(ds["path"][ndx])
  sa = speech_array[0].numpy()
  sa = librosa.resample(np.asarray(sa), 48_000, sampling_rate)

  # Index into the rest of the metadata we'll be logging
  duration = librosa.get_duration(y=sa, sr=sampling_rate) 
  text = ds['text'][ndx]
  gender = ds['gender'][ndx]
  fn = ds['path'][ndx].split('/')[-1] 
  age = ds['age'][ndx]
  downvotes = ds['down_votes'][ndx]
  accent = ds['accent'][ndx]

  # Example of additional calculated audio stats
  mean_loudness, max_loudness = get_loudness_stats(sa, sampling_rate)

  # Create a Wandb Audio object to log the speech array too
  raw_audio = wandb.Audio(data_or_path=sa, sample_rate=sampling_rate, caption=fn)

  # Create 1 row for our table with all of the objects we wish to log
  row = [raw_audio, text, duration, mean_loudness, max_loudness, gender, 
         age, downvotes, accent, sampling_rate, fn]

  # Add our row to the wandb table
  wandb_table.add_data(*row)

  if verbose: 
    if ndx % 100 == 0: print(ndx)

  return wandb_table

#### Select Random Subset of the rraining data to explore

In [None]:
N_RAND=1000
rand_ndxs = np.random.randint(0, len(common_voice_train), N_RAND)
ds = common_voice_train.select(rand_ndxs)

# Log to table, row by row
for ndx in range(len(ds)):
  wandb_table = log_row_to_table(ndx=ndx, wandb_table=wandb_table, ds=ds, verbose=True)

  - 0.5 * np.log10(f_sq + const[3])


0
100
200
300
400
500
600
700
800
900


### Add Table to wandb Artifact

In [None]:
# `type` can be set to whatever makes sense for you
audio_ds_artifact = wandb.Artifact(name="common-voice-tr-train", type="audio-file")

# Add the table to the artifact
audio_ds_artifact.add(wandb_table, "train_samples")

<ManifestEntry digest: crjHOKTp3qW2NY4+mAqFhg==>

### Log the Artifact to a wandb Run

In [None]:
# 1. Create a wandb Run
# providing a `name` and `job_type` are optional but it helps to organise your wandb project
audio_ds_run = wandb.init(name='train_dataset_logging', job_type='dataset_logging',
                          project=project_name, entity=entity, reinit=True)

# 2. Log the artifact to the Run
audio_ds_run.log_artifact(audio_ds_artifact)

# 3. Finish the run
audio_ds_run.finish()

VBox(children=(Label(value=' 104.54MB of 104.54MB uploaded (26.07MB deduped)\r'), FloatProgress(value=1.0, max…

## Create CommonVoiceDataset

In [None]:
import torchaudio
import librosa

import numpy as np
import pandas as pd

from torch.utils.data import Dataset, DataLoader
import os


class CommonVoiceDataset(Dataset):

    def __init__(self, csv_file, root_dir, processor, column_names=None, sep="\t"):
        self.data = pd.read_csv(os.path.join(root_dir, csv_file), sep=sep)
        self.processor = processor
        self.column_names = column_names

    def __len__(self):
        return len(self.data)


    def speech_file_to_array_fn(self, batch):
        speech_array, sampling_rate = torchaudio.load(batch["path"])
        batch["speech"] = speech_array[0].numpy()
        batch["sampling_rate"] = sampling_rate
        batch["target_text"] = batch["text"]
        return batch

    
    def resample(self, batch):
        batch["speech"] = librosa.resample(np.asarray(batch["speech"]), 48_000, 16_000)
        batch["sampling_rate"] = 16_000
        return batch

    
    def prepare_dataset(self, batch, column_names=None):
        batch["input_values"] = self.processor(batch["speech"], sampling_rate=batch["sampling_rate"]).input_values[0].tolist()

        with self.processor.as_target_processor():
            batch["labels"] = self.processor(batch["target_text"]).input_ids

        if column_names and isinstance(column_names, list):
            batch = {name: batch[name] for name in column_names}
        
        return batch


    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        batch = self.data.iloc[idx].copy()
        batch = batch.to_dict()
        batch = self.speech_file_to_array_fn(batch)
        batch = self.resample(batch)
        batch = self.prepare_dataset(batch, self.column_names)

        return batch 

## Create DataCollator

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
train_dataset = CommonVoiceDataset("train.csv", "/content/dataset/", processor=processor, column_names=["input_values", "labels"])
test_dataset = CommonVoiceDataset("test.csv", "/content/dataset/", processor=processor, column_names=["input_values", "labels"])

In [None]:
train_dataset[0].keys()

dict_keys(['input_values', 'labels'])

In [None]:
print(len(train_dataset))
print(len(test_dataset))

3478
1647


In [None]:
for batch in train_dataset:
    print(batch.keys())
    print(type(batch))
    # print(batch)
    break

dict_keys(['input_values', 'labels'])
<class 'dict'>


In [None]:
for batch in test_dataset:
    print(batch.keys())
    break

dict_keys(['input_values', 'labels'])


In [None]:
wer_metric = load_metric("wer")

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Load XLSR Model

In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53" if not last_checkpoint else last_checkpoint, 
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True, 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=processor.tokenizer.vocab_size
)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print(len(processor.tokenizer))
print(processor.tokenizer.vocab_size)

39
39


In [None]:
model.freeze_feature_extractor()

In a final step, we define all parameters related to training. 
To give more explanation on some of the parameters:
- `group_by_length` makes training more efficient by grouping training samples of similar input length into one batch. This can significantly speed up training time by heavily reducing the overall number of useless padding tokens that are passed through the model
- `learning_rate` and `weight_decay` were heuristically tuned until fine-tuning has become stable. Note that those parameters strongly depend on the Common Voice dataset and might be suboptimal for other speech datasets.

For more explanations on other parameters, one can take a look at the [docs](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer#trainingarguments).

**Note**: If one wants to save the trained models in his/her google drive the commented-out `output_dir` can be used instead.

## WANDB: Add wandb Training Arguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    # output_dir="/content/gdrive/MyDrive/wav2vec2-large-xlsr-turkish-demo",
    # output_dir="./wav2vec2-large-xlsr-turkish-demo",
    output_dir=save_dir,
    group_by_length=False,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="epoch",
    num_train_epochs=5, # Just for demo, change it
    fp16=True,
    # save_steps=20, # Just for demo, change it
    eval_steps=100, # Just for demo, change it
    logging_steps=10, # Just for demo, change it
    learning_rate=3e-4,
    warmup_steps=20, # Just for demo, change it
    save_total_limit=2,
    # WANDB LOGGING: 
    report_to = 'wandb',  # enable logging to W&B
    run_name = 'tr-base-5e',   # Name your run, optional
    load_best_model_at_end = True,  # This will ensure your best model will be uploaded to W&B
    metric_for_best_model='wer',    # Load best model based on "wer", not eval loss
    greater_is_better=False,
  )

## Create CommonVoiceTrainer
Now, all instances can be passed to Trainer and we are ready to start training!

In [None]:
import collections

from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union

from torch.utils.data.sampler import RandomSampler, SequentialSampler
from transformers import Trainer
from transformers.trainer import (
    SequentialDistributedSampler, 
    SequentialSampler,
    DistributedSamplerWithLoop
)
from transformers.trainer import is_datasets_available


class CommonVoiceTrainer(Trainer):

    def _get_train_sampler(self):
        if isinstance(self.train_dataset, torch.utils.data.IterableDataset) or not isinstance(
            self.train_dataset, collections.abc.Sized
        ):
            return None 
        
        if self.args.world_size <= 1:
            return RandomSampler(self.train_dataset)
        elif self.args.parallel_mode == ParallelMode.TPU and not self.args.dataloader_drop_last:
            # Use a loop for TPUs when drop_last is False to have all batches have the same size.
            return DistributedSamplerWithLoop(
                self.train_dataset,
                batch_size=self.args.per_device_train_batch_size,
                num_replicas=self.args.world_size,
                rank=self.args.process_index,
            )
        else:
            return DistributedSampler(
                self.train_dataset, num_replicas=self.args.world_size, rank=self.args.process_index
            )
    
    def get_train_dataloader(self):
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")
        train_sampler = self._get_train_sampler()

        return DataLoader(
            self.train_dataset,
            batch_size=self.args.train_batch_size,
            sampler=train_sampler,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
        )
    
    def _get_eval_sampler(self, eval_dataset):
        if self.args.local_rank != -1:
            return SequentialDistributedSampler(eval_dataset)
        else:
            return SequentialSampler(eval_dataset)


    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None):
        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
        eval_sampler = self._get_eval_sampler(eval_dataset)

        return DataLoader(
            eval_dataset,
            sampler=eval_sampler,
            batch_size=self.args.eval_batch_size,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
        )

In [None]:
# from transformers import Trainer

# trainer = Trainer(
#     model=model,
#     data_collator=data_collator,
#     args=training_args,
#     compute_metrics=compute_metrics,
#     train_dataset=common_voice_train,
#     eval_dataset=common_voice_test,
#     tokenizer=processor.feature_extractor,
# )

In [None]:
trainer = CommonVoiceTrainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=processor.feature_extractor,
)

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);
```

## WANDB: Train and then finish your wandb Run when training is complete
wandb.finish() only needs to be called when using notebooks

In [None]:
if last_checkpoint:
    print(f"last_checkpoint: {last_checkpoint}")
    train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
    wandb.finish()    # This will finish your WANDB run  
else:
    train_result = trainer.train()
    wandb.finish()    # This will finish your WANDB run  



Epoch,Training Loss,Validation Loss,Wer,Runtime,Samples Per Second
1,3.1367,3.184334,1.0,471.2849,3.495
2,3.1061,3.117155,1.0,471.3891,3.494
3,2.7635,2.454963,0.999489,470.3249,3.502
4,1.3115,0.920909,0.884077,473.5656,3.478
5,0.872,0.716227,0.789092,468.9443,3.512


VBox(children=(Label(value=' 1207.37MB of 1207.37MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, ma…

0,1
train/loss,0.872
train/learning_rate,0.0
train/epoch,5.0
_runtime,8845.0
_timestamp,1616283168.0
_step,545.0
eval/loss,0.71623
eval/wer,0.78909
eval/runtime,468.9443
eval/samples_per_second,3.512


0,1
train/loss,█▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁
train/learning_rate,██▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁
train/epoch,▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
_runtime,▁▁▁▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
_timestamp,▁▁▁▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
_step,▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇███
eval/loss,██▆▂▁
eval/wer,███▄▁
eval/runtime,▅▅▃█▁
eval/samples_per_second,▄▄▆▁█


## WANDB: Download and use your logged model

In [None]:
# Create a run object in your project
run = wandb.init(project=project_name, entity=entity)

# Connect an Artifact to your run
my_model_artifact = run.use_artifact('run-tr-base-5e:v0')

# Download model weights to a folder and return the path
model_dir = my_model_artifact.download()

# Load your Hugging Face model from that folder, e.g. SequenceClassification model
model = Wav2Vec2ForCTC.from_pretrained(model_dir)

[34m[1mwandb[0m: Downloading large artifact run-tr-base-5e:v0, 1203.63MB. 4 files... 

NameError: ignored

## Save Model locally

In [None]:
metrics = train_result.metrics
max_train_samples = len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.save_model()

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

## Test a pretrained Wav2Vec2 model

In [None]:
model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-turkish-demo").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-turkish-demo")

# model = Wav2Vec2ForCTC.from_pretrained(save_dir).to("cuda")
# processor = Wav2Vec2Processor.from_pretrained(save_dir)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1563.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1262093719.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=158.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=358.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=138.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


Now, we will just take the first example of the test set, run it through the model and take the `argmax(...)` of the logits to retrieve the predicted token ids.

In [None]:
test_dataset = CommonVoiceDataset("test.csv", "/content/dataset/", processor=processor, column_names=None)
print(test_dataset[10].keys())

dict_keys(['Unnamed: 0', 'accent', 'age', 'client_id', 'down_votes', 'gender', 'id', 'locale', 'path', 'segment', 'text', 'up_votes', 'speech', 'sampling_rate', 'target_text', 'input_values', 'labels'])


In [None]:
input_values = []
labels = []

test_loader = DataLoader(test_dataset, batch_size=10, collate_fn=data_collator)
for data in tqdm(test_loader, total=len(test_loader)):
    data_input_values = data["input_values"]
    data_labels = data["labels"]

    input_values.extend([data_input_values[i] for i in range(data_input_values.shape[0])])
    labels.extend([data_labels[i] for i in range(data_labels.shape[0])])

    # break

itest_loader = {"input_values": input_values, "labels": labels}

100%|██████████| 165/165 [05:23<00:00,  1.96s/it]


In [None]:
assert len(itest_loader["input_values"]) == len(itest_loader["labels"])

In [None]:
assert len(itest_loader["input_values"]) == len(test_dataset)

In [None]:
idx = np.random.randint(0, len(test_dataset))
print(f"idx {idx}")

print(f"TEXT: {test_dataset[idx]['text']}")
print(f"INPUT: {(itest_loader)['input_values'][0][:5]}")

idx 1126
TEXT: projenin iki bin yedi yılında tamamlanması bekleniyor 
INPUT: tensor([0.0007, 0.0007, 0.0007, 0.0007, 0.0007])


In [None]:
input_dict = processor(itest_loader["input_values"][idx], return_tensors="pt", padding=True)

logits = model(input_dict.input_values.to("cuda")).logits

pred_ids = torch.argmax(logits, dim=-1)[0]

It is strongly recommended to pass the ``sampling_rate`` argument to this function.Failing to do so can result in silent errors that might be hard to debug.


We adapted `common_voice_test` quite a bit so that the dataset instance does not contain the original sentence label anymore. Thus, we re-use the original dataset to get the label of the first example.

In [None]:
# common_voice_test_transcription = load_dataset("common_voice", "tr", data_dir="./cv-corpus-6.1-2020-12-11", split="test")

Finally, we can decode the example.

In [None]:
sample = test_dataset[idx]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(sample["text"].lower())


speech = torchaudio.load(sample["path"])
speech = speech[0].numpy().squeeze()

speech = librosa.resample(np.asarray(speech), 48_000, 16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

Prediction:
projenin iki bin yedi yılında tamamlanması bekleniyor

Reference:
projenin iki bin yedi yılında tamamlanması bekleniyor 


NameError: ignored

Alright! The transcription can definitely be recognized from our prediction, but it is far from being perfect. Training the model a bit longer, spending more time on the data preprocessing, and especially using a language model for decoding would certainly improve the model's overall performance. 

For a demonstration model on a low-resource language, the results are acceptable, however 🤗.