# Artists and Band Classification from thir songs Wav2Vec 2.0

**Wav2Vec 2.0** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2`s ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official [paper](https://arxiv.org/pdf/2006.13979.pdf).

During fine-tuning week hosted by HuggingFace, more than 300 people participated in tuning XLSR-Wav2Vec2's pretrained on low-resources ASR dataset for more than 50 languages. This model is fine-tuned using [Connectionist Temporal Classification](https://distill.pub/2017/ctc/) (CTC), an algorithm used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. Follow this [notebook](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb#scrollTo=Gx9OdDYrCtQ1) for more information about XLSR-Wav2Vec2 fine-tuning.

This model was shown significant results in many low-resources languages. You can see the [competition board](https://paperswithcode.com/dataset/common-voice) or even testing the models from the [HuggingFace hub](https://huggingface.co/models?filter=xlsr-fine-tuning-week).


In this notebook, we will go through how to use this model to recognize the emotional aspects of speech in a language (or even as a general view using for every classification problem). Before going any further, we need to install some handy packages and define some enviroment values.

In [None]:
!pip install -U -q accelerate
!pip install -U -q transformers
!pip install -U -q datasets
!pip install -U -q torchaudio
!pip install -U -q librosa
!pip install -U -q jiwer

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/474.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompati

In [None]:
gpu_info = !nvidia-smi
gpu_info = "\n".join(gpu_info)

if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

Sat Sep 14 10:11:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              45W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import shutil

shutil.copytree("/content/drive/My Drive/audio_dataset", "./audio_dataset")

'./audio_dataset'

# Dataset description

This dataset comprises a collection of short audio clips ($\le$ 10 seconds) extracted from songs by diverse artists including Taylor Swift, Leonard Cohen, Red Hot Chili Peppers, and Imagine Dragons. The clips are stored in a single folder, ready for use in audio classification tasks.

The dataset is available [here](https://drive.google.com/drive/folders/1WkK5wDBMgYQXprBUhlq5xZrBs87OXWhZ?usp=share_link).

![](https://drive.google.com/uc?export=view&id=1d1cK0vffdHEoKkxYbBUG7a3JXkQJMRsB)

In [None]:
import pandas as pd

df = pd.read_excel("audio_dataset/description.xlsx")

df.head()

Unnamed: 0,source,author
0,audio_dataset/Leonard Cohen/chunk_84.mp3,Leonard Cohen
1,audio_dataset/Imagine Dragons/chunk_364.mp3,Imagine Dragons
2,audio_dataset/Imagine Dragons/chunk_248.mp3,Imagine Dragons
3,audio_dataset/Imagine Dragons/chunk_253.mp3,Imagine Dragons
4,audio_dataset/Taylor Swift/chunk_446.mp3,Taylor Swift


Let's explore how many labels are in the dataset with what distribution.

In [None]:
df.author.value_counts()

Unnamed: 0_level_0,count
author,Unnamed: 1_level_1
Leonard Cohen,672
Taylor Swift,666
Imagine Dragons,657
Red Hot Chili Peppers,474


In [None]:
unique_authors = df.author.unique()
author_to_label = {author: i for i, author in enumerate(unique_authors)}
label_to_author = {i: author for author, i in author_to_label.items()}
df.author = df.author.map(author_to_label)

## Prepare Data for Training

For this particular example, we need to create Hugging Face train, test, and validation datasets. Also, we need to resample the audio data because wav2vec2 works with data at 16 kHz.

In [None]:
import librosa
import datasets
from sklearn.model_selection import train_test_split


SAMPLING_RATE = 16_000

# Define the load_audio function with sampling rate parameter
def load_audio(filepath, target_sr=SAMPLING_RATE):
    audio_data, orig_sr = librosa.load(filepath)

    # Resample if needed
    if orig_sr != target_sr:
        audio_data = librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)

    return audio_data

# Assuming your DataFrame is named 'df'
def create_dataset_dict(row):
    return {
        "input_values": load_audio(row["source"]).tolist(),
        "labels": row["author"]
    }

columns_to_remove = ["source", "author"]
dataset_dict = datasets.Dataset.from_pandas(df).map(create_dataset_dict).remove_columns(columns_to_remove)

seed = 42
dataset_dict = dataset_dict.class_encode_column("labels")
train_temp = dataset_dict.train_test_split(test_size=0.2, stratify_by_column="labels", seed=seed)
test_valid = train_temp["test"].train_test_split(test_size=0.5, stratify_by_column="labels", seed=seed)

train_dataset = train_temp["train"]
test_dataset = test_valid["train"]
valid_dataset = test_valid["test"]

pd.Series({
    "train": len(train_dataset),
    "test": len(test_dataset),
    "valid": len(valid_dataset)},
    name="size"
)

Map:   0%|          | 0/2469 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/2469 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/2469 [00:00<?, ? examples/s]

Unnamed: 0,size
train,1975
test,247
valid,247


In order to preprocess the audio into our classification model, we need to set up the relevant Wav2Vec2 assets regarding our language in this case `facebook/wav2vec2-large-lv60`. To handle the context representations in any audio length we use a merge strategy plan (pooling mode) to concatenate that 3D representations into 2D representations.

There are three merge strategies `mean`, `sum`, and `max`. In this example we will use `max`.

In [None]:
from transformers import AutoConfig, Wav2Vec2Processor, Wav2Vec2FeatureExtractor

model_name = "facebook/wav2vec2-large-lv60"
pooling_mode = "max"
problem_type = "single_label_classification"

config = AutoConfig.from_pretrained(
    model_name,
    num_labels=len(label_to_author),
    label2id=author_to_label,
    id2label=label_to_author,
    finetuning_task="wav2vec2_clf",
)
setattr(config, "pooling_mode", pooling_mode)
setattr(config, "problem_type", problem_type)

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

In [None]:
processor = Wav2Vec2Processor.from_pretrained(model_name)

preprocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]



## Model

Before diving into the training part, we need to build our classification model based on the merge strategy.

In [None]:
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
from transformers.file_utils import ModelOutput


@dataclass
class ArtistClassifierOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None

In [None]:
import torch
import torch.nn as nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2PreTrainedModel,
    Wav2Vec2Model
)


class Wav2Vec2ClassificationHead(nn.Module):
    """Head for wav2vec classification task."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x


class Wav2Vec2ForArtistClassification(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.pooling_mode = config.pooling_mode
        self.config = config
        self.loss_fct = CrossEntropyLoss()
        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = Wav2Vec2ClassificationHead(config)

        self.init_weights()

    def freeze_feature_extractor(self):
        self.wav2vec2.feature_extractor._freeze_parameters()

    def _merged_strategy(
            self,
            hidden_states,
            mode="mean"
    ):
        if mode == "mean":
            outputs = torch.mean(hidden_states, dim=1)
        elif mode == "sum":
            outputs = torch.sum(hidden_states, dim=1)
        elif mode == "max":
            outputs = torch.max(hidden_states, dim=1)[0]
        else:
            raise Exception(
                "The pooling method hasn't been defined! Your pooling mode must be one of these ['mean', 'sum', 'max']")

        return outputs

    def forward(
            self,
            input_values,
            attention_mask=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            labels=None,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.wav2vec2(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = outputs[0]
        hidden_states = self._merged_strategy(hidden_states, mode=self.pooling_mode)
        logits = self.classifier(hidden_states)
        loss = self.loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ArtistClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )


## Training

The data is processed so that we are ready to start setting up the training pipeline. We will make use of 🤗's [Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer) for which we essentially need to do the following:

- Define a data collator. In contrast to most NLP models, XLSR-Wav2Vec2 has a much larger input length than output length. *E.g.*, a sample of input length 50000 has an output length of no more than 100. Given the large input sizes, it is much more efficient to pad the training batches dynamically meaning that all training samples should only be padded to the longest sample in their batch and not the overall longest sample. Therefore, fine-tuning XLSR-Wav2Vec2 requires a special padding data collator, which we will define below

- Evaluation metric. During training, the model should be evaluated on the word error rate. We should define a `compute_metrics` function accordingly

- Load a pretrained checkpoint. We need to load a pretrained checkpoint and configure it correctly for training.

- Define the training configuration.

After having fine-tuned the model, we will correctly evaluate it on the test data and verify that it has indeed learned to correctly classify audio.

### Set-up Trainer

Let's start by defining the data collator. The code for the data collator was copied from [this example](https://github.com/huggingface/transformers/blob/9a06b6b11bdfc42eea08fa91d0c737d1863c99e3/examples/research_projects/wav2vec2/run_asr.py#L81).

Without going into too many details, in contrast to the common data collators, this data collator treats the `input_values` and `labels` differently and thus applies to separate padding functions on them (again making use of XLSR-Wav2Vec2's context manager). This is necessary because in speech input and output are of different modalities meaning that they should not be treated by the same padding function.
Analogous to the common data collators, the padding tokens in the labels with `-100` so that those tokens are **not** taken into account when computing the loss.

In [None]:
import torch
import transformers
from transformers import Wav2Vec2Processor
from dataclasses import dataclass
from typing import Dict, List, Optional, Union


@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [feature["labels"] for feature in features]

        d_type = torch.long if isinstance(label_features[0], int) else torch.float

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt"
        )

        batch["labels"] = torch.tensor(label_features, dtype=d_type)

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Next, the evaluation metric is defined. We will use **Accuracy** and **F1-score** You can define other metrics on your own.

In [None]:
import numpy as np
from transformers import EvalPrediction
from sklearn.metrics import accuracy_score, f1_score


def compute_metrics(p: EvalPrediction):
    labels = p.label_ids
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {"accuracy": acc, "f1": f1}

Now, we can load the pretrained XLSR-Wav2Vec2 checkpoint into our classification model with a pooling strategy.

In [None]:
model = Wav2Vec2ForArtistClassification.from_pretrained(
    model_name,
    config=config
)

pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForArtistClassification were not initialized from the model checkpoint at facebook/wav2vec2-large-lv60 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The first component of XLSR-Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful - but contextually independent - features from the raw speech signal. This part of the model has already been sufficiently trained during pretraining and as stated in the [paper](https://arxiv.org/pdf/2006.13979.pdf) does not need to be fine-tuned anymore.
Thus, we can set the `requires_grad` to `False` for all parameters of the *feature extraction* part.

In [None]:
model.freeze_feature_extractor()

In [None]:
from transformers import TrainingArguments

batch_size = 32
logging_steps = len(train_dataset) // batch_size

training_args = TrainingArguments(
    output_dir="/content/wav2vec2-large-lv60-artists-classification",
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=2,
    evaluation_strategy="epoch",
    num_train_epochs=20,
    fp16=True,
    logging_steps=logging_steps,
    logging_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    weight_decay=1e-4,
    save_total_limit=2,
    optim="adamw_torch",
    seed=42,
    metric_for_best_model="accuracy"
)



Now, all instances can be passed to Trainer and we are ready to start training!

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=processor.feature_extractor,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


### Training

Training will take between almost 60 minutes depending on A100 GPU in Google Colab.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.1371,0.742151,0.663968,0.590686
2,0.4694,0.316651,0.906883,0.904698
3,0.2538,0.218979,0.935223,0.934533
4,0.1725,0.164575,0.951417,0.950947
5,0.0904,0.049821,0.991903,0.991903
6,0.0809,0.049125,0.991903,0.991903
7,0.0323,0.073546,0.979757,0.979729
8,0.0218,0.134952,0.967611,0.967549
9,0.0172,0.083407,0.983806,0.983802
10,0.0115,0.047801,0.987854,0.987786


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,1.1371,0.742151,0.663968,0.590686
2,0.4694,0.316651,0.906883,0.904698
3,0.2538,0.218979,0.935223,0.934533
4,0.1725,0.164575,0.951417,0.950947
5,0.0904,0.049821,0.991903,0.991903
6,0.0809,0.049125,0.991903,0.991903
7,0.0323,0.073546,0.979757,0.979729
8,0.0218,0.134952,0.967611,0.967549
9,0.0172,0.083407,0.983806,0.983802
10,0.0115,0.047801,0.987854,0.987786


TrainOutput(global_step=620, training_loss=0.11489533725074462, metrics={'train_runtime': 3829.5186, 'train_samples_per_second': 10.315, 'train_steps_per_second': 0.162, 'total_flos': 1.20013925664e+19, 'train_loss': 0.11489533725074462, 'epoch': 20.0})

Almost 100% accuracy on test data. That's Great!

Let's see how our the best model works on validation data.

In [None]:
device = torch.device("cuda")
model_checkpoint = '/content/wav2vec2-large-lv60-artists-classification/checkpoint-155'
valid_model = Wav2Vec2ForArtistClassification.from_pretrained(model_checkpoint).to(device)

trainer.model = valid_model
results = trainer.evaluate(valid_dataset)

In [None]:
pd.Series(results, name="results")

Unnamed: 0,results
eval_loss,0.068997
eval_accuracy,0.979757
eval_f1,0.979781
eval_runtime,22.8008
eval_samples_per_second,10.833
eval_steps_per_second,1.36
epoch,20.0


Good. But model appeared to be a bit overfitted.