<a href="https://colab.research.google.com/github/VicDc/Uruz01/blob/main/7004_Live_Session_8_Build_a_Music_Genre_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Music Genre Classifier

Original Notebook available [here](https://huggingface.co/learn/audio-course/chapter4/introduction)

## Introduction

Transformers are widely used in audio and speech processing.

In this notebook, we will explore how to use audio transformers in the audio classification task of associating a musical genre like 'pop' or 'rock' to a song. This is a crucial task for music streaming platforms like Spotify. These platforms leverage such classification to recommend songs that align with the user's preferences.

What you will learn:

* Identify suitable pre-trained models for audio classification tasks
* Use the 🤗 Datasets library and the Hugging Face Hub to access and select audio classification datasets
* Fine-tune a pre-existing model for effective classification of songs based on genre.


##  Pre-trained models and datasets for audio classification

The Hugging Face Hub offers pre-trained models for audio classification. Using the *pipeline()* function makes it easy to switch between models without code changes, facilitating quick experimentation.

For audio classification, encoder-only models are preferred due to their efficiency. They map input audio sequences to hidden-state representations, and a top classification layer produces the final class label output. Decoder-only models add complexity and are less practical.

Let's explore popular pre-trained models for **zero-shot audio classification**.



### Zero-shot Audio Classification

In regular audio classification, pre-trained models may struggle if the labels in the task are different from what they learned before. Zero-shot audio classification helps with this using a model called [CLAP](https://huggingface.co/docs/transformers/model_doc/clap). CLAP looks at both the sound and some text and figures out how similar they are. We can use this to classify new sounds, even if they're not in the original set. Just give the model a sound and a few possible labels, and it will tell you which label is the best match.

In [None]:
!pip install transformers[torch] datasets accelerate

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import load_dataset

from transformers import pipeline

We can explore an example of audio from the [Environmental Speech Challenge (ESC) dataset](https://huggingface.co/datasets/ashraq/esc50)

In [None]:
dataset = load_dataset("ashraq/esc50", split="train", streaming=True)

audio_sample = next(iter(dataset))["audio"]["array"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


dataset_infos.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

We specify potential labels for classification. The model assigns a probability score to each label. It's important to know the possible labels in advance. We can use either the full set or a selected subset. Using the full set is more comprehensive but may lead to lower accuracy due to a larger classification space.


In [None]:
candidate_labels = ["Sound of a dog", "Sound of vacuum cleaner"]

We can run the CLAP model through the pipeline to find the candidate label that is most similar to the audio input

In [None]:
classifier = pipeline(
    task="zero-shot-audio-classification", model="laion/clap-htsat-unfused"
)

config.json:   0%|          | 0.00/5.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/615M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
classifier(audio_sample, candidate_labels=candidate_labels)

[{'score': 0.9997242093086243, 'label': 'Sound of a dog'},
 {'score': 0.00027583108749240637, 'label': 'Sound of vacuum cleaner'}]

The model is very confident that the sound is a dog, predicting it with more than 99% probability. We can listen to the audio sample to confirm the classification is correct.

In [None]:
from IPython.display import Audio

Audio(audio_sample, rate=16000)

Indeed, the model correctly identified the dog barking with high confidence.Feel free to experiment with different audio samples and labels. Consider using information from the dataset. Why not use the zero-shot audio classification pipeline for all tasks? The CLAP model is pre-trained on generic audio data, not specific speech. The model lacks certain language differentiation abilities.

Now, we will fine-tune a transformer for music classification. We will have a ready-to-use checkpoint for song classification using the pipeline() class.

## Fine-tuning a model for music classification

We will go now step by step on fine-tuning an encoder-only transformer model for music classification.

We will use a lightweight model and a relatively small dataset, making the code runnable on any consumer-grade GPU, including the T4 16GB GPU in the Google Colab free tier. We will provide tips in case you have a smaller GPU and encounter memory issues during the process too.

### Dataset

To train our model, we use the [GTZAN ](https://huggingface.co/datasets/marsyas/gtzan) dataset, a collection of 1,000 songs for music genre classification.

Each song is a 30-second clip from one of 10 music genres, ranging from disco to metal. We can easily access the audio files and their corresponding labels from the Hugging Face Hub using the load_dataset() function from Datasets.

In [None]:
!nvidia-smi

Fri Nov  8 19:53:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
gtzan = load_dataset("marsyas/gtzan", "all")

gtzan

README.md:   0%|          | 0.00/4.42k [00:00<?, ?B/s]

gtzan.py:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

The repository for marsyas/gtzan contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/marsyas/gtzan.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


genres.tar.gz:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 999
    })
})

Since GTZAN does not come with a predefined validation set, we create one.

Given the dataset's balanced distribution across genres, we can use the train_test_split() method to easily generate a 90/10 split.

In [None]:
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)

gtzan

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})

We can have a look at a sample

In [None]:
gtzan["train"][0]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/3b204381d6c029312e4f9c569c6b1130af3041dd36ca38ca53d4e20f585e39c6/genres/pop/pop.00098.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/3b204381d6c029312e4f9c569c6b1130af3041dd36ca38ca53d4e20f585e39c6/genres/pop/pop.00098.wav',
  'array': array([ 0.10720825,  0.16122437,  0.28585815, ..., -0.22924805,
         -0.20629883, -0.11334229]),
  'sampling_rate': 22050},
 'genre': 7}

The audio files are 1-dimensional NumPy arrays, with each value representing the amplitude at a specific timestep. For these songs, the sampling rate is 22.050 Hz, meaning there are 22,050 amplitude values sampled per second. When using a pretrained model with a different sampling rate, we must convert the sampling rates to ensure compatibility.

The genre is represented as an integer or class label, which is the format the model uses for predictions. To make these integers more human-readable, we can use the int2str() method of the genre feature.

In [None]:
id2label_fn = gtzan["train"].features["genre"].int2str

id2label_fn(gtzan["train"][0]["genre"])

'pop'

### Preprocessing the Data

Similar to tokenization in NLP, audio and speech models need the input to be encoded in a format that the model can process.

The conversion from audio to the input format is managed by the model's feature extractor. The Transformers library offers the AutoFeatureExtractor class, which automatically selects the appropriate feature extractor for a given model.

To understand how we can process our audio files, let's start by instantiating the feature extractor for DistilHuBERT from the pre-trained checkpoint/

In [None]:
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Given the different sampling rates between the model and the dataset, we need to resample the audio file to 16.000 Hz before feeding it to the feature extractor.

We can do this by retrieving the model's sample rate from the feature extractor.

In [None]:
sampling_rate = feature_extractor.sampling_rate
sampling_rate

16000

Now we resample on the dataset using the cast_column() method and the Audio feature from Datasets.

In [None]:
from datasets import Audio

gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate))

gtzan["train"][0]

{'file': '/root/.cache/huggingface/datasets/downloads/extracted/3b204381d6c029312e4f9c569c6b1130af3041dd36ca38ca53d4e20f585e39c6/genres/pop/pop.00098.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/3b204381d6c029312e4f9c569c6b1130af3041dd36ca38ca53d4e20f585e39c6/genres/pop/pop.00098.wav',
  'array': array([ 0.0873509 ,  0.20183384,  0.4790867 , ..., -0.18743178,
         -0.23294401, -0.13517427]),
  'sampling_rate': 16000},
 'genre': 7}

Sampling rate is now 16kHz.

For models like Wav2Vec2 and HuBERT, the feature extractor ensures consistent dynamic range by normalizing the input through feature scaling. Let's apply the feature extractor to our first audio sample and compute the mean and variance of the raw audio data.

In [None]:
import numpy as np

sample = gtzan["train"][0]["audio"]

print(f"Mean: {np.mean(sample['array']):.3}, Variance: {np.var(sample['array']):.3}")

Mean: 0.000185, Variance: 0.0493


In [None]:
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])

print(f"inputs keys: {list(inputs.keys())}")

print(
    f"Mean: {np.mean(inputs['input_values']):.3}, Variance: {np.var(inputs['input_values']):.3}"
)

inputs keys: ['input_values', 'attention_mask']
Mean: -7.45e-09, Variance: 1.0


Feature extractor returns input_values and attention_mask. Mean and variance are now suitable. We passed the correct sampling rate for compatibility.

Now, we can define a function for processing all examples in the dataset.
We truncate longer clips using max_length and truncation arguments.

In [None]:
max_duration = 30.0


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

With this function defined, we can apply it to the dataset using the map() method.

We use batches of examples by setting batched=True. The default batch size is 1000, but we'll reduce it to 100 to ensure the peak RAM stays within a sensible range for Google Colab's free tier.

```
If you go into RAM issues, you can adjust the batch parameters to reduce peak RAM usage. Specifically, two arguments can be modified:

- **batch_size**: Defaults to 1000, set to 100 above. Try reducing by a factor of 2 to 50.
- **writer_batch_size**: Defaults to 1000. Try reducing it to 500, and if that doesn't work, reduce it by a factor of 2 again to 250.code
```



In [None]:
gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
gtzan_encoded

Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

To simplify training, we remove the audio and file columns from the dataset.

The input_values column contains encoded audio files, the attention_mask is a binary mask indicating padding, and the genre column holds the labels. For the Trainer to process class labels, we rename the genre column to label.

Finally we obtain label mappings from the dataset using the int2str() method. This allows conversion between integer IDs and human-readable class labels!

In [None]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

In [None]:
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

### Fine-tuning the Model

To fine-tune the model, we use the Trainer class from Transformers.

We fine-tune the model on GTZAN using the Trainer. To load a model for this task, we'll use the AutoModelForAudioClassification class, which automatically adds the necessary classification head to our pre-trained model

In [None]:
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.0M [00:00<?, ?B/s]

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


You can upload model checkpoints directly to the Hugging Face Hub during training. The Hub offers:
* Integrated version control to ensure no checkpoint is lost.
* Tensorboard logs for tracking important metrics.
* Model cards for documenting the model's purpose and use cases.
* A community platform for sharing and collaboration.

Linking the notebook to the Hub is simple—just enter your Hub authentication token when prompted.
You can find your authentication token [here](https://huggingface.co/settings).


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The final step is to define the metrics. Given the balanced dataset, we'll use accuracy as our metric and load it using the Evaluate library.

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

We instantiate now the Trainer and train the model. Note the processor that is different from the model!

For a proper fine-tuning, you should use at least a batch size of 8 and 10 epochs.



In [None]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 4 # You should actually have a batch size of 8 at least
gradient_accumulation_steps = 1
num_train_epochs = 1 # You should actually do around 10 training epochs

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    push_to_hub=True,
)



In [None]:
from transformers import Trainer
from transformers import AutoProcessor, AutoModel, AutoModelForAudioClassification

processor = AutoProcessor.from_pretrained("bert-base-uncased")
model = AutoModelForAudioClassification.from_pretrained("lewtun/distilhubert-finetuned-gtzan")


trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of the model checkpoint at lewtun/distilhubert-finetuned-gtzan were not used when initializing HubertForSequenceClassification: ['hubert.encoder.pos_conv_embed.conv.weight_g', 'hubert.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing HubertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HubertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at lewtun/distilhubert-finetuned-gtzan and are newly initialized: ['hubert.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'hubert.encoder.pos_conv_embed.co

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-672e6c75-72a0c61f40e0e640769f5c6f;6ab9c228-6378-44ed-bb2d-cf1df11ec4d4)

Invalid username or password.

We can upload now both model and results to the Hub.

In [None]:
kwargs = {
    "dataset_tags": "marsyas/gtzan",
    "dataset": "GTZAN",
    "model_name": f"{model_name}-finetuned-gtzan",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
}

In [None]:
trainer.push_to_hub(**kwargs)

We can use our model in a pipeline

In [None]:
my_model_name = "pcasale/distilhubert-finetuned-gtzan"
pipe = pipeline(
    "audio-classification", model=my_model_name
)

In [None]:
### Do it yourself
##### Use the model for inference