<a href="https://colab.research.google.com/github/allispaul/audiobot/blob/main/models/Audiobots_GTZAN_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Erd&#337;s Institute Fall Boot Camp - Team Audiobots

We're using data from [this dataset](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification) to try and classify one thousand 30s samples of audio into one of 10 genres:

*   blues
*   classical
*   country
*   disco
*   hiphop
*   jazz
*   metal
*   pop
*   reggae
*   rock

We are assuming the genres are accurate. We're using this instead of the other dataset, as it seems to be more accurately classified and avoids any problematic "International" genre.

For training, we will feed the 90\% of the data into a pre-trained Transformer network from HuggingSpace, and fine-tune the network to classify one of the 10 genres above. If the architecture requries inputs of constant size, we can either pad shorter samples with 0's, or randomly clip shorter sections of longer audio streams.

This code does the following:

*   Downloads the data from Huggingface
*   Splits the data into a 90/10 training and validation set. No need to create a test set until we get to the larger datasets.
*   (TODO) split manually to ensure consistency - use same seed for now?
*   Reencodes audio at 16kHz
*   Uses preprocess_function to... uhh... extract features? I know it makes sure everything is the right length through padding/truncating. Oh, and it normalizes everything to mean 0 (most should be close) and variance 1.
*   Trains the pretrained distilhubert model for 10 epochs of batch size 8 with a fairly smol learning rate
*   (TODO) Are we freezing the bottom weights on the distilhubert model before training? If not, we should do this
*   (TODO) Are we un-freezing the bottom weights and fine-tuning with a lower learning rate? If not, we should do this
*   (TODO) Spectrograms and a different model?

In [1]:
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import librosa
import librosa.display

#from IPython.display import Audio

!pip install datasets
from datasets import load_dataset, Audio

#!pip install git+https://github.com/huggingface/transformers
! pip install -U accelerate
! pip install -U transformers
!pip install evaluate





Download the data from the Huggingface repo

In [2]:
gtzan = load_dataset("marsyas/gtzan", split='train')

gtzan

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset({
    features: ['file', 'audio', 'genre'],
    num_rows: 999
})

In [3]:
gtzan = gtzan.train_test_split(seed=42, shuffle=True, test_size=0.1, stratify_by_column = 'genre')
gtzan

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 899
    })
    test: Dataset({
        features: ['file', 'audio', 'genre'],
        num_rows: 100
    })
})

In [4]:
GENRES = ["blues", "classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"]
# I'm using the built in function later, but this might be easier?

In [5]:
from transformers import AutoFeatureExtractor

model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

In [6]:
sampling_rate = feature_extractor.sampling_rate
sampling_rate

#This is the sampling rate that the model expects, so we have to make sure we re-sample the audio to this rate.

16000

In [7]:
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate))
#Otherwise, it will ASSUME the audio is 16kHz, and only use the first ~11s of slowed down audio

In [8]:
max_duration = 30.0 #I'm pretty sure all the audio is close to exactly this long (skipped EDA, lol)


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [9]:
gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
gtzan_encoded

Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 899
    })
    test: Dataset({
        features: ['genre', 'input_values', 'attention_mask'],
        num_rows: 100
    })
})

In [10]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")

In [12]:
id2label_fn = gtzan["train"].features["genre"].int2str
id2label_fn(gtzan["train"][0]["genre"])

'country'

In [14]:
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

id2label["7"]

'pop'

In [15]:
from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments

model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    load_best_model_at_end = True,
)

In [None]:
import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=gtzan_encoded["train"],
    eval_dataset=gtzan_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()


This takes nearly 1.5h to run on Google Colab + T4 GPU. That may vary on a local machine.

We're approaching 90% accuracy using just the default model, training time, and hyperparameters, so I'm confident we can increase past that with some tweaking (but boy does it take a while to run and verify that).

I'll play around with hyperparameter optimization and freezing weights, and see if we can't get 90+ (otherwise why waste the compute time when XGBoost or whatever can get just as accurate in a fraction of the time).

I strongly suspect increasing the amount of data will help here.