#The Erd&#337;s Institute Fall Boot Camp - Team Audiobots

We're using data from [this dataset](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification) to try and classify one thousand 30s samples of audio into one of 10 genres:

*   blues
*   classical
*   country
*   disco
*   hiphop
*   jazz
*   metal
*   pop
*   reggae
*   rock

We are assuming the genres are accurate. We're using this instead of the other dataset, as it seems to be more accurately classified and avoids any problematic "International" genre.

For training, we will feed the 90\% of the data into a pre-trained Transformer network from HuggingSpace, and fine-tune the network to classify one of the 10 genres above. If the architecture requries inputs of constant size, we can either pad shorter samples with 0's, or randomly clip shorter sections of longer audio streams.

This code does the following:

*   Downloads the data from Huggingface
*   Splits the data into a 90/10 training and validation set. No need to create a test set until we get to the larger datasets.
*   (TODO) split manually to ensure consistency - use same seed for now?
*   Reencodes audio at 16kHz
*   Uses preprocess_function to... uhh... extract features? I know it makes sure everything is the right length through padding/truncating. Oh, and it normalizes everything to mean 0 (most should be close) and variance 1.
*   Trains the pretrained distilhubert model for 10 epochs of batch size 8 with a fairly smol learning rate
*   (TODO) Are we freezing the bottom weights on the distilhubert model before training? If not, we should do this
*   (TODO) Are we un-freezing the bottom weights and fine-tuning with a lower learning rate? If not, we should do this
*   (TODO) Spectrograms and a different model?

In [14]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import librosa
import librosa.display
import evaluate
import torch

#from IPython.display import Audio

from datasets import load_dataset, Audio

from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, Trainer, TrainingArguments


! pip install -U accelerate



[33mDEPRECATION: Loading egg at /home/dwgb93/mambaforge/envs/Audiobots/lib/python3.11/site-packages/huggingface_hub-0.22.2-py3.8.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


Download the data from the Huggingface repo

In [15]:
FMA = load_dataset("hkamath-rudra/fma", "medium") 
FMA = FMA.select_columns(['audio', 'track.genre_top'])
FMA

DatasetDict({
    train: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 13522
    })
    validation: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 1705
    })
    test: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 1773
    })
})

In [3]:
'''bad_idx = []


for i in range(len(FMA['train'])):
    if i % 1000 == 0:
        print("currently at", i)
    
    try:
        FMA['train'][i]['audio']['path']
    except:
        bad_idx.append(i)
        print("BAD",i)'''

'bad_idx = []\n\n\nfor i in range(len(FMA[\'train\'])):\n    if i % 1000 == 0:\n        print("currently at", i)\n    \n    try:\n        FMA[\'train\'][i][\'audio\'][\'path\']\n    except:\n        bad_idx.append(i)\n        print("BAD",i)'

In [16]:
def track_remover(split, idx):
    # create new dataset exluding those idx
    return FMA[split].select(
        (
            i for i in range(len(FMA[split])) 
            if i not in set(idx)
        )
    )

In [17]:
#small_train [3495, 3496, 3497, 3530, 3897, 5611]
#med_train = [146, 7638, 8878, 8879, 8880, 8881, 8882, 8883, 9301, 11215, 11290, 12713]
#med_validation = [817]
#med_test = [87]

split = 'train'
FMA[split] = track_remover(split, [146, 7638, 8878, 8879, 8880, 8881, 8882, 8883, 9301, 11215, 11290, 12713])

split = 'validation'
FMA[split] = track_remover(split, [817])

split = 'test'
FMA[split] = track_remover(split, [87])

FMA


DatasetDict({
    train: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 13510
    })
    validation: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 1704
    })
    test: Dataset({
        features: ['audio', 'track.genre_top'],
        num_rows: 1772
    })
})

In [18]:
#model_id = "openai/whisper-medium"
model_id = "openai/whisper-small"
#model_id = "sanchit-gandhi/whisper-medium-fleurs-lang-id"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True)

In [19]:
sampling_rate = feature_extractor.sampling_rate
sampling_rate

#This is the sampling rate that the model expects, so we have to make sure we re-sample the audio to this rate.

16000

In [20]:
FMA = FMA.cast_column("audio", Audio(sampling_rate=sampling_rate))
#Otherwise, it will ASSUME the audio is 16kHz, and only use the first ~11s of slowed down audio

In [21]:
max_duration = 30.0 #I'm pretty sure all the audio is close to exactly this long (skipped EDA, lol)


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        #return_attention_mask=True,
    )
    
    return inputs

In [22]:
def rename_labels(batch):
    batch['label'] = [label2id[x] for x in batch["track.genre_top"]]
    return batch

In [23]:
id2label = {
    0: 'International',
    1:	'Blues',
    2:	'Jazz',
    3:	'Classical',
    4:	'Old-Time / Historic',
    5:	'Country',
    6:	'Pop',
    7:	'Rock',
    8:	'Easy Listening',
    9:	'Soul-RnB',
    10:	'Electronic',
    11:	'Folk',
    12:	'Spoken',
    13:	'Hip-Hop',
    14:	'Experimental',
    15:	'Instrumental'
}

id2label.items()

dict_items([(0, 'International'), (1, 'Blues'), (2, 'Jazz'), (3, 'Classical'), (4, 'Old-Time / Historic'), (5, 'Country'), (6, 'Pop'), (7, 'Rock'), (8, 'Easy Listening'), (9, 'Soul-RnB'), (10, 'Electronic'), (11, 'Folk'), (12, 'Spoken'), (13, 'Hip-Hop'), (14, 'Experimental'), (15, 'Instrumental')])

In [24]:
label2id = {v: k for k, v in id2label.items()}

#id2label["7"]

In [25]:
FMA = FMA.map(
    rename_labels,
    remove_columns=["track.genre_top"],
    batched=True,
    batch_size=200,
    num_proc=1,
)

Map:   0%|          | 0/13510 [00:00<?, ? examples/s]

Map:   0%|          | 0/1704 [00:00<?, ? examples/s]

Map:   0%|          | 0/1772 [00:00<?, ? examples/s]

In [26]:
FMA_encoded = FMA.map(
    preprocess_function,
    remove_columns=["audio"],
    batched=True,
    batch_size=100,
    num_proc=1,
)
    
FMA_encoded

Map:   0%|          | 0/13510 [00:00<?, ? examples/s]

[src/libmpg123/layer3.c:INT123_do_layer3():1774] error: part2_3_length (3264) too large for available bit count (3224)
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1844] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!


Map:   0%|          | 0/1704 [00:00<?, ? examples/s]

[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!


Map:   0%|          | 0/1772 [00:00<?, ? examples/s]

[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!


DatasetDict({
    train: Dataset({
        features: ['label', 'input_features'],
        num_rows: 13510
    })
    validation: Dataset({
        features: ['label', 'input_features'],
        num_rows: 1704
    })
    test: Dataset({
        features: ['label', 'input_features'],
        num_rows: 1772
    })
})

In [27]:
num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Some weights of WhisperForAudioClassification were not initialized from the model checkpoint at openai/whisper-small and are newly initialized: ['model.classifier.bias', 'model.classifier.weight', 'model.projector.bias', 'model.projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
model_name = model_id.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 20

training_args = TrainingArguments(
    f"{model_name}-finetuned-FMA-med",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,#5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
   # load_best_model_at_end = True,
)

In [29]:
metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=FMA_encoded["train"],
    eval_dataset=FMA_encoded["validation"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss


In [None]:
trainer.evaluate()

In [None]:
trainer.save_model("./best_whisperSmall_model")

## This takes nearly 13+ hours to run on my computer.