<a href="https://colab.research.google.com/github/aayush1693/Music-Genre-Classification-using-Transformers/blob/main/exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step-by-step implementation:


---


Installing required module
At first, we need to install transformers, accelerate, datasets and evaluate modules to our runtime.

In [1]:
!pip install transformers
!pip install accelerate
!pip install datasets
!pip install evaluate

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00

Importing required libraries

---


Now we will import all required Python libraries like NumPy and transformers etc.

In [2]:
from datasets import load_dataset, Audio
import numpy as np
from transformers import pipeline, AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer
import evaluate

Loading dataset and Splitting


---


Now we will load the GTZAN dataset which contains total 10 music genres. Then we will split it into training and testing sets(90:10).

In [3]:
gtzan = load_dataset("marsyas/gtzan", "all")
gtzan = gtzan["train"].train_test_split(seed=42, shuffle=True, test_size=0.1)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


gtzan.py:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.42k [00:00<?, ?B/s]

The repository for marsyas/gtzan contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/marsyas/gtzan.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


genres.tar.gz:   0%|          | 0.00/1.23G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Data pre-processing

---


Now we will extract the features of audio files using transformers’ AutoFeatureExtractor. And define a driver function to iterate over the audio files(.wav).

Model and Feature Initialization
  *   used a pretrained model from the Hugging Face model hub
  *   initialized the feature extractor

Load data and performed audio preprocessing

Preprocessed the audio data in the GTZAN dataset using the feature extractor, the preprocess_function applies the feature extractor to a list of audio arrays, setting options such as ‘max_length’ and ‘truncation’.




In [4]:
model_id = "ntu-spml/distilhubert"
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)
sampling_rate = feature_extractor.sampling_rate
gtzan = gtzan.cast_column("audio", Audio(sampling_rate=sampling_rate))
sample = gtzan["train"][0]["audio"]
inputs = feature_extractor(
    sample["array"], sampling_rate=sample["sampling_rate"])
max_duration = 20.0


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs


gtzan_encoded = gtzan.map(
    preprocess_function,
    remove_columns=["audio", "file"],
    batched=True,
    batch_size=25,
    num_proc=1,
)

preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

Map:   0%|          | 0/899 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Encoding dataset:

---


To feed the dataset to the model we need to encode it.

*   Renamed the ‘genre’ column to ‘label’
*   Created mapping functions






In [5]:
gtzan_encoded = gtzan_encoded.rename_column("genre", "label")
id2label_fn = gtzan["train"].features["genre"].int2str
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(gtzan_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

Classification model



---


Now we will use ‘AutoModelForAudioClassification’ for the music genre classifiation. We will specify various training arguments for the model as per our choice and machine’s capability.

*   At first, we initialized a pretrained audio model for finetuning
*   We created an object containing various training configuration settings, such as evaluation strategy, learning rate, batch sizes, logging settings, etc. These settings are used during the model training process.



In [6]:
num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
	model_id,
	num_labels=num_labels,
	label2id=label2id,
	id2label=id2label,
)

model_name = model_id.split("/")[-1]
batch_size = 2
gradient_accumulation_steps = 1
num_train_epochs = 5

training_args = TrainingArguments(
	f"{model_name}-Music classification Finetuned",
	evaluation_strategy="epoch",
	save_strategy="epoch",
	learning_rate=5e-5,
	per_device_train_batch_size=batch_size,
	gradient_accumulation_steps=gradient_accumulation_steps,
	per_device_eval_batch_size=batch_size,
	num_train_epochs=num_train_epochs,
	warmup_ratio=0.1,
	logging_steps=5,
	load_best_model_at_end=True,
	metric_for_best_model="accuracy",
	fp16=True,
)


config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.0M [00:00<?, ?B/s]

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model evaluation
Now we will evaluate our model in the terms of Accuracy.


*   We loaded the accuracy metric for evaluation and it loaded from Hugging Face module.
*   We computed the evaluation metrics based on the model predictions and the reference labels. In this case, it uses the loaded accuracy metric to compute the accuracy.
*   Then we initialized the trainer and trained the model





In [7]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
	predictions = np.argmax(eval_pred.predictions, axis=1)
	return metric.compute(predictions=predictions, references=eval_pred.label_ids)


trainer = Trainer(
	model,
	training_args,
	train_dataset=gtzan_encoded["train"],
	eval_dataset=gtzan_encoded["test"],
	tokenizer=feature_extractor,
	compute_metrics=compute_metrics,
)

trainer.train()


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy
1,1.4709,1.268806,0.64
2,1.1448,0.949447,0.68
3,0.1766,0.69365,0.78
4,0.3255,0.941813,0.78
5,0.0988,0.789208,0.81


TrainOutput(global_step=2250, training_loss=0.8603939904106988, metrics={'train_runtime': 1789.5408, 'train_samples_per_second': 2.512, 'train_steps_per_second': 1.257, 'total_flos': 2.044662758208e+17, 'train_loss': 0.8603939904106988, 'epoch': 5.0})

Loading and Saving the model


In [8]:
# Save the model and feature extractor
model.save_pretrained("/content/Saved Model")
feature_extractor.save_pretrained("/content/Saved Model")

model.save_pretrained("/content/drive/MyDrive/Saved Model")
feature_extractor.save_pretrained("/content/drive/MyDrive/Saved Model")

['/content/drive/MyDrive/Saved Model/preprocessor_config.json']

Code for loading the model

In [9]:
# Load the model and feature extractor
loaded_model = AutoModelForAudioClassification.from_pretrained("/content/Saved Model")
loaded_feature_extractor = AutoFeatureExtractor.from_pretrained("/content/Saved Model")

Pipeline

---

Using this pipeline you will be able input an audio file and obtain the predicted genre along with the probability score. For the following code we have used a file of genre blue.

In [11]:
from transformers import pipeline, AutoFeatureExtractor

pipe = pipeline("audio-classification", model=loaded_model,
				feature_extractor=loaded_feature_extractor)


def classify_audio(filepath):
	preds = pipe(filepath)
	outputs = {}
	for p in preds:
		outputs[p["label"]] = p["score"]
	return outputs


# Provide the input file path
input_file_path = input('Input:')

# Classify the audio file
output = classify_audio(input_file_path)

# Print the output genre
print("Predicted Genre:")
max_key = max(output, key=output.get)

print("The predicted genre is:", max_key)
print("The prediction score is:", output[max_key])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Input:/content/sound-genre-blue.wav
Predicted Genre:
The predicted genre is: blues
The prediction score is: 0.9878857135772705


# Conclusion


---


Music genre classification presents a complex and computationally intensive challenge with broad applications across various industries. The implemented model, leveraging a DistilHuBERT-based architecture and fine-tuned on the GTZAN dataset, achieved a respectable accuracy of 82%. Further performance enhancements could be explored by utilizing a larger and more diverse dataset to improve the model's generalization capabilities and potentially achieve even higher accuracy.