<a href="https://colab.research.google.com/github/dmarx/notebooks/blob/music_representations/Contrastive_Learning_of_Musical_Representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Contrastive Learning of Musical Representations

This notebook demonstrates how to load a pre-trained encoder using CLMR, and train a linear classifier for the task of automatic music classification. At the end of this notebook, you should be able to achieve ~68% accuracy on the GTZAN genre classification task using a linear classifier.


### Introduction
In this work, we introduce SimCLR to the music domain and contribute a large chain of audio data augmentations, to form a simple framework for self-supervised learning of raw waveforms of music: CLMR. We evaluate the performance of the self-supervised learned representations on the task of music classification.

- We achieve competitive results on the MagnaTagATune and Million Song Datasets relative to fully supervised training, despite only using a linear classifier on self-supervised learned representations, i.e., representations that were learned task-agnostically without any labels.
- CLMR enables efficient classification: with only 1% of the labeled data, we achieve similar scores compared to using 100% of the labeled data.
- CLMR is able to generalise to out-of-domain datasets: when training on entirely different music datasets, it is still able to perform competitively compared to fully supervised training on the target dataset.

<div align="center">
  <img width="50%" alt="CLMR model" src="https://github.com/Spijkervet/CLMR/blob/master/media/clmr_model.png?raw=true">
</div>
<div align="center">
  An illustration of CLMR.
</div>


The latest checkpoints are accessible in the following GitHub release:
- [clmr_checkpoint_10000](https://github.com/Spijkervet/CLMR/releases/download/2.0/clmr_checkpoint_10000.zip)
- [finetuner_checkpoint_200](https://github.com/Spijkervet/CLMR/releases/download/2.0/finetuner_checkpoint_200.zip)

## Installation

In [2]:
#!git clone https://github.com/spijkervet/clmr
%cd /content/clmr
!pip install . -q

/content/clmr
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m719.0/719.0 kB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.2/519.2 kB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.6/149.6 kB[0m [31m18.9 M

## Download pre-trained CLMR encoder weights

In [3]:
!wget -nc https://github.com/Spijkervet/CLMR/releases/download/2.1/clmr_magnatagatune_mlp.zip
!unzip -o clmr_magnatagatune_mlp.zip

ENCODER_CHECKPOINT_PATH = "./clmr_magnatagatune_mlp/clmr_epoch=10000.ckpt"
FINETUNER_CHECKPOINT_PATH = "./clmr_magnatagatune_mlp/mlp_epoch=34-step=2589.ckpt"
SAMPLE_RATE = 22050

--2023-05-24 07:36:32--  https://github.com/Spijkervet/CLMR/releases/download/2.1/clmr_magnatagatune_mlp.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/254389163/7befd600-88ac-11eb-9b4f-ae40e2c8cdfa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230524%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230524T073632Z&X-Amz-Expires=300&X-Amz-Signature=22e4514052bbcbf09143e19f671b2bbd2cb56d2583c92c36d7d5fc16d74ac610&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=254389163&response-content-disposition=attachment%3B%20filename%3Dclmr_magnatagatune_mlp.zip&response-content-type=application%2Foctet-stream [following]
--2023-05-24 07:36:32--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/254389163/7befd600-88ac-11eb-9b4f

## Setup dependenies and load config file

In [5]:
import os
import argparse
import pytorch_lightning as pl
from torch.utils.data import DataLoader
#from torchaudio_augmentations import Compose, RandomResizedCrop
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger
from tqdm import tqdm

#from clmr.datasets import get_dataset
from clmr.data import ContrastiveDataset
from clmr.evaluation import evaluate
from clmr.models import SampleCNN
from clmr.modules import ContrastiveLearning, LinearEvaluation
from clmr.utils import (
    yaml_config_hook,
    load_encoder_checkpoint,
    load_finetuner_checkpoint,
)

In [6]:
parser = argparse.ArgumentParser(description="SimCLR")
parser = Trainer.add_argparse_args(parser)

config = yaml_config_hook("./config/config.yaml")
for k, v in config.items():
  parser.add_argument(f"--{k}", default=v, type=type(v))

args = parser.parse_args([])
pl.seed_everything(args.seed)
args.accelerator = None

AttributeError: ignored

## Initialise dataset
The GTZAN dataset is used in this example.
It will download and extract the dataset entirely, which may take a while.

In [None]:
args.dataset = "gtzan"
train_dataset = get_dataset(args.dataset, "./data", subset="train")
test_dataset = get_dataset(args.dataset, "./data", subset="test")

  0%|          | 0.00/1.14G [00:00<?, ?B/s]

### How to use your own folder of audio files for pre-training

In [None]:
from IPython.display import Audio
!mkdir /content/clmr/audio_data
!cd /content/clmr/audio_data && \
  wget -nc https://www2.cs.uic.edu/~i101/SoundFiles/ImperialMarch60.wav && \
  wget -nc https://www2.cs.uic.edu/~i101/SoundFiles/PinkPanther30.wav
!cd /content/clmr/audio_data && 
!cd /content/clmr

audio_dataset = get_dataset("audio", "./audio_data", subset="train")
for idx in range(len(audio_dataset)):
  audio_dataset.preprocess(idx, args.sample_rate)

x, _ = audio_dataset[0]
display(Audio(x, rate=args.sample_rate))

x, _ = audio_dataset[1]
display(Audio(x, rate=args.sample_rate))

mkdir: cannot create directory ‘/content/clmr/audio_data’: File exists
File ‘ImperialMarch60.wav’ already there; not retrieving.

File ‘PinkPanther30.wav’ already there; not retrieving.

/bin/bash: -c: line 1: syntax error: unexpected end of file


## Initialise ContrastiveDataset
Wrap the common torch Dataset class in our own ContrastiveDataset class, which enables us to specify our own transformations to the input data before it is feeded to the network.

In [None]:
# Wrap the train and test dataset in our own ContrastiveDataset class, which
# accepts a transform parameter that applies a transformation on the input data.

train_transform = [RandomResizedCrop(n_samples=args.audio_length)]

# ------------
# dataloaders
# ------------
train_dataset = get_dataset(args.dataset, args.dataset_dir, subset="train")
valid_dataset = get_dataset(args.dataset, args.dataset_dir, subset="valid")
test_dataset = get_dataset(args.dataset, args.dataset_dir, subset="test")

contrastive_train_dataset = ContrastiveDataset(
    train_dataset,
    input_shape=(1, args.audio_length),
    transform=Compose(train_transform),
)

contrastive_valid_dataset = ContrastiveDataset(
    valid_dataset,
    input_shape=(1, args.audio_length),
    transform=Compose(train_transform),
)

contrastive_test_dataset = ContrastiveDataset(
    test_dataset,
    input_shape=(1, args.audio_length),
    transform=None,
)

train_loader = DataLoader(
    contrastive_train_dataset,
    batch_size=args.finetuner_batch_size,
    num_workers=args.workers,
    shuffle=True,
)

valid_loader = DataLoader(
    contrastive_valid_dataset,
    batch_size=args.finetuner_batch_size,
    num_workers=args.workers,
    shuffle=False,
)

## Load CLMR pre-training weights to encoder

In [10]:
#args.checkpoint_path = ENCODER_CHECKPOINT_PATH

encoder = SampleCNN(
    strides=[3, 3, 3, 3, 3, 3, 3, 3, 3],
    #supervised=args.supervised,
    supervised=True,
    #out_dim=train_dataset.n_classes,
    out_dim=4,
)

n_features = encoder.fc.in_features  # get dimensions of last fully-connected layer

#state_dict = load_encoder_checkpoint(ENCODER_CHECKPOINT_PATH, train_dataset.n_classes)
state_dict = load_encoder_checkpoint(ENCODER_CHECKPOINT_PATH, 4)
encoder.load_state_dict(state_dict)

#cl = ContrastiveLearning(args, encoder)
#cl.eval()
#cl.freeze()


<All keys matched successfully>

## Load linear fine-tuner head module and weights

In [None]:
module = LinearEvaluation(
    args,
    cl.encoder,
    hidden_dim=n_features,
    output_dim=train_dataset.n_classes,
)



## Extract representations from our dataset of audio files
Let's extract our representations first, so that we do not have to do this for every iteration in our linear classifier:

In [None]:
train_representations_dataset = module.extract_representations(train_loader)
train_loader = DataLoader(
    train_representations_dataset,
    batch_size=args.batch_size,
    num_workers=args.workers,
    shuffle=True,
)

valid_representations_dataset = module.extract_representations(valid_loader)
valid_loader = DataLoader(
    valid_representations_dataset,
    batch_size=args.batch_size,
    num_workers=args.workers,
    shuffle=False,
)

100%|██████████| 2/2 [00:51<00:00, 25.89s/it]
100%|██████████| 1/1 [00:24<00:00, 24.09s/it]


In [None]:
print(f"There are {len(train_representations_dataset)} representations of each {len(train_representations_dataset[0][0])} dimensions in the train dataset.")
print(f"There are {len(valid_representations_dataset)} representations of each {len(valid_representations_dataset[0][0])} dimensions in the train dataset.")


There are 443 representations of each 512 dimensions in the train dataset.
There are 197 representations of each 512 dimensions in the train dataset.


In [None]:
early_stop_callback = EarlyStopping(
            monitor="Valid/loss", patience=10, verbose=False, mode="min"
        )

trainer = Trainer.from_argparse_args(
    args,
    logger=TensorBoardLogger(
        "runs", name="CLMRv2-eval-{}".format(args.dataset)
    ),
    max_epochs=args.finetuner_max_epochs,
    callbacks=[early_stop_callback],
)
trainer.fit(module, train_loader, valid_loader)


GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`."

  | Name              | Type             | Params
-------------------------------------------------------
0 | encoder           | SampleCNN        | 2.4 M 
1 | model             | Sequential       | 5.1 K 
2 | criterion         | CrossEntropyLoss | 0     
3 | accuracy          | Accuracy         | 0     
4 | average_precision | AveragePrecision | 0     
-------------------------------------------------------
5.1 K     Trainable params
2.4 M     Non-trainable params
2.4 M     Total params
9.495     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 42
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 289), started 0:03:02 ago. (Use '!kill 289' to kill it.)

<IPython.core.display.Javascript object>

## Get ROC-AUC and PR-AUC scores on test set

In [None]:
device = "cuda:0" if args.gpus else "cpu"
results = evaluate(
  module.encoder,
  module.model,
  contrastive_test_dataset,
  args.dataset,
  args.audio_length,
  device=device,
)
print(results)

100%|██████████| 290/290 [04:48<00:00,  1.01it/s]

{'Accuracy': 0.6275862068965518}





In [None]:
print(f"With a linear classifier, we reach an accuracy of: {results['Accuracy']*100:0.1f}%")

With a linear classifier, we reach an accuracy of: 62.8%


## Conclusion

Using a linear classifier trained on representations extracted from a pre-trained encoder using CLMR, we reach an accuracy of **~68.3%**.

We encourage everyone to try the code and create their own methods for self-supervised learning in the field of music research. The code, pre-trained weights are available at: https://github.com/spijkervet/clmr

The paper and the supplementary materials can be found on [Arxiv](https://arxiv.org/abs/2103.09410).

[![arXiv](https://img.shields.io/badge/arXiv-2103.09410-b31b1b.svg)](https://arxiv.org/abs/2103.09410)
[![Supplementary Material](https://img.shields.io/badge/Supplementary%20Material-2103.09410-blue.svg)](https://github.com/Spijkervet/CLMR/releases/download/2.1/CLMR.-.Supplementary.Material.pdf)


