<a href="https://colab.research.google.com/github/AchrafAsh/diva/blob/main/diva_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TODO — SimCLR

- Update the dataset to create only one sample per song
- Create a test dataset (myself singing the song)
- Create the Dataset with data augmentation
    1. Picks 2 random 10-sec clip from the list of songs
    2. Apply random (different) transformation on each clip

<img width="400px" src="https://s3.us-west-2.amazonaws.com/secure.notion-static.com/a656c6f8-935a-4249-bdde-9ff8c25ea59d/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220314%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220314T084351Z&X-Amz-Expires=86400&X-Amz-Signature=c26af76493b960ee79043f5c7c63ff1caaaff229c3f76fa5635dc252b3a45bfc&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Untitled.png%22&x-id=GetObject"/>

- Setup wandb for experiment tracking
- Run on GPU

Pipeline

1. select a song
2. sample 2 random clips from the song
3. apply random (different) torchaudio transformation on each clip
4. get spectrogram of each clip
5. repeat for every song
6. create batch of 2N samples (N=number of songs)
7. feedforward
8. compute contrastive loss
9. backpropagation
10. rinse and repeat

## Building the Dataset

Using the top 10 songs on TikTok:
1. Laxed (Siren Beat) by Jawsh 685 and Jason Derulo
2. Savage and Savage Remix by Megan Thee Stallion, featuring Beyoncé
3. OUT WEST by JACKBOYS and Travis Scott, featuring Young Thug
4. WAP by Cardi B, featuring Megan Thee Stallion
5. Say So by Doja Cat
6. Tap In by Saweetie
7. The Box by Roddy Ricch
8. Rags2Riches by Rod Wave, featuring ATR Son Son
9. Supalonely by BENEE, featuring Gus Dapperton
10. What You Know Bout Love by Pop Smoke

### Convert MP3 to WAV

In [10]:
import os
import subprocess
import wave

Convert MP3 files to WAV if needed

In [None]:
for filepath in os.listdir('./songs'):
    song_name = filepath.replace('.mp3', '')
    subprocess.call(['ffmpeg', '-i', f'./songs/{song_name}.mp3', f'./songs/{song_name}.wav'])

Let's listen to a clip

In [None]:
import librosa
import IPython.display as ipd
import numpy as np

In [None]:
ipd.Audio('./songs/say_so.mp3')

In [None]:
x, sr = librosa.load('./songs/say_so/clip_03.wav')
ipd.Audio(x, rate=sr)

### Creating the 10-sec clip samples

In [None]:
# TODO - filter out empty clips (at the beginning/end)

In [None]:
def create_short_clips(root:str, song_filename:str, clip_length=10):
    os.mkdir(f"{root}/{song_filename.replace('.wav', '')}")

    song_duration = librosa.get_duration(filename=f"{root}/{song_filename}")
    start = 0
    end = clip_length
    clip_id = 1

    while end < song_duration:
        # file to extract the snippet from
        with wave.open(f'{root}/{song_filename}', 'rb') as infile:
            # get file data
            nchannels = infile.getnchannels()
            sampwidth = infile.getsampwidth()
            framerate = infile.getframerate()
            # set position in wave to start of segment
            infile.setpos(int(start * framerate))
            # extract data
            data = infile.readframes(int((end - start) * framerate))

        # write the extracted data to a new file
        with wave.open(f"{root}/{song_filename.replace('.wav','')}/clip_{clip_id:02d}.wav", "w") as outfile:
            outfile.setnchannels(nchannels)
            outfile.setsampwidth(sampwidth)
            outfile.setframerate(framerate)
            outfile.setnframes(int(len(data) / sampwidth))
            outfile.writeframes(data)
        
        start, end = end, end+clip_length
        clip_id += 1
    
    print(f"{clip_id-1} clips created in {root}/{song_filename.replace('.wav','')}")

In [None]:
for filename in os.listdir('./songs'):
    if len(filename.split('.')) > 1 and filename.split('.')[1] != 'wav': continue
    try:
        create_short_clips("./songs", filename)
    except Exception as e:
        print(e)

[Errno 17] File exists: './songs/savage_love'
[Errno 17] File exists: './songs/savage_remix'
[Errno 17] File exists: './songs/say_so'
[Errno 17] File exists: './songs/savage_love'
[Errno 17] File exists: './songs/savage_remix'
[Errno 17] File exists: './songs/out_west'
[Errno 17] File exists: './songs/what_you_know_bout_love'
[Errno 17] File exists: './songs/the_box'
[Errno 17] File exists: './songs/the_box'
[Errno 17] File exists: './songs/what_you_know_bout_love'
[Errno 17] File exists: './songs/out_west'
[Errno 17] File exists: './songs/say_so'
15 clips created in ./songs/tap_in


### Create a meta data dataframe

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(columns=[
                           "song",
                           "clip_id",
                           "clip_path",
                           "start_timestamp",
                           "end_timestamp"
])

In [None]:
CLIP_LENGTH = 10
songs = next(os.walk('./songs'))[1]
clips = []
for song in songs:
    for clip in os.listdir(f"./songs/{song}"):
        try:
            id = int(clip.replace(".wav","").split("_")[1])
            clips.append({
                "song": song,
                "clip_id": id,
                "clip_path": f"songs/{song}/{clip}",
                "start_timestamp": (id-1) * CLIP_LENGTH,
                "end_timestamp": id * CLIP_LENGTH
            })
        except Exception as e:
            print(e)
            print(song, clip)

df = pd.DataFrame(clips)

invalid literal for int() with base 10: 'checkpoints'
say_so .ipynb_checkpoints


### Save the dataset (not persistent in Colab)

In [6]:
!zip -r songs.zip songs


zip error: Invalid command arguments (nothing to select from)


### What is a spectrogram?

<img style="float:left" width="600px" src="https://upload.wikimedia.org/wikipedia/commons/c/c5/Spectrogram-19thC.png"/>

- what are the axes?

> Time on the x-axis and frequencies on the y-axis

- what does the color represent?

> The colors represent the intensity of the signal (dB)

- how do you produce a spectrogram from an audio signal?

> FFT and whatnot

- is there any loss of information, i.e, is the process reversible)?

> A spectrogram contains no information about the exact, or even approximate, phase of the signal that it represents. For this reason, it is not possible to reverse the process and generate a copy of the original signal from a spectrogram, though in situations where the exact initial phase is unimportant it may be possible to generate a useful approximation of the original signal —from Wikipedia

Note: Humans perceive frequency logarithmically, that's why we use Melspectrograms instead of spectrograms

### Pytorch Custom Dataset

In [11]:
import os
import numpy as np
import pandas as pd

import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

import torch.nn as nn
import torch.nn.functional as F

import torchaudio

import matplotlib.pyplot as plt

In [12]:
from typing import List

In [13]:
class SOXEffects():
    def __init__(self, target_sample_rate:int):
        # TODO - pass the effects as argument, this is too many for one transform
        self.effects = [
            ["lowpass", "-1", "300"],  # apply single-pole lowpass filter
            ["speed", "0.8"],  # reduce the speed
            # This only changes sample rate, so it is necessary to
            # add `rate` effect with original sample rate after this.
            ["rate", f"{target_sample_rate}"],
            ["reverb", "-w"],  # Reverbration gives some dramatic feeling
        ]
        self.target_sample_rate = target_sample_rate

    def __call__(self, waveform):
        waveform, _ = torchaudio.sox_effects.apply_effects_tensor(waveform, self.target_sample_rate, self.effects)
        return waveform

In [14]:
class PlaybackSpeed():
    def __init__(self, target_sample_rate:int, speed:float):
        self.effects = [
            ["speed", str(speed)],
            # This only changes sample rate, so it is necessary to
            # add `rate` effect with original sample rate after this.
            ["rate", f"{target_sample_rate}"]
        ]
        self.target_sample_rate = target_sample_rate

    def __call__(self, waveform):
        waveform, _ = torchaudio.sox_effects.apply_effects_tensor(waveform, self.target_sample_rate, self.effects)
        return waveform

In [15]:
class Reverbation():
    def __init__(self, target_sample_rate:int):
        self.effects = [
            ["reverb", "-w"],  # Reverbration gives some dramatic feeling
        ]
        self.target_sample_rate = target_sample_rate

    def __call__(self, waveform):
        waveform, _ = torchaudio.sox_effects.apply_effects_tensor(waveform, self.target_sample_rate, self.effects)
        return waveform

In [16]:
class BackgroundNoise():
    def __init__(self, target_sample_rate:int, snr_db=3, noise_filepath:str="./background_noise.wav"):
        self.noise, _ = torchaudio.load(noise_filepath)
        self.snr_db = snr_db

    def __call__(self, waveform):
        noise = self.noise[:, : waveform.shape[1]]
        noise = F.pad(noise, (waveform.shape[1] - noise.shape[1], 0))

        input_power = waveform.norm(p=2)
        noise_power = noise.norm(p=2)
        
        snr = np.exp(self.snr_db / 10)
        scale = snr * noise_power / input_power
        output = (scale * waveform + noise) / 2
        return output

In [96]:
class TopSongDataset(Dataset):
    def __init__(self, rootdir:str, metadata_filepath:str, target_sample_rate:int, num_samples:int):
        self.rootdir = rootdir
        self.metadata = pd.read_csv(f"{rootdir}/{metadata_filepath}")
        self.songs = self.metadata['song'].unique()
        self.target_sample_rate = target_sample_rate
        self.num_samples = num_samples
        self.transforms = [
                           PlaybackSpeed(target_sample_rate, speed=0.8),
                           PlaybackSpeed(target_sample_rate, speed=1.2),
                           Reverbation(target_sample_rate),
                        #    BackgroundNoise(target_sample_rate, snr_db=3),
                        #    BackgroundNoise(target_sample_rate, snr_db=10)
        ]
        self.get_spectrogram = torchaudio.transforms.MelSpectrogram(
            sample_rate=self.target_sample_rate,
            n_fft=1024,
            hop_length=512,
            n_mels=64
        )

    def __len__(self):
        return len(self.songs)
    
    def get_sample(self, filepath:str):
        waveform, sample_rate = torchaudio.load(filepath)
        return waveform, sample_rate

    def preprocess_signal(self, signal, sample_rate):
        # all signals might not have the same sample rate so we need to resample evenly
        if sample_rate != self.target_sample_rate:
            resampler = torchaudio.transforms.Resample(sample_rate, self.target_sample_rate)
            signal = resampler(signal)

        # the signal is not necessarily mono channel so we need to mix them to get a mono channel
        if signal.shape[0] > 1:
            signal = torch.mean(signal, dim=0, keepdim=True)
        # should have the required number of samples
        if signal.shape[1] > self.num_samples:
            signal = signal[:,:self.num_samples]
        
        if signal.shape[1] < self.num_samples:
            # add padding
            missing_samples = self.num_samples - signal.shape[1]
            signal = F.pad(signal, (0, missing_samples))
        
        return signal

    def __getitem__(self, idx:int):
        song = self.songs[idx]
        # Select 2 random clips
        clips = self.metadata.query(f"song == '{song}'")
        idx_1, idx_2 = np.random.choice(range(4, len(clips)-4), 2, replace=False, p=None) # uniform distribution
        # idx_1, idx_2 = 5, 6 # TODO - move back to random clips
        sample_1, sample_2 = clips.iloc[idx_1].to_dict(), clips.iloc[idx_2].to_dict()
        
        signal_1, sr_1 = self.get_sample(
            os.path.join(self.rootdir, sample_1['clip_path'])
        )
        signal_2, sr_2 = self.get_sample(
            os.path.join(self.rootdir, sample_2['clip_path'])
        )

        # Resample, Trim/Padding, Mix
        signal_1 = self.preprocess_signal(signal_1, sr_1)
        signal_2 = self.preprocess_signal(signal_2, sr_2)

        sample_1['original'] = signal_1
        sample_2['original'] = signal_2

        # Select 2 random data transformations
        transform_1, transform_2 = np.random.choice(self.transforms, 2, replace=True, p=None) # uniform distribution
        signal_1, signal_2 = transform_1(signal_1), transform_2(signal_2)

        signal_1 = self.preprocess_signal(signal_1, self.target_sample_rate)
        signal_2 = self.preprocess_signal(signal_2, self.target_sample_rate)

        spectrogram_1, spectrogram_2 = self.get_spectrogram(signal_1), self.get_spectrogram(signal_2)

        sample_1['transform'] = type(transform_1).__name__
        sample_2['transform'] = type(transform_2).__name__
        sample_1['waveform'] = signal_1
        sample_2['waveform'] = signal_2
        sample_1['spectrogram'] = spectrogram_1
        sample_2['spectrogram'] = spectrogram_2

        return sample_1, sample_2

### Download dataset from zip

In [8]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [9]:
!unzip -q gdrive/MyDrive/songs.zip -d ./songs

### Instantiate Dataset and Dataloader

In [97]:
BATCH_SIZE = 10
# TODO - find the right values to get a 10-sec clip
SAMPLE_RATE = 22_050
NUM_SAMPLES = 220_500

dataset = TopSongDataset(rootdir="./songs", metadata_filepath="data.csv",
                         target_sample_rate=SAMPLE_RATE, num_samples=NUM_SAMPLES)

dataloader = DataLoader(dataset,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        num_workers=2)

In [98]:
def fix_filepath(row):
    row["clip_path"] = row["clip_path"].replace("songs/","")
    return row

dataset.metadata = dataset.metadata.apply(fix_filepath, 1)

In [21]:
sample = dataset[0][0]

In [23]:
import IPython.display as ipd

In [24]:
ipd.Audio(sample['original'], rate=SAMPLE_RATE)

In [25]:
ipd.Audio(sample['waveform'], rate=SAMPLE_RATE)

In [None]:
for idx, sample in enumerate(dataloader):
    spectrogram = sample[0]["spectrogram"][0]
    song = sample[0]["song"]
    print(spectrogram.shape)
    for i in range(BATCH_SIZE):
        plt.figure()
        plt.title(f"Song: {song[i]} ({sample[0]['start_timestamp'][i]} - {sample[0]['end_timestamp'][i]})")
        p = plt.imshow(spectrogram.log2()[i,:,:].detach().numpy())
    
    if idx == 0: break

## Building SimCLR baseline

1. Data augmentation
2. ResNet base encoder (backbone)
3. Projection head: Dense-ReLu-Dense
4. Contrastive Loss: Normalized Temperature-scaled Cross Entropy (NT-Xnet)

### Model definition

In [26]:
class SimCLR(nn.Module):
    def __init__(self): # TODO - use variables for hidden/output dimensions
        super().__init__()
        self.conv1 = nn.Conv2d(1, 3, kernel_size=(7,7), stride=(2,2), padding=(3,3), bias=False)
        self.backbone = torch.hub.load('pytorch/vision:v0.10.0', 'resnet34', pretrained=True)
        self.backbone.fc = nn.Linear(in_features=512, out_features=1024)

        self.projection = nn.Sequential(
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 124) # TODO - reduce output dimension
        )
    
    def forward(self, x):
        x = self.conv1(x)
        h = self.backbone(x)
        z = self.projection(h)
        return z

#### Quick and dirty debugging of the model

In [27]:
simclr_model = SimCLR()

Downloading: "https://github.com/pytorch/vision/archive/v0.10.0.zip" to /root/.cache/torch/hub/v0.10.0.zip
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth


  0%|          | 0.00/83.3M [00:00<?, ?B/s]

In [None]:
simclr_model(next(iter(dataloader))[0]['spectrogram']).shape

torch.Size([7, 256])

Alright we can do a forward step 🎉

Now let's run a proper training!

### Training Loop

First we need to define the _Normalized Temperature-scaled Cross Entropy Loss (NT-Xnet)_:

$$
l_{i,j} = -log\frac{exp(sim(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k\neq i]} exp(sim(z_i, z_k)/\tau ) }
$$

$$
\mathcal L = \frac{1}{2N} \sum_k^N l_{2k, 2k-1} + l_{2k-1, 2k}
$$

In [28]:
# From pytorch lightning examples
def nt_xent_loss(out_1, out_2, temperature, eps=1e-6):
    """
    assume out_1 and out_2 are normalized
    out_1: [batch_size, dim]
    out_2: [batch_size, dim]
    """
    out_1, out_2 = F.normalize(out_1, dim=1), F.normalize(out_2, dim=1)

    out = torch.cat([out_1, out_2], dim=0)
    
    # cov and sim: [2 * batch_size, 2 * batch_size * world_size]
    # neg: [2 * batch_size]
    cov = torch.mm(out, out.t().contiguous())
    sim = torch.exp(cov / temperature)
    neg = sim.sum(dim=-1)

    # from each row, subtract e^(1/temp) to remove similarity measure for x1.x1
    row_sub = torch.Tensor(neg.shape).fill_(np.exp(1 / temperature))
    neg = torch.clamp(neg - row_sub, min=eps)  # clamp for numerical stability

    # Positive similarity, pos becomes [2 * batch_size]
    pos = torch.exp(torch.sum(out_1 * out_2, dim=-1) / temperature)
    pos = torch.cat([pos, pos], dim=0)

    loss = -torch.log(pos / (neg + eps)).mean()

    return loss

#### Testing the custom loss on dummy data

In [None]:
z_1 = [0, 1, 0]
z_2 = [0, 0, 1]
z_3 = [0, 1, 0]
z_4 = [0, 0, 1]

tau = 0.1

sim = [
       [1/tau, 0, 1/tau, 0],
       [0, 1/tau, 0, 1/tau],
       [1/tau, 0, 1/.1, 0],
       [0, 1/tau, 0, 1/tau],
]

In [None]:
l = [[
     -np.log(
         np.exp(sim[i][j]) / np.sum([ np.exp(sim[i][k]) if k!=i else 0 for k in range(4) ])
     ) for i in range(4)
] for j in range(4)]

In [None]:
np.sum([ l[k][k+2] + l[k+2][k] for k in range(2) ]) / 4

9.079573746728087e-05

In [None]:
nt_xent_loss(torch.tensor([z_1,z_2], dtype=torch.float), torch.tensor([z_3,z_4], dtype=torch.float), temperature=0.1)

tensor(9.0782e-05)

#### Utility functions

In [29]:
from tqdm.auto import tqdm

In [95]:
def build_catalog(model, dataset):
    # TODO - index every clip so I can find the exact match and increase the likelihood of finding the right one
    catalog = {'song':[],'index':[], 'waveform': []}
    for idx, pair_samples in enumerate(dataset):
        v1 = model(pair_samples[0]['spectrogram'].unsqueeze(0))
        v2 = model(pair_samples[1]['spectrogram'].unsqueeze(0))
        # v = torch.mean(torch.cat([v1, v2]), dim=0)
        catalog['song'].append(pair_samples[0]['song'])
        catalog['song'].append(pair_samples[0]['song'])
        catalog['index'].append(v1)
        catalog['index'].append(v2)
        catalog['waveform'].append(pair_samples[0]['original'])
        catalog['waveform'].append(pair_samples[1]['original'])
    return {
        'song': np.array(catalog['song']),
        'waveform': torch.cat(catalog['waveform']),
        'index': F.normalize(torch.cat(catalog['index']))
    }

In [42]:
# Evaluate Easy mode
def fake_evaluation(model, dataset, n_iter=4):
    model.eval()
    recall_k = {'1':[], '2':[], '3':[]}
    
    catalog = build_catalog(model, dataset)
    
    for i in range(len(dataset)):
        sample, _ = dataset[i]
        projection = model(sample['spectrogram'].unsqueeze(0))

        similarities = F.normalize(projection) @ catalog['index'].T # Normalizing to avoid unscaled projections
        sorted_index = torch.argsort(similarities, descending=True)
        sorted_songs = catalog['song'][sorted_index.numpy().flatten()]
        
        recall_k['1'].append(np.isin(sample['song'], sorted_songs[:1]))
        recall_k['2'].append(np.isin(sample['song'], sorted_songs[:2]))
        recall_k['3'].append(np.isin(sample['song'], sorted_songs[:3]))

    return {f'{k}': np.mean(recall_k[f'{k}']) for k in range(1,4)}

In [31]:
def train(model, optimizer, train_loader, test_loader=None, num_epochs=100, log_step=10):
    for epoch in tqdm(range(num_epochs)):
        model.train()
        
        epoch_loss = 0
        for batch, pair_samples in enumerate(train_loader):
            optimizer.zero_grad()
            
            projections_1, projections_2 = model(pair_samples[0]['spectrogram']), model(pair_samples[1]['spectrogram'])
            loss = nt_xent_loss(projections_1, projections_2, temperature=0.1)

            loss.backward()
            optimizer.step()

            epoch_loss += float(loss.item())

        if (epoch+1) % log_step == 0:
            recall_k = fake_evaluation(model, dataset, n_iter=3)
            print(f"Epoch [{epoch+1}/{num_epochs}]: Loss — {epoch_loss} - Recall@[1,2,3] [{recall_k['1']} - {recall_k['2']} - {recall_k['3']}]")

In [34]:
class LARS(torch.optim.Optimizer):
    def __init__(
        self,
        params,
        lr,
        momentum=0,
        dampening=0,
        weight_decay=0,
        nesterov=False,
        trust_coefficient=0.001,
        eps=1e-8,
    ):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if momentum < 0.0:
            raise ValueError(f"Invalid momentum value: {momentum}")
        if weight_decay < 0.0:
            raise ValueError(f"Invalid weight_decay value: {weight_decay}")

        defaults = dict(
            lr=lr,
            momentum=momentum,
            dampening=dampening,
            weight_decay=weight_decay,
            nesterov=nesterov,
            trust_coefficient=trust_coefficient,
            eps=eps,
        )
        if nesterov and (momentum <= 0 or dampening != 0):
            raise ValueError("Nesterov momentum requires a momentum and zero dampening")

        super().__init__(params, defaults)

    def __setstate__(self, state):
        super().__setstate__(state)

        for group in self.param_groups:
            group.setdefault("nesterov", False)

    @torch.no_grad()
    def step(self, closure=None):
        """Performs a single optimization step.
        Args:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        # exclude scaling for params with 0 weight decay
        for group in self.param_groups:
            weight_decay = group["weight_decay"]
            momentum = group["momentum"]
            dampening = group["dampening"]
            nesterov = group["nesterov"]

            for p in group["params"]:
                if p.grad is None:
                    continue

                d_p = p.grad
                p_norm = torch.norm(p.data)
                g_norm = torch.norm(p.grad.data)

                # lars scaling + weight decay part
                if weight_decay != 0:
                    if p_norm != 0 and g_norm != 0:
                        lars_lr = p_norm / (g_norm + p_norm * weight_decay + group["eps"])
                        lars_lr *= group["trust_coefficient"]

                        d_p = d_p.add(p, alpha=weight_decay)
                        d_p *= lars_lr

                # sgd part
                if momentum != 0:
                    param_state = self.state[p]
                    if "momentum_buffer" not in param_state:
                        buf = param_state["momentum_buffer"] = torch.clone(d_p).detach()
                    else:
                        buf = param_state["momentum_buffer"]
                        buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
                    if nesterov:
                        d_p = d_p.add(buf, alpha=momentum)
                    else:
                        d_p = buf

                p.add_(d_p, alpha=-group["lr"])

        return loss

#### Run training

In [79]:
model = SimCLR()
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-5, weight_decay=1e-6)
optimizer = LARS(model.parameters(), lr=0.3, weight_decay=1e-6)
train(model, optimizer, dataloader, test_loader=None, num_epochs=800, log_step=100)

Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0


  0%|          | 0/800 [00:00<?, ?it/s]

Epoch [100/800]: Loss — 2.1149845123291016 - Recall@[1,2,3] [0.47619047619047616 - 0.5714285714285714 - 0.6666666666666666]
Epoch [200/800]: Loss — 1.3887320756912231 - Recall@[1,2,3] [0.7142857142857143 - 0.7619047619047619 - 0.8571428571428571]
Epoch [300/800]: Loss — 0.6643391847610474 - Recall@[1,2,3] [0.9523809523809523 - 1.0 - 1.0]
Epoch [400/800]: Loss — 0.3392679691314697 - Recall@[1,2,3] [1.0 - 1.0 - 1.0]
Epoch [500/800]: Loss — 0.006332626100629568 - Recall@[1,2,3] [0.9047619047619048 - 0.9047619047619048 - 0.9523809523809523]
Epoch [600/800]: Loss — 0.00023521219554822892 - Recall@[1,2,3] [0.9523809523809523 - 0.9523809523809523 - 1.0]
Epoch [700/800]: Loss — 0.00017056346405297518 - Recall@[1,2,3] [1.0 - 1.0 - 1.0]
Epoch [800/800]: Loss — 0.00011956570961046964 - Recall@[1,2,3] [1.0 - 1.0 - 1.0]


#### Saving the trained model

In [81]:
from datetime import datetime
now = lambda: int(datetime.now().timestamp())

In [121]:
torch.save(model.state_dict(), f"simclr_{now()}.pt")

In [None]:
# Export the model with ONNX
sample = dataset[0][0]['spectrogram'].unsqueeze(0)

torch.onnx.export(
    model,                     # model being run
    sample,           # model input (or a tuple for multiple inputs)
    f"simclr_{now()}.onnx",      # where to save the model (can be a file or file-like object)
    export_params=True,        # store the trained parameter weights inside the model file
    input_names = ['spectrogram'],   # the model's input names
    output_names = ['output']) # the model's output names

### Let's go in debugging mode...

1. Make sure the loss is what's expected with simple projection vectors (1s and 0s)
2. Listen to the audio clips to check if the task is too hard
3. Try using a small untrained resnet

Insights:

1. Look at the signal to find the bad ones — some clips have nothing → test if selecting (good) clips works better
2. Make sure the sampling rate/number of samples are big enough — I might be having just 1-sec clips... → learn about the theory of signal processing...

In [None]:
sample_1,sample_2 = dataset[0]

In [None]:
x, sr = librosa.load(os.path.join('./songs', sample_1['clip_path']))
ipd.Audio(x, rate=sr)

In [None]:
x, sr = librosa.load(os.path.join('./songs', sample_2['clip_path']))
ipd.Audio(x, rate=sr)

In [None]:
sample_2

{'clip_id': 15,
 'clip_path': 'tap_in/clip_15.wav',
 'end_timestamp': 150,
 'song': 'tap_in',
 'spectrogram': tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]),
 'start_timestamp': 140}

## Testing the model on new clips

In [51]:
import librosa

In [102]:
# Testing a new audio
subprocess.call(['ffmpeg', '-i', f'./test_say_so.mp3', f'./test_say_so.wav'])

0

In [49]:
# Create a 10-sec clip
def create_test_clip(filepath:str):
    with wave.open(filepath, 'rb') as infile:
        # get file data
        nchannels = infile.getnchannels()
        sampwidth = infile.getsampwidth()
        framerate = infile.getframerate()
        # set position in wave to start of segment
        infile.setpos(int(50 * framerate))
        # extract data
        data = infile.readframes(int((60 - 50) * framerate))

    # write the extracted data to a new file
    with wave.open(f"./test_clip.wav", "w") as outfile:
        outfile.setnchannels(nchannels)
        outfile.setsampwidth(sampwidth)
        outfile.setframerate(framerate)
        outfile.setnframes(int(len(data) / sampwidth))
        outfile.writeframes(data)
    
    x, sr = librosa.load('./test_clip.wav')
    ipd.Audio(x, rate=sr)

In [103]:
# create_test_clip("./test_savage_love.wav")
create_test_clip("./test_say_so.wav")

In [104]:
signal, sr = dataset.get_sample("./test_clip.wav")
signal = dataset.preprocess_signal(signal, sr)
spectrogram = dataset.get_spectrogram(signal)

In [105]:
# Running inference!
test_projection = model(spectrogram.unsqueeze(0))
catalog = build_catalog(model, dataset)
similarities = F.normalize(test_projection) @ catalog['index'].T
print(similarities)
sorted_index = torch.argsort(similarities, descending=True)
sorted_songs = catalog['song'][sorted_index.numpy().flatten()]
sorted_songs

tensor([[-0.1643, -0.1468,  0.2992,  0.3459, -0.1709, -0.1947, -0.2716, -0.2512,
         -0.2374, -0.2183, -0.2639, -0.2542,  0.8020,  0.7720]],
       grad_fn=<MmBackward0>)


array(['say_so', 'say_so', 'savage_remix', 'savage_remix', 'tap_in',
       'tap_in', 'savage_love', 'savage_love', 'what_you_know_bout_love',
       'what_you_know_bout_love', 'out_west', 'the_box', 'the_box',
       'out_west'], dtype='<U23')

In [127]:
library = catalog.copy()

In [128]:
library['song'] = library['song'].tolist()

In [129]:
library['waveform'] = library['waveform'].tolist()

In [130]:
library['index'] = library['index'].tolist()

In [132]:
import json

with open('library.json', 'w') as json_file:
    library_json = json.dump(library, json_file)

In [106]:
ipd.Audio(catalog['waveform'][sorted_index[0][1]], rate=sr)

In [107]:
ipd.Audio(signal, rate=sr)

---

In [None]:
# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
    const reader = new FileReader()
    reader.onloadend = e => resolve(e.srcElement.result)
    reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
    stream = await navigator.mediaDevices.getUserMedia({ audio: true })
    recorder = new MediaRecorder(stream)
    chunks = []
    recorder.ondataavailable = e => chunks.push(e.data)
    recorder.start()
    await sleep(time)
    recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
    }
    recorder.stop()
})
"""

def record(sec=3):
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    b = b64decode(s.split(',')[1])
    with open('record.wav','wb') as f:
        f.write(b)
    return 'record.wav'  # or webm ?

In [None]:
recording_filepath = record(10)

<IPython.core.display.Javascript object>

In [None]:
def run_shazam(model:nn.Module, filepath:str):
    # process recording
    # signal, sr = dataset.get_sample(filepath)
    signal, sr = librosa.load('./songs/../record.wav')
    signal = dataset.preprocess_signal(signal, sr)
    spectrogram = dataset.get_spectrogram(signal)
    
    # run inference
    test_projection = model(spectrogram.unsqueeze(0))
    
    # find best match
    catalog = build_catalog(model, dataset)
    similarities = F.softmax(test_projection @ catalog['index'].T, dim=1)
    print(similarities)
    sorted_index = torch.argsort(similarities, descending=True)
    sorted_songs = catalog['song'][sorted_index.numpy().flatten()]
    
    return sorted_songs

NameError: ignored

In [None]:
run_shazam(model, './record.wav')

In [None]:
x, sr = librosa.load('./songs/../record.wav')
ipd.Audio(x, rate=sr)

