# Generate Music With Diffusion Models
In this notebook we will implement a Diffusion Probabilistic Model to generate music from raw waveform.\
A Diffusion Model learns to invert a Markov process that progressively adds gaussian noise to the signal until the signal is completely destroyed. As shown in the figure.

<center><img src="./images/diff_model.png"></img></center>

The forward process $q(x_t | x_{t-1})$ is fixed, and it can be parametrized as:

$$
q(x_t|x_{t-1}) = N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I)
$$
Where $\beta_1$,...,$\beta_t$ are givern by a fixed variance schedule (in our case the betas are set to be linearly increasing constants from $10^{-4}$ to 0.02 as in [1]).

If the steps are small enough also the backward process is a gaussian process of the form:

$$
p(x_{t-1}|x_t) = N(x_{t-1},\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))
$$
Where $\mu_\theta$ and $\Sigma_\theta$ are parametric functions with parameters $\theta$.

Diffusion Models are latent variable models of the form:
$$ p_\theta(x_0) = \int p_\theta(x_{0:T}) dx_{1:T}$$
where $x_1$,...,$x_T$ are latents with the same dimensionality of the data $x_0 \sim q(x_0)$.
However the distribution $p_\theta(x_0)$ cannot be optimized directly maximizing the log-lokelihood for the intractability of the integral. Hence we optimize the model maximizing the ELBO:
$$
E_{x_{1:T}}[\log{\frac{p_\theta(x_{0:T})}{p(x_{1:T}|x_0)}}]
$$

The maximization of the ELBO can be reduced to the minimization of a loss that compares directly the learned backward process $p_\theta(x_{t-1}|x_t)$ with the posterior $q(x_{t-1}|x_t,x_0) = N(x_{t-1};\tilde{\mu_t}(x_t,x_0),\tilde{\beta}_tI)$, that is tractable when conditioned on $x_0$. 

We adopted the same reparametrization introduced in [1], according to which:
$$
\tilde{\mu_t}(x_t,x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t
$$
$$
\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t
$$
$$
\bar{\alpha_t} = \prod_{s=1}^t \alpha_s
$$
$$
\alpha_t = 1-\beta_t
$$


In [1]:
import tensorflow as tf
from params import params
from data.dataset import get_wav
import math
import subprocess

2022-08-23 10:32:45.352099: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Dataset
To train the model we used a small portion of the <a href="https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification">GTZAN</a> dataset (about 40 tracks), because training the model on the whole dataset was too expensive. The GTZAN dataset contains 1000 traks categorized in 10 genres, even though the categorization is not used in this project. 
### Preprocessing
Each sample is a mono audio file (one channel) that is 30 seconds long with a sampling rate of 22.5 kHz. This implies that each sample is a tensor with shape (675000,1). In particular the sequence length was too long to efficiently train the neural network that will be described in the next section. For this reason each audio file was downsampled by a factor 3. In order to prevent aliasing, before the downsampling, the signal was passed through a low pass filter, that was implemented from scratch in order to be fully integrated in the data pipeline. The low pass filter is implemented by the method `data.dataset.low_pass(signal,fc,npoints=128)` that takes as input a signal with shape (seq_len,n_channels) and the cutoff frequency fc and it returns the same signal without all the frequencies component that are above the cutoff. In order to further reduce the number of samples each downsampled audiofile was split into 6 samples.
The data pipeline is implemented by the method `data.dataset.get_unlabelled_dataset(bs=None,nsplits=1,downsample=1)`, while the values of the parameters used for training are stored in the file `params.py`.
In order to train the model efficiently the dataset was cached into the `dataset/unlabelled_10` directory. 

In [None]:
# set this counter to the number of batches (4 samples each)
# you want to listen
count = 1

ds = tf.data.experimental.load("./dataset/unlabelled_10",tf.TensorSpec(shape=(params["BS"],math.ceil((params["NSAMPLES"]//params["NSPLITS"])/params["DOWNSAMPLE"]),1), dtype=tf.float32))
for curr,x in enumerate(ds):
    if curr >= count:
        break
    for i in range(x.shape[0]):
        wav = get_wav(x[i],params["SR"]//params["DOWNSAMPLE"])
        subprocess.run(["ffplay","-"],input=wav.numpy())

## Training

## Model

## Results

## Bibliography
- [1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851.
