<a href="https://colab.research.google.com/github/eloimoliner/CQTdiff/blob/main/notebook/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Solving Audio Inverse Problems with a Diffusion Model

This notebook is a demo of the gramophone noise synthesis method proposed in:

> E. Moliner, J. Lehtinen and V. Välimäki,, "Solving audio inverse problems with a diffusion model", submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023
Rhodes, Greece, May, 2023

Listen to our [audio samples](http://research.spa.aalto.fi/publications/papers/icassp23-cqt-diff/)

### Instructions for running:

* Make sure to use a GPU runtime, click:  __Runtime >> Change Runtime Type >> GPU__
* Press ▶️ on the left of each of the cells
* View the code: Double-click any of the cells
* Hide the code: Double click the right side of the cell


In [None]:
!git clone https://github.com/eloimoliner/CQTdiff.git
%cd CQTdiff
!bash download_weights_and_examples.sh

Cloning into 'CQTdiff'...
remote: Enumerating objects: 352, done.[K
remote: Counting objects: 100% (352/352), done.[K
remote: Compressing objects: 100% (237/237), done.[K
remote: Total 352 (delta 165), reused 281 (delta 106), pack-reused 0[K
Receiving objects: 100% (352/352), 278.94 KiB | 1.25 MiB/s, done.
Resolving deltas: 100% (165/165), done.
/content/CQTdiff
--2022-10-25 15:14:31--  https://github.com/eloimoliner/CQTdiff/releases/download/weights_and_examples/cqt_weights.pt
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/544841884/fd6c8e11-47e2-44e0-9f1d-6146ccb74457?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221025%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221025T151431Z&X-Amz-Expires=300&X-Amz-Signature=43e19c8c62ac4ba109a4bcc149c88

In [None]:
#@title #Setup environment

#@markdown Execute this cell to setup the environment
#! git clone git@github.com:eloimoliner/CQTdiff.git
#%cd gramophone_noise_synth
#! wget https://github.com/eloimoliner/gramophone_noise_synth/releases/download/gramophonediff/weights-750000.pt
#! mkdir experiments
#! mkdir experiments/trained_model
#! mv weights-750000.pt experiments/trained_model/

!pip install omegaconf
! pip install dotmap
! pip install Ninja

import os
#import hydra
import logging
import torch
import torchaudio
torch.cuda.empty_cache() 
import soundfile as sf

from omegaconf import OmegaConf
from omegaconf.omegaconf import open_dict
import numpy as np
from datetime import date

#from learner import Learner
#from model import UNet
import soundfile as sf
import IPython

from tqdm import tqdm

import scipy.signal


import yaml
from pathlib import Path
from dotmap import DotMap

import glob
from IPython.display import Audio 

args = yaml.safe_load(Path('conf/conf.yaml').read_text())
args = DotMap(args)


device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

dirname = os.getcwd()

#define the path where weights will be loaded and audio samples and other logs will be saved
args.model_dir = os.path.join(dirname, str(args.model_dir))
if not os.path.exists(args.model_dir):
    os.makedirs(args.model_dir)


args.architecture="unet_CQT" 
args.inference.checkpoint="cqt_weights.pt"

args.sample_rate=22050
args.resample_factor=1
args.inference.load.load_mode="from_directory"

#mkdir examples_dir
#copy the files there from somewhere
args.inference.load.data_directory=os.path.join(dirname,"data_dir")
args.inference.load.seg_idx=0

args.inference.load.seg_size=65536            

args.cqt.numocts=7
args.diffusion_parameters.sigma_data=0.057
args.cqt.use_norm=False


#import src.utils.setup as utils_setup
#test_set = utils_setup.get_test_set_for_sampling(args)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting omegaconf
  Downloading omegaconf-2.2.3-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 6.9 MB/s 
[?25hCollecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
[K     |████████████████████████████████| 117 kB 65.8 MB/s 
Building wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144575 sha256=506371ba31609b0104f352a5768ed5d632869c292ef0ae78e3c429fc194de626
  Stored in directory: /root/.cache/pip/wheels/8b/8d/53/2af8772d9aec614e3fc65e53d4a993ad73c61daa8bbd85a873
Successfully built antlr4-python3-runtime
Installing collected packages: antlr4-python3-runtime, omegaconf
Successfully installed antlr4-python3-runtime-4.9.3 omegaconf-2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dotmap
  Downloading dotmap-1.3.30-py3-none-any.whl (11 kB)
Installing collected packages: dotmap
Successfully installed dotmap-1.3.30
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Ninja
  Downloading ninja-1.10.2.4-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (120 kB)
[K     |████████████████████████████████| 120 kB 29.7 MB/s 
[?25hInstalling collected packages: Ninja
Successfully installed Ninja-1.10.2.4


In [None]:
#@title Unconditional synthesis
#@markdown Execute this cell to run unconditional synthesis experiments

args.inference.mode = 'unconditional'
mode=args.inference.mode
args.inference.unconditional.num_samples=1

#@markdown Length of the generated samples (in seconds)
audio_len=4 #@param {type:"slider", min:0.5, max:40, step:0.1}
args.audio_len=int(audio_len*args.sample_rate)


#@markdown Number of discretization steps (recommended: 35)
num_steps = 35 #@param {type:"slider", min:0, max:100, step:1}
args.inference.T=num_steps

#@markdown minimum noise level (recommended: 0.0001)
sigma_min = 0.0001 #@param {type:"number"}
args.diffusion_parameters.sigma_min=sigma_min

#@markdown maximum  noiose level (recommended: 1)
sigma_max= 1 #@param {type:"number"}
args.diffusion_parameters.sigma_max=sigma_max

#@markdown noise schedule parameter (recommended 13)
rho=12 #@param{type:"slider", min:5, max:20, step:1}
args.diffusion_parameters.ro=rho


#@markdown Stochasticity parameter (recommended 5)
Schurn=8.5 #@param{type:"slider", min:0, max:40, step:0.1}
args.diffusion_parameters.Schurn=Schurn


plot_animation=True

from src.experimenters.exp_unconditional import Exp_Unconditional
exp=Exp_Unconditional(args, plot_animation)

if plot_animation:
  audio_path, fig=exp.conduct_experiment("1")
  fig.show()
else:
  audio_path=exp.conduct_experiment("1")

Audio(audio_path) # load the saved file

  warn("Q-factor too high for frequencies %s"%",".join("%.2f"%fi for fi in f[q >= qneeded]))
100%|██████████| 35/35 [00:55<00:00,  1.59s/it]

Using padding='same' with even kernel lengths and odd dilation may require a zero-padded copy of the input be created (Triggered internally at  ../aten/src/ATen/native/Convolution.cpp:882.)



In [None]:
#@title Select audio example
example = 2 #@param {type:"slider", min:0, max:149, step:1}
files=glob.glob(os.path.join(args.inference.load.data_directory,"*.wav"))
audio_file=files[example]


import soundfile as sf

segnp, fs =sf.read(audio_file)

n=os.path.basename(audio_file)
n=os.path.splitext(n)[0]
args.audio_len=segnp.shape[0]
seg=torch.Tensor(segnp).unsqueeze(0)
Audio(data=segnp, rate=fs) 


In [None]:
#@title Bandwidth Extension
#@markdown Execute this cell to run bandwidth extension experiments
#@markdown ## Diffusion schedule
args.inference.mode = 'bandwidth_extension'
mode=args.inference.mode

audio_len=seg.shape[-1]
#@markdown Number of discretization steps (recommended: 35)
num_steps = 35 #@param {type:"slider", min:0, max:100, step:1}
args.inference.T=num_steps

#@markdown minimum noise level (recommended: 0.0001)
sigma_min = 0.0001 #@param {type:"number"}
args.diffusion_parameters.sigma_min=sigma_min

#@markdown maximum  noise level (recommended: 1)
sigma_max= 1 #@param {type:"number"}
args.diffusion_parameters.sigma_max=sigma_max

#@markdown noise schedule parameter (recommended 13)
rho=13 #@param{type:"slider", min:5, max:20, step:1}
args.diffusion_parameters.ro=rho

#@markdown stochasticity parameter (recommended 5)
Schurn=5 #@param{type:"slider", min:0, max:40, step:0.1}
args.diffusion_parameters.Schurn=Schurn

#@markdown ## Conditioning parameters
#@markdown guidance scaling parameter (recommended 0.25).
#@markdown Leave as 0 for no reconstruction guidance, but make sure to activate data consistency 
xi=0.26 #@param{type:"slider", min:0, max:1, step:0.01}
args.inference.xi=xi

#@markdown Choose if you want to apply data consistency steps (only for "firwin" filters)
data_consistency = False #@param {type:"boolean"}
args.inference.data_consistency=data_consistency
plot_animation=False


#@markdown ## Lowpass filter parameters
#filt_type = "firwin" #@param ["firwin", "cheby1", "resample", "decimate"]
#@markdown In this cell, the filter is an FIR, designed using the window method
#@markdown Specify the cutoff frequency (in Hz)
fc=1054 #@param{type:"slider", min:0, max:10000, step:1}
args.inference.bandwidth_extension.filter.fc=fc
#@markdown Specify the order of the filter
order=403 #@param{type:"slider", min:0, max:1000, step:1}
args.inference.bandwidth_extension.filter.order=order

from src.experimenters.exp_bandwidth_extension import Exp_BWE
exp=Exp_BWE(args, plot_animation)

if plot_animation:
  path_degraded, path_result, fig=exp.conduct_experiment(seg,"1")
  fig.show()
else:
  path_degraded, path_result=exp.conduct_experiment(seg, "1")


print("")
print("lowpass filtered:")
IPython.display.display(Audio(path_degraded))
print("bandwidth-extended:")
IPython.display.display(Audio(path_result))


dashape torch.Size([1, 65536])


100%|██████████| 35/35 [02:51<00:00,  4.91s/it]


lowpass filtered:





bandwidth-extended:


In [None]:
#@title Audio Inpainting
#@markdown Execute this cell to run audio inpainting experiments
#@markdown ## Diffusion schedule
args.inference.mode = 'inpainting'
mode=args.inference.mode

audio_len=seg.shape[-1]
#@markdown Number of discretization steps (recommended: 35)
num_steps = 35 #@param {type:"slider", min:0, max:100, step:1}
args.inference.T=num_steps

#@markdown minimum noise level (recommended: 0.0001)
sigma_min = 0.0001 #@param {type:"number"}
args.diffusion_parameters.sigma_min=sigma_min

#@markdown maximum  noise level (recommended: 1)
sigma_max= 1 #@param {type:"number"}
args.diffusion_parameters.sigma_max=sigma_max

#@markdown noise schedule parameter (recommended 13)
rho=13 #@param{type:"slider", min:1, max:20, step:1}
args.diffusion_parameters.ro=rho

#@markdown stochasticity parameter (recommended 5)
Schurn=5 #@param{type:"slider", min:0, max:40, step:0.1}
args.diffusion_parameters.Schurn=Schurn

args.diffusion_parameters.Schurn=Schurn

#@markdown ## Conditioning parameters
#@markdown guidance scaling parameter (recommended 0.25).
#@markdown Leave as 0 for no reconstruction guidance, but make sure to activate data consistency 
xi=0.26 #@param{type:"slider", min:0, max:1, step:0.01}
args.inference.xi=xi

#@markdown Choose if you want to apply data consistency steps (only for "firwin" filters)
data_consistency = False #@param {type:"boolean"}
args.inference.data_consistency=data_consistency
plot_animation=False


#@markdown ## Inpainting details
#@markdown length od the gap (in ms)
gap_length=1000 #@param {type:"number"}
args.inference.inpainting.gap_length=gap_length
#@markdown start of the gap (in ms)
start_gap_idx=1000 #@param {type:"number"}
args.inference.inpainting.start_gap_idx=start_gap_idx


from src.experimenters.exp_inpainting import Exp_Inpainting
exp=Exp_Inpainting(args, plot_animation)

if plot_animation:
  path_degraded, path_result, fig=exp.conduct_experiment(seg,"1")
  fig.show()
else:
  path_degraded, path_result=exp.conduct_experiment(seg, "1")


print("")
print("masked:")
IPython.display.display(Audio(path_degraded))
print("reconstructed")
IPython.display.display(Audio(path_result))


  warn("Q-factor too high for frequencies %s"%",".join("%.2f"%fi for fi in f[q >= qneeded]))
100%|██████████| 35/35 [02:58<00:00,  5.10s/it]



masked:


  y_lpf=torch.nn.functional.conv1d(y,B,padding="same")


reconstructed


In [None]:
#@title Audio Declipping
#@markdown Execute this cell to run audio declipping experiments
#@markdown ## Diffusion schedule
args.inference.mode = 'declipping'
mode=args.inference.mode

audio_len=seg.shape[-1]
#@markdown Number of discretization steps (recommended: 35)
num_steps = 35 #@param {type:"slider", min:0, max:100, step:1}
args.inference.T=num_steps

#@markdown minimum noise level (recommended: 0.0001)
sigma_min = 0.0001 #@param {type:"number"}
args.diffusion_parameters.sigma_min=sigma_min

#@markdown maximum  noise level (recommended: 1)
sigma_max= 1 #@param {type:"number"}
args.diffusion_parameters.sigma_max=sigma_max

#@markdown noise schedule parameter (recommended 13)
rho=13 #@param{type:"slider", min:1, max:20, step:1}
args.diffusion_parameters.ro=rho

#@markdown stochasticity parameter (recommended 5)
Schurn=5 #@param{type:"slider", min:0, max:40, step:0.1}
args.diffusion_parameters.Schurn=Schurn

args.diffusion_parameters.Schurn=Schurn

#@markdown ## Conditioning parameters
#@markdown guidance scaling parameter (recommended 0.25).
xi=0.26 #@param{type:"slider", min:0, max:1, step:0.01}
args.inference.xi=xi

#@markdown This time it is not possible to use data consistency
data_consistency = False
args.inference.data_consistency=data_consistency
plot_animation=False


#@markdown ## Declipping details
#@markdown Specify the Signal-to-Distortion Ratio (in dB) of the clipping distortion
SDR=1 #@param{type:"slider", min:-10, max:30, step:0.1}
args.inference.declipping.SDR=SDR



from src.experimenters.exp_declipping import Exp_Declipping
exp=Exp_Declipping(args, plot_animation)

if plot_animation:
  path_degraded, path_result, fig=exp.conduct_experiment(seg,"1")
  fig.show()
else:
  path_degraded, path_result=exp.conduct_experiment(seg, "1")


print("")
print("clipped:")
IPython.display.display(Audio(path_degraded))
print("reconstructed")
IPython.display.display(Audio(path_result))


  warn("Q-factor too high for frequencies %s"%",".join("%.2f"%fi for fi in f[q >= qneeded]))


65536
/content/CQTdiff/experiments/cqt/declipping25_10_2022/original/1.wav
/content/CQTdiff/experiments/cqt/declipping25_10_2022/original/1.wav


100%|██████████| 35/35 [03:00<00:00,  5.16s/it]
  y_lpf=torch.nn.functional.conv1d(y,B,padding="same")


/content/CQTdiff/experiments/cqt/declipping25_10_2022/original/1.wav

clipped:


reconstructed


In [None]:
#@title Compressive Sensing
#@markdown Execute this cell to run audio compressive sensing experiments
#@markdown ## Diffusion schedule


args.inference.mode = 'declipping'
mode=args.inference.mode

audio_len=seg.shape[-1]
#@markdown Number of discretization steps (recommended: 35)
num_steps = 35 #@param {type:"slider", min:0, max:100, step:1}
args.inference.T=num_steps

#@markdown minimum noise level (recommended: 0.0001)
sigma_min = 0.0001 #@param {type:"number"}
args.diffusion_parameters.sigma_min=sigma_min

#@markdown maximum  noise level (recommended: 1)
sigma_max= 1 #@param {type:"number"}
args.diffusion_parameters.sigma_max=sigma_max

#@markdown noise schedule parameter (recommended 13)
rho=13 #@param{type:"slider", min:1, max:20, step:1}
args.diffusion_parameters.ro=rho

#@markdown stochasticity parameter (recommended 5)
Schurn=5 #@param{type:"slider", min:0, max:40, step:0.1}
args.diffusion_parameters.Schurn=Schurn

args.diffusion_parameters.Schurn=Schurn

#@markdown ## Conditioning parameters
#@markdown guidance scaling parameter (recommended 0.25).
xi=0.26 #@param{type:"slider", min:0, max:1, step:0.01}
args.inference.xi=xi

#@markdown This time it is not possible to use data consistency
data_consistency = False
args.inference.data_consistency=data_consistency
plot_animation=False


#@markdown ## Compressed sensing details
#@markdown Specify the compression ratio. The percentage of samples that are dropped out from the example audio file. (Suggestion: use high values)
percentage=96 #@param{type:"slider", min:0, max:100, step:0.1}
args.inference.comp_sens.percentage=100-percentage



from src.experimenters.exp_comp_sens import Exp_CompSens
exp=Exp_CompSens(args, plot_animation)

if plot_animation:
  path_degraded, path_result, fig=exp.conduct_experiment(seg,"1")
  fig.show()
else:
  path_degraded, path_result=exp.conduct_experiment(seg, "1")


print("")
print("compressed:")
IPython.display.display(Audio(path_degraded))
print("reconstructed")
IPython.display.display(Audio(path_result))


  warn("Q-factor too high for frequencies %s"%",".join("%.2f"%fi for fi in f[q >= qneeded]))
100%|██████████| 35/35 [02:56<00:00,  5.03s/it]



compressed:


  y_lpf=torch.nn.functional.conv1d(y,B,padding="same")


reconstructed
