This is notebook gives a quick overview of this WaveNet implementation, i.e. creating the model and the data set, training the model and generating samples from it.

In [1]:
#!git clone https://github.com/Vichoko/pytorch-wavenet.git
import sys
sys.path.append("pytorch-wavenet")
import torch
from wavenet_model import *
from audio_data import WavenetDataset
from wavenet_training import *
from model_logging import *

In [2]:
import os
os.makedirs("snapshots", exist_ok=True)

In [None]:
!zip -r pytorch-wavenet.zip /content/pytorch-wavenet/

updating: content/pytorch-wavenet/ (stored 0%)
updating: content/pytorch-wavenet/optimizers.py (deflated 65%)
updating: content/pytorch-wavenet/train_samples/ (stored 0%)
updating: content/pytorch-wavenet/train_samples/.DS_Store (deflated 93%)
updating: content/pytorch-wavenet/train_samples/bach_chaconne/ (stored 0%)
updating: content/pytorch-wavenet/train_samples/bach_chaconne/.DS_Store (deflated 97%)
updating: content/pytorch-wavenet/train_samples/bach_chaconne/dataset.npz (deflated 14%)
updating: content/pytorch-wavenet/wavenet_modules.py (deflated 64%)
updating: content/pytorch-wavenet/.ipynb_checkpoints/ (stored 0%)
updating: content/pytorch-wavenet/.ipynb_checkpoints/WaveNet_demo-checkpoint.ipynb (deflated 72%)
updating: content/pytorch-wavenet/.ipynb_checkpoints/demo-checkpoint.ipynb (deflated 80%)
updating: content/pytorch-wavenet/.ipynb_checkpoints/Generated_Samples-checkpoint.ipynb (deflated 25%)
updating: content/pytorch-wavenet/WaveNet_demo.ipynb (deflated 72%)
updating: co

## Model
This is an implementation of WaveNet as it was described in the original paper (https://arxiv.org/abs/1609.03499). Each layer looks like this:

```
            |----------------------------------------|      *residual*
            |                                        |
            |    |-- conv -- tanh --|                |
 -> dilate -|----|                  * ----|-- 1x1 -- + -->  *input*
                 |-- conv -- sigm --|     |
                                         1x1
                                          |
 ---------------------------------------> + ------------->  *skip*
```

Each layer dilates the input by a factor of two. After each block the dilation is reset and start from one. You can define the number of layers in each block (``layers``) and the number of blocks (``blocks``). The blocks are followed by two 1x1 convolutions and a softmax output function.
Because of the dilation operation, the independent output for multiple successive samples can be calculated efficiently. With ``output_length``, you can define the number these outputs. Empirically, it seems that a large number of skip channels is required.

In [3]:
# initialize cuda option
dtype = torch.FloatTensor # data type
ltype = torch.LongTensor # label type

use_cuda = torch.cuda.is_available()
if use_cuda:
    print('use gpu')
    dtype = torch.cuda.FloatTensor
    ltype = torch.cuda.LongTensor

use gpu


In [7]:
"""model = WaveNetModel(layers=2,
                     blocks=2,
                     dilation_channels=2,
                     residual_channels=2,
                     skip_channels=2,
                     end_channels=2,
                     output_length=16,
                     dtype=dtype,
                     bias=True,
                     )
                     """
# model = load_latest_model_from('snapshots', use_cuda=use_cuda)

print('model: ', model)
print('receptive field: ', model.receptive_field)
print('parameter count: ', model.parameter_count())

UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL torch.nn.modules.container.ModuleList was not an allowed global by default. Please use `torch.serialization.add_safe_globals([ModuleList])` or the `torch.serialization.safe_globals([ModuleList])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

## Data Set
To create the data set, you have to specify a path to a data set file. If this file already exists it will be used, if not it will be generated. If you want to generate the data set file (a ``.npz`` file), you have to specify the directory (``file_location``) in which all the audio files you want to use are located. The attribute ``target_length`` specifies the number of successive samples are used as a target and corresponds to the output length of the model. The ``item_length`` defines the number of samples in each item of the dataset and should always be ``model.receptive_field + model.output_length - 1``.

```
          |----receptive_field----|
                                |--output_length--|
example:  | | | | | | | | | | | | | | | | | | | | |
target:                           | | | | | | | | | |  
```
To create a test set, you should define a ``test_stride``. Then each ``test_stride``th item will be assigned to the test set.

In [None]:
data = WavenetDataset(dataset_file='pytorch-wavenet/train_samples/bach_chaconne/dataset.npz',
                      item_length=model.receptive_field + model.output_length - 1,
                      target_length=model.output_length,
                      file_location='train_samples/bach_chaconne',
                      test_stride=500)
print('the dataset has ' + str(len(data)) + ' items')

one hot input
the dataset has 598467 items


## Training and Logging
This implementation supports logging with TensorBoard (you need to have TensorFlow installed). You can even generate audio samples from the current snapshot of the model during training. This will happen in a background thread on the cpu, so it will not interfere with the actual training but will be rather slow. If you don't have TensorFlow, you can use the standard logger that will print out to the console.
The trainer uses Adam as default optimizer.

In [None]:
def generate_and_log_samples(step):
    sample_length=32000
    gen_model = load_latest_model_from('snapshots', use_cuda=False)
    print("start generating...")
    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[0.5])
    tf_samples = tf.convert_to_tensor(samples, dtype=tf.float32)
    logger.audio_summary('temperature_0.5', tf_samples, step, sr=16000)

    samples = generate_audio(gen_model,
                             length=sample_length,
                             temperatures=[1.])
    tf_samples = tf.convert_to_tensor(samples, dtype=tf.float32)
    logger.audio_summary('temperature_1.0', tf_samples, step, sr=16000)
    print("audio clips generated")

"""
logger = TensorboardLogger(log_interval=200,
                           validation_interval=400,
                           generate_interval=1000,
                           generate_function=generate_and_log_samples,
                           log_dir="logs/chaconne_model")
"""
logger = Logger(log_interval=200,
                validation_interval=400,
                generate_interval=1000)
# 6000

In [None]:
model.cuda()
trainer = WavenetTrainer(model=model,
                         dataset=data,
                         lr=0.01,
                         snapshot_path='snapshots',
                         snapshot_name='chaconne_model',
                         snapshot_interval=1000,
                         logger=logger,
                         dtype=dtype,
                         ltype=ltype)

print('start training...')
trainer.train(batch_size=6,
              epochs=1)

start training...
epoch 0
one training step does take approximately 0.06068881273269654 seconds)
loss at step 200: tensor(5.3096, device='cuda:0', grad_fn=<DivBackward0>)
loss at step 400: tensor(5.1402, device='cuda:0', grad_fn=<DivBackward0>)
validation loss: 5.277937476634979
validation accuracy: 0.9539199332777314%
loss at step 600: tensor(5.1500, device='cuda:0', grad_fn=<DivBackward0>)
loss at step 800: tensor(5.1469, device='cuda:0', grad_fn=<DivBackward0>)
validation loss: 5.276752121448517
validation accuracy: 0.7662635529608007%
loss at step 1000: tensor(5.1407, device='cuda:0', grad_fn=<DivBackward0>)
loss at step 1200: tensor(5.1367, device='cuda:0', grad_fn=<DivBackward0>)
validation loss: 5.276032621860504
validation accuracy: 0.8288156797331109%
loss at step 1400: tensor(5.1416, device='cuda:0', grad_fn=<DivBackward0>)
loss at step 1600: tensor(5.1415, device='cuda:0', grad_fn=<DivBackward0>)
validation loss: 5.2852328848838805
validation accuracy: 0.9539199332777314%
lo

## Generating
This model has the Fast Wavenet Generation Algorithm (https://arxiv.org/abs/1611.09482) implemented. This might run faster on the cpu. You can give some starting data (of at least the length of receptive field) or let the model generate from zero. In my experience, a temperature between 0.5 and 1.0 yields the best results, but this may depend on the data set.

In [None]:
# Check the length of the dataset
dataset_length = len(data)
print(f"Dataset length: {dataset_length}")

# Ensure the index is within the dataset bounds
index = min(250000, dataset_length - 1)

NameError: name 'data' is not defined

In [None]:
start_data = data[index][0] # use start data from the data set
start_data = torch.max(start_data, 0)[1] # convert one hot vectors to integers
if use_cuda:
    start_data = start_data.cuda()
def prog_callback(step, total_steps):
    print(str(100 * step // total_steps) + "% generated")

generated = model.generate_fast(num_samples=160000,
                                 first_samples=start_data,
                                 progress_callback=prog_callback,
                                 progress_interval=1000,
                                 temperature=1.0,
                                 regularize=0.)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA_scatter__value)

In [None]:
import IPython.display as ipd

ipd.Audio(generated, rate=16000)

In [8]:
from IPython.display import Audio

# Specify the path to the uploaded file
file_path = 'raw_audio.wav'  # Replace with the correct path

# Play the audio
Audio(file_path)

# **My comments**
Even though I trained my model for 1 hour, I was unable to get the model to output audio that sounded similar to music in the end. The quality of my generated sample was not very good, the audio generated sounded like static noise.

Potential improvements will be to train the model for higher epochs so that the model has more opprutunities to adjust its weights and reduce error. Another way to improve the quality of the generated output is to increase the number of layers, dialation channels, residual channels as this would all allow the model to learn more complex features from the training data, however this would mean a long training time so that was not implemented here