### This is example, how to train Text-to-Music model from Scratch (based on musiclm_pytorch, which include )

This is top-level approach, without implementation details of quantizers (VQ-VAE, or RVQ) and Transformer models. A parametric library of models is used that can be customized for the tasks of studying musical fragments and generation.

First stage: Training MuLaN model (https://arxiv.org/pdf/2208.12415) 

![](https://github.com/MaxMax2016/musiclm-pytorch/blob/main/musiclm.png?raw=true)

MuLan: new generation of acoustic models that link music audio directly to un-constrained natural language music descriptions.
It helps us generate (audio, text) pairs, based only on audio files dataset, so we don't need labeled them.

In [None]:
import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

# based on https://arxiv.org/pdf/2208.12415
mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM
embeds = mulan.get_audio_latents(wavs)  # during training
embeds = mulan.get_text_latents(texts)  # during inference

In [None]:
from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,                          # pass in trained mulan from above
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine'),
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch.randn(2, 1024)
conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

Hierarchical modeling of semantic and acoustic tokens

![](./img/audiolm-hierarchy-modeling.png)

(1) Semantic tokens that allow the modeling of long-term structure, extracted from models pretrained on audio data with the objective of masked lan- guage modeling

Semantic modeling. The first stage models p(zt |z<t ), the
autoregressive prediction of semantic tokens to capture long-term temporal structure.

In [None]:
import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

torch.set_default_device("mps")

wav2vec = HubertWithKmeans(
    checkpoint_path = './models/hubert/hubert_base_ls960.pt',
    kmeans_path = './models/hubert/hubert_base_ls960_L9_km500.bin' 
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
)

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    audio_conditioner = quantizer,   # pass in the MulanEmbedQuantizer instance above
    folder = './dataset/music_data',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1
)

trainer.train()

(2) Coarse acoustic modeling conditioned on the semantic tokens. Acoustic tokens, provided by a neural audio codec, for capturing fine acoustic details. This allows AudioLM to generate coherent and high-quality speech as well as piano music continuations without relying on tran- scripts or symbolic music representations.

`Coarse acoustic modeling. The second stage proceeds analogously on the acoustic tokens, but it only predicts the acoustic tokens from the coarse Q′ SoundStream quantizers, conditioned on the semantic tokens. Due to residual quantization in SoundStream, the acoustic tokens have a hierarchical structure: tokens from the coarse quantizers recover acoustic properties like speaker identity and recording conditions, while leaving only the fine acoustic details to the fine quantizer tokens, which are modeled by the next stage. We rely on the simple approach of flattening the acoustic tokens in a row-major order to handle their hierarchical structure.`

In [None]:
import torch
from unittest import mock
from audiolm_pytorch import SoundStream, CoarseTransformer, CoarseTransformerTrainer
from audiolm_pytorch import AudioLMSoundStream, MusicLMSoundStream

# soundstream = SoundStream.init_and_load_from('/path/to/trained/soundstream.pt')
soundstream = MusicLMSoundStream() 

coarse_transformer = CoarseTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    codebook_size = 1024,
    num_coarse_quantizers = 4,
    dim = 1024,
    depth = 6,
    audio_text_condition = True
)

with mock.patch('builtins.input', return_value='n'):
    trainer = CoarseTransformerTrainer(
        transformer = coarse_transformer,
        codec = soundstream,
        wav2vec = wav2vec,
        audio_conditioner = quantizer,
        folder = './dataset/music_data',
        batch_size = 1,
        data_max_length = 320 * 32,
        num_train_steps = 1
    )
    trainer.train()

(3) fine acoustic modeling

`The third stage operates on acoustic tokens corresponding to the fine quantizers, using the Q′ coarse tokens as conditioning and modeling the conditional probability distribution p(yq|y≤Q′ , y>Q′ , y<q) for q > Q′.`

In [None]:
import torch
from audiolm_pytorch import SoundStream, FineTransformer, FineTransformerTrainer
from audiolm_pytorch import AudioLMSoundStream, MusicLMSoundStream

fine_transformer = FineTransformer(
    num_coarse_quantizers = 4,
    num_fine_quantizers = 8,
    codebook_size = 1024,
    dim = 1024,
    depth = 6,
    audio_text_condition = True
)

with mock.patch('builtins.input', return_value='n'):
    trainer = FineTransformerTrainer(
        transformer = fine_transformer,
        codec = soundstream,
        folder = './dataset/music_data',
        batch_size = 1,
        data_max_length = 320 * 32,
        num_train_steps = 1,
        audio_conditioner = quantizer
    )

    trainer.train()

### On this step we trained quantizer and 3 Transformers (Semantic, Coarse, Fine), and ready generation Audio/Music. 

In [None]:
from audiolm_pytorch import AudioLM
from musiclm_pytorch import MusicLM

audiolm = AudioLM(
    wav2vec = wav2vec,
    codec = soundstream,
    semantic_transformer = semantic_transformer,
    coarse_transformer = coarse_transformer,
    fine_transformer = fine_transformer
)

musiclm = MusicLM(
    audio_lm = audiolm,
    mulan_embed_quantizer = quantizer
)

In [None]:
music = musiclm('the crystalline sounds of the piano in a ballroom', num_samples = 1) # sample 4 and pick the top match with mulan
torch.save(music, 'generated_music.pt')

In [None]:
output_path = "out.wav"
sample_rate = 44100
torchaudio.save(output_path, music.cpu(), sample_rate)