## Song Generation

Song generation models include difference kinds:
- Generation music (Meta/MaGNET)
- Generation instrumental music (MusicGen model).
- Generation mix of music+vocal (Jukebox model).
- Generation vocal/signing (Bark model).
- Generation audio (audio-effects) (AudioGen model).
- Generation song (text-to-lyrics+music+vocal) (Suno.Ai).
- Generation composition song (Muzic/Microsoft family models)

For task is more relevant to Song generation JukeBox approach, but to review methods of generation, need review achitecture each model, datasets each models and training methods.

### Architecture/Method Each Models
- MusicGen Model, based on https://arxiv.org/pdf/2306.05284:
 - Architecture:
   - Audio Tokenization:
     - Converts audio into quantized tokens using [RVQ](https://arxiv.org/abs/2107.03312) (EnCodec).
     - EnCodec: Convolutional Auto-encoder with latent space quantized using Residual Vector Quantization (RVQ) and an adversarial reconstruction loss. EnCodec tokenizer with 4 codebooks sampled at 50 Hz.
    - Fours codebooks:
      - Codebook 1: Captures the high-level structure and broad aspects of the audio.
      - Codebook 2: Focuses on intermediate features, refining the details provided by the first codebook.
      - Codebook 3: Provides additional detail, working on the nuances not captured by the previous codebooks.
      - Codebook 4: Adds the final layer of detail, ensuring high-fidelity audio output.
   - Transformer single stage auto-regressive model trained over a 32kHz.
   - Transformer Decoder: An autoregressive model conditioned on text or melody.
 - Training:
   - use 20K hours of licensed music to train MusicGen.
   - internal dataset of 10K high-quality music tracks.
   - on the ShutterStock and Pond5 music data.
 - Samples page: https://ai.honu.io/papers/musicgen/
- Bark model (suno.ai): https://github.com/suno-ai/bark
 - As for text-to- speech synthesis, we leverage the Bark (Suno, 2023) model, which can generate realistic speech and is able to match the tone, pitch, emotion, and prosody of a given voice preset.
  - Architecture:
    - three Transformer models: coarse, text, fine. The same approach hierarchical modeling, described in [AudioML](https://arxiv.org/pdf/2209.03143).
      - Each Transformer model based on nanoGPT transformer.
- JukeBox model: (https://jukebox.openai.com/):
 - Architecture:
    - Audio Tokenization:
      - To compress Audio to lower dimension space are used 3 separate hierarchical [VQ-VAE](https://arxiv.org/pdf/1711.00937) models.
       - Three cascade models.
    - autoregresive Sparse Transformers.
    - autoregressive upsamplers to recreate the lost information at each level of compression.
 - Training:
    - For the music VQ-VAE, we use 3 levels of bottlenecks compressing 44 kHz audio in dimensionality by 8x, 32x, and 128x.
    - Codebooks size of 2048 for each level.
    - The VQ-VAE has 2 million parameters and is trained on 9-second audio clips on 256 V100 for 3 days.

### How improve quality of generation?

1. To use text-to-text LLM to generate prompts more precisly with much more characteristis of music batches. It can more detailed by including the instrument, tempo, genre, or emotion.
 - text-to-musictags model. (with each part of song use separate number tags).
 - text-to-lyrics+musictags model. (With determination of composition of song)

2. Architecture Imrovements:
 - use different Quantization Methods, currently VQ-VAE and RVQ most popular.
 - use non-autoregression models (such in MagNET model), which shows 7-10 highly performance in inference.
 - use more Codebooks in VQ-VAE, but in one stage.
 - use LoRA in Transformer model, make training faster and cheaper.
 - use Diffusion model approaches + Transformer model approach.

3. Train model generate lyrics-to-singing /lyrics-to-music with compositional song (about this problem mentioned in JukeBox by Ilya Sustkever):

- Introduction/intro;
- Verse;
- Prechorus/bridge;
- Chorus;
- Post Chorus/Tag;
- Losing Break;
- Ending/morning.

So we should generate to each Song part detailed prompt, and generate MusicGen + Singing samples (Bark model can singing in different languages).

4. We should find method merging Vocals + Melody in each part of Song composition. I have reviewed model of suno-ai/bark (based on nanoGPT and ideas of musicLM, where used 3 transformer models: ), where there is Vocal generation based on Prompts. It should help.


Add...

Vocals can be removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation ([Demucs](https://github.com/adefossez/demucs)).


## Prerequisites with packages (HF Transformer, Accelerate, and so on...)

In [None]:
#MusicGen Model Prerequisites
!pip install git+https://github.com/huggingface/transformers.git
!pip install accelerate
# MagNET Model Prerequisites
!pip install git+https://github.com/facebookresearch/audiocraft.git
!apt-get install ffmpeg


Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-qcegmjtg
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-qcegmjtg
[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/req_command.py", line 242, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 73, in resolve
    collected = self.factory.collect_root_requirements(root_reqs)
  File "/usr/loca

## MusicGen-Small Model

MusicGen models are trained on 30-second chunks of audio but it is possible to generate longer sequences with a simple windowing approach. Let’s use a fixed 30-second windows and slide the window by chunks of 10 seconds, keeping the last 20 seconds that were generated as context, to generate those 2 minute-long tracks.

In [None]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import numpy as np
import torch
from IPython.display import Audio
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

sampling_rate = model.config.audio_encoder.sampling_rate
prompt = "rock and roll, active, dance-style music, elvis presley"

generation_len_10secs = 1
audio_chunk = None
generated_ids_concat = None
prev_generated_ids = None
for i in range(generation_len_10secs):
   # input 10 seconds if there is
  inputs = processor(
    audio=[prev_generated_ids[0,0].cpu().numpy()] if prev_generated_ids != None else None,
    text=[prompt],
    padding=True,
    return_tensors="pt",
    sampling_rate=sampling_rate
  )
  # output: putted 10 seconds as audio=[], and generated 10 seconds
  generated_ids = model.generate(**inputs, max_new_tokens=1000)
  display(Audio(generated_ids[0].numpy(), rate=sampling_rate))
  prev_generated_ids = generated_ids


  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


## MusicGen-Melody Model

In [None]:
from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")

inputs = processor(
    audio=wav,
    sampling_rate=demucs.samplerate,
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/369 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/68.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

NameError: name 'wav' is not defined

In [None]:
!pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
!pip install git+https://github.com/facebookresearch/audiocraft.git
!apt-get install ffmpeg

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting git+https://github.com/facebookresearch/audiocraft.git
  Cloning https://github.com/facebookresearch/audiocraft.git to /tmp/pip-req-build-nbbba5f1
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/audiocraft.git /tmp/pip-req-build-nbbba5f1
  Resolved https://github.com/facebookresearch/audiocraft.git to commit 72cb16f9fb239e9cf03f7bd997198c7d7a67a01c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch==2.1.0 (from audiocraft==1.3.0)
  Using cached torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
Collecting xformers<0.0.23 (from audiocraft==1.3.0)
  Using cached xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
Collecting nvidia-nccl-cu12==2.18.1 (from torch==2.1.0->audiocraft==1.3.0)
  Using cached nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)
Collecting triton==2.1.0 (from torch==2.1.0->audiocraft==1.3.0)
  Usi

## MagNET Model

In [None]:
from audiocraft.models import MAGNeT

model = MAGNeT.get_pretrained("facebook/magnet-medium-30secs")

descriptions = ['disco beat', 'energetic EDM', 'funky groove']
wav = model.generate(descriptions)  # generates 3 samples.

    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.3.0+cu121)
    Python  3.10.13 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


AttributeError: module 'torch._functorch.eager_transforms' has no attribute 'grad_and_value'

## References:

Papers: ComputerScience.Sound (cs.SD) https://arxiv.org/list/cs.SD/recent

(MagNET) Masked Audio Generation using a Single Non-Autoregressive Transformer
- Paper: https://arxiv.org/abs/2401.04577
- ModelCard: https://huggingface.co/models?other=magnet
- Samples: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/

Jukebox: A Generative Model for Music
- Paper: https://arxiv.org/pdf/2005.00341 (2020/OpenAI)
- Samples: https://jukebox.openai.com/?song=804331648

Simple and Controllable Music Generation (MusicGen)
- Paper: https://arxiv.org/pdf/2306.05284 (2024/MetaAI)
- Samples: https://audiocraft.metademolab.com/musicgen.html

MusicLM: Generating Music From Text
- Paper: https://arxiv.org/pdf/2301.11325 (2023/Google)
- Samples: https://google-research.github.io/seanet/musiclm/examples/

Bark/SunoAI:
- Source: https://github.com/suno-ai/bark

Music Transformer
- Paper: https://arxiv.org/pdf/1809.04281 (2018/Google Brain)

### Vocal & Singing
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
https://arxiv.org/pdf/2405.20336

### Audio Neural Codecs
High Fidelity Neural Audio Compression (EnCodec) - https://arxiv.org/pdf/2210.13438

### Convoluntional Models Audio Generation
WaveNet: A Generative Model for Raw Audio - https://arxiv.org/pdf/1609.03499

### Compositional Audio
WavJourney: Compositional Audio Creation with Large Language Models - https://arxiv.org/pdf/2307.14335