In [1]:
!git clone https://github.com/ylacombe/musicgen-dreamboothing.git

!pip install -U git+https://github.com/huggingface/transformers

Cloning into 'musicgen-dreamboothing'...
remote: Enumerating objects: 152, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 152 (delta 14), reused 13 (delta 13), pack-reused 135 (from 1)[K
Receiving objects: 100% (152/152), 5.72 MiB | 5.33 MiB/s, done.
Resolving deltas: 100% (75/75), done.
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-8ulz9_3v
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-8ulz9_3v
  Resolved https://github.com/huggingface/transformers to commit c4e71e8fffcdbcf1144a4e96f2d1f034ffafd4d7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel fo

In [2]:
%pip install transformers peft torch accelerate sentencepiece "datasets[audio]>=2.12.0" wandb evaluate torchaudio soundfile black~=23.1 isort>=5.5.4 "ruff>=0.0.241,<=0.0.259" msclap librosa demucs


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
category-encoders 2.7.0 requires scikit-learn<1.6.0,>=1.0.0, but you have scikit-learn 1.6.1 which is incompatible.
cesium 0.12.4 requires numpy<3.0,>=2.0, but you have numpy 1.26.4 which is incompatible.
bigframes 1.42.0 requires rich<14,>=12.4.4, but you have rich 14.0.0 which is incompatible.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import os
import csv

def generate_caption(singer, technique, substyle, filename):
    """
    Generates a caption string for an audio file based on its metadata.

    Args:
        singer (str): The singer's name (e.g., 'Adele').
        technique (str): The vocal technique used (e.g., 'Vibrato').
        substyle (str): The musical substyle (e.g., 'Pop').
        filename (str): The original filename of the audio (e.g., '01_track.wav').

    Returns:
        str: The generated caption string.
    """
    # Extract the base name of the file without its extension.
    base_name = os.path.splitext(filename)[0]

    # Construct the caption by combining the singer, technique, substyle, and base filename.
    # Replace underscores with spaces in all parts for better readability in the caption.
    # The original code had 'singer[:-1]', which would truncate the last character of the
    # singer's name. This has been removed, assuming the directory names are the correct
    # singer names without any trailing characters to strip.
    caption = (
        f"{singer.replace('_', ' ')} "
        f"{technique.replace('_', ' ')} "
        f"{substyle.replace('_', ' ')} "
        f"{base_name.replace('_', ' ')}"
    )
    return caption

def create_musicgen_csv(root_dir, output_csv):
    """
    Scans a specified directory structure for .wav audio files and creates a CSV file.
    Each row in the CSV contains the absolute path to a .wav file and a generated caption
    based on its hierarchical location (singer/technique/substyle).

    The expected directory structure is:
    root_dir/singer_name/technique_name/substyle_name/audio_filename.wav

    Args:
        root_dir (str): The root directory from which to start scanning for audio files.
        output_csv (str): The full path where the output CSV file will be saved.
    """
    with open(output_csv, mode="w", newline="", encoding="utf-8") as csvfile:
        # Initialize the CSV writer.
        # 'delimiter=','': Specifies that fields are separated by commas.
        # 'quotechar='"': Specifies that fields containing special characters (like commas)
        #                should be enclosed in double quotes.
        # 'quoting=csv.QUOTE_ALL': Ensures that *all* fields are enclosed in double quotes.
        #                         This is a crucial step to prevent parsing errors if captions
        #                         themselves contain commas, as it guarantees proper escaping.
        writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)

        # Write the header row for the CSV file.
        writer.writerow(["audio", "caption"])

        # Traverse the directory structure.
        # os.listdir() gets the contents of the current directory.
        # os.path.join() constructs platform-independent paths.
        # os.path.isdir() checks if an item is a directory.
        for singer_dir in os.listdir(root_dir):
            singer_path = os.path.join(root_dir, singer_dir)
            if not os.path.isdir(singer_path):
                continue # Skip if it's not a directory (e.g., a file in the root_dir)

            for technique_dir in os.listdir(singer_path):
                tech_path = os.path.join(singer_path, technique_dir)
                if not os.path.isdir(tech_path):
                    continue # Skip if it's not a directory

                for substyle_dir in os.listdir(tech_path):
                    substyle_path = os.path.join(tech_path, substyle_dir)
                    if not os.path.isdir(substyle_path):
                        continue # Skip if it's not a directory

                    for filename in os.listdir(substyle_path):
                        # Process only files that end with '.wav' extension.
                        if filename.endswith(".wav"):
                            # Get the absolute path of the audio file.
                            filepath = os.path.abspath(os.path.join(substyle_path, filename))

                            # Generate the caption for the current audio file.
                            # The original code used 'filename[2:]' to strip the first two characters
                            # from the filename before generating the caption. This behavior is preserved
                            # here, assuming it's intentional for your specific file naming convention.
                            caption = generate_caption(singer_dir, technique_dir, substyle_dir, filename[2:])

                            # Write the file path and its generated caption as a new row in the CSV.
                            writer.writerow([filepath, caption])

# Example usage (commented out as it's specific to your Kaggle environment)
create_musicgen_csv("/kaggle/input/humsounds/FULL", "/kaggle/working/acapella.csv")


In [4]:
from datasets import DatasetDict

dataset = DatasetDict.from_csv({"train":  "/kaggle/working/acapella.csv"})

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
from datasets import Audio
dataset = dataset.cast_column("audio", Audio())

In [6]:
# dataset.to_json("/kaggle/working/acapella.csv")

In [7]:
from datasets import load_dataset, Audio

dataset = load_dataset("csv", data_files="/kaggle/working/acapella.csv")
dataset = dataset.cast_column("audio", Audio())

dataset.save_to_disk("/kaggle/working/audio_dataset")


Generating train split: 0 examples [00:00, ? examples/s]

Saving the dataset (0/6 shards):   0%|          | 0/3613 [00:00<?, ? examples/s]

In [8]:
!rm /kaggle/working/musicgen-acapella-output -rf

In [9]:

!python /kaggle/working/musicgen-dreamboothing/dreambooth_musicgen.py \
--use_lora \
--model_name_or_path "facebook/musicgen-small"\
--dataset_name  "/kaggle/working/audio_dataset"\
--pad_token_id 0 \
--decoder_start_token_id 0\
--text_column_name caption \
--target_audio_column_name audio \
--train_split_name train \
--do_train \
--do_eval False \
--output_dir /kaggle/working/musicgen-acapella-output \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--num_train_epochs 15 \
--logging_steps 10 \
--save_steps 500 \
--save_total_limit 2 \
--generation_max_length 128 \
--report_to none


2025-05-26 22:31:58.019201: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748298718.211475     152 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748298718.271337     152 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Generating train split: 3613 examples [00:04, 876.94 examples/s]
config.json: 100%|█████████████████████████| 7.87k/7.87k [00:00<00:00, 40.1MB/s]
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--facebook--musicgen-small/snapshots/4c8334b02c6ec4e8664a91979669a501ec497792/config.json
Model config MusicgenConfig {
  "architectures": [
    "MusicgenForConditionalGeneration"
  ],
  "audio

In [10]:
from transformers import AutoProcessor, AutoModelForTextToWaveform
import torch
import soundfile as sf

# Load processor and model
processor = AutoProcessor.from_pretrained("/kaggle/working/musicgen-acapella-output")
model = AutoModelForTextToWaveform.from_pretrained("/kaggle/working/musicgen-acapella-output")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")


2025-05-27 02:44:59.798541: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748313899.822301      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748313899.829305      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [11]:
# prompt = "male accapela piano"
# inputs = processor(text=prompt, return_tensors="pt").to(model.device)

# # Generate audio
# with torch.no_grad():
#     generated = model.generate(**inputs, do_sample=True, guidance_scale=3.0, max_new_tokens=500)

# # Convert to NumPy
# audio_array = generated.cpu().float().numpy()

# # Save to file
# sf.write("/kaggle/working/musicgen_output.wav", audio_array[0].T, samplerate=model.config.audio_encoder.sampling_rate)


In [12]:
from transformers import AutoProcessor, AutoModelForTextToWaveform
import torch
import soundfile as sf
import IPython.display as ipd

# Load processor and model
processor = AutoProcessor.from_pretrained("/kaggle/working/musicgen-acapella-output")
model = AutoModelForTextToWaveform.from_pretrained(
    "/kaggle/working/musicgen-acapella-output",
    torch_dtype=torch.float16
).to("cuda")

# Generate from prompt
inputs = processor(text="man slow breathy violin", return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, do_sample=True, max_new_tokens= 700,guidance_scale=1)

# Convert and save
audio = output[0].cpu().float().numpy()
sr = model.config.audio_encoder.sampling_rate

sf.write("/kaggle/working/musicgen_output.wav", audio.T, sr)

# Playback
ipd.Audio("/kaggle/working/musicgen_output.wav")


In [13]:
!zip -r /kaggle/working/musicgen-acapella-output.zip /kaggle/working/musicgen-acapella-output


  adding: kaggle/working/musicgen-acapella-output/ (stored 0%)
  adding: kaggle/working/musicgen-acapella-output/all_results.json (deflated 43%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/ (stored 0%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/spiece.model (deflated 48%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/preprocessor_config.json (deflated 37%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/adapter_config.json (deflated 59%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/rng_state.pth (deflated 25%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/tokenizer.json (deflated 74%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/optimizer.pt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


 (deflated 8%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/scheduler.pt (deflated 55%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/adapter_model.safetensors (deflated 8%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/tokenizer_config.json (deflated 94%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/trainer_state.json (deflated 81%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/README.md (deflated 66%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/special_tokens_map.json (deflated 85%)
  adding: kaggle/working/musicgen-acapella-output/checkpoint-13000/training_args.bin (deflated 52%)
  adding: kaggle/working/musicgen-acapella-output/spiece.model (deflated 48%)
  adding: kaggle/working/musicgen-acapella-output/preprocessor_config.json (deflated 37%)
  adding: kaggle/working/musicgen-acapella-output/adapter_config.json (deflated 59%)
  adding: kaggle