In [None]:
!git clone https://github.com/facebookresearch/audiocraft.git
%cd audiocraft
!pip install -e .
!pip install dora-search
!pip install numba

## ⚠️ read the whole notebook first! ⚠️
There's high odds that your question is already answered somewhere in here. pay attention to details like file paths, and skim through the example code. make sure you understand it before diving face first into a training run!

# data preprocessing etc

### --- this is optional if you've already got resampled 30s audio clips and .json labels ---

This section includes a tool to slice your audio into 30s chunks, resample to 44100hz, and normalize. it also includes a WIP autolabeller based on essentia (https://essentia.upf.edu/models.html, https://colab.research.google.com/drive/1tFInmCYK2uX-PajYemERvtSojkT0vrhF)

If you don't want to autolabel your data but you do want to split it, only run the first cell under this header. The rest is for essentia.

Tags parsed by essentia:
- genre
- mood/theme
- instrumentation
- key, bpm (these are actually from librosa)

This section is very WIP, and it's always better to label your own data. However, that can be a lot of work especially with a lot of samples. this should give you a decent baseline.

> TODO: for the autolabeller to have best results, you provide an OpenAI API key when prompted so it can do some prompt enhancement on the tags essentia creates. Otherwise, it will train only on the raw tags. information TBA by GPT:
> - natural language description
> - enhanced tags (eg adding "percussion" if "drums" already exists)

> TODO: add a song recognition API to fill in title and artist info

> TODO: figure out vram requirements, and clean up to leave space for musicgen. for now, save the `.jsonl` files, restart the notebook (your split files should be safe in drive), run everything after this (replacing the `.jsonl` files with the ones you downloaded at the paths mentioned in the code), and you should be okay.

In [None]:

# mount google drive to access dataset
from google.colab import drive
drive.mount('/content/drive')

dataset_path = "/content/drive/MyDrive/samples/train_musicgen_edm_uncond/train/outputs"

Mounted at /content/drive


In [None]:
# split and resample
# don't run this if your audio is already sliced and resampled

import os
from pydub import AudioSegment

os.makedirs(os.path.join(dataset_path, "original"), exist_ok=True)

for filename in os.listdir(dataset_path):
    if filename.endswith(('.mp3', '.wav', '.flac')):

        # move original file out of the way
        os.rename(filename, f"original/{filename}")
        audio = AudioSegment.from_file(f"original/{filename}")

        # resample
        audio = audio.set_frame_rate(44100)

        # split into 30-second chunks
        for i in range(0, len(audio), 30000):
            chunk = audio[i:i+30000]
            chunk.export(f"{filename[:-4]}_chunk{i//1000}.wav", format="wav")


In [None]:
# install essentia and requirements

# this line is needed for the tensorflow modules to install properly!!
!sudo apt-get install build-essential libeigen3-dev libyaml-dev libfftw3-dev libtag1-dev libchromaprint-dev

!pip install -U essentia-tensorflow

In [None]:
# download some essentia model weights

!curl https://essentia.upf.edu/models/classification-heads/genre_discogs400/genre_discogs400-discogs-effnet-1.pb --output genre_discogs400-discogs-effnet-1.pb
!curl https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb --output discogs-effnet-bs64-1.pb
!curl https://essentia.upf.edu/models/classification-heads/mtg_jamendo_moodtheme/mtg_jamendo_moodtheme-discogs-effnet-1.pb --output mtg_jamendo_moodtheme-discogs-effnet-1.pb
!curl https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb --output mtg_jamendo_instrument-discogs-effnet-1.pb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2009k  100 2009k    0     0   226k      0  0:00:08  0:00:08 --:--:--  358k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.5M  100 17.5M    0     0   715k      0  0:00:25  0:00:25 --:--:--  838k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2675k  100 2675k    0     0   459k      0  0:00:05  0:00:05 --:--:--  639k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2643k  100 2643k    0     0   507k      0  0:00:05  0:00:05 --:--:--  612k


In [None]:
# @title metadata (labels) for essentia - LONG CELL DONT OPEN
genre_labels = [
    "Blues---Boogie Woogie",
    "Blues---Chicago Blues",
    "Blues---Country Blues",
    "Blues---Delta Blues",
    "Blues---Electric Blues",
    "Blues---Harmonica Blues",
    "Blues---Jump Blues",
    "Blues---Louisiana Blues",
    "Blues---Modern Electric Blues",
    "Blues---Piano Blues",
    "Blues---Rhythm & Blues",
    "Blues---Texas Blues",
    "Brass & Military---Brass Band",
    "Brass & Military---Marches",
    "Brass & Military---Military",
    "Children's---Educational",
    "Children's---Nursery Rhymes",
    "Children's---Story",
    "Classical---Baroque",
    "Classical---Choral",
    "Classical---Classical",
    "Classical---Contemporary",
    "Classical---Impressionist",
    "Classical---Medieval",
    "Classical---Modern",
    "Classical---Neo-Classical",
    "Classical---Neo-Romantic",
    "Classical---Opera",
    "Classical---Post-Modern",
    "Classical---Renaissance",
    "Classical---Romantic",
    "Electronic---Abstract",
    "Electronic---Acid",
    "Electronic---Acid House",
    "Electronic---Acid Jazz",
    "Electronic---Ambient",
    "Electronic---Bassline",
    "Electronic---Beatdown",
    "Electronic---Berlin-School",
    "Electronic---Big Beat",
    "Electronic---Bleep",
    "Electronic---Breakbeat",
    "Electronic---Breakcore",
    "Electronic---Breaks",
    "Electronic---Broken Beat",
    "Electronic---Chillwave",
    "Electronic---Chiptune",
    "Electronic---Dance-pop",
    "Electronic---Dark Ambient",
    "Electronic---Darkwave",
    "Electronic---Deep House",
    "Electronic---Deep Techno",
    "Electronic---Disco",
    "Electronic---Disco Polo",
    "Electronic---Donk",
    "Electronic---Downtempo",
    "Electronic---Drone",
    "Electronic---Drum n Bass",
    "Electronic---Dub",
    "Electronic---Dub Techno",
    "Electronic---Dubstep",
    "Electronic---Dungeon Synth",
    "Electronic---EBM",
    "Electronic---Electro",
    "Electronic---Electro House",
    "Electronic---Electroclash",
    "Electronic---Euro House",
    "Electronic---Euro-Disco",
    "Electronic---Eurobeat",
    "Electronic---Eurodance",
    "Electronic---Experimental",
    "Electronic---Freestyle",
    "Electronic---Future Jazz",
    "Electronic---Gabber",
    "Electronic---Garage House",
    "Electronic---Ghetto",
    "Electronic---Ghetto House",
    "Electronic---Glitch",
    "Electronic---Goa Trance",
    "Electronic---Grime",
    "Electronic---Halftime",
    "Electronic---Hands Up",
    "Electronic---Happy Hardcore",
    "Electronic---Hard House",
    "Electronic---Hard Techno",
    "Electronic---Hard Trance",
    "Electronic---Hardcore",
    "Electronic---Hardstyle",
    "Electronic---Hi NRG",
    "Electronic---Hip Hop",
    "Electronic---Hip-House",
    "Electronic---House",
    "Electronic---IDM",
    "Electronic---Illbient",
    "Electronic---Industrial",
    "Electronic---Italo House",
    "Electronic---Italo-Disco",
    "Electronic---Italodance",
    "Electronic---Jazzdance",
    "Electronic---Juke",
    "Electronic---Jumpstyle",
    "Electronic---Jungle",
    "Electronic---Latin",
    "Electronic---Leftfield",
    "Electronic---Makina",
    "Electronic---Minimal",
    "Electronic---Minimal Techno",
    "Electronic---Modern Classical",
    "Electronic---Musique Concrète",
    "Electronic---Neofolk",
    "Electronic---New Age",
    "Electronic---New Beat",
    "Electronic---New Wave",
    "Electronic---Noise",
    "Electronic---Nu-Disco",
    "Electronic---Power Electronics",
    "Electronic---Progressive Breaks",
    "Electronic---Progressive House",
    "Electronic---Progressive Trance",
    "Electronic---Psy-Trance",
    "Electronic---Rhythmic Noise",
    "Electronic---Schranz",
    "Electronic---Sound Collage",
    "Electronic---Speed Garage",
    "Electronic---Speedcore",
    "Electronic---Synth-pop",
    "Electronic---Synthwave",
    "Electronic---Tech House",
    "Electronic---Tech Trance",
    "Electronic---Techno",
    "Electronic---Trance",
    "Electronic---Tribal",
    "Electronic---Tribal House",
    "Electronic---Trip Hop",
    "Electronic---Tropical House",
    "Electronic---UK Garage",
    "Electronic---Vaporwave",
    "Folk, World, & Country---African",
    "Folk, World, & Country---Bluegrass",
    "Folk, World, & Country---Cajun",
    "Folk, World, & Country---Canzone Napoletana",
    "Folk, World, & Country---Catalan Music",
    "Folk, World, & Country---Celtic",
    "Folk, World, & Country---Country",
    "Folk, World, & Country---Fado",
    "Folk, World, & Country---Flamenco",
    "Folk, World, & Country---Folk",
    "Folk, World, & Country---Gospel",
    "Folk, World, & Country---Highlife",
    "Folk, World, & Country---Hillbilly",
    "Folk, World, & Country---Hindustani",
    "Folk, World, & Country---Honky Tonk",
    "Folk, World, & Country---Indian Classical",
    "Folk, World, & Country---Laïkó",
    "Folk, World, & Country---Nordic",
    "Folk, World, & Country---Pacific",
    "Folk, World, & Country---Polka",
    "Folk, World, & Country---Raï",
    "Folk, World, & Country---Romani",
    "Folk, World, & Country---Soukous",
    "Folk, World, & Country---Séga",
    "Folk, World, & Country---Volksmusik",
    "Folk, World, & Country---Zouk",
    "Folk, World, & Country---Éntekhno",
    "Funk / Soul---Afrobeat",
    "Funk / Soul---Boogie",
    "Funk / Soul---Contemporary R&B",
    "Funk / Soul---Disco",
    "Funk / Soul---Free Funk",
    "Funk / Soul---Funk",
    "Funk / Soul---Gospel",
    "Funk / Soul---Neo Soul",
    "Funk / Soul---New Jack Swing",
    "Funk / Soul---P.Funk",
    "Funk / Soul---Psychedelic",
    "Funk / Soul---Rhythm & Blues",
    "Funk / Soul---Soul",
    "Funk / Soul---Swingbeat",
    "Funk / Soul---UK Street Soul",
    "Hip Hop---Bass Music",
    "Hip Hop---Boom Bap",
    "Hip Hop---Bounce",
    "Hip Hop---Britcore",
    "Hip Hop---Cloud Rap",
    "Hip Hop---Conscious",
    "Hip Hop---Crunk",
    "Hip Hop---Cut-up/DJ",
    "Hip Hop---DJ Battle Tool",
    "Hip Hop---Electro",
    "Hip Hop---G-Funk",
    "Hip Hop---Gangsta",
    "Hip Hop---Grime",
    "Hip Hop---Hardcore Hip-Hop",
    "Hip Hop---Horrorcore",
    "Hip Hop---Instrumental",
    "Hip Hop---Jazzy Hip-Hop",
    "Hip Hop---Miami Bass",
    "Hip Hop---Pop Rap",
    "Hip Hop---Ragga HipHop",
    "Hip Hop---RnB/Swing",
    "Hip Hop---Screw",
    "Hip Hop---Thug Rap",
    "Hip Hop---Trap",
    "Hip Hop---Trip Hop",
    "Hip Hop---Turntablism",
    "Jazz---Afro-Cuban Jazz",
    "Jazz---Afrobeat",
    "Jazz---Avant-garde Jazz",
    "Jazz---Big Band",
    "Jazz---Bop",
    "Jazz---Bossa Nova",
    "Jazz---Contemporary Jazz",
    "Jazz---Cool Jazz",
    "Jazz---Dixieland",
    "Jazz---Easy Listening",
    "Jazz---Free Improvisation",
    "Jazz---Free Jazz",
    "Jazz---Fusion",
    "Jazz---Gypsy Jazz",
    "Jazz---Hard Bop",
    "Jazz---Jazz-Funk",
    "Jazz---Jazz-Rock",
    "Jazz---Latin Jazz",
    "Jazz---Modal",
    "Jazz---Post Bop",
    "Jazz---Ragtime",
    "Jazz---Smooth Jazz",
    "Jazz---Soul-Jazz",
    "Jazz---Space-Age",
    "Jazz---Swing",
    "Latin---Afro-Cuban",
    "Latin---Baião",
    "Latin---Batucada",
    "Latin---Beguine",
    "Latin---Bolero",
    "Latin---Boogaloo",
    "Latin---Bossanova",
    "Latin---Cha-Cha",
    "Latin---Charanga",
    "Latin---Compas",
    "Latin---Cubano",
    "Latin---Cumbia",
    "Latin---Descarga",
    "Latin---Forró",
    "Latin---Guaguancó",
    "Latin---Guajira",
    "Latin---Guaracha",
    "Latin---MPB",
    "Latin---Mambo",
    "Latin---Mariachi",
    "Latin---Merengue",
    "Latin---Norteño",
    "Latin---Nueva Cancion",
    "Latin---Pachanga",
    "Latin---Porro",
    "Latin---Ranchera",
    "Latin---Reggaeton",
    "Latin---Rumba",
    "Latin---Salsa",
    "Latin---Samba",
    "Latin---Son",
    "Latin---Son Montuno",
    "Latin---Tango",
    "Latin---Tejano",
    "Latin---Vallenato",
    "Non-Music---Audiobook",
    "Non-Music---Comedy",
    "Non-Music---Dialogue",
    "Non-Music---Education",
    "Non-Music---Field Recording",
    "Non-Music---Interview",
    "Non-Music---Monolog",
    "Non-Music---Poetry",
    "Non-Music---Political",
    "Non-Music---Promotional",
    "Non-Music---Radioplay",
    "Non-Music---Religious",
    "Non-Music---Spoken Word",
    "Pop---Ballad",
    "Pop---Bollywood",
    "Pop---Bubblegum",
    "Pop---Chanson",
    "Pop---City Pop",
    "Pop---Europop",
    "Pop---Indie Pop",
    "Pop---J-pop",
    "Pop---K-pop",
    "Pop---Kayōkyoku",
    "Pop---Light Music",
    "Pop---Music Hall",
    "Pop---Novelty",
    "Pop---Parody",
    "Pop---Schlager",
    "Pop---Vocal",
    "Reggae---Calypso",
    "Reggae---Dancehall",
    "Reggae---Dub",
    "Reggae---Lovers Rock",
    "Reggae---Ragga",
    "Reggae---Reggae",
    "Reggae---Reggae-Pop",
    "Reggae---Rocksteady",
    "Reggae---Roots Reggae",
    "Reggae---Ska",
    "Reggae---Soca",
    "Rock---AOR",
    "Rock---Acid Rock",
    "Rock---Acoustic",
    "Rock---Alternative Rock",
    "Rock---Arena Rock",
    "Rock---Art Rock",
    "Rock---Atmospheric Black Metal",
    "Rock---Avantgarde",
    "Rock---Beat",
    "Rock---Black Metal",
    "Rock---Blues Rock",
    "Rock---Brit Pop",
    "Rock---Classic Rock",
    "Rock---Coldwave",
    "Rock---Country Rock",
    "Rock---Crust",
    "Rock---Death Metal",
    "Rock---Deathcore",
    "Rock---Deathrock",
    "Rock---Depressive Black Metal",
    "Rock---Doo Wop",
    "Rock---Doom Metal",
    "Rock---Dream Pop",
    "Rock---Emo",
    "Rock---Ethereal",
    "Rock---Experimental",
    "Rock---Folk Metal",
    "Rock---Folk Rock",
    "Rock---Funeral Doom Metal",
    "Rock---Funk Metal",
    "Rock---Garage Rock",
    "Rock---Glam",
    "Rock---Goregrind",
    "Rock---Goth Rock",
    "Rock---Gothic Metal",
    "Rock---Grindcore",
    "Rock---Grunge",
    "Rock---Hard Rock",
    "Rock---Hardcore",
    "Rock---Heavy Metal",
    "Rock---Indie Rock",
    "Rock---Industrial",
    "Rock---Krautrock",
    "Rock---Lo-Fi",
    "Rock---Lounge",
    "Rock---Math Rock",
    "Rock---Melodic Death Metal",
    "Rock---Melodic Hardcore",
    "Rock---Metalcore",
    "Rock---Mod",
    "Rock---Neofolk",
    "Rock---New Wave",
    "Rock---No Wave",
    "Rock---Noise",
    "Rock---Noisecore",
    "Rock---Nu Metal",
    "Rock---Oi",
    "Rock---Parody",
    "Rock---Pop Punk",
    "Rock---Pop Rock",
    "Rock---Pornogrind",
    "Rock---Post Rock",
    "Rock---Post-Hardcore",
    "Rock---Post-Metal",
    "Rock---Post-Punk",
    "Rock---Power Metal",
    "Rock---Power Pop",
    "Rock---Power Violence",
    "Rock---Prog Rock",
    "Rock---Progressive Metal",
    "Rock---Psychedelic Rock",
    "Rock---Psychobilly",
    "Rock---Pub Rock",
    "Rock---Punk",
    "Rock---Rock & Roll",
    "Rock---Rockabilly",
    "Rock---Shoegaze",
    "Rock---Ska",
    "Rock---Sludge Metal",
    "Rock---Soft Rock",
    "Rock---Southern Rock",
    "Rock---Space Rock",
    "Rock---Speed Metal",
    "Rock---Stoner Rock",
    "Rock---Surf",
    "Rock---Symphonic Rock",
    "Rock---Technical Death Metal",
    "Rock---Thrash",
    "Rock---Twist",
    "Rock---Viking Metal",
    "Rock---Yé-Yé",
    "Stage & Screen---Musical",
    "Stage & Screen---Score",
    "Stage & Screen---Soundtrack",
    "Stage & Screen---Theme",
]
mood_theme_classes = [
    "action",
    "adventure",
    "advertising",
    "background",
    "ballad",
    "calm",
    "children",
    "christmas",
    "commercial",
    "cool",
    "corporate",
    "dark",
    "deep",
    "documentary",
    "drama",
    "dramatic",
    "dream",
    "emotional",
    "energetic",
    "epic",
    "fast",
    "film",
    "fun",
    "funny",
    "game",
    "groovy",
    "happy",
    "heavy",
    "holiday",
    "hopeful",
    "inspiring",
    "love",
    "meditative",
    "melancholic",
    "melodic",
    "motivational",
    "movie",
    "nature",
    "party",
    "positive",
    "powerful",
    "relaxing",
    "retro",
    "romantic",
    "sad",
    "sexy",
    "slow",
    "soft",
    "soundscape",
    "space",
    "sport",
    "summer",
    "trailer",
    "travel",
    "upbeat",
    "uplifting"
]
instrument_classes = [
    "accordion",
    "acousticbassguitar",
    "acousticguitar",
    "bass",
    "beat",
    "bell",
    "bongo",
    "brass",
    "cello",
    "clarinet",
    "classicalguitar",
    "computer",
    "doublebass",
    "drummachine",
    "drums",
    "electricguitar",
    "electricpiano",
    "flute",
    "guitar",
    "harmonica",
    "harp",
    "horn",
    "keyboard",
    "oboe",
    "orchestra",
    "organ",
    "pad",
    "percussion",
    "piano",
    "pipeorgan",
    "rhodes",
    "sampler",
    "saxophone",
    "strings",
    "synthesizer",
    "trombone",
    "trumpet",
    "viola",
    "violin",
    "voice"
]

In [None]:
from essentia.standard import MonoLoader, TensorflowPredictEffnetDiscogs, TensorflowPredict2D
import numpy as np

def filter_predictions(predictions, class_list, threshold=0.1):
    predictions_mean = np.mean(predictions, axis=0)
    sorted_indices = np.argsort(predictions_mean)[::-1]
    filtered_indices = [i for i in sorted_indices if predictions_mean[i] > threshold]
    filtered_labels = [class_list[i] for i in filtered_indices]
    filtered_values = [predictions_mean[i] for i in filtered_indices]
    return filtered_labels, filtered_values

def make_comma_separated_unique(tags):
    seen_tags = set()
    result = []
    for tag in ', '.join(tags).split(', '):
        if tag not in seen_tags:
            result.append(tag)
            seen_tags.add(tag)
    return ', '.join(result)

def get_audio_features(audio_filename):
    audio = MonoLoader(filename=audio_filename, sampleRate=16000, resampleQuality=4)()
    embedding_model = TensorflowPredictEffnetDiscogs(graphFilename="discogs-effnet-bs64-1.pb", output="PartitionedCall:1")
    embeddings = embedding_model(audio)

    result_dict = {}

    # predict genres
    genre_model = TensorflowPredict2D(graphFilename="genre_discogs400-discogs-effnet-1.pb", input="serving_default_model_Placeholder", output="PartitionedCall:0")
    predictions = genre_model(embeddings)
    filtered_labels, _ = filter_predictions(predictions, genre_labels)
    filtered_labels = ', '.join(filtered_labels).replace("---", ", ").split(', ')
    result_dict['genres'] = make_comma_separated_unique(filtered_labels)

    # predict mood/theme
    mood_model = TensorflowPredict2D(graphFilename="mtg_jamendo_moodtheme-discogs-effnet-1.pb")
    predictions = mood_model(embeddings)
    filtered_labels, _ = filter_predictions(predictions, mood_theme_classes, threshold=0.05)
    result_dict['moods'] = make_comma_separated_unique(filtered_labels)

    # predict instruments
    instrument_model = TensorflowPredict2D(graphFilename="mtg_jamendo_instrument-discogs-effnet-1.pb")
    predictions = instrument_model(embeddings)
    filtered_labels, _ = filter_predictions(predictions, instrument_classes)
    result_dict['instruments'] = filtered_labels

    return result_dict

In [None]:
# Create .jsonl from the extracted features, make a train/test split, and save in the right place.

# set the following variable to True if you want to see a progress bar instead of the printed results:
use_tqdm = False

import os
import json
import random
import librosa
from pydub import AudioSegment
import wave
import re

from functools import partial
from tqdm import tqdm
tqdm = partial(tqdm, position=0, leave=True)

# make sure the .jsonl has a place to go
os.makedirs("/content/audiocraft/egs/train", exist_ok=True)
os.makedirs("/content/audiocraft/egs/eval", exist_ok=True)

train_len = 0
eval_len = 0

with open("/content/audiocraft/egs/train/data.jsonl", "w") as train_file, \
     open("/content/audiocraft/egs/eval/data.jsonl", "w") as eval_file:

    dset = tqdm(os.listdir(dataset_path)) if use_tqdm else os.listdir(dataset_path)
    random.shuffle(dset)

    for filename in dset:
        result = get_audio_features(os.path.join(dataset_path, filename))

        # TODO: make openai call, populate description and keywords

        # get key and BPM
        y, sr = librosa.load(os.path.join(dataset_path, filename))
        tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
        tempo = round(tempo) # not usually accurate lol
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        key = np.argmax(np.sum(chroma, axis=1))
        key = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B'][key]
        length = librosa.get_duration(y=y, sr=sr)
        # print(f"{filename}: {result}, detected key {key}, detected bpm {tempo}")

        # THIS IS FOR MY OWN DATASET FORMAT
        # Meant strictly to extract from format: "artist name 4_chunk25.wav"
        # Modify for your own use!!
        def extract_artist_from_filename(filename):
            match = re.search(r'(.+?)\s\d+_chunk\d+\.wav', filename)
            artist = match.group(1) if match else ""
            return artist.replace("mix", "").strip() if "mix" in artist else artist
        artist_name = extract_artist_from_filename(filename)

        # populate json
        entry = {
            "key": f"{key}",
            "artist": artist_name,
            "sample_rate": 44100,
            "file_extension": "wav",
            "description": "",
            "keywords": "",
            "duration": length,
            "bpm": tempo,
            "genre": result.get('genres', ""),
            "title": "",
            "name": "",
            "instrument": result.get('instruments', ""),
            "moods": result.get('moods', []),
            "path": os.path.join(dataset_path, filename),
        }
        print(entry)

        # train/test split
        if random.random() < 0.85:
            train_len += 1
            train_file.write(json.dumps(entry) + '\n')
        else:
            eval_len += 1
            eval_file.write(json.dumps(entry) + '\n')

print(train_len)
print(eval_len)

# clear cuda mem for finetuning
from numba import cuda
device = cuda.get_current_device()
device.reset()

# load dataset into musicgen

a dataset for musicgen is:
- a .yaml file with basic info about the audio sample rate and channels, and a pointer to .jsonl files containing the prompt metadata and links to the corresponding audio files.
- the above .jsonl file
- a folder full of audio

it looks for the .yaml at `content/audiocraft/config/dset/audio/YOUR_TRAINING_RUN.yaml`, which you point to by setting `dset=audio/YOUR_TRAINING_RUN` in the `dora run` command. In my case, `YOUR_TRAINING_RUN` is `train` in the example code.

it looks like this:
```
datasource:
  max_channels: 2
  max_sample_rate: 44100

  evaluate: egs/eval
  generate: egs/train
  train: egs/train
  valid: egs/eval
```
you can use four different datasets, but it works just fine (if you're okay with overfitting!) with all of them pointing to the same one.

The script is currently set up to put 15% of your dataset in `egs/eval` at random. This should help mitigate overfitting.

- `train`: Data for training the model.
- `valid`: Data for hyperparameter tuning and early stopping during training.
- `evaluate`: Data for assessing the model's performance post-training.
- `generate`: Data used for output generation, often the same as `train` but not necessarily.

the `egs/YOUR_TRAINING_RUN` is referring to `content/audiocraft/egs/YOUR_TRAINING_RUN/data.jsonl`. the contents is a long list of lines that each look like this:

```
{
  "key": "", "artist": "", "sample_rate": 44100, "file_extension": "wav",
  "description": "", "keywords": "", "duration": 30.0, "bpm": "", "genre": "",
  "title": "", "name": "", "instrument": "", "moods": [],
  "path": os.path.join(dataset_folder, filename),
}
```
any of these fields will be omitted if empty, only "path" is required.

If you already have json tags for your dataset (eg: `filename1.wav` `filename1.json`, `filename2.wav` `filename2.json`), set `use_existing_json` to `True` in the following code. Otherwise, it will use a blank json template for all the files, as an example.

## example:

the below code is what I used to construct my dataset. it depends on a few assumptions about it, namely:
- all the audios are 30 seconds long @ 44100 samples
- they live in `content/drive/MyDrive/samples/train_musicgen_edm_uncond/train/outputs`
- the name of my training run ("YOUR_TRAINING_RUN") is "train"

In [None]:
# Make the .jsonl

# if autolabelled:
#     print('skip this cell, move on to the .yaml')
#     exit(0)

# if you have a .json file for each audio file sharing the same filename, set this variable to True
use_existing_json = False

# mount google drive to access dataset
from google.colab import drive
drive.mount('/content/drive')

import os
import json
import random
import wave

# make sure the .jsonl has a place to go
os.makedirs("/content/audiocraft/egs/train", exist_ok=True)
os.makedirs("/content/audiocraft/egs/eval", exist_ok=True)

dataset_folder = "/content/drive/MyDrive/samples/train_musicgen_edm_uncond/train/outputs"
train_manifest_path = "/content/audiocraft/egs/train/data.jsonl"
eval_manifest_path = "/content/audiocraft/egs/eval/data.jsonl"

dataset_len = 0
train_len = 0
eval_len = 0

train_file = open(train_manifest_path, 'w')
eval_file = open(eval_manifest_path, 'w')

for filename in os.listdir(dataset_folder):
    if filename.endswith(".wav"):
        dataset_len += 1

        if use_existing_json:
            json_filepath = os.path.splitext(filename)[0] + ".json"
            if os.path.exists(json_filepath):
                with open(json_filepath, 'r') as json_file:
                    entry = json.load(json_file)
                    entry["path"] = os.path.join(dataset_folder, filename)
            else:
                print(f'error loading json: could not find {json_filepath}')
        else:

            # empty fields for now, alter as needed to match your metadata.
            # all this does is make sure each file loads and trains semi-unconditionally
            import librosa
            y, sr = librosa.load(os.path.join(dataset_path, filename))
            length = librosa.get_duration(y=y, sr=sr)

            entry = {
                "key": "",
                "artist": "",
                "sample_rate": 44100,
                "file_extension": "wav",
                "description": "",
                "keywords": "",
                "duration": length,
                "bpm": "",
                "genre": "electronic",
                "title": "",
                "name": "",
                "instrument": "",
                "moods": [],
                "path": os.path.join(dataset_folder, filename),
            }

        if random.random() < 0.85:
            train_len += 1
            train_file.write(json.dumps(entry) + '\n')
        else:
            eval_len += 1
            eval_file.write(json.dumps(entry) + '\n')

train_file.close()
eval_file.close()

print(f'dataset length: {dataset_len} audio clips')
print(f'train length: {train_len} audio clips')
print(f'eval length: {eval_len} audio clips')

In [None]:
# Make the .yaml

config_path = "/content/audiocraft/config/dset/audio/train.yaml"

# point to the folders that your .jsonl is in
data_path = "egs/train"
eval_data_path = "egs/eval"

package = "package" # yay python not letting me put #@.package in a string :/
yaml_contents = f"""#@{package} __global__

datasource:
  max_channels: 2
  max_sample_rate: 44100

  evaluate: {eval_data_path}
  generate: {data_path}
  train: {data_path}
  valid: {eval_data_path}
"""

with open(config_path, 'w') as yaml_file:
    yaml_file.write(yaml_contents)

# training

musicgen uses a thing called `dora` to launch the training run with a given solver, dataset, hyperparams, etc etc. if you've used anything like `accelerate`, it's like that. the command should be fairly easy to figure out from the given example.

The below command is starting a finetuning run on the pretrained "small" model. Training from scratch is beyond the scope of this notebook, but shouldn't be too hard, just a *lot* of compute. Resuming from your finetuned checkpoints will be covered later.

Some info about VRAM requirements and training time:
- It appears you can't train "medium" or "large" models on a single colab A100. They both OOM even with a batch size of 1. "melody" doesn't seem to load with anything i can pass to the `model/lm/model_scale=` param. You'll have to use "small"
- here's the VRAM requirements with different batch sizes, using small model, 30s@44100 audios:

```
batch_size: VRAM
1: 13.4 GB
2: 14.6 GB
4: 19.0 GB
8: 27.5 GB
12: 35.1 GB
16: OOM
```

I did a training run with ~5000 of those samples, and it took roughly 90 minutes for 1 epoch to complete (on an A100). A checkpoint gets saved every epoch, more on this later. With a smaller dataset, you might have faster train times, I'm not sure (since the musicgen code says dataset size is disconnected from epoch length for this repo specifically, but they also say 1 epoch is roughly 30 minutes, so I can't take their word for it)

You'll probably need quite a few epochs to get good results, I trained it for 4 epochs (about 6 hours) and the quality is still not great.

NOTE: These numbers were from a previous test, before I updated the training parameters to be more efficient on a single GPU. There should be a ~50% speedup vs previous version. musicgen uses dadaw optimizer by default, which caches tokens, good for large runs on multiple gpus, but is less efficient on a single colab gpu.

My example code stops after 5 epochs. change this number to suit your needs, I kept it here to mitigate losing my instance on crash. My personal version saves the `.th` to drive and resumes run, so if colab crashes while I'm afk, I can at most lose 5 epochs of compute.

If you're retrying the command and getting the following error: `NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968`, here's a temporary fix:

```py
import locale
locale.getpreferredencoding = lambda: "UTF-8"
```

## stereo training

audiocraft release v1.2.0 comes with models trained to generate pseudo-stereo. they work by altering the delay pattern to generate alternating codebooks for left and right channels- L, R, L, R, etc. The stereo models are called `musicgen-stereo-small` `musicgen-stereo-medium` etc. Running inference pretty simply means passing those names into `get_pretrained` instead of the default ones we're all used to.

Training is a bit harder, since you also have to pass info about the new codebook size/shape and delay pattern. [meta's documentation here.](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md#training-stereo-models)

## example:

In [None]:
%env USER=lyra
# CHANGE THIS

command = (
    "dora -P audiocraft run"
    " solver=musicgen/musicgen_base_32khz"
    " model/lm/model_scale=small"
    " continue_from=//pretrained/facebook/musicgen-small"
    " conditioner=text2music"
    " dset=audio/train"
    " dataset.num_workers=2"
    " dataset.valid.num_samples=1"
    " dataset.batch_size=2" # CHANGE THIS
    " schedule.cosine.warmup=8"
    " optim.optimizer=adamw" # uses dadaw by default, which is worse for single-gpu runs
    " optim.lr=1e-4"
    " optim.epochs=5" # stops training after 5 epochs- change this
    " optim.updates_per_epoch=1000" # 2000 by default, change this if you want checkpoints quicker ig
    " optim.adam.weight_decay=0.01"
    " generate.lm.prompted_samples=False" # skip super long generate step
    " generate.lm.gen_gt_samples=True"
)

!{command}

In [None]:
# STEREO TRAINING EXAMPLE

%env USER=lyra

command = (
    "dora -P audiocraft run"
    " solver=musicgen/musicgen_stereo_finetune_32khz" # this config comes with all the stereo settings
    " model/lm/model_scale=small"
    " continue_from=//pretrained/facebook/musicgen-stereo-small" # point to the stereo model here
    " conditioner=text2music"
    " dset=audio/train"
    " dataset.num_workers=2"
    " dataset.valid.num_samples=1"
    " dataset.batch_size=2"
    " schedule.cosine.warmup=8"
    " optim.optimizer=adamw"
    " optim.lr=1e-4"
    " optim.epochs=5"
    " optim.updates_per_epoch=1000"
    " optim.adam.weight_decay=0.01"
    " generate.lm.prompted_samples=False" # skip super long generate step
    " generate.lm.gen_gt_samples=True"
)

!{command}

In [None]:
# clear cuda mem
from numba import cuda
device = cuda.get_current_device()
device.reset()

# save, load, export

Every epoch, the trainer saves a `checkpoint.th` file to `tmp/audiocraft_USER/xps/YOUR_RUN_SIG/checkpoint.th`.

`USER` is set by `%env USER=lyra` from earlier, and `YOUR_RUN_SIG` should look something like "2312e8a4", and is found in the output of the training command, in a line that looks like this near the top:

```
[08-10 08:17:06][flashy.solver][INFO] - Instantiating solver MusicGenSolver for XP 2312e8a4
[08-10 08:17:06][flashy.solver][INFO] - All XP logs are stored in /tmp/audiocraft_lyra/xps/2312e8a4
```
see https://github.com/facebookresearch/audiocraft/blob/main/docs/TRAINING.md#a-small-introduction-to-dora for more details on run signatures

the `.th` file can't be loaded into musicgen for inference, but is required to resume a training run.

to get a model that the generator can use, audiocraft comes with an export function. using it requires passing a model signature (unsure about passing a .th), so you'll have to do it in the same runtime as training.

The export function makes two .bin files in the same folder: `state_dict.bin` and `compression_state_dict.bin`. the file sizes (for the small finetunes, at least) are ~800 MB and 1 KB respectively. These seem to load on top of the pretrained models, leading to lower file size than the original checkpoints. For reference, the small base model is ~2.5 GB, and the `checkpoint.th` is ~8.8 GB.

loading the finetune involves passing the folder containing both `.bin` files to `MusicGen.get_pretrained()`, just like loading the base models.

## the tricky parts

colab's filesystem is temporary. this means if the runtime crashes before you've saved your `.th`, all that training is gone. the training code will continuously run, so there's no easy way to interrupt it to save the checkpoint before continuing. You'll need to monitor your training run, and make sure you have enough compute credits! it's always safer to stop, save, and resume than to hope it just keeps working.

one way to try to get around this is exporting in a try/except block, so if it fails to load an audio file while you're away, at least you get the last checkpoint. (untested!) example at the end.

Additionally, resuming from SIG will break if you're training a stereo model, but resuming from `checkpoint.th` still works.

##examples:

In [None]:
# Exporting .bin files from a training run:

from audiocraft.utils import export
from audiocraft import train

sig = "aec0903f"

# from https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md#importing--exporting-models
xp = train.main.get_xp_from_sig(sig)
export.export_lm(xp.folder / 'checkpoint.th', '/content/checkpoints/finetune/state_dict.bin')
export.export_pretrained_compression_model('facebook/encodec_32khz', '/content/checkpoints/finetune/compression_state_dict.bin')




In [None]:
# Loading a finetune for inference:

from audiocraft.models import MusicGen
musicgen = MusicGen.get_pretrained('/content/checkpoints/finetune')

In [None]:
# Resuming a run:

sig = "aec0903f"

command = (
    "dora run solver=musicgen/musicgen_base_32khz"
    " model/lm/model_scale=small"

    # you can continue a run this way, if the filesystem still exists:
    f" continue_from=//SIG/{sig}"

    # or you can save the .th file, load it in a new runtime, and resume from just it:
    f" continue_from=/tmp/audiocraft_lyra/xps/{sig}/checkpoint.th"

    " conditioner=text2music"
    " dset=audio/train"
    " dataset.num_workers=2"
    " dataset.valid.num_samples=1"
    " dataset.batch_size=2"
    " schedule.cosine.warmup=8"
    " optim.optimizer=adamw" # uses dadaw by default, which is worse for single-gpu runs
    " optim.lr=1e-4"
    " optim.epochs=5" # stops training after 5 epochs- change this
    " optim.adam.weight_decay=0.01"
)

!{command}

In [None]:
# Saving the .th file to google drive for persistence:
# it's ~8.8gb, so downloading from the colab filesystem is annoying. this is much easier
# you can point directly to the checkpoint in google drive when resuming as well

import shutil
sig = "aec0903f"

source_path = f'/tmp/audiocraft_lyra/xps/{sig}/checkpoint.th'
destination_path = '/content/drive/MyDrive/musicgen_finetunes/checkpoints/new'
os.makedirs(destination_path, exist_ok=True)
shutil.copy(source_path, destination_path)

'/content/drive/MyDrive/musicgen_finetunes/checkpoints/new/checkpoint.th'

In [None]:
# Attempt to save checkpoint on crash!!
sig = "aec0903f"

try:
    !{command}
except:
    import shutil
    source_path = f'/tmp/audiocraft_lyra/xps/{sig}/checkpoint.th'
    destination_path = '/content/drive/MyDrive/musicgen_finetunes/checkpoints/'
    os.makedirs(destination_path, exist_ok=True)
    shutil.copy(source_path, destination_path)

# generate

generating has been covered by many notebooks etc, but I'll include a few scripts here for different types of generating, including unconditional, text conditioned, sample continuation, stereo, multiband diffusion, and eventually more.

note that melody is not included, since I cannot train the melody model with this script so it's beyond the scope of this notebook.

for more info on using the multi-band diffusion model, read the docs here:
https://github.com/facebookresearch/audiocraft/blob/main/docs/MBD.md

Note: if you want to load the model from drive so you can use these examples without running the training script first, change `content/checkpoints/finetune` in the first cell to the path to the drive folder that your two `.bin` files are saved in. in the exporting example, it was `content/drive/MyDrive/musicgen_finetunes/checkpoints/`.

In [None]:
from audiocraft.data.audio import audio_write
import IPython.display as ipd
from audiocraft.models import MusicGen
import numpy as np

# load your finetune
musicgen = MusicGen.get_pretrained('/content/checkpoints/finetune')
musicgen.set_generation_params(duration=16)

In [None]:
# Example 1: unconditional generation

wavs = musicgen.generate_unconditional(4)

# save and display generated audio
for idx, one_wav in enumerate(wavs):
    audio_write(f'{idx}', one_wav.cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)
    ipd.display(ipd.Audio(one_wav.cpu(), rate=32000))

In [None]:
# Example 2: text guided generation

wavs = musicgen.generate([
    'disco',
    'slide guitar bluegrass',
    'breakbeat, amen break',
    'epic orchestral strings'
])

# save and display generated audio
for idx, one_wav in enumerate(wavs):
    audio_write(f'{idx}', one_wav.cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)
    ipd.display(ipd.Audio(one_wav.cpu(), rate=32000))

In [None]:
# helper functions for handling sample input and continations
# RUN THIS BEFORE RUNNING THE NEXT CELLS!

import julius, torch

def normalize_audio(audio_data):
    max_value = torch.max(torch.abs(audio_data))
    audio_data /= max_value
    return audio_data

def convert_audio_channels(wav: torch.Tensor, channels: int = 2) -> torch.Tensor:
    *shape, src_channels, length = wav.shape
    if src_channels == channels:
        pass
    elif channels == 1:
        wav = wav.mean(dim=-2, keepdim=True)
    elif src_channels == 1:
        wav = wav.expand(*shape, channels, length)
    elif src_channels >= channels:
        wav = wav[..., :channels, :]
    else:
        raise ValueError('The audio file has less channels than requested but is not mono.')
    return wav

def convert_audio(wav: torch.Tensor, from_rate: float, to_rate: float, to_channels: int) -> torch.Tensor:
    wav = julius.resample_frac(wav, int(from_rate), int(to_rate))
    wav = convert_audio_channels(wav, to_channels)
    return wav

# runs musicgen.generate_continuation in 30s chunks and appends them until it reaches generation_length

def generate_audio_continuation(musicgen, sample, generation_length, segment_length=30, overlap=10):
    overlap_samples = overlap * 32000
    segment_samples = segment_length * 32000
    output = np.array([])
    output = np.concatenate((output, sample.cpu().squeeze().numpy().astype(np.float32)))
    init_length = len(output) / 32000

    while len(output) / 32000 < generation_length:
        musicgen.set_generation_params(duration=segment_length)
        prompt = torch.tensor(np.array([output[-overlap_samples:]]), dtype=torch.float32)
        res = musicgen.generate_continuation(prompt=prompt, prompt_sample_rate=32000)
        res = res.cpu().squeeze().numpy().astype(np.float32)
        output = np.concatenate((output, res[overlap_samples:]))

    return output

In [None]:
# Example 3: sample continuation
# run the helper functions cell or this won't work!

# upload your file
from google.colab import files
uploaded = files.upload()

input_audio_filename = next(iter(uploaded.keys()))
sample, sample_sr = torchaudio.load(input_audio_filename)
sample = normalize_audio(sample)
sample = convert_audio(sample, sample_sr, 32000, 1)

# generate
wav = generate_audio_continuation(musicgen, sample, 60)

# save and display generated audio
audio_write('continuation', output.cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)
ipd.display(ipd.Audio(output, rate=32000))

In [None]:
# Example 4: long generations (self-continuation)
# run the helper functions cell or this won't work!

# this is unconditional for the example, swap in text guidance as needed.
wavs = musicgen.generate_unconditional(4)

# continuations only work on one sample at a time
for idx, wav in enumerate(wavs):

    wav = generate_audio_continuation(musicgen, wav, 60)

    audio_write(f'{idx}', wav.cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)
    ipd.display(ipd.Audio(wav.cpu(), rate=32000))

In [None]:
# Example 6: multiband diffusion decoder

from audiocraft.models import MusicGen, MultiBandDiffusion
mbd = MultiBandDiffusion.get_mbd_musicgen()

# use mbd to generate the audio from the codebook tokens
wavs, tokens = musicgen.generate_unconditional(4, return_tokens=True)
wavs_diffusion = mbd.tokens_to_wav(tokens)

# save and display generated audio
for idx, one_wav in enumerate(wavs):
    audio_write(f'{idx}', one_wav.cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)
    audio_write(f'{idx}_diffusion', wavs_diffusion[idx].cpu(), musicgen.sample_rate, strategy="loudness", loudness_compressor=True)

    print('default decoder:')
    ipd.display(ipd.Audio(one_wav.cpu(), rate=32000))
    print('multiband diffusion:')
    ipd.display(ipd.Audio(wavs_diffusion[idx].cpu().cpu(), rate=32000))

In [None]:
# Stereo inference
# This is an example using stock stereo musicgen. loading a custom checkpoint is as usual.
# Note that nothing here is actually any different from the mono models, this example
# is meant to show how easy it is to integrate

from audiocraft.models import MusicGen
import IPython.display as ipd

model = MusicGen.get_pretrained("facebook/musicgen-stereo-medium")
model.set_generation_params(duration=8)

wavs = model.generate([
    'disco',
    'slide guitar bluegrass',
    'breakbeat, amen break',
    'epic orchestral strings'
])

for idx, one_wav in enumerate(wavs):
    ipd.display(ipd.Audio(one_wav.cpu(), rate=model.sample_rate))


# local installation for inference (or training with good enough hardware)

this section assumes you have a GPU with 8gb of vram. it only covers inference, since you cannot train on only 8gb vram. if you have the gpu spec to train, this should get you started but you may have to figure out the details yourself (i cannot test it, my gpu is 8gb). I'll assume you have python installed already.

## step 1 - environment

I use miniconda for managing environments for each project. you can download it here: https://docs.conda.io/en/main/miniconda.html

once it's installed, run:

```
conda create --name musicgen
conda activate musicgen
```

make a folder for your musicgen projects to live. you'll want to `cd` to this folder before running the commands in the next step.

## step 2 - audiocraft

now you're in your environment and can start installing.

I'm going to assume you have git installed, if not, https://gitforwindows.org/ (other platforms- find your own. it won't be hard).

You should recognize these commands from the top of this notebook:

```
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .
pip install dora-search julius
```

running these out of the box should install a bunch of dependencies, but it'll give you a pytorch-cpu version with no GPU access by default. You'll need a CUDA version to actually run this.

## step 3 - torch with cuda

You will need cuda for this to work. if you don't have it, go get it here: https://developer.nvidia.com/cuda-downloads

`pip install --upgrade torch --extra-index-url https://download.pytorch.org/whl/cu117`

You can't install torch with cuda *first*, because audiocraft will overwrite it during install.

## step 4 - make a script and run

at this point, everything is installed and you can do whatever with the environment. here's an example script you can use to make sure everything is loaded and working properly:

```
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

print('loading model...')
model = MusicGen.get_pretrained('facebook/musicgen-small')
model.set_generation_params(duration=8)

print('generating...')
wav = model.generate_unconditional(4)

print('saving...')
for idx, one_wav in enumerate(wav):
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

print('done!')
```

save this as `example.py` and run it, and you should get a `1.wav` file after a little bit. I won't go into too much detail since I figure you know how to run a python script.

if you want to use your finetunes, go ahead and make a folder for them to live (perhaps `checkpoints/finetune_1`, `checkpoints/finetune_2`, etc), and use the examples given above to load them from your folder.

---

If you're trying to train this with `dora` on Windows, you'll need to make some modifications:


> in `audiocraft/utils/cluster.py`: comment out lines 28-40

> in `audiocraft/train.py`: comment out line 110

> in the above code:

```
command = (
    "dora -P audiocraft run"
    ...
```
Note that the above command is expecting to be executed from the audiocraft directory, looking back one folder: `C:/...project_folder/audiocraft>python ../train_musicgen.py` where `train_musicgen.py` is in `project_folder` and contains the above code.


# multi-gpu training on runpod

TBA, see musicgen discord for current progress. documentation soon!

notes:

- run command needs updating with `-d` flag (`"dora -P audiocraft run -d "`) and `" fsdp.use=true"` and `" autocast=false"`

- fsdp allows more efficient memory usage so any size model can be trained with 2xA6000s (48gb vram each, 96gb combined). further documentation assumes this setup. the cost is $1.58/hr for 2xA6000

- if training `large` model, `batch_size=4`. if training `medium`, `batch_size=16`

- runpod used /workspace/ everywhere colab would usually use /content/. something you might want to know for making a notebook

- do scraping and labelling first (locally or colab), bc you dont really need a lot of GPU for that. put the data somewhere downloadable (i have a 13gb .tar.gz in google cloud storage, alongside the `train.jsonl` `test.jsonl` files, all publicly accessible so i can just fetch them from the instance, `!gdown` is also super useful if your data is hosted on google drive)

- it will die on eval after 1 epoch. to get rid of the deadlock, comment out lines 478-487 in `audiocraft/audiocraft/solvers/base.py`

- it will die on checkpoint save if you don't give the instance enough disk space. your dataset should be in a network drive, to leave disk space for the checkpoints. `small` model training checkpoint is 8gb, `medium` is 30gb, and `large` is 68gb. it also uses 2x storage because it writes a temp checkpoint before replacing, so give yourself like 256gb disk if you're trying to train large model. 128 works for medium. you'll need to do this during instance config before its actually launched. it's called "advanced settings" or something in the first screen.

- Several people (me included) have been running into issues exporting and loading finetuned medium/large models. You may have to do some experimenting here to get that to work. I have a hunch it has to do with the fact that it stores `checkpoint.th` *and* `checkpoint.th.1`, but only one gets passed to the export function. you'll end up getting

# training from scratch with DAC

TBA, still being testing. Below code may help you get started but probably won't figure everything out for you:

```py
!git clone https://github.com/facebookresearch/audiocraft.git
%cd audiocraft
!pip install -e .
!pip install dora-search descript-audio-codec
!pip install -U protobuf
```

```py
%env USER=lyra
command = (
    "dora -P audiocraft run -d"
    " solver=musicgen/musicgen_base_32khz"
        # should probably put all this in a config, its no longer 32k
        # mostly just using this to handle defaults i dont want to worry about setting here
    " model/lm/model_scale=medium"
    " compression_model_checkpoint=//pretrained/dac_44khz"
    " sample_rate=44100"
    " transformer_lm.card=1024"
    " transformer_lm.n_q=9"
    " codebooks_pattern.modeling=delay"
    " codebooks_pattern.delay.delays=[0,1,2,3,4,5,6,7,8]"
    " conditioner=text2music"
    " dset=audio/train"
    " dataset.num_workers=2"
    " dataset.valid.num_samples=1"
    " dataset.batch_size=2"
    " schedule.cosine.warmup=8"
    " optim.optimizer=adamw"
    " optim.lr=1e-4"
    " optim.epochs=5"
    " optim.updates_per_epoch=10"
    " optim.adam.weight_decay=0.01"
    " fsdp.use=true"
    " autocast=false"
)
!{command}
```

DISCLAIMER: Not everything in this notebook is guaranteed to run first try. I do my best to keep it functional, but edge cases and bugs slip thru my fingers frequently. If something is broken or needs changing, please let me know!

I'm `lyraaaa_` on discord, and [@_lyraaaa_](https://twitter.com/_lyraaaa_) on twitter/X.

Community musicgen discord server for development/testing, sharing outputs, and asking questions:

https://discord.gg/h498MAYZ4e

licensing: just go ahead and use stuff from here if you want. credit is cool but optional