CodecLM is a research codebase for codec-token audio language modeling inspired by Moshi and Mimi-style workflows.
It provides three configurable model families:
flat_rvq: audio-only flat transformer based RVQ baselineqwen_flat_joint: flat text+audio joint modeling with Qwen2.5-1.5B-Instruct backboneseparable_qwen: temporal Qwen2.5-1.5B-Instruct + depth transformer
Core workflow: prepare cache -> train -> generate samples
- Install dependencies
pip install torch torchaudio lightning transformers pyyaml- Prepare cache
python -m audiolm.scripts.prepare_dataset \
--config configs/experiments/qwen_flat_joint_audio_text.yaml \
--set data.data_dir=./data \
--set runtime.codec_device=cuda- Train
python -m audiolm.scripts.train \
--config configs/experiments/qwen_flat_joint_audio_text.yaml- Generate samples
python -m audiolm.scripts.generate_samples \
--checkpoint ./my_model.ckpt \
--config configs/experiments/separable_qwen_audio_text.yamlpython -m audiolm.scripts.train \
--config configs/experiments/qwen_flat_joint_audio_text.yaml \
--set trainer.fast_dev_run=true \
--set trainer.devices=1 \
--set runtime.codec_device=cuda| Run | Base model | Total params | LoRA | Data | Setup | Epochs | Best val metric |
|---|---|---|---|---|---|---|---|
separable_qwen |
Qwen2.5-1.5B-Instruct |
1.8B | disabled | LibriSpeech train-clean-360 -> dev-clean | 8 GPU DDP | 10 | val loss = 15 |
Additional notes for this run:
- Full Qwen training (not LoRA-only)
- Best checkpoint selected by minimum validation loss
- Loss weights:
alpha_text=2.0,alpha_cb1=1.0,alpha_depth=5.0,alpha_audio=1.0
Curated v0.1.0 samples:
| Model | Prompt | Dataset | Audio |
|---|---|---|---|
| separable_qwen | first two seconds | LibriSpeech dev-clean | sample_00.wav |
| separable_qwen | first two seconds | LibriSpeech dev-clean | sample_01.wav |
| separable_qwen | first two seconds | LibriSpeech dev-clean | sample_02.wav |
| separable_qwen | first two seconds | LibriSpeech dev-clean | sample_03.wav |
| separable_qwen | first two seconds | LibriSpeech dev-clean | sample_04.wav |
| Model | Conditioning | Best for |
|---|---|---|
flat_rvq |
audio_only |
smallest audio-only baseline |
qwen_flat_joint |
audio_text |
flat joint sequence objective |
separable_qwen |
audio_text |
temporal-depth factorization |
Use repeated --set key=value flags to override YAML fields without editing files.
-
Add a new dataset source:
- implement a datamodule and wire it in
audiolm/data/factory.py
- implement a datamodule and wire it in
-
Add a new model variant:
- implement model class under
audiolm/model/models/ - register it in
audiolm/model/factory.py
- implement model class under
-
Add a new experiment:
- copy a config from
configs/experiments/ - edit model/data/optimizer fields
- run with
python -m audiolm.scripts.train --config <new_file>.yaml
- copy a config from
- Data pipeline: docs/DATA_PIPELINE.md
- Codec logic: docs/CODEC.md
- Model logic: docs/MODELS.md
- Config writing: docs/CONFIGS.md
- Troubleshooting: docs/TROUBLESHOOTING.md
- Config folder notes: configs/README.md
audiolm/scripts: entrypoints (prepare_dataset,train,generate_samples)audiolm/data: alignment, caching, datamodule, collatoraudiolm/model: model factory, model implementations, runtime codec helpersconfigs/experiments: runnable experiment YAML files
- Add standardized evaluation and expanded metrics table.
- Add additional dataset adapters.
- Add dual audio stream for full-duplex conversation
- Add additional LLM backbones
- Add acoustic delay (similar to Moshi)
- Citation metadata: CITATION.cff