Skip to content

aminEdraki/CodecLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodecLM

CodecLM is a research codebase for codec-token audio language modeling inspired by Moshi and Mimi-style workflows.

It provides three configurable model families:

  • flat_rvq: audio-only flat transformer based RVQ baseline
  • qwen_flat_joint: flat text+audio joint modeling with Qwen2.5-1.5B-Instruct backbone
  • separable_qwen: temporal Qwen2.5-1.5B-Instruct + depth transformer

Core workflow: prepare cache -> train -> generate samples

Training Pipeline

Training Pipeline

Model Architecture

Model Architecture

Quick Start

  1. Install dependencies
pip install torch torchaudio lightning transformers pyyaml
  1. Prepare cache
python -m audiolm.scripts.prepare_dataset \
  --config configs/experiments/qwen_flat_joint_audio_text.yaml \
  --set data.data_dir=./data \
  --set runtime.codec_device=cuda
  1. Train
python -m audiolm.scripts.train \
  --config configs/experiments/qwen_flat_joint_audio_text.yaml
  1. Generate samples
python -m audiolm.scripts.generate_samples \
  --checkpoint ./my_model.ckpt \
  --config configs/experiments/separable_qwen_audio_text.yaml

Fast Smoke Run

python -m audiolm.scripts.train \
  --config configs/experiments/qwen_flat_joint_audio_text.yaml \
  --set trainer.fast_dev_run=true \
  --set trainer.devices=1 \
  --set runtime.codec_device=cuda

Preliminary Result (v0.1.0)

Run Base model Total params LoRA Data Setup Epochs Best val metric
separable_qwen Qwen2.5-1.5B-Instruct 1.8B disabled LibriSpeech train-clean-360 -> dev-clean 8 GPU DDP 10 val loss = 15

Additional notes for this run:

  • Full Qwen training (not LoRA-only)
  • Best checkpoint selected by minimum validation loss
  • Loss weights: alpha_text=2.0, alpha_cb1=1.0, alpha_depth=5.0, alpha_audio=1.0

Samples

Curated v0.1.0 samples:

Model Prompt Dataset Audio
separable_qwen first two seconds LibriSpeech dev-clean sample_00.wav
separable_qwen first two seconds LibriSpeech dev-clean sample_01.wav
separable_qwen first two seconds LibriSpeech dev-clean sample_02.wav
separable_qwen first two seconds LibriSpeech dev-clean sample_03.wav
separable_qwen first two seconds LibriSpeech dev-clean sample_04.wav

Model Choices

Model Conditioning Best for
flat_rvq audio_only smallest audio-only baseline
qwen_flat_joint audio_text flat joint sequence objective
separable_qwen audio_text temporal-depth factorization

Use repeated --set key=value flags to override YAML fields without editing files.

5-Minute Extension Guide

  1. Add a new dataset source:

    • implement a datamodule and wire it in audiolm/data/factory.py
  2. Add a new model variant:

    • implement model class under audiolm/model/models/
    • register it in audiolm/model/factory.py
  3. Add a new experiment:

    • copy a config from configs/experiments/
    • edit model/data/optimizer fields
    • run with python -m audiolm.scripts.train --config <new_file>.yaml

Documentation

Project Structure

  • audiolm/scripts: entrypoints (prepare_dataset, train, generate_samples)
  • audiolm/data: alignment, caching, datamodule, collator
  • audiolm/model: model factory, model implementations, runtime codec helpers
  • configs/experiments: runnable experiment YAML files

Near-Term Roadmap

  • Add standardized evaluation and expanded metrics table.
  • Add additional dataset adapters.
  • Add dual audio stream for full-duplex conversation
  • Add additional LLM backbones
  • Add acoustic delay (similar to Moshi)

Citation

About

Pytorch implementation of a Moshi-inspired audio LM based on multiple backbones, inlcuding Qwen and a transformer decoder.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages