Syllable-level rhythm correction for rap vocals. RapMap takes a beat, a dry human rap vocal, and lyrics, then deterministically edits the vocal so every syllable lands on the beat. The original voice is preserved -- AI is only used for guide generation and alignment, never for vocal transformation.
The output is a transparent Audacity session with visible clips, labels, and tracks that you can inspect and adjust.
Inputs: backing_track.wav + human_rap.wav + lyrics.txt
|
+-- Phase 0: Normalize -- resample to 48kHz, mono for analysis
+-- Phase 1: Guide -- AI guide vocal or beat-only mode
+-- Phase 2: Syllabify -- lyrics -> CMUdict -> canonical syllables
+-- Phase 3: Align -- MFA forced alignment -> syllable timestamps
+-- Phase 4: Anchors -- human_anchor[i] -> guide_anchor[i]
+-- Phase 5: Group -- safe-boundary clip grouping
+-- Phase 6: Plan -- deterministic edit plan (cut/stretch/crossfade)
+-- Phase 7: Render -- Rubber Band time-stretch -> corrected vocal
+-- Phase 8: Audacity -- mod-script-pipe -> tracks + labels
Two modes:
- Guide mode -- uses an AI-generated or manually-supplied rap vocal as a timing reference. Syllable anchors in the human vocal are mapped to the guide's timing.
- Beat-only mode -- detects BPM from the backing track and snaps syllable anchors to the beat grid. No guide vocal needed.
Requires Python 3.11+ and uv.
# Core dependencies
uv sync
# With beat detection (librosa)
uv sync --extra beat
# With interactive editor (Flask + pywebview)
uv sync --extra editor
# With MFA alignment
uv sync --extra align
# Development (pytest + ruff)
uv sync --extra devNLTK's CMUdict and the g2p_en model are downloaded automatically on first use.
uv run rapmap run \
--backing inputs/backing.wav \
--human inputs/human_rap.wav \
--lyrics inputs/lyrics.txt \
--out workdir \
--mode beat-onlyuv run rapmap run \
--backing inputs/backing.wav \
--human inputs/human_rap.wav \
--lyrics inputs/lyrics.txt \
--guide inputs/guide_vocal.wav \
--out workdir \
--mode guide# Phase 0: Normalize
uv run rapmap init --backing inputs/backing.wav --human inputs/human_rap.wav --lyrics inputs/lyrics.txt --out workdir
# Phase 1: Set guide (or skip for beat-only)
uv run rapmap set-guide --project workdir --guide inputs/guide.wav
# Phase 2: Syllabify
uv run rapmap syllabify --project workdir
# Phase 3: Align
uv run rapmap align --project workdir --role guide
uv run rapmap align --project workdir --role human
# Phase 4: Build anchor map
uv run rapmap anchors --project workdir --anchor onset
# Phases 5-6: Group + plan
uv run rapmap plan --project workdir --grouping safe_boundary
# Phase 7: Render
uv run rapmap render --project workdir
# Phase 8: Audacity session
uv run rapmap audacity --project workdir --openuv run rapmap init --backing inputs/backing.wav --human inputs/human_rap.wav --lyrics inputs/lyrics.txt --out workdir
uv run rapmap syllabify --project workdir
uv run rapmap align --project workdir --role human
uv run rapmap detect-beats --project workdir --subdivision eighth --strength 1.0
uv run rapmap plan --project workdir
uv run rapmap render --project workdir
uv run rapmap audacity --project workdirLaunch a waveform-based syllable timing editor:
uv run rapmap editor --project workdir # native window (pywebview)
uv run rapmap editor --project workdir --browser # fallback: open in browserLaunch Audacity + editor side by side:
uv run rapmap studio --project workdirAll modes produce valid results from the same anchor map:
| Mode | Description |
|---|---|
safe_boundary |
Split at acoustically safe points (default) |
word |
Split at word boundaries |
syllable_with_handles |
One clip per syllable with pre/post handles and crossfades |
strict_syllable |
Hard cut per syllable (debug mode) |
phrase |
Group by line/breath |
bar |
Group by bar/newline |
uv run rapmap plan --project workdir --grouping word| Strategy | Description |
|---|---|
onset |
Syllable onset to guide onset (default, best for rap) |
vowel_nucleus |
Vowel center to guide vowel center |
end |
Syllable end to guide end |
src/rapmap/
cli.py # CLI entry point (click)
config.py # Config loading and defaults
audio/ # I/O, normalization, time-stretch, rendering
beat/ # Beat detection, grid, syllable quantization
lyrics/ # Parsing, syllabification, CMUdict + g2p
align/ # MFA forced alignment, TextGrid parsing
timing/ # Anchor mapping, confidence scoring
edit/ # Clip grouping, edit planning, operations
guide/ # Guide vocal generation (model-adapter pattern)
audacity/ # Label tracks, script_pipe, session builder
editor/ # Interactive syllable editor (Flask + wavesurfer.js)
studio/ # Studio launcher (Audacity + editor)
Every syllable anchor in the rendered output must land exactly on the guide anchor, at integer sample precision:
For every syllable i:
rendered_anchor_sample[i] == guide_anchor_sample[i]
Zero-sample error. No tolerance. The render fails if this cannot be achieved.
- Integer sample indices everywhere -- all timing uses 48kHz integer sample indices internally. Seconds are only used for Audacity label export.
- Deterministic Phases 4-8 -- no neural models after alignment. Only cut, stretch, crossfade, and place audio.
- Fail-loud validation -- the pipeline aborts rather than silently producing incorrect edits.
- Pronunciation overrides -- rap slang (tryna, finna, ion, etc.) handled via YAML override file.
uv run pytest tests/
uv run ruff check src/ tests/The example/ directory contains a backing track (beat.m4a) and lyrics (lyrics.txt) for testing.
TBD