MambaVoiceCloning (MVC) - Efficient and Expressive TTS with State-Space Modeling

This paper presents MambaVoiceCloning (MVC), a scalable and expressive text-to-speech (TTS) framework that unifies state-space sequence modeling with diffusion-driven style control. Distinct from prior diffusion-based models, MVC replaces all self-attention and recurrent components in the TTS pipeline with novel Mamba-based modules: a Bi-Mamba Text Encoder, Temporal Bi-Mamba Encoder, and Expressive Mamba Predictor. These modules enable linear-time modeling of long-range phonetic and prosodic dependencies, improving efficiency and expressiveness without relying on external reference encoders. While MVC uses a diffusion-based decoder for waveform generation, our contribution is architectural—introducing the first end-to-end Mamba-integrated TTS backbone. Extensive experiments on LJSpeech and LibriTTS demonstrate that MVC significantly improves naturalness, prosody, intelligibility, and latency over state-of-the-art methods. MVC maintains a lightweight footprint of 21M parameters and achieves 1.6× faster training than comparable Transformer-based baselines.

🎧 Audio Demos

Explore MVC's expressive and high-quality speech synthesis through our audio samples: MVC Audio Demos

🚀 Key Features

Efficient State-Space Modeling: Utilizes Mamba blocks for linear time sequence modeling, significantly reducing computation time and memory overhead compared to traditional self-attention mechanisms.
Lightweight Temporal and Spectrogram Encoders: Includes optimized BiMambaTextEncoder, TemporalBiMambaEncoder, and ExpressiveMambaEncoder with depthwise separable convolutions for reduced parameter count.
Dynamic Style Conditioning: Integrates AdaLayerNorm for style modulation, enabling flexible control over prosody and speaker style during synthesis.
Advanced Gating Mechanisms: Employs grouped convolutional gating for efficient residual connections, minimizing parameter overhead while maintaining expressiveness.
Optimized Inference Path: Supports gradient checkpointing and efficient feature aggregation, reducing memory usage during both training and inference.

📦 Installation

Prerequisites

Python >= 3.8
PyTorch >= 1.12.0
CUDA-enabled GPU (recommended for training)
Mamba SSM (Required for Mamba-based encoders)

Setup

Clone the repository and install dependencies:

git clone https://github.com/aiai-9/MVC.git
cd MVC
pip install -r requirements.txt

Install Mamba SSM

To install the Mamba SSM module, use the following command:

pip install git+https://github.com/state-spaces/mamba.git

Training

First stage training (Text Encoder, Duration Encoder, Prosody Predictor):

accelerate launch train_first.py --config_path ./configs/config.yml

Second stage training (Diffusion-based decoder and adversarial refinement):

python train_second.py --config_path ./configs/config.yml

Inference

Generate high-quality speech with pre-trained models:

python inference.py --config_path ./configs/config.yml --input_text "Hello, this is MambaVoiceCloning."

🧠 Model Architecture

MVC consists of three core components:

Bi-Mamba Text Encoder: Efficiently captures phoneme-level context using bidirectional state-space models (SSMs).
Expressive Mamba Encoder: Enhances prosodic variation and speaker expressiveness.
Temporal Bi-Mamba Encoder: Models rhythmic structures and duration alignment for natural speech generation.

📊 Evaluation

Run objective and subjective evaluations using provided scripts:

python evaluate.py --config_path ./configs/config.yml

🏆 Results

Table 1: Subjective Evaluation on LibriTTS (Zero-Shot)

Model	MOS-N (↑)	MOS-S (↑)
Ground Truth	4.60 ± 0.09	4.35 ± 0.10
VITS	3.69 ± 0.12	3.54 ± 0.13
StyleTTS2	4.15 ± 0.11	4.03 ± 0.11
MVC (Ours)	4.22 ± 0.10	4.07 ± 0.10

Table 2: MOS Comparison on LJSpeech (ID vs OOD)

Model	MOS_ID (↑)	MOS_OOD (↑)
Ground Truth	3.81 ± 0.09	3.70 ± 0.11
StyleTTS2	3.83 ± 0.08	3.87 ± 0.08
JETS	3.57 ± 0.10	3.21 ± 0.12
VITS	3.44 ± 0.10	3.21 ± 0.11
MVC (Ours)	3.87 ± 0.07	3.88 ± 0.09

Table 3: Objective Metrics on LJSpeech

Model	F0 RMSE (↓)	MCD (↓)	WER (↓)	RTF (↓)
VITS	0.667 ± 0.011	4.97 ± 0.09	7.23%	0.0211
StyleTTS2	0.651 ± 0.013	4.93 ± 0.06	6.50%	0.0185
MVC (Ours)	0.653 ± 0.014	4.91 ± 0.07	6.52%	0.0177

🛠️ Troubleshooting

NaN Loss: Ensure the batch size is properly set (e.g., 16 for stable training).
Out of Memory: Reduce batch size or sequence length if OOM errors occur.
Audio Quality Issues: Fine-tune model hyperparameters for specific datasets.

📄 License

This project is released under the MIT License. See the LICENSE file for more details.

🙌 Contributing

We welcome contributions! Please read the CONTRIBUTING.md file for guidelines on code style, pull requests, and community support.

🤝 Acknowledgements

MVC builds on prior work from the Mamba, StyleTTS2, and VITS communities. We thank the authors for their foundational contributions to the field of TTS.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Configs		Configs
Data		Data
Modules		Modules
Utils		Utils
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
losses.py		losses.py
meldataset.py		meldataset.py
models.py		models.py
optimizers.py		optimizers.py
requirements.txt		requirements.txt
text_utils.py		text_utils.py
train_first.py		train_first.py
train_second_mamba.py		train_second_mamba.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MambaVoiceCloning (MVC) - Efficient and Expressive TTS with State-Space Modeling

🎧 Audio Demos

🚀 Key Features

📦 Installation

Prerequisites

Setup

Install Mamba SSM

Training

Inference

🧠 Model Architecture

📊 Evaluation

🏆 Results

Table 1: Subjective Evaluation on LibriTTS (Zero-Shot)

Table 2: MOS Comparison on LJSpeech (ID vs OOD)

Table 3: Objective Metrics on LJSpeech

🛠️ Troubleshooting

📄 License

🙌 Contributing

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aiai-9/MVC

Folders and files

Latest commit

History

Repository files navigation

MambaVoiceCloning (MVC) - Efficient and Expressive TTS with State-Space Modeling

🎧 Audio Demos

🚀 Key Features

📦 Installation

Prerequisites

Setup

Install Mamba SSM

Training

Inference

🧠 Model Architecture

📊 Evaluation

🏆 Results

Table 1: Subjective Evaluation on LibriTTS (Zero-Shot)

Table 2: MOS Comparison on LJSpeech (ID vs OOD)

Table 3: Objective Metrics on LJSpeech

🛠️ Troubleshooting

📄 License

🙌 Contributing

🤝 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages