📖 Paper (IJCAI 2026) | 🤗 Datasets | 🖥️ Quick Start | ©️ Citation
This repository provides the official implementation of Formula-Speech, which has been accepted by IJCAI 2026. Formula-Speech is the first end-to-end Large Speech Language Model (LSLM) designed for scientific formula verbalization toward accessible learning. It is built upon GLM-4-Voice and optimized with a lightweight two-stage training framework that combines supervised fine-tuning with reinforcement learning guided by custom rewards.
- LoRA-based adaptation: Efficiently adapts GLM-4-Voice for scientific formula verbalization.
- Two-stage training framework: Combines supervised fine-tuning and GRPO-based reinforcement learning for formula-to-speech alignment.
- Comprehensive evaluation framework: Supports WER, BLEU, LLM-based semantic evaluation, and NMOS audio quality assessment.
- Flexible inference pipeline: Supports both base model and LoRA-adapted model inference with local token-to-audio conversion.
- Cross-disciplinary coverage: Covers scientific formulas across mathematics, physics and chemistry.
FormulaSpeech/
├── README.md # This file
├── requirements.txt # Python dependencies
├── training/ # Training modules
│ ├── sft_training.py # Supervised fine-tuning
│ └── rl_training.py # Reinforcement learning (GRPO)
├── inference/ # Inference module
│ └── inference.py # Model inference engine
├── evaluation/ # Evaluation modules
│ ├── metrics/ # Evaluation metrics
│ │ ├── wer_evaluation.py # Word Error Rate
│ │ ├── bleu_evaluation.py # BLEU score
│ │ ├── llm_evaluation.py # LLM semantic evaluation
│ │ └── nmos_evaluation.py # Audio quality (NMOS)
│ └── utils.py # Evaluation utilities
├── token2audio/ # Local token-to-audio conversion
│ ├── local_token2audio.py # Local decoder implementation
│ ├── flow_inference.py # Flow matching decoder
│ └── cosyvoice/ # CosyVoice foundation
├── configs/ # Configuration files
│ ├── sft_config.yaml # SFT training config
│ ├── rl_config.yaml # RL (GRPO) training config
│ ├── inference_config.yaml # Inference config
│ ├── evaluation_config.yaml # Evaluation config
│ └── paths.py # Path management
├── utils/ # Common utilities
│ ├── common.py # Helper functions
│ └── api_interfaces.py # Abstract API interface layer
├── checkpoints/ # Training checkpoints
│ └── formula-speech-lora/ # Trained LoRA adapter
└── models/ # Pre-trained models (to be downloaded)
# Clone the repository
git clone https://github.com/ai4ed/FormulaSpeech.git
cd FormulaSpeech
# Create conda environment
conda create -n formula-speech python=3.11
conda activate formula-speech
# Install dependencies
pip install -r requirements.txtDownload the following models and place them in the models/ directory:
-
GLM-4-Voice: Download from Hugging Face
models/GLM-4-Voice/glm-4-voice-9b models/GLM-4-Voice/glm-4-voice-decoder models/GLM-4-Voice/glm-4-voice-tokenizer -
Paraformer (Chinese ASR): Download from Hugging Face
models/paraformer-zh/ -
DNSMOS (Audio Quality): Download from Microsoft DNS Challenge
models/model_v8.onnx
Before running any training, inference, or evaluation, configure the paths in configs/paths.py:
# Edit configs/paths.py
MODEL_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-9b"
LORA_PATH = "/path/to/your/checkpoints/formula-speech-lora"
TOKEN2AUDIO_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-decoder"
ASR_MODEL_PATH = "/path/to/your/models/paraformer-zh"
DNSMOS_MODEL_PATH = "/path/to/your/models/model_v8.onnx"python training/sft_training.pyKey parameters (configured in configs/sft_config.yaml):
| Parameter | Default | Description |
|---|---|---|
num_train_epochs |
20 | Number of SFT training epochs |
per_device_train_batch_size |
1 | Batch size per device |
gradient_accumulation_steps |
4 | Gradient accumulation steps |
learning_rate |
5e-4 | Learning rate |
LoRA r |
64 | LoRA rank |
LoRA lora_alpha |
32 | LoRA alpha |
LoRA lora_dropout |
0.05 | LoRA dropout |
python training/rl_training.pyKey parameters (configured in configs/rl_config.yaml):
| Parameter | Default | Description |
|---|---|---|
num_train_epochs |
10 | Number of RL training epochs |
per_device_train_batch_size |
8 | Batch size per device |
gradient_accumulation_steps |
8 | Gradient accumulation steps |
learning_rate |
1e-6 | Learning rate |
num_generations |
8 | Number of sampled responses per prompt |
top_p |
0.95 | Top-p sampling |
temperature |
0.6 | Sampling temperature |
max_new_tokens |
512 | Maximum number of generated tokens |
python inference/inference.pyThis loads the base GLM-4-Voice model with your LoRA adapter and runs inference on a sample formula. Edit the messages in inference/inference.py to test with different inputs.
from inference.inference import FormulaSpeechInference
engine = FormulaSpeechInference(
model_path="/path/to/glm-4-voice-9b",
lora_path="/path/to/checkpoints/formula-speech-lora",
token2audio_model_path="/path/to/glm-4-voice-decoder",
device="cuda"
)
messages = [
{"role": "system", "content": "User will provide you with a text instruction..."},
{"role": "user", "content": r"Please read the following formula: $E=mc^2$"}
]
result = engine.inference(messages, max_tokens=512, temperature=0.2)
print(result["clean_text"]) # Verbalized formula text
print(result["audio_file_path"]) # Generated audio file# Word Error Rate evaluation
python evaluation/metrics/wer_evaluation.py
# BLEU score evaluation
python evaluation/metrics/bleu_evaluation.py
# LLM semantic evaluation
python evaluation/metrics/llm_evaluation.py
# Audio quality evaluation (NMOS)
python evaluation/metrics/nmos_evaluation.pyEvaluation configuration is in configs/evaluation_config.yaml. Supported metrics include:
| Metric | Description | Language Support |
|---|---|---|
| WER | Word Error Rate via ASR | Chinese, English |
| BLEU | N-gram overlap score | Chinese, English |
| LLM Score | LLM-based semantic evaluation of verbalization correctness | Chinese, English |
| NMOS | Automatic speech quality estimation | Chinese, English |
All datasets are publicly available on HuggingFace 🤗:
| Dataset | Description | Link |
|---|---|---|
| EduDialogue | Multi-turn teacher-student educational dialogues with speech and spoken transcripts | 🤗 FormulaSpeech_datasets |
| SciFormula-Math2k | Mathematical formulas and spoken verbalizations | 🤗 FormulaSpeech_datasets |
| SciFormula-Physics1k | Physics formulas and spoken verbalizations | 🤗 FormulaSpeech_datasets |
| SciFormula-Chemistry5k | Chemical formulas and equations with spoken verbalizations | 🤗 FormulaSpeech_datasets |
You can also load the datasets programmatically:
from datasets import load_dataset
edu_dialogue = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "EduDialogue")
math = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "SciFormula")NOTE: The complete dataset is hosted on HuggingFace.
If you find FormulaSpeech useful in your research, please cite our paper:
@inproceedings{li2026improving,
title={Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning},
author={Li, Xueyi and Liu, Tianqiao and Liu, Zitao and Guo, Teng and Wu, Yongdong},
booktitle={Proceedings of the 35th International Joint Conference on Artificial Intelligence},
month = {August},
year={2026},
address = {Bremen, Germany}
}This codebase is built with reference to the following excellent open-source projects. We sincerely thank the authors for their contributions:
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.