Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning

📖 Paper (IJCAI 2026) | 🤗 Datasets | 🖥️ Quick Start | ©️ Citation

This repository provides the official implementation of Formula-Speech, which has been accepted by IJCAI 2026. Formula-Speech is the first end-to-end Large Speech Language Model (LSLM) designed for scientific formula verbalization toward accessible learning. It is built upon GLM-4-Voice and optimized with a lightweight two-stage training framework that combines supervised fine-tuning with reinforcement learning guided by custom rewards.

Highlights

LoRA-based adaptation: Efficiently adapts GLM-4-Voice for scientific formula verbalization.
Two-stage training framework: Combines supervised fine-tuning and GRPO-based reinforcement learning for formula-to-speech alignment.
Comprehensive evaluation framework: Supports WER, BLEU, LLM-based semantic evaluation, and NMOS audio quality assessment.
Flexible inference pipeline: Supports both base model and LoRA-adapted model inference with local token-to-audio conversion.
Cross-disciplinary coverage: Covers scientific formulas across mathematics, physics and chemistry.

Project Structure

FormulaSpeech/
├── README.md                           # This file
├── requirements.txt                    # Python dependencies
├── training/                           # Training modules
│   ├── sft_training.py                 # Supervised fine-tuning
│   └── rl_training.py                  # Reinforcement learning (GRPO)
├── inference/                          # Inference module
│   └── inference.py                    # Model inference engine
├── evaluation/                         # Evaluation modules
│   ├── metrics/                        # Evaluation metrics
│   │   ├── wer_evaluation.py           # Word Error Rate
│   │   ├── bleu_evaluation.py          # BLEU score
│   │   ├── llm_evaluation.py           # LLM semantic evaluation
│   │   └── nmos_evaluation.py          # Audio quality (NMOS)
│   └── utils.py                        # Evaluation utilities
├── token2audio/                        # Local token-to-audio conversion
│   ├── local_token2audio.py            # Local decoder implementation
│   ├── flow_inference.py               # Flow matching decoder
│   └── cosyvoice/                      # CosyVoice foundation
├── configs/                            # Configuration files
│   ├── sft_config.yaml                 # SFT training config
│   ├── rl_config.yaml                  # RL (GRPO) training config
│   ├── inference_config.yaml            # Inference config
│   ├── evaluation_config.yaml           # Evaluation config
│   └── paths.py                        # Path management
├── utils/                              # Common utilities
│   ├── common.py                       # Helper functions
│   └── api_interfaces.py               # Abstract API interface layer
├── checkpoints/                        # Training checkpoints
│   └── formula-speech-lora/            # Trained LoRA adapter
└── models/                            # Pre-trained models (to be downloaded)

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/ai4ed/FormulaSpeech.git
cd FormulaSpeech

# Create conda environment
conda create -n formula-speech python=3.11
conda activate formula-speech

# Install dependencies
pip install -r requirements.txt

2. Download Pre-trained Models

Download the following models and place them in the models/ directory:

GLM-4-Voice: Download from Hugging Face models/GLM-4-Voice/glm-4-voice-9b models/GLM-4-Voice/glm-4-voice-decoder models/GLM-4-Voice/glm-4-voice-tokenizer
Paraformer (Chinese ASR): Download from Hugging Face models/paraformer-zh/
DNSMOS (Audio Quality): Download from Microsoft DNS Challenge models/model_v8.onnx

3. Configuration

Before running any training, inference, or evaluation, configure the paths in configs/paths.py:

# Edit configs/paths.py
MODEL_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-9b"
LORA_PATH = "/path/to/your/checkpoints/formula-speech-lora"
TOKEN2AUDIO_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-decoder"
ASR_MODEL_PATH = "/path/to/your/models/paraformer-zh"
DNSMOS_MODEL_PATH = "/path/to/your/models/model_v8.onnx"

4. Training

Supervised Fine-tuning (SFT)

python training/sft_training.py

Key parameters (configured in configs/sft_config.yaml):

Parameter	Default	Description
`num_train_epochs`	20	Number of SFT training epochs
`per_device_train_batch_size`	1	Batch size per device
`gradient_accumulation_steps`	4	Gradient accumulation steps
`learning_rate`	5e-4	Learning rate
LoRA `r`	64	LoRA rank
LoRA `lora_alpha`	32	LoRA alpha
LoRA `lora_dropout`	0.05	LoRA dropout

Reinforcement Learning (GRPO)

python training/rl_training.py

Key parameters (configured in configs/rl_config.yaml):

Parameter	Default	Description
`num_train_epochs`	10	Number of RL training epochs
`per_device_train_batch_size`	8	Batch size per device
`gradient_accumulation_steps`	8	Gradient accumulation steps
`learning_rate`	1e-6	Learning rate
`num_generations`	8	Number of sampled responses per prompt
`top_p`	0.95	Top-p sampling
`temperature`	0.6	Sampling temperature
`max_new_tokens`	512	Maximum number of generated tokens

5. Inference

Quick Inference (Single Sample)

python inference/inference.py

This loads the base GLM-4-Voice model with your LoRA adapter and runs inference on a sample formula. Edit the messages in inference/inference.py to test with different inputs.

Batch Inference

from inference.inference import FormulaSpeechInference

engine = FormulaSpeechInference(
    model_path="/path/to/glm-4-voice-9b",
    lora_path="/path/to/checkpoints/formula-speech-lora",
    token2audio_model_path="/path/to/glm-4-voice-decoder",
    device="cuda"
)

messages = [
    {"role": "system", "content": "User will provide you with a text instruction..."},
    {"role": "user", "content": r"Please read the following formula: $E=mc^2$"}
]

result = engine.inference(messages, max_tokens=512, temperature=0.2)
print(result["clean_text"])       # Verbalized formula text
print(result["audio_file_path"])  # Generated audio file

6. Evaluation

# Word Error Rate evaluation
python evaluation/metrics/wer_evaluation.py

# BLEU score evaluation
python evaluation/metrics/bleu_evaluation.py

# LLM semantic evaluation
python evaluation/metrics/llm_evaluation.py

# Audio quality evaluation (NMOS)
python evaluation/metrics/nmos_evaluation.py

Evaluation configuration is in configs/evaluation_config.yaml. Supported metrics include:

Metric	Description	Language Support
WER	Word Error Rate via ASR	Chinese, English
BLEU	N-gram overlap score	Chinese, English
LLM Score	LLM-based semantic evaluation of verbalization correctness	Chinese, English
NMOS	Automatic speech quality estimation	Chinese, English

Datasets

All datasets are publicly available on HuggingFace 🤗:

Dataset	Description	Link
EduDialogue	Multi-turn teacher-student educational dialogues with speech and spoken transcripts	🤗 FormulaSpeech_datasets
SciFormula-Math2k	Mathematical formulas and spoken verbalizations	🤗 FormulaSpeech_datasets
SciFormula-Physics1k	Physics formulas and spoken verbalizations	🤗 FormulaSpeech_datasets
SciFormula-Chemistry5k	Chemical formulas and equations with spoken verbalizations	🤗 FormulaSpeech_datasets

You can also load the datasets programmatically:

from datasets import load_dataset

edu_dialogue = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "EduDialogue")
math = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "SciFormula")

NOTE: The complete dataset is hosted on HuggingFace.

Citation

If you find FormulaSpeech useful in your research, please cite our paper:

@inproceedings{li2026improving,
  title={Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning},
  author={Li, Xueyi and Liu, Tianqiao and Liu, Zitao and Guo, Teng and Wu, Yongdong},
  booktitle={Proceedings of the 35th International Joint Conference on Artificial Intelligence},
  month = {August},
  year={2026},
  address = {Bremen, Germany}
}

Acknowledgements

This codebase is built with reference to the following excellent open-source projects. We sincerely thank the authors for their contributions:

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning

Highlights

Project Structure

Quick Start

1. Environment Setup

2. Download Pre-trained Models

3. Configuration

4. Training

Supervised Fine-tuning (SFT)

Reinforcement Learning (GRPO)

5. Inference

Quick Inference (Single Sample)

Batch Inference

6. Evaluation

Datasets

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
checkpoints/formula-speech-lora		checkpoints/formula-speech-lora
configs		configs
evaluation		evaluation
inference		inference
models		models
token2audio		token2audio
training		training
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning

Highlights

Project Structure

Quick Start

1. Environment Setup

2. Download Pre-trained Models

3. Configuration

4. Training

Supervised Fine-tuning (SFT)

Reinforcement Learning (GRPO)

5. Inference

Quick Inference (Single Sample)

Batch Inference

6. Evaluation

Datasets

Citation

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages