Skip to content

ai4ed/FormulaSpeech

Repository files navigation

Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning

📖 Paper (IJCAI 2026)  |    🤗 Datasets  |    🖥️ Quick Start  |    ©️ Citation


This repository provides the official implementation of Formula-Speech, which has been accepted by IJCAI 2026. Formula-Speech is the first end-to-end Large Speech Language Model (LSLM) designed for scientific formula verbalization toward accessible learning. It is built upon GLM-4-Voice and optimized with a lightweight two-stage training framework that combines supervised fine-tuning with reinforcement learning guided by custom rewards.

Highlights

  • LoRA-based adaptation: Efficiently adapts GLM-4-Voice for scientific formula verbalization.
  • Two-stage training framework: Combines supervised fine-tuning and GRPO-based reinforcement learning for formula-to-speech alignment.
  • Comprehensive evaluation framework: Supports WER, BLEU, LLM-based semantic evaluation, and NMOS audio quality assessment.
  • Flexible inference pipeline: Supports both base model and LoRA-adapted model inference with local token-to-audio conversion.
  • Cross-disciplinary coverage: Covers scientific formulas across mathematics, physics and chemistry.

Project Structure

FormulaSpeech/
├── README.md                           # This file
├── requirements.txt                    # Python dependencies
├── training/                           # Training modules
│   ├── sft_training.py                 # Supervised fine-tuning
│   └── rl_training.py                  # Reinforcement learning (GRPO)
├── inference/                          # Inference module
│   └── inference.py                    # Model inference engine
├── evaluation/                         # Evaluation modules
│   ├── metrics/                        # Evaluation metrics
│   │   ├── wer_evaluation.py           # Word Error Rate
│   │   ├── bleu_evaluation.py          # BLEU score
│   │   ├── llm_evaluation.py           # LLM semantic evaluation
│   │   └── nmos_evaluation.py          # Audio quality (NMOS)
│   └── utils.py                        # Evaluation utilities
├── token2audio/                        # Local token-to-audio conversion
│   ├── local_token2audio.py            # Local decoder implementation
│   ├── flow_inference.py               # Flow matching decoder
│   └── cosyvoice/                      # CosyVoice foundation
├── configs/                            # Configuration files
│   ├── sft_config.yaml                 # SFT training config
│   ├── rl_config.yaml                  # RL (GRPO) training config
│   ├── inference_config.yaml            # Inference config
│   ├── evaluation_config.yaml           # Evaluation config
│   └── paths.py                        # Path management
├── utils/                              # Common utilities
│   ├── common.py                       # Helper functions
│   └── api_interfaces.py               # Abstract API interface layer
├── checkpoints/                        # Training checkpoints
│   └── formula-speech-lora/            # Trained LoRA adapter
└── models/                            # Pre-trained models (to be downloaded)

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/ai4ed/FormulaSpeech.git
cd FormulaSpeech

# Create conda environment
conda create -n formula-speech python=3.11
conda activate formula-speech

# Install dependencies
pip install -r requirements.txt

2. Download Pre-trained Models

Download the following models and place them in the models/ directory:

  • GLM-4-Voice: Download from Hugging Face     models/GLM-4-Voice/glm-4-voice-9b   models/GLM-4-Voice/glm-4-voice-decoder   models/GLM-4-Voice/glm-4-voice-tokenizer  

  • Paraformer (Chinese ASR): Download from Hugging Face     models/paraformer-zh/  

  • DNSMOS (Audio Quality): Download from Microsoft DNS Challenge     models/model_v8.onnx  

3. Configuration

Before running any training, inference, or evaluation, configure the paths in configs/paths.py:

# Edit configs/paths.py
MODEL_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-9b"
LORA_PATH = "/path/to/your/checkpoints/formula-speech-lora"
TOKEN2AUDIO_PATH = "/path/to/your/models/GLM-4-Voice/glm-4-voice-decoder"
ASR_MODEL_PATH = "/path/to/your/models/paraformer-zh"
DNSMOS_MODEL_PATH = "/path/to/your/models/model_v8.onnx"

4. Training

Supervised Fine-tuning (SFT)

python training/sft_training.py

Key parameters (configured in configs/sft_config.yaml):

Parameter Default Description
num_train_epochs 20 Number of SFT training epochs
per_device_train_batch_size 1 Batch size per device
gradient_accumulation_steps 4 Gradient accumulation steps
learning_rate 5e-4 Learning rate
LoRA r 64 LoRA rank
LoRA lora_alpha 32 LoRA alpha
LoRA lora_dropout 0.05 LoRA dropout

Reinforcement Learning (GRPO)

python training/rl_training.py

Key parameters (configured in configs/rl_config.yaml):

Parameter Default Description
num_train_epochs 10 Number of RL training epochs
per_device_train_batch_size 8 Batch size per device
gradient_accumulation_steps 8 Gradient accumulation steps
learning_rate 1e-6 Learning rate
num_generations 8 Number of sampled responses per prompt
top_p 0.95 Top-p sampling
temperature 0.6 Sampling temperature
max_new_tokens 512 Maximum number of generated tokens

5. Inference

Quick Inference (Single Sample)

python inference/inference.py

This loads the base GLM-4-Voice model with your LoRA adapter and runs inference on a sample formula. Edit the messages in inference/inference.py to test with different inputs.

Batch Inference

from inference.inference import FormulaSpeechInference

engine = FormulaSpeechInference(
    model_path="/path/to/glm-4-voice-9b",
    lora_path="/path/to/checkpoints/formula-speech-lora",
    token2audio_model_path="/path/to/glm-4-voice-decoder",
    device="cuda"
)

messages = [
    {"role": "system", "content": "User will provide you with a text instruction..."},
    {"role": "user", "content": r"Please read the following formula: $E=mc^2$"}
]

result = engine.inference(messages, max_tokens=512, temperature=0.2)
print(result["clean_text"])       # Verbalized formula text
print(result["audio_file_path"])  # Generated audio file

6. Evaluation

# Word Error Rate evaluation
python evaluation/metrics/wer_evaluation.py

# BLEU score evaluation
python evaluation/metrics/bleu_evaluation.py

# LLM semantic evaluation
python evaluation/metrics/llm_evaluation.py

# Audio quality evaluation (NMOS)
python evaluation/metrics/nmos_evaluation.py

Evaluation configuration is in configs/evaluation_config.yaml. Supported metrics include:

Metric Description Language Support
WER Word Error Rate via ASR Chinese, English
BLEU N-gram overlap score Chinese, English
LLM Score LLM-based semantic evaluation of verbalization correctness Chinese, English
NMOS Automatic speech quality estimation Chinese, English

Datasets

All datasets are publicly available on HuggingFace 🤗:

Dataset Description Link
EduDialogue Multi-turn teacher-student educational dialogues with speech and spoken transcripts 🤗 FormulaSpeech_datasets
SciFormula-Math2k Mathematical formulas and spoken verbalizations 🤗 FormulaSpeech_datasets
SciFormula-Physics1k Physics formulas and spoken verbalizations 🤗 FormulaSpeech_datasets
SciFormula-Chemistry5k Chemical formulas and equations with spoken verbalizations 🤗 FormulaSpeech_datasets

You can also load the datasets programmatically:

from datasets import load_dataset

edu_dialogue = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "EduDialogue")
math = load_dataset("Stephen-Lee/FormulaSpeech_datasets", "SciFormula")

NOTE: The complete dataset is hosted on HuggingFace.

Citation

If you find FormulaSpeech useful in your research, please cite our paper:

@inproceedings{li2026improving,
  title={Improving Scientific Formula Verbalization in Large Speech Language Models for Accessible Learning},
  author={Li, Xueyi and Liu, Tianqiao and Liu, Zitao and Guo, Teng and Wu, Yongdong},
  booktitle={Proceedings of the 35th International Joint Conference on Artificial Intelligence},
  month = {August},
  year={2026},
  address = {Bremen, Germany}
}

Acknowledgements

This codebase is built with reference to the following excellent open-source projects. We sincerely thank the authors for their contributions:

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors