LLM Souping is a framework for creating high-performance language models through weighted averaging of multiple pre-trained model checkpoints. By combining the strengths of different specialized models, this technique produces ensemble models that often outperform individual components across various tasks.
Model souping (also known as model averaging or weight averaging) is a technique that combines multiple trained models by averaging their parameters with specific weights. This approach can:
- Improve performance across diverse evaluation benchmarks
- Reduce overfitting by leveraging multiple training trajectories
- Create robust models that combine specialized capabilities
- Save computational costs compared to ensemble inference
Souper achieves state-of-the-art performance by considering benchmark composition and employing non-uniform weighting strategies. We show:
- Automated Checkpoint Souping: We introduce SoCE, Soup Of Category Experts, a novel model souping technique that leverages benchmark composition through an automatic category-aware expert selection mechanism.
- State-of-the-Art Performance: We demonstrate the efficiency of the proposed method across diverse domains, including state-of-the-art results for the Berkeley Function Calling Leaderboard. Our approach consistently outperforms existing baselines, validating the effectiveness of category-specific model souping.
- Higher Model Consistency: We perform a large-scale empirical analysis to show that model souping enhances performance consistency across benchmark categories. Souped models exhibit significantly higher Pearson correlations between category performances across model populations compared to their unsouped counterparts, indicating improved robustness and coherence across diverse task types.
@misc{li2025modelmergingpretraininglarge,
title={Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance},
author={Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach},
year={2025},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.13254},
}
The framework currently supports averaging of Llama-based models in two size categories:
- Team-ACE/ToolACE-2-Llama-3.1-8B - Tool-calling specialized model
- Salesforce/Llama-xLAM-2-8b-fc-r - Function calling model
- watt-ai/watt-tool-8B - Multi-tool reasoning model
- Salesforce/Llama-xLAM-2-70b-fc-r - Large-scale function calling
- watt-ai/watt-tool-70B - Advanced tool reasoning
- uiuc-convai/CoALM-70B - Conversational AI model
π₯ π₯¦ π π§ π² π§ π₯¬
- Remember that souping is better done with the models derived from the same pre-trained models
- You can check for other derivative models of the same size via HuggingFace filters
- Do not soup unaligned checkpoints with the aligned ones Happy souping!
- Python 3.10 or higher
- Conda package manager
- Sufficient disk space (8B models: ~50GB, 70B models: ~400GB)
- CUDA-compatible GPU (recommended for faster processing)
git clone <repository-url>
cd llm_soupingThe framework provides a single end-to-end script that handles everything:
# For 8B models
./run_souping_e2e.sh 8b
# For 70B models
./run_souping_e2e.sh 70bThis script will automatically:
- Create conda environment named
soupingwith Python 3.10 - Install dependencies including HuggingFace CLI, transformers, and sglang
- Download models from HuggingFace Hub to local directories
- Average model weights according to predefined configurations
- Save the final ensemble model to the output directory
- Creates a conda environment called
souping - Installs required packages:
huggingface_hub- For model downloadingtransformers- For model loading and savingsglang[all]- For serving and inference- Additional utilities:
orjson,pybase64,uvicorn
Models are downloaded to structured directories under ~/magg_checkpoints/magg/souping_experiments/checkpoints/:
8B Models:
~/magg_checkpoints/magg/souping_experiments/checkpoints/
βββ m1-8b/ # Team-ACE/ToolACE-2-Llama-3.1-8B
βββ m2-8b/ # Salesforce/Llama-xLAM-2-8b-fc-r
βββ m3-8b/ # watt-ai/watt-tool-8B
70B Models:
~/magg_checkpoints/magg/souping_experiments/checkpoints/
βββ m1-70b/ # Salesforce/Llama-xLAM-2-70b-fc-r
βββ m2-70b/ # watt-ai/watt-tool-70B
βββ m3-70b/ # uiuc-convai/CoALM-70B
The averaging weights are optimized configurations:
8B Ensemble:
- ToolACE-2-Llama-3.1-8B: 20%
- Llama-xLAM-2-8b-fc-r: 70%
- watt-tool-8B: 10%
70B Ensemble:
- Llama-xLAM-2-70b-fc-r: 50%
- watt-tool-70B: 30%
- CoALM-70B: 20%
You can create custom ensemble configurations by modifying the configuration files:
souping_experiments/ensemble_configs_8b_sota.py- 8B model configurationssouping_experiments/ensemble_configs_70b_sota.py- 70B model configurations
Example custom configuration:
ensemble_configs = [
{
"name": "custom_8b_ensemble",
"models": {
f"{Path.home()}/path/to/model1": 0.4,
f"{Path.home()}/path/to/model2": 0.6,
},
"output_dir": f"{Path.home()}/custom_output_dir/",
}
]To add support for new models:
- Update the download function in
run_souping_e2e.sh - Modify the ensemble configuration files
- Ensure models have compatible architectures (Llama-based)
For faster model souping when you have multiple GPUs available, you can use the parallel execution mode:
# Run with automatic parallel execution across available GPUs
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b
# Force sequential execution (still uses GPU for faster averaging)
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --sequential
# Limit parallel workers to specific number
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --max-workers 4Features:
- Automatic GPU detection and memory estimation
- Parallel processing with one worker per GPU
- GPU-accelerated weight averaging
- Falls back to sequential CPU mode if GPUs unavailable
Requirements: Install GPU management library: pip install nvidia-ml-py3 or pip install pynvml
llm_souping/
βββ run_souping_e2e.sh # Main execution script (CPU-based)
βββ souping_experiments/
β βββ model_avg.py # Core averaging logic (CPU)
β βββ example_model_averaging.py # Averaging orchestration (CPU)
β βββ model_avg_gpu.py # GPU-accelerated averaging logic
β βββ run_parallel_averaging.py # Parallel execution entry point
β βββ parallel_runner_mp.py # Multiprocessing parallel runner
β βββ gpu_manager.py # GPU resource management
β βββ ensemble_configs_8b_sota.py # 8B model configurations
β βββ ensemble_configs_70b_sota.py # 70B model configurations
βββ README.md # This file
βββ LICENSE # CC BY-NC 4.0 license
βββ CODE_OF_CONDUCT.md # Contribution guidelines
βββ CONTRIBUTING.md # Development guidelines
This project is licensed under the MIT License. This means you can use, share, and adapt the material for non-commercial purposes with proper attribution.
We welcome contributions! Please see our Contributing Guidelines and Code of Conduct for details on how to get involved.
