LLM Souping 🍲

Model Weight Averaging for Enhanced Performance

LLM Souping is a framework for creating high-performance language models through weighted averaging of multiple pre-trained model checkpoints. By combining the strengths of different specialized models, this technique produces ensemble models that often outperform individual components across various tasks.

What is Model Souping?

Model souping (also known as model averaging or weight averaging) is a technique that combines multiple trained models by averaging their parameters with specific weights. This approach can:

Improve performance across diverse evaluation benchmarks
Reduce overfitting by leveraging multiple training trajectories
Create robust models that combine specialized capabilities
Save computational costs compared to ensemble inference

Souper achieves state-of-the-art performance by considering benchmark composition and employing non-uniform weighting strategies. We show:

Automated Checkpoint Souping: We introduce SoCE, Soup Of Category Experts, a novel model souping technique that leverages benchmark composition through an automatic category-aware expert selection mechanism.
State-of-the-Art Performance: We demonstrate the efficiency of the proposed method across diverse domains, including state-of-the-art results for the Berkeley Function Calling Leaderboard. Our approach consistently outperforms existing baselines, validating the effectiveness of category-specific model souping.
Higher Model Consistency: We perform a large-scale empirical analysis to show that model souping enhances performance consistency across benchmark categories. Souped models exhibit significantly higher Pearson correlations between category performances across model populations compared to their unsouped counterparts, indicating improved robustness and coherence across diverse task types.

Cite us:

@misc{li2025modelmergingpretraininglarge,
      title={Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance},
      author={Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach},
      year={2025},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.13254},
}

Supported Models

The framework currently supports averaging of Llama-based models in two size categories:

8B Models

Team-ACE/ToolACE-2-Llama-3.1-8B - Tool-calling specialized model
Salesforce/Llama-xLAM-2-8b-fc-r - Function calling model
watt-ai/watt-tool-8B - Multi-tool reasoning model

70B Models

Salesforce/Llama-xLAM-2-70b-fc-r - Large-scale function calling
watt-ai/watt-tool-70B - Advanced tool reasoning
uiuc-convai/CoALM-70B - Conversational AI model

Quick Start:How to soup your models 🥣

🥕 🥦 🍅 🧅 🍲 🧄 🥬

General advice:

Remember that souping is better done with the models derived from the same pre-trained models
You can check for other derivative models of the same size via HuggingFace filters
Do not soup unaligned checkpoints with the aligned ones Happy souping!

Prerequisites

Python 3.10 or higher
Conda package manager
Sufficient disk space (8B models: ~50GB, 70B models: ~400GB)
CUDA-compatible GPU (recommended for faster processing)

1. Clone the Repository

git clone <repository-url>
cd llm_souping

2. Run Model Souping

The framework provides a single end-to-end script that handles everything:

# For 8B models
./run_souping_e2e.sh 8b

# For 70B models
./run_souping_e2e.sh 70b

This script will automatically:

Create conda environment named souping with Python 3.10
Install dependencies including HuggingFace CLI, transformers, and sglang
Download models from HuggingFace Hub to local directories
Average model weights according to predefined configurations
Save the final ensemble model to the output directory

What Happens During Execution

Environment Setup

Creates a conda environment called souping
Installs required packages:
- huggingface_hub - For model downloading
- transformers - For model loading and saving
- sglang[all] - For serving and inference
- Additional utilities: orjson, pybase64, uvicorn

Model Download and Storage

Models are downloaded to structured directories under ~/magg_checkpoints/magg/souping_experiments/checkpoints/:

8B Models:

~/magg_checkpoints/magg/souping_experiments/checkpoints/
├── m1-8b/  # Team-ACE/ToolACE-2-Llama-3.1-8B
├── m2-8b/  # Salesforce/Llama-xLAM-2-8b-fc-r
└── m3-8b/  # watt-ai/watt-tool-8B

70B Models:

~/magg_checkpoints/magg/souping_experiments/checkpoints/
├── m1-70b/  # Salesforce/Llama-xLAM-2-70b-fc-r
├── m2-70b/  # watt-ai/watt-tool-70B
└── m3-70b/  # uiuc-convai/CoALM-70B

Model Averaging Configuration

The averaging weights are optimized configurations:

8B Ensemble:

ToolACE-2-Llama-3.1-8B: 20%
Llama-xLAM-2-8b-fc-r: 70%
watt-tool-8B: 10%

70B Ensemble:

Llama-xLAM-2-70b-fc-r: 50%
watt-tool-70B: 30%
CoALM-70B: 20%

Advanced Usage

Custom Configurations

You can create custom ensemble configurations by modifying the configuration files:

souping_experiments/ensemble_configs_8b_sota.py - 8B model configurations
souping_experiments/ensemble_configs_70b_sota.py - 70B model configurations

Example custom configuration:

ensemble_configs = [
    {
        "name": "custom_8b_ensemble",
        "models": {
            f"{Path.home()}/path/to/model1": 0.4,
            f"{Path.home()}/path/to/model2": 0.6,
        },
        "output_dir": f"{Path.home()}/custom_output_dir/",
    }
]

Adding New Models

To add support for new models:

Update the download function in run_souping_e2e.sh
Modify the ensemble configuration files
Ensure models have compatible architectures (Llama-based)

Parallel GPU-Accelerated Souping

For faster model souping when you have multiple GPUs available, you can use the parallel execution mode:

# Run with automatic parallel execution across available GPUs
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b

# Force sequential execution (still uses GPU for faster averaging)
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --sequential

# Limit parallel workers to specific number
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --max-workers 4

Features:

Automatic GPU detection and memory estimation
Parallel processing with one worker per GPU
GPU-accelerated weight averaging
Falls back to sequential CPU mode if GPUs unavailable

Requirements: Install GPU management library: pip install nvidia-ml-py3 or pip install pynvml

Project Structure

llm_souping/
├── run_souping_e2e.sh                    # Main execution script (CPU-based)
├── souping_experiments/
│   ├── model_avg.py                      # Core averaging logic (CPU)
│   ├── example_model_averaging.py        # Averaging orchestration (CPU)
│   ├── model_avg_gpu.py                  # GPU-accelerated averaging logic
│   ├── run_parallel_averaging.py         # Parallel execution entry point
│   ├── parallel_runner_mp.py             # Multiprocessing parallel runner
│   ├── gpu_manager.py                    # GPU resource management
│   ├── ensemble_configs_8b_sota.py       # 8B model configurations
│   └── ensemble_configs_70b_sota.py      # 70B model configurations
├── README.md                             # This file
├── LICENSE                               # CC BY-NC 4.0 license
├── CODE_OF_CONDUCT.md                   # Contribution guidelines
└── CONTRIBUTING.md                      # Development guidelines

License

This project is licensed under the MIT License. This means you can use, share, and adapt the material for non-commercial purposes with proper attribution.

Contributing

We welcome contributions! Please see our Contributing Guidelines and Code of Conduct for details on how to get involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Souping 🍲

Model Weight Averaging for Enhanced Performance

What is Model Souping?

Cite us:

Supported Models

8B Models

70B Models

Quick Start:How to soup your models 🥣

General advice:

Prerequisites

1. Clone the Repository

2. Run Model Souping

What Happens During Execution

Environment Setup

Model Download and Storage

Model Averaging Configuration

Advanced Usage

Custom Configurations

Adding New Models

Parallel GPU-Accelerated Souping

Project Structure

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
souping_experiments		souping_experiments
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
run_souping_e2e.sh		run_souping_e2e.sh

License

facebookresearch/llm_souping

Folders and files

Latest commit

History

Repository files navigation

LLM Souping 🍲

Model Weight Averaging for Enhanced Performance

What is Model Souping?

Cite us:

Supported Models

8B Models

70B Models

Quick Start:How to soup your models 🥣

General advice:

Prerequisites

1. Clone the Repository

2. Run Model Souping

What Happens During Execution

Environment Setup

Model Download and Storage

Model Averaging Configuration

Advanced Usage

Custom Configurations

Adding New Models

Parallel GPU-Accelerated Souping

Project Structure

License

Contributing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages