Skip to content

facebookresearch/llm_souping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

LLM Souping 🍲

Model Weight Averaging for Enhanced Performance

LLM Souping is a framework for creating high-performance language models through weighted averaging of multiple pre-trained model checkpoints. By combining the strengths of different specialized models, this technique produces ensemble models that often outperform individual components across various tasks.

What is Model Souping?

Model souping (also known as model averaging or weight averaging) is a technique that combines multiple trained models by averaging their parameters with specific weights. This approach can:

  • Improve performance across diverse evaluation benchmarks
  • Reduce overfitting by leveraging multiple training trajectories
  • Create robust models that combine specialized capabilities
  • Save computational costs compared to ensemble inference

Souper achieves state-of-the-art performance by considering benchmark composition and employing non-uniform weighting strategies. We show:

  1. Automated Checkpoint Souping: We introduce SoCE, Soup Of Category Experts, a novel model souping technique that leverages benchmark composition through an automatic category-aware expert selection mechanism.
  2. State-of-the-Art Performance: We demonstrate the efficiency of the proposed method across diverse domains, including state-of-the-art results for the Berkeley Function Calling Leaderboard. Our approach consistently outperforms existing baselines, validating the effectiveness of category-specific model souping.
  3. Higher Model Consistency: We perform a large-scale empirical analysis to show that model souping enhances performance consistency across benchmark categories. Souped models exhibit significantly higher Pearson correlations between category performances across model populations compared to their unsouped counterparts, indicating improved robustness and coherence across diverse task types.
souper

Cite us:

@misc{li2025modelmergingpretraininglarge,
      title={Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance},
      author={Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach},
      year={2025},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.13254},
}

Supported Models

The framework currently supports averaging of Llama-based models in two size categories:

8B Models

70B Models

Quick Start:How to soup your models πŸ₯£

πŸ₯• πŸ₯¦ πŸ… πŸ§… 🍲 πŸ§„ πŸ₯¬

General advice:

  • Remember that souping is better done with the models derived from the same pre-trained models
  • You can check for other derivative models of the same size via HuggingFace filters
  • Do not soup unaligned checkpoints with the aligned ones Happy souping!

Prerequisites

  • Python 3.10 or higher
  • Conda package manager
  • Sufficient disk space (8B models: ~50GB, 70B models: ~400GB)
  • CUDA-compatible GPU (recommended for faster processing)

1. Clone the Repository

git clone <repository-url>
cd llm_souping

2. Run Model Souping

The framework provides a single end-to-end script that handles everything:

# For 8B models
./run_souping_e2e.sh 8b

# For 70B models
./run_souping_e2e.sh 70b

This script will automatically:

  1. Create conda environment named souping with Python 3.10
  2. Install dependencies including HuggingFace CLI, transformers, and sglang
  3. Download models from HuggingFace Hub to local directories
  4. Average model weights according to predefined configurations
  5. Save the final ensemble model to the output directory

What Happens During Execution

Environment Setup

  • Creates a conda environment called souping
  • Installs required packages:
    • huggingface_hub - For model downloading
    • transformers - For model loading and saving
    • sglang[all] - For serving and inference
    • Additional utilities: orjson, pybase64, uvicorn

Model Download and Storage

Models are downloaded to structured directories under ~/magg_checkpoints/magg/souping_experiments/checkpoints/:

8B Models:

~/magg_checkpoints/magg/souping_experiments/checkpoints/
β”œβ”€β”€ m1-8b/  # Team-ACE/ToolACE-2-Llama-3.1-8B
β”œβ”€β”€ m2-8b/  # Salesforce/Llama-xLAM-2-8b-fc-r
└── m3-8b/  # watt-ai/watt-tool-8B

70B Models:

~/magg_checkpoints/magg/souping_experiments/checkpoints/
β”œβ”€β”€ m1-70b/  # Salesforce/Llama-xLAM-2-70b-fc-r
β”œβ”€β”€ m2-70b/  # watt-ai/watt-tool-70B
└── m3-70b/  # uiuc-convai/CoALM-70B

Model Averaging Configuration

The averaging weights are optimized configurations:

8B Ensemble:

  • ToolACE-2-Llama-3.1-8B: 20%
  • Llama-xLAM-2-8b-fc-r: 70%
  • watt-tool-8B: 10%

70B Ensemble:

  • Llama-xLAM-2-70b-fc-r: 50%
  • watt-tool-70B: 30%
  • CoALM-70B: 20%

Advanced Usage

Custom Configurations

You can create custom ensemble configurations by modifying the configuration files:

  • souping_experiments/ensemble_configs_8b_sota.py - 8B model configurations
  • souping_experiments/ensemble_configs_70b_sota.py - 70B model configurations

Example custom configuration:

ensemble_configs = [
    {
        "name": "custom_8b_ensemble",
        "models": {
            f"{Path.home()}/path/to/model1": 0.4,
            f"{Path.home()}/path/to/model2": 0.6,
        },
        "output_dir": f"{Path.home()}/custom_output_dir/",
    }
]

Adding New Models

To add support for new models:

  1. Update the download function in run_souping_e2e.sh
  2. Modify the ensemble configuration files
  3. Ensure models have compatible architectures (Llama-based)

Parallel GPU-Accelerated Souping

For faster model souping when you have multiple GPUs available, you can use the parallel execution mode:

# Run with automatic parallel execution across available GPUs
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b

# Force sequential execution (still uses GPU for faster averaging)
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --sequential

# Limit parallel workers to specific number
python souping_experiments/run_parallel_averaging.py ensemble_configs_8b_sota experiments_configs_26_08_8b --max-workers 4

Features:

  • Automatic GPU detection and memory estimation
  • Parallel processing with one worker per GPU
  • GPU-accelerated weight averaging
  • Falls back to sequential CPU mode if GPUs unavailable

Requirements: Install GPU management library: pip install nvidia-ml-py3 or pip install pynvml

Project Structure

llm_souping/
β”œβ”€β”€ run_souping_e2e.sh                    # Main execution script (CPU-based)
β”œβ”€β”€ souping_experiments/
β”‚   β”œβ”€β”€ model_avg.py                      # Core averaging logic (CPU)
β”‚   β”œβ”€β”€ example_model_averaging.py        # Averaging orchestration (CPU)
β”‚   β”œβ”€β”€ model_avg_gpu.py                  # GPU-accelerated averaging logic
β”‚   β”œβ”€β”€ run_parallel_averaging.py         # Parallel execution entry point
β”‚   β”œβ”€β”€ parallel_runner_mp.py             # Multiprocessing parallel runner
β”‚   β”œβ”€β”€ gpu_manager.py                    # GPU resource management
β”‚   β”œβ”€β”€ ensemble_configs_8b_sota.py       # 8B model configurations
β”‚   └── ensemble_configs_70b_sota.py      # 70B model configurations
β”œβ”€β”€ README.md                             # This file
β”œβ”€β”€ LICENSE                               # CC BY-NC 4.0 license
β”œβ”€β”€ CODE_OF_CONDUCT.md                   # Contribution guidelines
└── CONTRIBUTING.md                      # Development guidelines

License

This project is licensed under the MIT License. This means you can use, share, and adapt the material for non-commercial purposes with proper attribution.

Contributing

We welcome contributions! Please see our Contributing Guidelines and Code of Conduct for details on how to get involved.

About

Model souping for LLMs

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5