The official codes of our CVPR2025 paper: Gradient-Guided Annealing for Domain Generalization
In this paper we observe that the initial iterations of model training play a key role in domain generalization effectiveness, since the loss landscape may be significantly different across the training and test distributions, contrary to the case of i.i.d. data. Conflicts between gradients of the loss components of each domain lead the optimization procedure to undesirable local minima that do not capture the domain-invariant features of the target classes. We propose alleviating domain conflicts in model optimization, by iteratively annealing the parameters of a model in the early stages of training and searching for points where gradients align between domains. By discovering a set of parameter values where gradients are updated towards the same direction for each data distribution present in the training set, the proposed Gradient-Guided Annealing (GGA) algorithm encourages models to seek out minima that exhibit improved robustness against domain shifts.
Note that this project is built upon SWAD and DomainBed.
pip install -r requirements.txtpython -m domainbed.scripts.download --data_dir=/my/datasets/pathEnvironment details used for our study.
Python: 3.10.12
PyTorch: 2.0.1
Torchvision: 0.15.2
CUDA: 11.8
train_all.py script conducts multiple leave-one-out cross-validations for all target domain.
python train_all.py exp_name --dataset <dataset> --data_dir /my/datasets/path --trial_seed <seed> --algorithm <algorithm> --checkpoint_freq 100 --lr <lr> --weight_decay 1e-4 --resnet_dropout 0.5 --swad FalseWe provide the instructions to reproduce the main results of the paper, Table 1 and 2. Note that the difference in a detailed environment or uncontrolled randomness may bring a slightly different result from the paper.
- PACS
python train_all.py PACS0 --dataset PACS --data_dir /my/datasets/path --deterministic --trial_seed 0 --algorithm ERM_GGA --checkpoint_freq 100 --lr 3e-5 --weight_decay 1e-4 --resnet_dropout 0.5 --swad False \
--start_step 100 --end_step 200 --neighborhoodSize 0.00001
- VLCS
python train_all.py VLCS0 --dataset VLCS --data_dir /my/datasets/path --deterministic --trial_seed 0 --algorithm ERM_GGA --checkpoint_freq 100 --lr 1e-5 --weight_decay 1e-4 --resnet_dropout 0.5 --swad False \
--start_step 100 --end_step 200 --neighborhoodSize 0.000001
- OfficeHome
python train_all.py OH0 --dataset OfficeHome --data_dir /my/datasets/path --deterministic --trial_seed 0 --algorithm ERM_GGA --checkpoint_freq 100 --lr 1e-5 --weight_decay 1e-4 --resnet_dropout 0.5 --swad False \
--start_step 100 --end_step 200 --neighborhoodSize 0.00001
- TerraIncognita
python train_all.py TR0 --dataset TerraIncognita --data_dir /my/datasets/path --deterministic --trial_seed 0 --algorithm ERM_GGA --checkpoint_freq 100 --lr 1e-5 --weight_decay 1e-4 --resnet_dropout 0.5 --swad False \
--start_step 100 --end_step 200 --neighborhoodSize 0.00001
- DomainNet
python train_all.py DN0 --dataset DomainNet --data_dir /my/datasets/path --deterministic --trial_seed 0 --algorithm ERM_GGA --checkpoint_freq 100 --lr 3e-5 --weight_decay 1e-6 --resnet_dropout 0.5 --swad False \
--start_step 100 --end_step 200 --neighborhoodSize 0.00001
This repo also contains code for an alternative method to GGA, which integrates noise directly into the gradient update step rather than modifying weights separately. This alternative method is described in the Supplemntary Material of the ArXiv version of the manuscript.
Specifically in GGA-L, we propose injecting dynamic noise based on domain gradient similarity directly into the update step, as follows:
where ξ is noise drawn from a Uniform distribution and
is a dynamic scaling factor depending on the average gradient similarity between domains and γ is a hyperparameter controlling the noise intensity.
GGA-L is able to perform similarly to GGA but adds a considerably lower computational cost as the gradients for each domain are only calculated once per batch. The algorithm for GGA-L is the following:
Similar to above, you can run GGA-L as follows:
python train_all.py exp_name --dataset <dataset> --data_dir /my/datasets/path --trial_seed <seed> --algorithm GGA_L --checkpoint_freq 100 --lr <lr> --gga_l_gamma <gamma> --weight_decay 1e-4 --resnet_dropout 0.5 --swad False GGA is able to boost the performance of a vanilla model on all 5 datasets.
When applied on top of previously proposed algorithms, GGA is able to boost their performance in most cases.
NOTICE
We have identified and corrected several issues in the original version of this paper, including errors in both the manuscript and the accompanying code. We kindly ask that any comparisons or future references be made using the results and findings presented in the updated ArXiv version.
Please cite this paper if it helps your research:
@inproceedings{ballas2025gradient,
title={Gradient-Guided Annealing for Domain Generalization},
author={Ballas, Aristotelis and Diou, Christos},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={20558--20568},
year={2025}
}
This source code is released under the MIT license, included here.
This project includes some code from DomainBed, also MIT licensed.






