Improving Energy Conserving Descent for Machine Learning: Theory and Practice

In this repository we collect the implementation of the optimizer ECDSep described in Improving Energy Conserving Descent for Machine Learning: Theory and Practice together with the code to reproduce the experiments presented in the paper.

Requirements

The requirements are in the file requirements.txt.

Importing the optimizer

The script inflation.py contains the code of ECDSep. To import the optimizer in your Python code it suffices to put the file in the folder of your code and add

from inflation import ECDSep

at the beginning of it. To call it write

ECDSep(parameters, lr, nu=1e-5, eta, weight_decay=0, F0=0., eps1=1e-10, eps2=1e-40, deltaEn=0., s=1, consEn=True)

where

parameters are the parameters regarding which performing optimization.
lr is the learning rate value and is required.
nu the chaos hyperparameter $\nu$.
eta the measure concentration hyperparameter $\eta$ (required) It has to be >= 1. Increasing it concentrates the measure towards the bottom of the basin, and it is useful for pure optimization problems where the goal to find smallest loss. Tested up to eta = 5.
F0 is the $F_0$ value.
eps1 and eps2 are two constants that improve stability.
deltaEn is the $\delta E$ value.
s is the regularization switch $s$.
weight_decay is the weight decay value.
consEn indicates whether energy is conserved or not during optimization.

Reproducing the experiments

The paper contains several experiments comparing ECDSep against Adam, AdamW and SGD.

Synthetic

The experiments on the Ackley and Zakharov functions are in the notebooks ECDSep_ackley.ipynb and ECDSep_zakharov.ipynb respectively. To reproduce them just execute the cells in the notebooks (note that these exepriments do not require a GPU). The hyperparameters that give the best results are searched using optuna for all optimizers.

Images

CIFAR 100

The folder contains the code for performing the CIFAR-100 SWA experiments using the SWA capabilities of Pytorch. This has been adapted from https://github.com/izmailovpavel/torch_swa_examples, to which Adam, AdamW and ECDSep have been added.

To run the CIFAR100 experiments, move to the folder and then run train.py as follows.

For SGD:

cd cifar-swa
python3 train.py --dir=cifar_output --dataset=CIFAR100 --data_path=cifar100 --model=WideResNet28x10 --epochs=300 --lr_init=0.05 --swa_lr=0.05 --wd=5e-4 --swa --swa_start=161 --optimizer=sgd --momentum=0.9 --seed=42

For ECDSep:

cd cifar-swa
python3 train.py --dir=cifar_output --dataset=CIFAR100 --data_path=cifar100 --model=WideResNet28x10 --epochs=300 --lr_init=0.4 --wd=5e-4 --swa --swa_start=161  --nu=5e-5 --eta=1.0 --deltaEn=0.0 --F0=0.0 --optimizer=ECDSep --seed=42

For Adam:

cd cifar-swa
python3 train.py --dir=cifar_output --dataset=CIFAR100 --data_path=cifar100 --model=WideResNet28x10 --epochs=300 --lr_init=0.0001 --wd=1e-4 --swa --swa_start=161 --swa_lr=0.0001 --optimizer=adam --seed=42

For AdamW:

cd cifar-swa
python3 train.py --dir=cifar_output --dataset=CIFAR100 --data_path=cifar100 --model=WideResNet28x10 --epochs=300 --lr_init=0.0001 --wd=1e-4 --swa --swa_start=161 --swa_lr=0.0001 --optimizer=adamw --seed=42

The hyperparameters that give the best performance (accuracy) are:

ECDSep: $\nu = 5\times 10^{-5}$, $\Delta t = 0.4$
SGD: $\alpha = 0.05$, $\beta = 0.9$.
Adam: $\alpha = 10^{-4}$, $w_d = 10^{-4}$
AdamW: $\alpha = 10^{-4}$, $w_d = 10^{-4}$

Imagenet and Tiny Imagenet

For these experiments, the SWA procedure is not implemented during training, but the networks at each epoch are saved and averaged at the end. This gives more flexibility to choose the best epoch at which to start the average.

The code for running this experiments is in the subfolder ECDSep-IN, together with the notebooks to perform this analysis. For tiny-IN the code will automatically download the dataset, but for IN this has to been provided by the user. We thank Daniel Kunin for sharing this code.

Tiny-Imagenet

The code to run the experiments for each optimizer is below.

SGD

cd ECDSep-IN
python3 train.py --experiment run-SGD --expid 1 --model-class=tinyimagenet --model resnet18 --dataset tiny-imagenet --wd 1e-4 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 0.1 --epochs 100 --optimizer momentum --momentum 0.9 --seed 42 --overwrite

ECDSep

cd ECDSep-IN
python3 train.py --experiment run-ECD --expid 1 --model-class=tinyimagenet --model resnet18 --dataset tiny-imagenet --wd 1e-4 --nu 5e-5 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 0.6 --epochs 100 --eta 1.0 --F0 0.0 --deltaEn 0.0 --optimizer ECDSep --seed 42 --overwrite

Adam

cd ECDSep-IN
python3 train.py --experiment run-adam --expid 1 --model-class=tinyimagenet --model resnet18 --dataset tiny-imagenet --wd 1e-4  --lr 0.001 --train-batch-size 128 --test-batch-size 128 --gpu 0 --epochs 100 --optimizer adam  --seed 42 --overwrite

For AdamW

cd ECDSep-IN
python3 train.py --experiment run-adamw --expid 1 --model-class=tinyimagenet --model resnet18 --dataset tiny-imagenet --wd 1e-4  --lr 0.001 --train-batch-size 128 --test-batch-size 128 --gpu 0 --epochs 100 --optimizer adamw  --seed 42 --overwrite

The parameters that give the best performance (accuracy) for each optimizer are:

ECDSep: $\Delta t = 0.6, \nu = 5\times 10^{-5}$.
SGD: $\alpha = 0.1, \beta = 0.9$.
Adam: $\alpha = 10^{-3}, w_d = 10^{-4}$.
AdamW: $\alpha = 10^{-3}, w_d = 10^{-4}$.

After the run is done, the notebook notebooks/averages.ipynb can be used to average the saved networks over an arbitrary number of epochs (starting from the last one) and keep the averaged network with the best test accuracy.

Imagenet 1-K

For Imagenet-1K, the dataset has to be provided. In the commands below replace DATADIR with the directory where the IN-1K dataset is located.

For SGD

cd ECDSep-IN
python3 train.py --experiment run-SGD --expid 1 --model-class=imagenet --model resnet18 --pretrained True --dataset imagenet --data-dir DATADIR --wd 1e-4 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 5e-5 --epochs 10 --momentum 0.99 --optimizer momentum --seed 42 --overwrite

For ECDSep

cd ECDSep-IN
python3 train.py --experiment run-ECD --expid 1 --model-class=imagenet --model resnet18 --pretrained True --dataset imagenet --data-dir DATADIR --wd 1e-4 --nu 1e-3 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 0.05 --epochs 10 --eta 1.0 --F0 0.0 --deltaEn 0.0 --optimizer ECDSep --seed 42 --overwrite

For Adam

cd ECDSep-IN
python3 train.py --experiment run-adam --expid 1 --model-class=imagenet --model resnet18 --pretrained True --dataset imagenet --data-dir DATADIR --wd 1e-4 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 1e-5 --epoch&s 10 --optimizer adam --seed 42 --overwrite

For AdamW

cd ECDSep-IN
python3 train.py --experiment run-adamw --expid 1 --model-class=imagenet --model resnet18 --pretrained True --dataset imagenet --data-dir DATADIR --wd 1e-4 --train-batch-size 128 --test-batch-size 128 --gpu 0 --lr 1e-5 --epochs 10 --optimizer adamw --seed 42 --overwrite

After the run is done, the notebook notebooks/averagesIN.ipynb can be used to average the saved networks over an arbitrary number of epochs (starting from the last one) and keep the averaged network with the best test accuracy.

The hyperparameters that give the best performance (accuracy) are:

ECDSep: $\Delta t = 0.1$ , $\nu = 10^{-3}$, $w_d = 10^{-4}$.
SGD: $\alpha = 5\times 10^{-5}$, $\beta = 0.99$, $w_d=10^{-4}$.
Adam: $\alpha = 10^{-5}$, $w_d = 10^{-4}$.
AdamW: $\alpha = 10^{-5}$, $w_d = 10^{-4}$.

Summing up, the best accuracy results of all optimizers are summarized in the following table. The results are averaged over $4$, $3$ and $2$ runs for CIFAR100, Tiny Imagenet and Imagenet-1K respectively.

	ECDSep	SGD	Adam	AdamW
CIFAR 100	82.57	82.50	79.01	78.71
Tiny Imagenet	66.44	64.83	61.67	59.84
IN-1K (fine tuning)	70.49	70.49	70.48	70.48

Graphs

The experiments on graphs are collected in the Colab notebooks ECDSep_graphs_arxiv.ipynb and ECDSep_graphs_proteins.ipynb. To reproduce them just execute the cells in the notebooks changing the optimizers and the hyperparameters. Note that the required packages are installed automatically when executing the notebooks. The code for the datasets and the training is taken from the Open Graph Benchmark repository and is released under MIT License. The hyperparameters that give the best results for each optimizer are:

For ogbn-arxiv
- ECDSep: $\Delta t=2.8,\ \eta=4.5,\ \nu=10^{-5},\ w_d=0$ (and the default remaining hyperparameters).
- SGD: $\alpha=0.1$, $\beta=0.95$, $w_d=10^{-3}$.
- AdamW: $\alpha=5\cdot 10^{-3}$, $w_d=0$.
- Adam: $\alpha=5\cdot 10^{-3}$, $w_d=0$.
For ogbn-proteins
- ECDSep: $\Delta t=1.8,\ \eta=5,\ \nu=10^{-5},\ w_d=0$ (and the default remaining hyperparameters).
- SGD: $\alpha=0.1$, $\beta=0.999$, $w_d=10^{-5}$.
- AdamW: $\alpha=0.01$, $w_d=10^{-5}$.
- Adam: $\alpha=0.01$, $w_d=0$.

The best performance (accuracy for ogbn-arxiv and ROC-AUC score for ogbn-proteins) of all optimizers is summarized in the following table. For ogbn-arxiv the results are averaged over $10$ runs, while for ogbn-proteins over $5$ runs.

	ECDSEp	SGD	Adam	AdamW
`ogbn-arxiv`	71.55	71.81	72.37	72.41
`ogbn-proteins`	74.67	65.79	77.42	77.44

Language

The NLP exepriments on the GLUE benchmark are contained in the script ECDSep_language_bert.py and are perfomed using the Hugging Face tools. The To reproduce them just run

python3 ECDSep_language_bert.py

with the following arguments:

--optimizer : the optimizer chosen to perform the experiments among ECDSep, adam, adamw and sgd. Default is ECDSep.
--lr: the value of the learning rate. Default is $0.04$.
--momentum: the momentum value for SGD. Default is $0.99$.
--nu: the $\nu$ value for ECDSep. Default is $10^{-5}$.
--eta: the $\eta$ value for ECDSep. Default is $1.4$.
--consEn: whether energy is conserved or not in ECDSep. Default is True.
--F0: the $F_0$ value for ECDSep. Default is $0$.
--deltaEn: $\delta E$ value for ECDSep. Default is $0$.
--s: regularization $s$ value for ECDSep. Default is $1$.
--epochs: number of epochs. Default is $3$.
--dataset: name of the GLUE dataset. Default "all" (meaning that the training is done over all datasets).
--seed: random seed to use in the experiment. Default is $42$.
--wd: weight decay value. Default is $0$.

The hyperparameters that give the best results for each optimizer are:

For MNLI:
- ECDSep: $w_d = 0$, $\eta = 2$ , $\nu = 10^{-4}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-5}$, $\beta = 0.99$.
- AdamW: $w_d = 10^{-2}$, $\alpha = 2\times 10^{-5}$.
- Adam: $w_d = 0$, $\alpha = 2 10^{-5}$.
For QQP:
- ECDSep: $w_d = 0$, $\eta = 2$ , $\nu = 10^{-4}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-5}$, $\beta = 0.99$.
- AdamW: $w_d = 10^{-2}$, $\alpha = 2 10^{-5}$.
- Adam: $w_d = 0$, $\alpha = 2 \times 10^{-5}$.
For QNLI:
- ECDSep: $w_d = 0$, $\eta = 1.4$ , $\nu = 10^{-5}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-5}$, $\beta = 0.99$.
- AdamW: $w_d = 10^{-2}$, $\alpha = 2\times 10^{-5}$.
- Adam: $w_d = 0$, $\alpha = 2 \times 10^{-5}$.
For SST-2:
- ECDSep: $w_d = 0$, $\eta = 1$ , $\nu = 10^{-4}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-5}$, $\beta = 0.99$.
- AdamW: $w_d = 0$, $\alpha = 2 \times 10^{-5}$.
- Adam: $w_d = 0$, $\alpha = 2\times 10^{-5}$.
For CoLA:
- ECDSep: $w_d = 10^{-2}$, $\eta = 2$ , $\nu = 10^{-5}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-4}$, $\beta = 0.9$.
- AdamW: $w_d = 10^{-3}$, $\alpha = 3\times 10^{-5}$.
- Adam: $w_d = 10^{-5}$, $\alpha = 2\times 10^{-5}$.
For STS-B:
- ECDSep: $w_d = 10^{-2}$, $\eta = 2$ , $\nu = 10^{-5}$.
- SGD: $w_d = 10^{-2}$, $\alpha = 10^{-5}$, $\beta = 0.99$.
- AdamW: $w_d = 0$, $\alpha = 3\times 10^{-5}$.
- Adam : $w_d = 0$, $\alpha = 2\times 10^{-5}$.
For MRPC:
- ECDSep: $w_d = 10^{-3}$, $\eta = 1.4$ , $\nu = 10^{-5}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-4}$, $\beta = 0.99$.
- AdamW: $w_d = 0$, $\alpha = 2\times 10^{-5}$.
- Adam: $w_d = 10^{-2}$, $\alpha = 2\times 10^{-5}$.
For RTE:
- ECDSep: $w_d = 10^{-3}$, $\eta = 1$ , $\nu = 10^{-5}$.
- SGD: $w_d = 10^{-3}$, $\alpha = 10^{-4}$, $\beta = 0.99$.
- AdamW: $w_d = 10^{-2}$, $\alpha = 3\times 10^{-5}$.
- Adam: $w_d = 10^{-2}$, $\alpha = 2\times 10^{-5}$.

The best performance (Matthews correlation for CoLA, Spearmans’s correlation for STS-B, F1 score for MRPC and QQP, and accuracy for the remaining datasets) of all optimizers is summarized in the following table. The results are averaged over $3$ runs.

	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	avg.
ECDSep	84.24	86.70	91.19	92.66	57.91	89.26	90.96	73.16	83.26
SGD	83.31	86.36	91.03	92.17	60.54	89.26	90.88	71.96	83.19
Adam	84.31	88.14	91.39	92.81	59.34	89.02	91.09	71.36	83.43
AdamW	84.41	88.21	91.49	93.03	59.68	89.15	91.13	71.24	83.54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Energy Conserving Descent for Machine Learning: Theory and Practice

Requirements

Importing the optimizer

Reproducing the experiments

Synthetic

Images

CIFAR 100

Imagenet and Tiny Imagenet

Tiny-Imagenet

Imagenet 1-K

Graphs

Language

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ECDSep-IN		ECDSep-IN
cifar-swa		cifar-swa
ECDSep_ackley.ipynb		ECDSep_ackley.ipynb
ECDSep_graphs_arxiv.ipynb		ECDSep_graphs_arxiv.ipynb
ECDSep_graphs_proteins.ipynb		ECDSep_graphs_proteins.ipynb
ECDSep_language_bert.py		ECDSep_language_bert.py
ECDSep_zakharov.ipynb		ECDSep_zakharov.ipynb
LICENSE		LICENSE
README.md		README.md
inflation.py		inflation.py
requirements.txt		requirements.txt

License

gbdl/ECDSep

Folders and files

Latest commit

History

Repository files navigation

Improving Energy Conserving Descent for Machine Learning: Theory and Practice

Requirements

Importing the optimizer

Reproducing the experiments

Synthetic

Images

CIFAR 100

Imagenet and Tiny Imagenet

Tiny-Imagenet

Imagenet 1-K

Graphs

Language

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages