Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about ema alpha setting #168

Closed
FateScript opened this issue Nov 18, 2021 · 4 comments
Closed

question about ema alpha setting #168

FateScript opened this issue Nov 18, 2021 · 4 comments

Comments

@FateScript
Copy link

Hi, thanks for your wonderful repo.
In your code of update_model_ema

pycls/pycls/core/net.py

Lines 101 to 114 in ee770af

def update_model_ema(model, model_ema, cur_epoch, cur_iter):
"""Update exponential moving average (ema) of model weights."""
update_period = cfg.OPTIM.EMA_UPDATE_PERIOD
if update_period == 0 or cur_iter % update_period != 0:
return
# Adjust alpha to be fairly independent of other parameters
adjust = cfg.TRAIN.BATCH_SIZE / cfg.OPTIM.MAX_EPOCH * update_period
alpha = min(1.0, cfg.OPTIM.EMA_ALPHA * adjust)
# During warmup simply copy over weights instead of using ema
alpha = 1.0 if cur_epoch < cfg.OPTIM.WARMUP_EPOCHS else alpha
# Take ema of all parameters (not just named parameters)
params = unwrap_model(model).state_dict()
for name, param in unwrap_model(model_ema).state_dict().items():
param.copy_(param * (1.0 - alpha) + params[name] * alpha)

I notice that you are using a magic code
adjust = cfg.TRAIN.BATCH_SIZE / cfg.OPTIM.MAX_EPOCH * update_period
to modify alpha value. Is there any insight of doing this? If there are some paper of this, could you please help telling me?

Thanks : )

@FateScript FateScript changed the title questions about ema alpha setting question about ema alpha setting Nov 18, 2021
@pdollar
Copy link
Member

pdollar commented Nov 18, 2021

Hi @FateScript , yeah good point. We never published this. It's some math I sketched out. I'll share my math below with the caveat that I haven't been very careful to verify its full correctness, and the math may lack context (e.g., variable meanings). I sketched the math and implemented it, and it works empirically. So I'm sharing the math, and hopefully there's no embarrassing mistakes in it. If you find an embarrassing mistake in the math below, please lmk! Now, the math is lacking context, I hope you'll be able to understand it. I probably won't have time to go into more detail, but I figured I'd share in the hope that you can try to decode it and it's somewhat useful :)


Momentum formulation [α=.999]
v = α · v + (1 - α) · u

Update formulation [α=.001]:
v = (1 - α) · v + α · u

Two step update rolled into one assuming α2 ≈ 0 and setting u=(u0+u1)/2:
v1 = (1 - α) · v0 + α · u0
v2 = (1 - α) · v1 + α · u1
v2 = (1 - α) · ((1 - α) · v0 + α · u0) + α · u1
v2 = (1 - α) · ((1 - α) · v0 ) + α · u0 + α · u1
v2 = (1 - 2α + α2) · v0 + α · u0 + α · u1
v2 ≈ (1 - 2α) · v0 + 2α · u

The same holds for n>>1 updates not just 2 since for small α and αn<<1 the following holds:
(1 - α)n ≈ 1 - αn + n(n-1)α2/2! - n(n-1)(n-2)α3/3! + ... [binomial expansion]
(1 - α)n ≈ 1 - αn + n2α2/2! - n3α3/3! + ... [n>>1]
(1 - α)n ≈ 1 - αn [αn<<1]

Thus, To make the update independent of batch size n, we will specify α* (independent of batch_size) and we will use α in the update step where:
α = α* · n
This will make ema behavior roughly independent of the batch size n. Furthermore, it is not necessary to perform an update at every iteration. If we perform the update every k iterations, effectively we do an update after seeing n·k examples, and thus can use:
α = α* · n · k

Finally, to normalize by schedule length, we set:
α = α* · n · k / m
Where m = #epochs. Empirically we find using α set this way allows for using a fairly constant α across schedule lengths without needing to carefully tune α for each schedule length. The logic isn’t exactly equivalent for this step, this is more that your “history” is proportional across runs w different epoch length. [Note: need to make this explanation more precise.]

@pdollar
Copy link
Member

pdollar commented Nov 18, 2021

Screen Shot 2021-11-18 at 8 59 02 AM

formatting got lost in my previous post. here's a screenshot of my note with formatting preserved.

@FateScript
Copy link
Author

FateScript commented Nov 19, 2021

Thanks @pdollar , I understand how the magic code works now. It's soooooo kind of you : )

BTW, I want to discuss this issue a bit more.
In my opinion, if your total number of images in training process is not changed,
#iter = #epoch * #image per epoch / bachsize
so value of batch_size / #epoch could be treated as k / #iters, where k is a constant number depends on your dataset and k could be absorbed into alpha.
Maybe adjust = update_period / total_iters is more intuitive? WDYT ?

@pdollar
Copy link
Member

pdollar commented Nov 19, 2021

Hey thanks for digging in deeper! I don't think I have time to adjust this or think more deeply, and we're already using this way of defining EMA for many models we have trained. I find it works really well, but more importantly, I wouldn't want to break backward compatibility at this stage even if the result was more intuitive! Thanks for the discussion/suggestions tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants