question about ema alpha setting #168

FateScript · 2021-11-18T07:22:00Z

Hi, thanks for your wonderful repo.
In your code of update_model_ema

Lines 101 to 114 in ee770af

    
           def update_model_ema(model, model_ema, cur_epoch, cur_iter): 
        
               """Update exponential moving average (ema) of model weights.""" 
        
               update_period = cfg.OPTIM.EMA_UPDATE_PERIOD 
        
               if update_period == 0 or cur_iter % update_period != 0: 
        
                   return 
        
               # Adjust alpha to be fairly independent of other parameters 
        
               adjust = cfg.TRAIN.BATCH_SIZE / cfg.OPTIM.MAX_EPOCH * update_period 
        
               alpha = min(1.0, cfg.OPTIM.EMA_ALPHA * adjust) 
        
               # During warmup simply copy over weights instead of using ema 
        
               alpha = 1.0 if cur_epoch < cfg.OPTIM.WARMUP_EPOCHS else alpha 
        
               # Take ema of all parameters (not just named parameters) 
        
               params = unwrap_model(model).state_dict() 
        
               for name, param in unwrap_model(model_ema).state_dict().items(): 
        
                   param.copy_(param * (1.0 - alpha) + params[name] * alpha)

I notice that you are using a magic code
adjust = cfg.TRAIN.BATCH_SIZE / cfg.OPTIM.MAX_EPOCH * update_period
to modify alpha value. Is there any insight of doing this? If there are some paper of this, could you please help telling me?

Thanks : )

The text was updated successfully, but these errors were encountered:

pdollar · 2021-11-18T16:57:14Z

Hi @FateScript , yeah good point. We never published this. It's some math I sketched out. I'll share my math below with the caveat that I haven't been very careful to verify its full correctness, and the math may lack context (e.g., variable meanings). I sketched the math and implemented it, and it works empirically. So I'm sharing the math, and hopefully there's no embarrassing mistakes in it. If you find an embarrassing mistake in the math below, please lmk! Now, the math is lacking context, I hope you'll be able to understand it. I probably won't have time to go into more detail, but I figured I'd share in the hope that you can try to decode it and it's somewhat useful :)

Momentum formulation [α=.999]
v = α · v + (1 - α) · u

Update formulation [α=.001]:
v = (1 - α) · v + α · u

Two step update rolled into one assuming α2 ≈ 0 and setting u=(u0+u1)/2:
v1 = (1 - α) · v0 + α · u0
v2 = (1 - α) · v1 + α · u1
v2 = (1 - α) · ((1 - α) · v0 + α · u0) + α · u1
v2 = (1 - α) · ((1 - α) · v0 ) + α · u0 + α · u1
v2 = (1 - 2α + α2) · v0 + α · u0 + α · u1
v2 ≈ (1 - 2α) · v0 + 2α · u

The same holds for n>>1 updates not just 2 since for small α and αn<<1 the following holds:
(1 - α)n ≈ 1 - αn + n(n-1)α2/2! - n(n-1)(n-2)α3/3! + ... [binomial expansion]
(1 - α)n ≈ 1 - αn + n2α2/2! - n3α3/3! + ... [n>>1]
(1 - α)n ≈ 1 - αn [αn<<1]

Thus, To make the update independent of batch size n, we will specify α* (independent of batch_size) and we will use α in the update step where:
α = α* · n
This will make ema behavior roughly independent of the batch size n. Furthermore, it is not necessary to perform an update at every iteration. If we perform the update every k iterations, effectively we do an update after seeing n·k examples, and thus can use:
α = α* · n · k

Finally, to normalize by schedule length, we set:
α = α* · n · k / m
Where m = #epochs. Empirically we find using α set this way allows for using a fairly constant α across schedule lengths without needing to carefully tune α for each schedule length. The logic isn’t exactly equivalent for this step, this is more that your “history” is proportional across runs w different epoch length. [Note: need to make this explanation more precise.]

pdollar · 2021-11-18T16:59:43Z

formatting got lost in my previous post. here's a screenshot of my note with formatting preserved.

FateScript · 2021-11-19T03:32:53Z

Thanks @pdollar , I understand how the magic code works now. It's soooooo kind of you : )

BTW, I want to discuss this issue a bit more.
In my opinion, if your total number of images in training process is not changed,
#iter = #epoch * #image per epoch / bachsize
so value of batch_size / #epoch could be treated as k / #iters, where k is a constant number depends on your dataset and k could be absorbed into alpha.
Maybe adjust = update_period / total_iters is more intuitive? WDYT ?

pdollar · 2021-11-19T18:52:06Z

Hey thanks for digging in deeper! I don't think I have time to adjust this or think more deeply, and we're already using this way of defining EMA for many models we have trained. I find it works really well, but more importantly, I wouldn't want to break backward compatibility at this stage even if the result was more intuitive! Thanks for the discussion/suggestions tho.

FateScript changed the title ~~questions about ema alpha setting~~ question about ema alpha setting Nov 18, 2021

FateScript closed this as completed Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about ema alpha setting #168

question about ema alpha setting #168

FateScript commented Nov 18, 2021

pdollar commented Nov 18, 2021

pdollar commented Nov 18, 2021

FateScript commented Nov 19, 2021 •

edited

pdollar commented Nov 19, 2021

question about ema alpha setting #168

question about ema alpha setting #168

Comments

FateScript commented Nov 18, 2021

pdollar commented Nov 18, 2021

pdollar commented Nov 18, 2021

FateScript commented Nov 19, 2021 • edited

pdollar commented Nov 19, 2021

FateScript commented Nov 19, 2021 •

edited