In the deepspeed docs, the activation memory is calculated by:
XXX: For Transformers is probably around (2* seq * attn_heads + 16 * hidden_size) * sequence * batch/gpu
In the ZeRO-Infinity paper, Section 3, the activation memory is calculated by:
2 * bsz * seq * hd * nl/ci
where bsz is batch size, seq is sequence length, hd is hidden dimension, nl is number of Transformer layers and ci is the number of Transformer blocks between two .activation checkpoints.
If we don't consider activation checkoutpoints, we get:
2 * bsz * seq * hd * nl
Which one is correct? Would anybody please tell me how to calculate the activation memory of Transformer/Bert layer with multi-head self-attetion?