How to calculate Transformer/Bert layer's activation memory?

In the [deepspeed docs](https://deepspeed.readthedocs.io/en/latest/memory.html), the activation memory is calculated by:
`XXX: For Transformers is probably around (2* seq * attn_heads + 16 * hidden_size) * sequence * batch/gpu`

In the ZeRO-Infinity paper, Section 3, the activation memory is calculated by:
`2 * bsz * seq * hd * nl/ci`
where `bsz` is batch size, seq is `sequence length`, `hd` is `hidden dimension`, `nl` is number of Transformer layers and `ci` is the number of Transformer blocks between two .activation checkpoints.

If we don't consider activation checkoutpoints, we get:
`2 * bsz * seq * hd * nl`

Which one is correct? Would anybody please tell me how to calculate the activation memory of Transformer/Bert layer with multi-head self-attetion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to calculate Transformer/Bert layer's activation memory? #1861

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to calculate Transformer/Bert layer's activation memory? #1861

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions