Skip to content

How to calculate Transformer/Bert layer's activation memory? #1861

@cailun01

Description

@cailun01

In the deepspeed docs, the activation memory is calculated by:
XXX: For Transformers is probably around (2* seq * attn_heads + 16 * hidden_size) * sequence * batch/gpu

In the ZeRO-Infinity paper, Section 3, the activation memory is calculated by:
2 * bsz * seq * hd * nl/ci
where bsz is batch size, seq is sequence length, hd is hidden dimension, nl is number of Transformer layers and ci is the number of Transformer blocks between two .activation checkpoints.

If we don't consider activation checkoutpoints, we get:
2 * bsz * seq * hd * nl

Which one is correct? Would anybody please tell me how to calculate the activation memory of Transformer/Bert layer with multi-head self-attetion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions