-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Open
Description
I’m reading through the Autotuner code and found this function:
It computes
total_gpus = self.exp_num_nodes * self.exp_num_gpus
based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled, optimizer_mem, gradients_mem, and/or params_mem get sharded across the GPUs. But then if self.mp_size() (for tensor parallelism, right?) is greater than 1, then the total memory usage is divided again by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure if
- there’s a bug here
- if the value of the
num_gpusflag supposed to be reduced by the amount of tensor parallelism, or - if I’m not understanding this correctly.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels