You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @cli99, Thank you very much for open-sourcing your library for analyzing large language models. This is very helpful for us to understand various optimization algorithms and parallel configuration strategies. After going through the code, I have encountered a few questions.
In the paper "Reducing Activation Recomputation in Large Transformer Models", there are two LayerNorms in one transformer layer. But in the code:
only one "weight_memory_layernorm_per_layer" is added.
Also in this paper, the blocks which can be parallelized using tensor parallelism are attention and mlp. But in the code below, LayerNorm can also be parallelized when tensor parallelism is applied.
layernorm is not tensor parallelized, thanks for catching this, I removed the dividing by tp_size.
the llama2 model uses gated linear units and I recently added a field in ModelConfig to specify mlp_gated_linear_units to true in model json files. If the model config is pulled from HF by name, there is no equivalent information to use, so I added https://github.com/cli99/llm-analysis/blob/main/llm_analysis/config.py#L218-L223 and make the assumption that if the expansion ratio (intermidiate dim / model dim) is 3.5, the model uses gated linear units. Now the params estimation for llama2 shall be correct.
Hello @cli99, Thank you very much for open-sourcing your library for analyzing large language models. This is very helpful for us to understand various optimization algorithms and parallel configuration strategies. After going through the code, I have encountered a few questions.
In the paper "Reducing Activation Recomputation in Large Transformer Models", there are two LayerNorms in one transformer layer. But in the code:
only one "weight_memory_layernorm_per_layer" is added.
Also in this paper, the blocks which can be parallelized using tensor parallelism are attention and mlp. But in the code below, LayerNorm can also be parallelized when tensor parallelism is applied.
When I print the summary_dict in the provied python script in llama2 folder, it given me the following result:
weight_memory_per_gpu is 55292731392.0, which is less 70B * 2 / 2 = 70B
So can you provide more information about this? Look forward to your response, thank you once again.
The text was updated successfully, but these errors were encountered: