Switch order of ds init_inference and pipeline construction to save m… #373
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The order of
deepspeed.init_inference
andhf.pipeline
matters.If hf pipeline is created first, model is replicated fully across ranks with deepspeed init inference (each rank is trying to split it via tensor parallelism, resulting in rank number of sharded models).
If deepspeed.init_inference happens first, the model is used as is by the hf pipeline since it has already been loaded fully and memory footprint is as expected.
It's probably worth bringing this up with DeepSpeed-MII as well, since they create the HF pipeline first, and then do ds.init_inference https://github.com/microsoft/DeepSpeed-MII/blob/main/mii/models/load_models.py