-
Notifications
You must be signed in to change notification settings - Fork 10
Low CPU memory usage feature #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jwilles
approved these changes
Mar 11, 2024
jacobthebanana
pushed a commit
that referenced
this pull request
Mar 11, 2024
…enchmarking * added changes to implement low cpu mem usage feature * implemented new ruff linting changes and ran a fix across files
adil-a
added a commit
that referenced
this pull request
Apr 26, 2024
…ode. (#5) * Implemented baseline LoRA peft for one Nvidia GPU. * Added support for saving lora adapters. Added support for non-fsdp models. * save_utils: added support for non-FSDP optimizers. trainer: replaced clip_grad_norm_ with nn.utils.clip_grad_norm_ for lora compatibility. * example_lora: highlighted current lora (non-fsdp) limitations. * Added instructions on LoRA on one GPU. * Added example script for launching lora. * Revised instructions on LoRA on one GPU. * Implemented LoRA FSDP. Also see https://github.com/facebookresearch/llama-recipes/blob/674b37ee66f59a7845cbc3868948f4d7fa69c679/src/llama_recipes/utils/fsdp_utils.py#L9 * Reverted automatic formatter changes in README.md * Eliminated non-FSDP logic from save_utils. Set model path to local copy of llama-2-7b in example config. * Moved lora config out of example config.yaml. * Implemented LoRA benchmarking logic for worker. * model_utils: Refactored get_lora_model to reduce interface width. (this method no longer wraps load_model_and_tokenizer) test_modelling: revised base model fixture scope since torch FSDP wrap is in-place. launch_benchmark: added confirmation before launching. * test_modelling: moved text output to data/. * added example yaml config for lora benchmarking. * launch_benchmark: marked qos flag as optional. * launch_benchmark: added option to limit number of jobs launched. * launch_benchmark: implemented torch profiler integration. * Merged changes from low CPU memory usage feature (#6) into jjt/lora-benchmarking * added changes to implement low cpu mem usage feature * implemented new ruff linting changes and ran a fix across files * Revised launch_benchmark.py to use new profiling path. * Enabled automatic creation of data/trace folder. * Added instructions for profiling tools. * Cleaned up duplicate imports from merge. * Cleaned up duplicate imports from merge. * Cleaned up parse_benchmark.py * Integrated LoRA logic into llama_example.py. * Moved lora_configs into train_parameters in config yaml. Adjusted docs/config.md accordingly. * Revised handling of nproc-per-node in benchmark script. * Included parameter_count info in benchmark output. * Implemented basic util for parsing benchmarking output. * model_utils: Enabled low_cpu_mem_usage in auto model from_pretrained by default. * launch_lora_benchmark.sh: implemented automatic identification of num_gpus. lora-benchmark: switched parse_benchmark: implemented option to specify benchmark artifact folder to load. * requirements.txt: included accelerate to support low_cpu_mem loading. * benchmark.py: adjusted BenchmarkingDataset to avoid StopIteration exception. * benchmark.py: added env var flag to toggle export_trace * parse_benchmark: included profiler table in output file. launch_benchmark: automated folder creation. launch_lora_benchmark: included model info in slurm output. * get_lora_model_from_base_model: enabled peft for models loaded via low_cpu_mem. More investigation might be needed. * model_utils: revised dtype handling for peft-wrapped models. * parse_benchmark: implemented sorting of profiler table output. launch_benchmark: revised default run time limit. * Merged example_lora into examples/llama_example.pu * Added instructions related to parse_benchmark * parse_benchmark: implemented aggregation across repeated metrics. * Implemented non-LoRA profiling and benchmarking. * Various static typechecking and formatting fixes. * Implemented restoring LoRA train state from filesystem. During training the adapter weights are saved to and loaded from the filesystem. The base model weights are loaded separately. Revised reference to optim_state_dict_to_load in load_optimizer. * Included train step number in LoRA adapter output path. * Added reference throughput table to documentation. * Added unit description to reference throughput table. Applied markdown formatting via prettier. * Added unit description to reference throughput table. Applied markdown formatting via prettier. * Benchmark: added option to override max_length of pre-trained model. * Deleted unused `accelerate` dependency from requirements.txt * Benchmark: added comment on max_length. * Benchmark: added comment on batch size. * Benchmark: added option to override batch size. * Benchmark throughput documentation: revised word choices. * Moved profiling-tracking logic out of Trainer. * Eliminated hasattr check related to no_sync since FSDP is always enabled. * Replaced peft fsdp_auto_wrap_policy to eliminate implicit `accelerate` dependency. Eliminated redundant bfloat16 type conversion. Fixed scope of placeholder for `is_peft_adapter_restored`. * Configured LoRA auto-wrap policy as off by default- enable the policy only when LoRA is required. * Revised punctuation in lora_requires_grad_policy_fn. * Renamed declarative `enable_lora` with descriptive `is_lora_enabled`. * Replaced optimizer.load_state_dict with load_sharded_optimizer_state_dict for PEFT optimizer. Added LoRA/PEFT documentations. * benchmarking: deleted unused TypeVar in parse_benchmark.py * Replaced config getattr and hasattr with dict methods. * Deleted redundant lora-specific launch scripts. * Added launch_benchmark.sh for throughput benchmarks. * Benchmark: run `makedirs` only `if __name__ == "__main__"`. * Replaced peft class attributes in Trainer with instance attributes. Added information about benchmarking environment. Additional formatting fixes. --------- Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements changes required to enable low cpu memory loading of a model.
Memory required to load Llama-70B in b16 on 4 GPUs:
70 * 2 * 4 = 560 GB
70 * 2 = 140 GB