-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Floating-point ops counting and reloading #40
Conversation
Note this integrates cherry-picked fixes from the main Megatron codebase, most notably changes to integrate deepspeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a few suggestions, looking good. Thank you, @TevenLeScao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not yet familiar with the codebase, so I'm not the best person to review so I'll let the others approve.
btw, this:
will count tied variables multiple times, you probably want:
|
Also please make sure to rebase to include this fix if you're testing: |
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
megatron/training.py
Outdated
@@ -649,7 +649,7 @@ def save_checkpoint_and_time(iteration, model, optimizer, lr_scheduler): | |||
def train(forward_step_func, model, optimizer, lr_scheduler, | |||
train_data_iterator, valid_data_iterator): | |||
"""Train the model function.""" | |||
print(sum(p.numel() for submodel in model for p in submodel.parameters() if p.requires_grad)) | |||
print(f"Number of trainable parameters: {sum(p.numel() for submodel in model for p in submodel.parameters() if p.requires_grad)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this most likely needs to be fixed to remove double/triple-counting of tied vars:
#40 (comment)
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
megatron/utils.py
Outdated
|
||
def param_count_without_doubles(param_list): | ||
return sum(dict((p.data_ptr(), param_size(p)) for p in param_list).values()) | ||
# sum(dict((p.data_ptr(), param_size(p)) for submodel in model for p in submodel.parameters()).values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed then
megatron/utils.py
Outdated
return parameter.ds_numel if hasattr(parameter, 'ds_id') else parameter.nelement() | ||
|
||
|
||
def param_count_without_doubles(param_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this can be triplicates, and more, perhaps a more telling name would be just: unique_param_count
?
So originally I wasn't sure I was testing it right (sorry, took a bit more time than I would have liked) but now I am quite certain that they both return the same number, with duplicates. Do we have other leads on this? I am testing by switching comments out:
and looking at the model parameter print-out at the start of training. |
I think you're correct, Teven, I experimented some more with it and it looks like in general case this is actually not needed. I got the original at https://stackoverflow.com/a/62764464/9201239 and I'm pretty sure I used it because I was getting duplicate counts. But perhaps it was in a different scenario. I think these have to be 2 tensors with shared set creation: For example here
Thank you for questioning my suggestion, @TevenLeScao! |
Sorry for the misunderstanding, I currently find that both return the number with duplicates! For the same reasons you've laid, this does not correspond to what I thought the code would do, so I'm still investigating. |
Oh, you're saying you are getting wrong numbers with both approaches. I see. Let's look at the code/data. How many shared embedings do you have in the model? What is the code that does the sharing (or tieing)? Let's perhaps see the dump of the model structure as well? The things to consider
So now you can look at the params that you think are shared and do the above and see which storage the python variables point to and which tensor data ptrs point to. |
Hey, sorry, I looked at the code a bit and I don't think I currently have the bandwidth to adapt it this way. I'd really like this to be functional before we launch the next round of training; here are ways I think we can achieve that:
|
As I mentioned in my last comment, I will need context to work with to understand the problem and to be able to reproduce it. Once I finished sorting out the checkpoints I can work on this if you can explain to me how to reproduce the issue.
I don't think we have enough memory to load the whole model on cpu.
Doesn't hurt to ask
warnings don't usually work IMHO, but if an incomplete solution works for your needs, and you need it now as-is, merge it and open another issue to track to bring this to completion. |
I really haven't had time to come back to this; I'll merge now with a warning and open an issue. |
Maybe unrelated, but I found this in DS repo: https://github.com/microsoft/DeepSpeed/blob/big-science/deepspeed/runtime/pipe/engine.py#L117-L126 |
* initial flo count/logging setup (need to fix model parameter count) * initial flo count/logging setup (need to fix model parameter count) * 1B3 parameter setup + flos counting * 1B3 parameter setup + flos counting * 1B3 parameter setup + flos counting * 1B3 parameter setup * 1B3 parameter setup * synched with latest 13B script * synched with latest 13B script * pipe transformer docstring * improve DS integration evaluation + logging * use pp engine even for pp=1 (bigscience-workshop#6) * removed slurm_examples * flos re-loading * Update megatron/training.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update megatron/data/gpt_dataset.py Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> * Update megatron/utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update megatron/utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * formatting fix, reserving bug for somewhere else, adding flo-logging to TB groups * indentation bug * fixing possible double counts * tweaks * warning for double counts Co-authored-by: Shaden Smith <shaden.smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: TevenLeScao <uhk85as@jean-zay1.idris.fr> Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Hi Teven, This warning isn't great:
as it logs on each gpu and resulting in hundreds of these flooding the log file. could we please change this to run it only global rank 0 please? Thank you! |
This PR adds flo-based logging of the validation loss. We count gigaflos to avoid overflowing, and because parameter numbers are also counted in billions. We don't count operations linked to the embeddings (cf. Scaling Laws for Neural Language Models, section 2.1.) Those are stored in the
args
asargs.gigaflos_no_embeds
, and exported with them in thestate_dict
at checkpoint saving time. Currently, reloading a model trained before this PR assumes its flos are 0, which is similar behaviour to theconsumed_train_samples
.