Conversation
|
@stas00 I have a few questions I'd like to discuss.
Thank you! P.S. Very happy that CI is working (though it took 10+ minutes to finish)! |
Yes! Other then really slow instance booting, the main overhead is this: it takes like 5min to compile the cuda kernels, which happens on every CI run. If you want to try to speed it up see: And CI only works on non-forked branches. |
This is a good consideration. Here is my take on it: this counter has a different logical purpose. if you were to log every 100 iterations it then tells you if any iterations were skipped due to internal logic, so you know something was off - e.g. loss scale was too big. i.e. this is the framework reporting to the user that it did something that user needs to know about. with
The logging is part of the spec. So absolutely yes. But this is not sufficient for testing that it actually works. That will only test that the right branch of code was invoked. As I wrote in the spec you actually want to test that the correct |
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
|
@stas00 Some updates on the changes I've made since yesterday:
I will spend more time looking at logging and testing moving forward. Thank you! |
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
@stas00 I've added logging and testing based on the specs and your earlier comment. I've also removed some repeats by using
Thank you! |
I'm not 100% sure what you are asking about - the exact format match rather than some sort of regex? No, that's fine.
I'd say let's not worry about it. This is not a normal functionality.
of course, it's in the logs, e.g. example: |
|
Oh, I know what the problem is - |
|
OK, succeeded. I left the debug code for now if we choose to change the logic before we merge this. Please think about this one: As it's critical we get it right. |
|
I'm trying your PR on a live server to try to salvage the CL training. We have another bug here, it fails if we skip the first iteration, since I fixed it with: |
megatron/training.py
Outdated
| new_samples = mpu.get_data_parallel_world_size() * \ | ||
| args.micro_batch_size * \ | ||
| get_num_microbatches() | ||
| args.consumed_train_samples += new_samples |
There was a problem hiding this comment.
@jaketae @stas00 I think we should not accumulate the args.consumed_train_samples, args.consumed_train_tokens, iteration and args.iteration at here for two reasons: 1) If we just skipped the data but still count the steps, samples and tokens, it could lead to undesirable behavior for those techniques that reply on these stats. For example curriculum learning replies on step to calculate current seqlen. 2) DeepSpeed engine itself will keep counting global step when the train step is called. So if we only increment the step on user side without calling train step to ds engine, it will generate a global step mismatch which is also an issue.
There was a problem hiding this comment.
So this is the incarnation that I'm running at the moment on the CL experiment on JZ:
start, end = args.skip_train_iteration_range.popleft()
print_rank_0(f"RANGE: {start} {end}")
print_rank_0(f"iteration {args.iteration}")
print_rank_0(f"Skipped iterations {start} {end} due to --skip-iterations flag.")
iteration_for_skipping = args.iteration
while iteration_for_skipping + 1 <= end:
try:
_ = next(train_data_iterator)
except TypeError:
pass
iteration_for_skipping += 1
continue
|
Reposting some potentially important conversations from Slack for documentation/future reference. |
|
I backported this feature into the |
|
@stas00 I see in the backported commit that you implemented the additional counter discussed in Slack. I'm wondering how it plays with checkpointing, etc. Let me know if there's something I can contribute (i.e. perhaps bringing in the changes in your backported commit into this branch before the final merge)! |
|
Since the new version silently skips the data, that's all there is to it. No need to do anything else. But the problem is that we don't know when the data was skipped. So if down the road we want to extract a sample, it will be incorrect. So we either want to keep a different counter of real number of samples drawn (which would require a lot of extra work) or adjust the So perhaps let's do that the latter? Of course our elaborate test will have to be cut down to a much simpler now. |
|
@stas00 I agree that the easiest way to examine samples would be to modify
Thank you! |
It has its own counters, so it doesn't affect anything, not even the general
Exactly
Let's start with making simple assumptions and not worry about edge cases here.
Yes please. Bottom line let's finish this PR with:
|
|
@stas00 Pushed the updates + opened a new issue #189 . Please feel free to edit or comment as you see fit. Thank you! There seems to be an issue with deepspeed dependency. Traceback from GitHub Actions: Is this something that just happens from time to time? |
|
This is strange indeed. I will ask Jeff. |
|
Jeff fixed the DSE issue, but we now discovered a new issue in deepspeed@master - should be fixed soon. Waiting for the merge of deepspeedai/DeepSpeed#1569 |
* Llama 2 GQA * llama2 pretrain demo * GQA minor fix
This PR resolves #175.
continuelogic to skip iterations