-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22
Comments
feedback from anonymous expert ;) I would use the occasion to add some profiling decorators to all functions that should be time consuming, depending on a flag it would then profile the function (num calls, time spent etc) That will help you find out the culprit Also dumping the state of the cuda allocator from time to time is interesting |
That's interesting, we haven't seen that before with us. It might be related to the memory requirements? What GPUs are you using? What's your micro batchsize? Does it also occur whith TP=4,PP=1 or TP=1,PP=4? |
I tried on gpu009 with micro batchsize 2 and TP=4,PP=1 and had again the strange effect. It is remarkable that you can run all your training without problems ;-) .. the main difference is that I used my local_changes branch, but currently I don't know what there could cause this. You can run variable-length with the official instruction tuning branch with TP=4, PP=1 without problems, right? |
Yes, we haven't experienced this issue at all. I have observed that the iteration time seems to fluctuate a bit but not with this pattern and never slowing down by 10x or anything close... That said, we haven't been logging the times yet so we haven't inspected it very closely... |
I have also experienced this issue. Have you resolved this problem? I am using 8 A100 40G GPUs with training settings TP=4 PP=2. iteration 10/ 600 | consumed samples: 640 | elapsed time per iteration (ms): 84606.2 | learning rate: 8.000E-06 | |
this problem never appeared in our training runs, and now with the newly reported issue #60 probably it is because we always use a >1 we will investigate some more, but in the meantime please use |
I experienced this when I was using a different nvcr docker image than the one in the user guide, the problem went away when switching to |
Testing with Llam2 7B and different settings for
--tensor_model_parallel_size
and--pipeline_model_parallel_size
I noticed that the elapsed time per iteration increases linearly for bothTP=2, PP=1
andTP=1, PP=2
while it stays almost flat forTP=2 PP=2
.(please ignore the 'lima' in the names of the runs, both used the same lima dropout parameters and I also tried without lima dropout which made no difference - the dropout method is unrelated to the increase in iteration time)
This is the log of a
TP=2, PP=1
run (notice how "elapsed time per iteration (ms)" increases):The text was updated successfully, but these errors were encountered: