Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

Closed
andreaskoepf opened this issue Aug 10, 2023 · 8 comments · Fixed by #62
Closed

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

andreaskoepf opened this issue Aug 10, 2023 · 8 comments · Fixed by #62

Comments

@andreaskoepf
Copy link
Contributor

andreaskoepf commented Aug 10, 2023

Testing with Llam2 7B and different settings for --tensor_model_parallel_size and --pipeline_model_parallel_size I noticed that the elapsed time per iteration increases linearly for both TP=2, PP=1 and TP=1, PP=2 while it stays almost flat for TP=2 PP=2.

image

(please ignore the 'lima' in the names of the runs, both used the same lima dropout parameters and I also tried without lima dropout which made no difference - the dropout method is unrelated to the increase in iteration time)

This is the log of a TP=2, PP=1 run (notice how "elapsed time per iteration (ms)" increases):

 iteration        5/    2000 | consumed samples:          320 | elapsed time per iteration (ms): 11075.6 | learning rate: 5.000E-07 | global batch size:    64 | lm loss: 1.793634E+00 | loss scale: 1.0 | grad norm: 6.716 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       10/    2000 | consumed samples:          640 | elapsed time per iteration (ms): 14642.7 | learning rate: 1.000E-06 | global batch size:    64 | lm loss: 2.205030E+00 | loss scale: 1.0 | grad norm: 12.834 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       15/    2000 | consumed samples:          960 | elapsed time per iteration (ms): 22576.9 | learning rate: 1.500E-06 | global batch size:    64 | lm loss: 1.436895E+00 | loss scale: 1.0 | grad norm: 5.681 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       20/    2000 | consumed samples:         1280 | elapsed time per iteration (ms): 30626.8 | learning rate: 2.000E-06 | global batch size:    64 | lm loss: 1.524862E+00 | loss scale: 1.0 | grad norm: 5.253 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       25/    2000 | consumed samples:         1600 | elapsed time per iteration (ms): 38109.3 | learning rate: 2.500E-06 | global batch size:    64 | lm loss: 1.534147E+00 | loss scale: 1.0 | grad norm: 6.936 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       30/    2000 | consumed samples:         1920 | elapsed time per iteration (ms): 46237.4 | learning rate: 3.000E-06 | global batch size:    64 | lm loss: 1.294282E+00 | loss scale: 1.0 | grad norm: 3.555 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       35/    2000 | consumed samples:         2240 | elapsed time per iteration (ms): 53443.8 | learning rate: 3.500E-06 | global batch size:    64 | lm loss: 1.346382E+00 | loss scale: 1.0 | grad norm: 3.461 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       40/    2000 | consumed samples:         2560 | elapsed time per iteration (ms): 60785.3 | learning rate: 4.000E-06 | global batch size:    64 | lm loss: 1.323138E+00 | loss scale: 1.0 | grad norm: 2.622 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       45/    2000 | consumed samples:         2880 | elapsed time per iteration (ms): 66910.9 | learning rate: 4.500E-06 | global batch size:    64 | lm loss: 1.446500E+00 | loss scale: 1.0 | grad norm: 4.450 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       50/    2000 | consumed samples:         3200 | elapsed time per iteration (ms): 78089.7 | learning rate: 5.000E-06 | global batch size:    64 | lm loss: 1.180950E+00 | loss scale: 1.0 | grad norm: 2.712 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       55/    2000 | consumed samples:         3520 | elapsed time per iteration (ms): 88459.2 | learning rate: 5.500E-06 | global batch size:    64 | lm loss: 1.315969E+00 | loss scale: 1.0 | grad norm: 2.561 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       60/    2000 | consumed samples:         3840 | elapsed time per iteration (ms): 93434.8 | learning rate: 6.000E-06 | global batch size:    64 | lm loss: 1.255166E+00 | loss scale: 1.0 | grad norm: 2.143 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       65/    2000 | consumed samples:         4160 | elapsed time per iteration (ms): 100035.0 | learning rate: 6.500E-06 | global batch size:    64 | lm loss: 1.182611E+00 | loss scale: 1.0 | grad norm: 2.422 | number of skipped iterations:   0 | number of nan iterations:   0 |

@martinjaggi
Copy link
Contributor

feedback from anonymous expert ;)

I would use the occasion to add some profiling decorators to all functions that should be time consuming, depending on a flag it would then profile the function (num calls, time spent etc)

That will help you find out the culprit

Also dumping the state of the cuda allocator from time to time is interesting

@AleHD
Copy link
Collaborator

AleHD commented Aug 13, 2023

That's interesting, we haven't seen that before with us. It might be related to the memory requirements? What GPUs are you using? What's your micro batchsize? Does it also occur whith TP=4,PP=1 or TP=1,PP=4?

@andreaskoepf
Copy link
Contributor Author

I tried on gpu009 with micro batchsize 2 and TP=4,PP=1 and had again the strange effect. It is remarkable that you can run all your training without problems ;-) .. the main difference is that I used my local_changes branch, but currently I don't know what there could cause this. You can run variable-length with the official instruction tuning branch with TP=4, PP=1 without problems, right?

@AleHD
Copy link
Collaborator

AleHD commented Aug 29, 2023

Yes, we haven't experienced this issue at all. I have observed that the iteration time seems to fluctuate a bit but not with this pattern and never slowing down by 10x or anything close... That said, we haven't been logging the times yet so we haven't inspected it very closely...

@LlinWing
Copy link

LlinWing commented Sep 5, 2023

I have also experienced this issue. Have you resolved this problem?

I am using 8 A100 40G GPUs with training settings TP=4 PP=2.
Additionally, I have noticed that the GPU memory remains unchanged, while CPU memory consumption is increasing linearly. There might be a memory leak issue.

iteration 10/ 600 | consumed samples: 640 | elapsed time per iteration (ms): 84606.2 | learning rate: 8.000E-06 |
iteration 15/ 600 | consumed samples: 960 | elapsed time per iteration (ms): 142474.1 | learning rate: 1.200E-05 |
iteration 20/ 600 | consumed samples: 1280 | elapsed time per iteration (ms): 202438.6 | learning rate: 1.600E-05 |

@martinjaggi
Copy link
Contributor

this problem never appeared in our training runs, and now with the newly reported issue #60 probably it is because we always use a >1 micro_batch_size .

we will investigate some more, but in the meantime please use micro_batch_size > 1

@IgorVasiljevic-TRI
Copy link

I experienced this when I was using a different nvcr docker image than the one in the user guide, the problem went away when switching to nvcr.io/nvidia/pytorch:23.07-py3.

@AleHD AleHD linked a pull request Sep 6, 2023 that will close this issue
@martinjaggi
Copy link
Contributor

closing as #60 narrows it down more, and the fix #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants