iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

andreaskoepf · 2023-08-10T21:00:36Z

Testing with Llam2 7B and different settings for --tensor_model_parallel_size and --pipeline_model_parallel_size I noticed that the elapsed time per iteration increases linearly for both TP=2, PP=1 and TP=1, PP=2 while it stays almost flat for TP=2 PP=2.

(please ignore the 'lima' in the names of the runs, both used the same lima dropout parameters and I also tried without lima dropout which made no difference - the dropout method is unrelated to the increase in iteration time)

This is the log of a TP=2, PP=1 run (notice how "elapsed time per iteration (ms)" increases):

 iteration        5/    2000 | consumed samples:          320 | elapsed time per iteration (ms): 11075.6 | learning rate: 5.000E-07 | global batch size:    64 | lm loss: 1.793634E+00 | loss scale: 1.0 | grad norm: 6.716 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       10/    2000 | consumed samples:          640 | elapsed time per iteration (ms): 14642.7 | learning rate: 1.000E-06 | global batch size:    64 | lm loss: 2.205030E+00 | loss scale: 1.0 | grad norm: 12.834 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       15/    2000 | consumed samples:          960 | elapsed time per iteration (ms): 22576.9 | learning rate: 1.500E-06 | global batch size:    64 | lm loss: 1.436895E+00 | loss scale: 1.0 | grad norm: 5.681 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       20/    2000 | consumed samples:         1280 | elapsed time per iteration (ms): 30626.8 | learning rate: 2.000E-06 | global batch size:    64 | lm loss: 1.524862E+00 | loss scale: 1.0 | grad norm: 5.253 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       25/    2000 | consumed samples:         1600 | elapsed time per iteration (ms): 38109.3 | learning rate: 2.500E-06 | global batch size:    64 | lm loss: 1.534147E+00 | loss scale: 1.0 | grad norm: 6.936 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       30/    2000 | consumed samples:         1920 | elapsed time per iteration (ms): 46237.4 | learning rate: 3.000E-06 | global batch size:    64 | lm loss: 1.294282E+00 | loss scale: 1.0 | grad norm: 3.555 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       35/    2000 | consumed samples:         2240 | elapsed time per iteration (ms): 53443.8 | learning rate: 3.500E-06 | global batch size:    64 | lm loss: 1.346382E+00 | loss scale: 1.0 | grad norm: 3.461 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       40/    2000 | consumed samples:         2560 | elapsed time per iteration (ms): 60785.3 | learning rate: 4.000E-06 | global batch size:    64 | lm loss: 1.323138E+00 | loss scale: 1.0 | grad norm: 2.622 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       45/    2000 | consumed samples:         2880 | elapsed time per iteration (ms): 66910.9 | learning rate: 4.500E-06 | global batch size:    64 | lm loss: 1.446500E+00 | loss scale: 1.0 | grad norm: 4.450 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       50/    2000 | consumed samples:         3200 | elapsed time per iteration (ms): 78089.7 | learning rate: 5.000E-06 | global batch size:    64 | lm loss: 1.180950E+00 | loss scale: 1.0 | grad norm: 2.712 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       55/    2000 | consumed samples:         3520 | elapsed time per iteration (ms): 88459.2 | learning rate: 5.500E-06 | global batch size:    64 | lm loss: 1.315969E+00 | loss scale: 1.0 | grad norm: 2.561 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       60/    2000 | consumed samples:         3840 | elapsed time per iteration (ms): 93434.8 | learning rate: 6.000E-06 | global batch size:    64 | lm loss: 1.255166E+00 | loss scale: 1.0 | grad norm: 2.143 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       65/    2000 | consumed samples:         4160 | elapsed time per iteration (ms): 100035.0 | learning rate: 6.500E-06 | global batch size:    64 | lm loss: 1.182611E+00 | loss scale: 1.0 | grad norm: 2.422 | number of skipped iterations:   0 | number of nan iterations:   0 |

The text was updated successfully, but these errors were encountered:

martinjaggi · 2023-08-11T14:29:08Z

feedback from anonymous expert ;)

I would use the occasion to add some profiling decorators to all functions that should be time consuming, depending on a flag it would then profile the function (num calls, time spent etc)

That will help you find out the culprit

Also dumping the state of the cuda allocator from time to time is interesting

AleHD · 2023-08-13T17:38:43Z

That's interesting, we haven't seen that before with us. It might be related to the memory requirements? What GPUs are you using? What's your micro batchsize? Does it also occur whith TP=4,PP=1 or TP=1,PP=4?

andreaskoepf · 2023-08-28T20:34:17Z

I tried on gpu009 with micro batchsize 2 and TP=4,PP=1 and had again the strange effect. It is remarkable that you can run all your training without problems ;-) .. the main difference is that I used my local_changes branch, but currently I don't know what there could cause this. You can run variable-length with the official instruction tuning branch with TP=4, PP=1 without problems, right?

AleHD · 2023-08-29T01:52:43Z

Yes, we haven't experienced this issue at all. I have observed that the iteration time seems to fluctuate a bit but not with this pattern and never slowing down by 10x or anything close... That said, we haven't been logging the times yet so we haven't inspected it very closely...

LlinWing · 2023-09-05T03:31:28Z

I have also experienced this issue. Have you resolved this problem?

I am using 8 A100 40G GPUs with training settings TP=4 PP=2.
Additionally, I have noticed that the GPU memory remains unchanged, while CPU memory consumption is increasing linearly. There might be a memory leak issue.

martinjaggi · 2023-09-06T11:51:56Z

this problem never appeared in our training runs, and now with the newly reported issue #60 probably it is because we always use a >1 micro_batch_size .

we will investigate some more, but in the meantime please use micro_batch_size > 1

IgorVasiljevic-TRI · 2023-09-06T17:18:41Z

I experienced this when I was using a different nvcr docker image than the one in the user guide, the problem went away when switching to nvcr.io/nvidia/pytorch:23.07-py3.

martinjaggi · 2023-09-07T08:12:27Z

closing as #60 narrows it down more, and the fix #62

martinjaggi mentioned this issue Sep 6, 2023

iteration-time increases linearly when micro_batch_size=1 #60

Closed

AleHD linked a pull request Sep 6, 2023 that will close this issue

Fixed linear time increase observed when micro=1 #62

Merged

martinjaggi closed this as completed Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

andreaskoepf commented Aug 10, 2023 •

edited

Loading

martinjaggi commented Aug 11, 2023

AleHD commented Aug 13, 2023

andreaskoepf commented Aug 28, 2023

AleHD commented Aug 29, 2023

LlinWing commented Sep 5, 2023

martinjaggi commented Sep 6, 2023

IgorVasiljevic-TRI commented Sep 6, 2023

martinjaggi commented Sep 7, 2023

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

iteration-time increases linearly (for TP=2, PP=1 & TP=1, PP=2) #22

Comments

andreaskoepf commented Aug 10, 2023 • edited Loading

martinjaggi commented Aug 11, 2023

AleHD commented Aug 13, 2023

andreaskoepf commented Aug 28, 2023

AleHD commented Aug 29, 2023

LlinWing commented Sep 5, 2023

martinjaggi commented Sep 6, 2023

IgorVasiljevic-TRI commented Sep 6, 2023

martinjaggi commented Sep 7, 2023

andreaskoepf commented Aug 10, 2023 •

edited

Loading