Train detr3d_vovnet_train exceed the memory of 4*RTX3090 #21

synsin0 · 2022-03-09T10:26:22Z

Environment: 4xRTX3090.
Failure: train detr3d with resnet101 backbone dominates each card with 21GB memory. Train detr3d with vovnet backbone exceeds the memory limit. image_per_gpu is set to 1.
I read from your paper that your experiment uses 8xRTX3090. How should I adjust for adaption of my training process?

a1600012888 · 2022-03-09T21:10:04Z

Hi synsin0.
For vovnet backbone, it is too large to fit in 3090.
If you want to fit it in 3090, you can try:

fp16
memory checkpoint Training Deep Nets with Sublinear Memory Cost, pytorch provide a checkpoint implementation: torch.utils.checkpoint.checkpoint, see https://pytorch.org/docs/stable/checkpoint.html?highlight=checkpoint
Freeze some layers of Vovnet. e.g. first stage, etc.

cgl-cell · 2022-10-15T09:38:09Z

Environment: 4xRTX3090. Failure: train detr3d with resnet101 backbone dominates each card with 21GB memory. Train detr3d with vovnet backbone exceeds the memory limit. image_per_gpu is set to 1. I read from your paper that your experiment uses 8xRTX3090. How should I adjust for adaption of my training process?

Have you solved it？

a1600012888 mentioned this issue Mar 9, 2022

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 71517) #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train detr3d_vovnet_train exceed the memory of 4*RTX3090 #21

Train detr3d_vovnet_train exceed the memory of 4*RTX3090 #21

synsin0 commented Mar 9, 2022

a1600012888 commented Mar 9, 2022

cgl-cell commented Oct 15, 2022

Train detr3d_vovnet_train exceed the memory of 4*RTX3090 #21

Train detr3d_vovnet_train exceed the memory of 4*RTX3090 #21

Comments

synsin0 commented Mar 9, 2022

a1600012888 commented Mar 9, 2022

cgl-cell commented Oct 15, 2022