How many GPUs will be used for training the code? #24

xiaofeng-c · 2021-11-09T12:35:24Z

Thank you for your work,when use your codes for training,i want to know how many gpus will be used?

xiaofeng-c · 2021-11-09T13:52:33Z

When run the train codes,i use 2 gpus,the problem occur as follow:
**num classes: 15
2021-11-09 21:32:31 epoch 20/353, processed 291080 samples, lr 0.000333
291144: nGT 155, recall 127, proposals 324, loss: x 4.715607, y 5.864799, w 4.005206, h 3.525136, conf 130.468964, cls 392.400330, class_contrast 1.089527, total 542.069580
291208: nGT 144, recall 129, proposals 356, loss: x 3.931613, y 5.137510, w 6.525736, h 2.192330, conf 89.379707, cls 200.923706, class_contrast 1.220589, total 309.311188
Traceback (most recent call last):
File "tool/train_decoupling_disturbance.py", line 403, in
train(epoch,repeat_time,mask_ratio)
File "tool/train_decoupling_disturbance.py", line 280, in train
output, dynamic_weights = model(data, metax_disturbance, mask_disturbance)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/tool/darknet/darknet_decoupling.py", line 223, in forward
x = self.detect_forward(x, dynamic_weights)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/tool/darknet/darknet_decoupling.py", line 175, in detect_forward
x = self.models[ind]((x, dynamic_weights[dynamic_cnt]))
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/dio/VSST/anaconda3/envs/pytorch1.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/dio/VSST/zwm/YOLO_Meta/CME-main/core/dynamic_conv.py", line 163, in forward
self.padding, self.dilation, groups)
RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached)
if i need to use 4 gpus to train？

Bohao-Lee · 2021-11-09T14:29:19Z

In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.

xiaofeng-c · 2021-11-09T14:32:05Z

OK,thank you very much!

chenrxi · 2021-11-10T02:51:34Z

In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.

I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee

Bohao-Lee · 2021-11-10T04:07:17Z

In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.

I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee

I have not tried on the reduced batch size setting before, but I can reproduce the performance on two 3080 GPUs.

chenrxi · 2021-11-10T06:32:20Z

In experiment, I used 2 GPUs to train. You can use more GPU to train or reduce the batch size, but it may affect the result.

I used 4 1080Tis and reduced the batch size from 64 to 32 during fine-tuning, the result is not as good as the result paper reported. The mAP on VOC split1 is only 0.385 and it is 0.475 in paper. @Bohao-Lee

I have not tried on the reduced batch size setting before, but I can reproduce the performance on two 3080 GPUs.

Thanks, I will try again

Jxt5671 · 2021-12-09T13:20:07Z

I also encountered this error：RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached). but I only have two 2080ti, so what should I do? Reduce the batch_size? @xiaofeng-c @Bohao-Lee

Bohao-Lee · 2021-12-17T12:55:00Z

I also encountered this error：RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB (GPU 0; 10.76 GiB total capacity; 9.72 GiB already allocated; 179.69 MiB free; 84.55 MiB cached). but I only have two 2080ti, so what should I do? Reduce the batch_size? @xiaofeng-c @Bohao-Lee

Maybe reducing the batch size can help you. But it may affect performance. @Jxt5671

Jxt5671 · 2021-12-20T03:48:52Z

I use two 3080 GPUs，butI also encountered this error：RuntimeError: CUDA out of memory. Tried to allocate 422.00 MiB，Could you please tell me the CUDA, torch and python versions you use？
Besides，I have also tried to reduce the batchsize to 32. There is no problem with basetrain, but there is still a CUDA out of memory problem in finetuning，So should I reduce the batch size to 16？@Bohao-Lee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many GPUs will be used for training the code? #24

How many GPUs will be used for training the code? #24

xiaofeng-c commented Nov 9, 2021

xiaofeng-c commented Nov 9, 2021

Bohao-Lee commented Nov 9, 2021

xiaofeng-c commented Nov 9, 2021

chenrxi commented Nov 10, 2021 •

edited

Loading

Bohao-Lee commented Nov 10, 2021

chenrxi commented Nov 10, 2021

Jxt5671 commented Dec 9, 2021

Bohao-Lee commented Dec 17, 2021

Jxt5671 commented Dec 20, 2021

How many GPUs will be used for training the code? #24

How many GPUs will be used for training the code? #24

Comments

xiaofeng-c commented Nov 9, 2021

xiaofeng-c commented Nov 9, 2021

Bohao-Lee commented Nov 9, 2021

xiaofeng-c commented Nov 9, 2021

chenrxi commented Nov 10, 2021 • edited Loading

Bohao-Lee commented Nov 10, 2021

chenrxi commented Nov 10, 2021

Jxt5671 commented Dec 9, 2021

Bohao-Lee commented Dec 17, 2021

Jxt5671 commented Dec 20, 2021

chenrxi commented Nov 10, 2021 •

edited

Loading