in the middle of training process at iteration 2100 it show this error #5

tilahun12 · 2022-05-08T22:41:27Z

wymanCV · 2022-05-08T22:55:28Z

Is this from Vgg16 backbone? and do you change batch size?

tilahun12 · 2022-05-08T22:59:32Z

yes it throws ÇUDA OUT OF MEMORY error so that i changed the batch size to 1

tilahun12 · 2022-05-08T23:02:16Z

the backbone is R-50-FPN-RETINANET''

wymanCV · 2022-05-08T23:03:57Z

yes it throws ÇUDA OUT OF MEMORY error so I changed the batch size to 1

Since the cross-image graph-based message propagation (within batch) is necessary, batch size should be set at least 2. We tested batch size 2 and 4. Do you change the learning rate for bs=1?

tilahun12 · 2022-05-08T23:06:52Z

I didn't change the learning rate but it still throws cuda out of memory error using batch size of 2

wymanCV · 2022-05-08T23:12:42Z

I didn't change the learning rate but it still throws cuda out of memory error using batch size of 2

We used 2080 Ti (12GB) for bs=2 and V100 (36GB) for bs=4, and never had a try for bs=1.

It is common practice to halve the learning rate if you halve the batch size. For now, you can try to halve the learning rate. We will further test the bs=1 if you still face such a problem. But we still don't recommend training with only bs=1.

tilahun12 · 2022-05-08T23:17:55Z

at the beginning it starts well but then at some pint of the iteration it show that error. I will try reducing the learning rate.

tilahun12 · 2022-05-09T00:56:42Z

'CUDA out of memory' even for learning rate of 0.0005

wymanCV · 2022-05-09T01:02:03Z

'CUDA out of memory' even for learning rate of 0.0005

It seems that your GPU memory is too small.,

Try to further reduce the number of sampled nodes by changing this in the YAML config file. (The number of sampled nodes could increase during the training.)

NUM_NODES_PER_LVL_SR: 50
NUM_NODES_PER_LVL_TG: 50

Reduce the node number as much as possible until the CUDA out of memory doesn't appear, although it may have some negative impact on the performance.

tilahun12 · 2022-05-09T01:05:47Z

this is the gpu I am using. so can i reduce NUM_NODES_PER_LVL_SR and NUM_NODES_PER_LVL_TG to any number?

wymanCV · 2022-05-09T01:11:39Z

Actually, 8GP GPU is a little bit small for the detection tasks.
Sure, you can try any number of nodes. But you'd better not reduce too much, as shown in Table 4.

tilahun12 · 2022-05-09T01:12:38Z

Okay and isn't there a checkpoint because it starts from the scratch every time i restarted it even though it did many iterations before

wymanCV · 2022-05-09T01:15:58Z

Okay and isn't there a checkpoint because it starts from the scratch every time i restarted it even though it did many iterations before

We automatically start to save the checkpoint if the validation results are larger than SOLVER.INITIAL_AP50 to save the desk space. You can change SOLVER.INITIAL_AP50 to 0 to save more checkpoints.

tilahun12 · 2022-05-09T01:27:19Z

let me try it by applying your suggestions this issue will be open until the process finishes

tilahun12 · 2022-05-09T02:49:32Z

thank you I will re-open when an issue is encountered.

wymanCV · 2022-05-09T08:00:11Z

Hi, I have reproduced your issue, and this issue should have been addressed in the latest commit.

Since your bs is too small (bs=1), there exists an extreme case in which there are only two nodes in the source domain and no nodes in the target domain. Then, SIGMA will split the source nodes into two parts to train the matching branch, leading to the wrong size of target nodes [256] instead of [num_node, 256].

We fix this bug by adding these lines, which directly jump out the middle head if there are not enough source nodes.

Add these lines. Then, you can try to keep the original learning rate to train faster, or it will take too long time to train the model with limited bs=1.

wymanCV · 2022-05-10T10:13:52Z

thank you I will re-open when an issue is encountered.

We have updated README about the small batch-size training for your convenience. ResNet 50 backbone always gives better results than VGG 16.

tilahun12 · 2022-05-10T12:33:51Z

oh sorry for the late replay i see I'll check out the updates made. but now regarding the check point after more than 24 hours of training unfortunately there was a power interruptions and when i start the training it starts from the scratch and also it shows the same estimated remaining time as the original one even though it saved checkpoint each steps. here is the screenshot of it and also i showed the saved models in the GIF file. please kindly check it out.

wymanCV · 2022-05-10T12:56:14Z

Hi, that's okay since the framework will automatically load the latest checkpoint. You can directly ignore the INFO message since it is from EPM, which isn't used in our project. You can directly continue to train the model and set the warm-up iterations to 0. It seems to work properly now, if you face the previous issue again, you only need to add those lines mentioned above.

I recommend you to try changing the learning rate back to 0.0025 to train faster as I find your model converges too slowly with only bs=1. As the updated readme, you need to train double iterations if you halve the batch size. Usually, for bs=2 of resnet 50 (0.0025 lr), it can achieve 40+ mAP only using 10000 iterations.

tilahun12 · 2022-05-10T13:17:09Z

Noted with thanks so I think I don't have t redownload the repo the update is on file 'graph_matching_head.py' just changing the learning rate to 0.0025 and batch size 2 so I can just replace graph_matching_head.py right?

wymanCV · 2022-05-10T13:21:06Z

Noted with thanks so I think I don't have t redownload the repo the update is on file 'graph_matching_head.py' just changing the learning rate to 0.0025 and batch size 2 so I can just replace graph_matching_head.py right?

Yes, you only need to replace graph_matching_head.py and change the BASE_LR in the YAML config file

tilahun12 · 2022-05-11T18:22:51Z

Dear sir the ''çuda out of memory' problem still persists after ten thousands of iterations even though I applied the recommendation provided So i updated it to the original. is there any other recommendation please?

wymanCV · 2022-05-12T00:30:12Z

Dear sir the ''çuda out of memory' problem still persists after ten thousands of iterations even though I applied the recommendation provided So i updated it to the original. is there any other recommendation please?

Hi, maybe you can disable the one-to-one (o2o) matching by setting MODEL.MIDDLE_HEAD.GM.MATCHING_CFG 'none'， which will save lots of CUDA memory. Please try this setting first, thanks!.

Besides, we have updated some solutions for limited GPU memory in the latest README. Kindly have a try.

tilahun12 · 2022-05-12T08:53:44Z

Ok Thanks I will try it.

tilahun12 closed this as completed May 9, 2022

tilahun12 reopened this May 9, 2022

wymanCV added the model training label May 9, 2022

tilahun12 closed this as completed May 9, 2022

tilahun12 reopened this May 10, 2022

tilahun12 closed this as completed May 10, 2022

tilahun12 reopened this May 11, 2022

tilahun12 closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in the middle of training process at iteration 2100 it show this error #5

in the middle of training process at iteration 2100 it show this error #5

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022

tilahun12 commented May 8, 2022

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022 •

edited

Loading

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022 •

edited

Loading

tilahun12 commented May 8, 2022

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 •

edited

Loading

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 •

edited

Loading

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022

tilahun12 commented May 9, 2022

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 •

edited

Loading

wymanCV commented May 10, 2022

tilahun12 commented May 10, 2022

wymanCV commented May 10, 2022

tilahun12 commented May 10, 2022

wymanCV commented May 10, 2022

tilahun12 commented May 11, 2022

wymanCV commented May 12, 2022 •

edited

Loading

tilahun12 commented May 12, 2022

in the middle of training process at iteration 2100 it show this error #5

in the middle of training process at iteration 2100 it show this error #5

Comments

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022

tilahun12 commented May 8, 2022

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022 • edited Loading

tilahun12 commented May 8, 2022

wymanCV commented May 8, 2022 • edited Loading

tilahun12 commented May 8, 2022

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 • edited Loading

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 • edited Loading

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022

tilahun12 commented May 9, 2022

tilahun12 commented May 9, 2022

wymanCV commented May 9, 2022 • edited Loading

wymanCV commented May 10, 2022

tilahun12 commented May 10, 2022

wymanCV commented May 10, 2022

tilahun12 commented May 10, 2022

wymanCV commented May 10, 2022

tilahun12 commented May 11, 2022

wymanCV commented May 12, 2022 • edited Loading

tilahun12 commented May 12, 2022

wymanCV commented May 8, 2022 •

edited

Loading

wymanCV commented May 8, 2022 •

edited

Loading

wymanCV commented May 9, 2022 •

edited

Loading

wymanCV commented May 9, 2022 •

edited

Loading

wymanCV commented May 9, 2022 •

edited

Loading

wymanCV commented May 12, 2022 •

edited

Loading