Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on using multi GPU setting #37

Closed
saswat0 opened this issue Dec 7, 2022 · 2 comments · Fixed by #44
Closed

Error on using multi GPU setting #37

saswat0 opened this issue Dec 7, 2022 · 2 comments · Fixed by #44

Comments

@saswat0
Copy link
Contributor

saswat0 commented Dec 7, 2022

Hey @aosokin,
I tried running the training on the given dataset and am facing this error while testing

TypeError: zip argument #1 must support iteration

Here's how I'm running the training:

python trainval_net.py --mGPUs --cuda --dataset grozi-train --dataset_val grozi-val-new-cl --init_weights /home/user/exp/os2d/baselines/CoAE/experiments/../../../models/resnet101-5d3b4d8f.pth --disp_interval 1 --val_interval 10 --nw 4 --bs 8 --s 1 --epochs 2000 --lr_decay_milestones 1000 1500 --lr 0.01 --lr_decay_gamma 0.1 --lr_reload_best_after_decay True --save_dir /home/user/exp/os2d/baselines/CoAE/output/grozi/coae.0.res101_initPytorch_query192_scale900_ms --net res101 --set DATA_DIR /home/user/exp/os2d/baselines/CoAE/data TRAIN.MAX_SIZE 3000 TEST.MAX_SIZE 3000 TRAIN.query_size 192 TRAIN.SCALES [450,562,720,900,1080,1260,1440] TEST.SCALES [900]

Full trace of error:

Traceback (most recent call last):
  File "trainval_net.py", line 504, in <module>
    mAP = test(args_val, model=fasterRCNN)
  File "/home/user/exp/os2d/baselines/CoAE/test_net.py", line 177, in test
    rois_label, weight = fasterRCNN(im_data, q, im_info, gt_boxes, catgory)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.gather(outputs, self.output_device)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/user/miniconda3/envs/os2d/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

How do I handle this issue? This error doesn't come up when--mGPUs flag is off

@aosokin
Copy link
Owner

aosokin commented Dec 7, 2022

Hi, we've never tried running the coae code with multiple GPUs . Please, refer the original implementation for that functionality: https://github.com/timy90022/One-Shot-Object-Detection

@saswat0
Copy link
Contributor Author

saswat0 commented Dec 8, 2022

The authors' implementation doesn't have validation in the training script, and hence this has gone unhandled. I'll refer to some forums for resolving this instead. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants