Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning: Moving average ignored a value of inf #359

Open
sdimantsd opened this issue Feb 26, 2020 · 60 comments · May be fixed by #360
Open

Warning: Moving average ignored a value of inf #359

sdimantsd opened this issue Feb 26, 2020 · 60 comments · May be fixed by #360

Comments

@sdimantsd
Copy link

sdimantsd commented Feb 26, 2020

Hi, im try to train yolact to detect cars with images from COCO.
I take all of the images with cars in it and make dataset from them.
My config look like this:
`
only_cars_coco2017_dataset = dataset_base.copy({
'name': 'cars COCO 2017',

# Training images and annotations
'train_info': '/home/ws/data/COCO/only_cars_train.json',
'train_images':   '/home/ws/data/COCO/train/train2017/',

# Validation images and annotations.
'valid_info': '/home/ws/data/COCO/only_cars_val.json',
'valid_images':   '/home/ws/data/COCO/val/val2017/',

'class_names': ('car'),
'label_map': {1: 1}

})

yolact_im200_coco_cars_config = yolact_base_config.copy({
'name': 'yolact_im200_coco_cars',

# Dataset stuff
'dataset': only_cars_coco2017_dataset,
'num_classes': len(only_cars_coco2017_dataset.class_names) + 1,

'masks_to_train': 20,
'max_num_detections': 20,
'max_size': 200,
'backbone': yolact_base_config.backbone.copy({
    'pred_scales': [[int(x[0] / yolact_base_config.max_size * 200)] for x in yolact_base_config.backbone.pred_scales],
}),

})
`

After a few iterations, my loss going very high...

Can somwone help me with this?

Update:
Also if im train with full COCO dataset i get the same error...

@sdimantsd
Copy link
Author

Update:
It's happend in YOLACT++ version.
The same config and dataset work's find in YOLACT (with no ++) version.

@jasonkena jasonkena linked a pull request Feb 27, 2020 that will close this issue
@jasonkena
Copy link

I just made a pull request that should fix the inf errors, you can try merging it locally.

@sdimantsd
Copy link
Author

Thanks! i will try this next week :)

@sdimantsd
Copy link
Author

@jasonkena What is the diffrents from @dbolya repo to you'r repo?

@jasonkena
Copy link

For the most part, I added support for Apex's AMP, one of it's features is dynamic loss scaling, so your losses will never overflow. Apex also supports 16-bit precision, so that's a plus.

To enable it, change use_amp in config.py to True

@sdimantsd
Copy link
Author

OK, Thx.
Why @dbolya don't use it?
@jasonkena are you one of dbolya team?

@jasonkena
Copy link

Yeah, I just added the pull request about 40 minutes ago, so he might not have read it.

Hahaha, I'm not part of his team, I'm just doing it in my spare time.

@sdimantsd
Copy link
Author

Hahaha, I hope i can do it oneday...
Thx

@sdimantsd
Copy link
Author

@jasonkena one more question.
If i start to train with YOLACT (not YOLACT++). Should I marge from you'r repo and continue training?

@jasonkena
Copy link

Yes, that should work, but just backup your weights in case anything happens.

@sdimantsd
Copy link
Author

OK, but it's may HELP my training?

@jasonkena
Copy link

Yes, your weights shouldn't explode

@sdimantsd
Copy link
Author

lol
thx

@sdimantsd
Copy link
Author

@jasonkena ,
I train with you'r fork, but i always get error/warning:
Gtadient overflow...

@jasonkena
Copy link

jasonkena commented Mar 3, 2020

Did you set use_amp to True in config.py?
The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

@sdimantsd
Copy link
Author

Yes

@jasonkena
Copy link

Does it work? Can you send me a screenshot?

@sdimantsd
Copy link
Author

sdimantsd commented Mar 3, 2020

`
[ 1] 1440 || B: 5.495 | C: 3.668 | M: 6.919 | S: 1.564 | T: 17.647 || ETA: 197 days, 23:50:39 || timer: 0.298
[ 1] 1450 || B: 5.514 | C: 3.519 | M: 6.986 | S: 1.640 | T: 17.658 || ETA: 197 days, 3:40:56 || timer: 0.291
[ 1] 1460 || B: 5.534 | C: 3.423 | M: 7.028 | S: 1.677 | T: 17.662 || ETA: 197 days, 22:16:54 || timer: 1.842
[ 1] 1470 || B: 5.511 | C: 3.289 | M: 7.060 | S: 1.603 | T: 17.464 || ETA: 197 days, 13:59:05 || timer: 0.328
[ 1] 1480 || B: 5.505 | C: 3.218 | M: 7.148 | S: 1.514 | T: 17.384 || ETA: 197 days, 15:42:25 || timer: 1.852
[ 1] 1490 || B: 5.494 | C: 3.176 | M: 7.190 | S: 1.386 | T: 17.245 || ETA: 197 days, 1:47:05 || timer: 0.303
[ 1] 1500 || B: 5.505 | C: 3.123 | M: 7.184 | S: 1.254 | T: 17.066 || ETA: 197 days, 1:48:12 || timer: 1.197
[ 1] 1510 || B: 5.515 | C: 3.088 | M: 7.223 | S: 1.129 | T: 16.955 || ETA: 196 days, 13:00:18 || timer: 0.310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
[ 1] 1520 || B: 5.535 | C: 3.031 | M: 8.000 | S: 1.054 | T: 17.619 || ETA: 196 days, 23:30:26 || timer: 2.222
[ 1] 1530 || B: 5.557 | C: 2.994 | M: 8.920 | S: 0.971 | T: 18.442 || ETA: 196 days, 14:56:04 || timer: 0.321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
[ 1] 1540 || B: 5.597 | C: 2.949 | M: 21.387 | S: 0.881 | T: 30.813 || ETA: 196 days, 14:42:04 || timer: 0.286
[ 1] 1550 || B: 5.622 | C: 2.922 | M: 37.121 | S: 0.804 | T: 46.469 || ETA: 196 days, 12:09:21 || timer: 0.292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
[ 1] 1560 || B: 5.623 | C: 2.873 | M: 55.813 | S: 0.749 | T: 65.058 || ETA: 196 days, 9:30:19 || timer: 0.550
[ 1] 1570 || B: 5.619 | C: 2.851 | M: 73.614 | S: 0.724 | T: 82.808 || ETA: 196 days, 9:33:39 || timer: 0.296
[ 1] 1580 || B: 5.629 | C: 2.835 | M: 92.922 | S: 0.704 | T: 102.090 || ETA: 196 days, 6:45:42 || timer: 2.162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
[ 1] 1590 || B: 5.653 | C: 2.812 | M: 110.329 | S: 0.695 | T: 119.489 || ETA: 195 days, 21:04:57 || timer: 0.323
[ 1] 1600 || B: 5.685 | C: 2.799 | M: 127.568 | S: 0.679 | T: 136.732 || ETA: 195 days, 23:16:05 || timer: 0.306
[ 1] 1610 || B: 5.674 | C: 2.791 | M: 143.285 | S: 0.651 | T: 152.400 || ETA: 196 days, 6:41:42 || timer: 0.301
[ 1] 1620 || B: 5.635 | C: 2.779 | M: 159.871 | S: 0.625 | T: 168.909 || ETA: 195 days, 18:32:50 || timer: 0.281
[ 1] 1630 || B: 5.642 | C: 2.775 | M: 176.858 | S: 0.627 | T: 185.902 || ETA: 196 days, 1:03:51 || timer: 0.313
[ 1] 1640 || B: 5.665 | C: 2.775 | M: 184.345 | S: 0.645 | T: 193.430 || ETA: 195 days, 12:20:02 || timer: 1.755
[ 1] 1650 || B: 5.683 | C: 2.774 | M: 188.442 | S: 0.637 | T: 197.537 || ETA: 195 days, 20:48:08 || timer: 0.954
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
[ 1] 1660 || B: 5.685 | C: 2.775 | M: 189.106 | S: 0.636 | T: 198.203 || ETA: 195 days, 21:43:24 || timer: 0.859
[ 1] 1670 || B: 5.712 | C: 2.774 | M: 188.970 | S: 0.629 | T: 198.086 || ETA: 196 days, 0:34:48 || timer: 2.194
[ 1] 1680 || B: 5.740 | C: 2.775 | M: 189.250 | S: 0.629 | T: 198.395 || ETA: 195 days, 14:22:12 || timer: 0.339
[ 1] 1690 || B: 5.714 | C: 2.777 | M: 187.945 | S: 0.631 | T: 197.067 || ETA: 195 days, 8:43:53 || timer: 2.327
[ 1] 1700 || B: 5.663 | C: 2.778 | M: 185.333 | S: 0.638 | T: 194.412 || ETA: 194 days, 19:34:24 || timer: 0.311
[ 1] 1710 || B: 5.686 | C: 2.779 | M: 187.430 | S: 0.638 | T: 196.533 || ETA: 194 days, 11:48:10 || timer: 1.437
[ 1] 1720 || B: 5.697 | C: 2.779 | M: 186.670 | S: 0.641 | T: 195.788 || ETA: 194 days, 6:37:38 || timer: 0.316
[ 1] 1730 || B: 5.682 | C: 2.779 | M: 186.148 | S: 0.650 | T: 195.259 || ETA: 194 days, 7:36:17 || timer: 0.339
[ 1] 1740 || B: 5.645 | C: 2.779 | M: 183.526 | S: 0.637 | T: 192.587 || ETA: 194 days, 8:39:09 || timer: 1.945
[ 1] 1750 || B: 5.640 | C: 2.780 | M: 182.646 | S: 0.656 | T: 191.722 || ETA: 194 days, 7:13:58 || timer: 0.335
[ 1] 1760 || B: 5.663 | C: 2.779 | M: 179.543 | S: 0.653 | T: 188.637 || ETA: 194 days, 17:29:50 || timer: 1.747
[ 1] 1770 || B: 5.640 | C: 2.779 | M: 179.012 | S: 0.656 | T: 188.087 || ETA: 194 days, 8:49:16 || timer: 0.316
[ 1] 1780 || B: 5.643 | C: 2.779 | M: 180.269 | S: 0.654 | T: 189.346 || ETA: 194 days, 11:36:07 || timer: 0.297
[ 1] 1790 || B: 5.672 | C: 2.778 | M: 182.306 | S: 0.652 | T: 191.408 || ETA: 194 days, 7:13:17 || timer: 0.299
[ 1] 1800 || B: 5.698 | C: 2.776 | M: 185.035 | S: 0.637 | T: 194.145 || ETA: 194 days, 6:00:46 || timer: 0.285

`

@jasonkena
Copy link

Should work fine, unless the loss scaler becomes something ridiculous like e-40 or something. But if that happens, I'm afraid you'll have to follow the solutions from the other Issues:
#318 (comment)

@sdimantsd
Copy link
Author

OK, thx

@sdimantsd
Copy link
Author

Currently the loss is going high (start from ~7 and now it's ~180)

@jasonkena
Copy link

The total loss right?

@sdimantsd
Copy link
Author

Nop, it's ~200

@sdimantsd
Copy link
Author

The 'T' loss, right?

@jasonkena
Copy link

Yes

@sdimantsd
Copy link
Author

It's ~200

@jasonkena
Copy link

Yeah, you shouldn't be surprised. Unfortunately the loss scaler makes the loss readings inaccurate, because it multiplies the loss by a factor. So you shouldn't compare losses in between different "Gradient Overflow" warnings. If it still doesn't converge, I'm guessing it's either your batch size or learning rate.

@sdimantsd
Copy link
Author

I did't change the learning rate. and im using batch size of 32 with 2 GPUs. it's OK?

@jasonkena
Copy link

Sorry, I can't help.

@sdimantsd
Copy link
Author

Thx.

@sdimantsd sdimantsd reopened this Mar 8, 2020
@sdimantsd
Copy link
Author

@dbolya anything new?

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 20, 2020

Hey @jasonkena, @sdimantsd

I wanted to know the performance after training with Apex's AMP. Did you gain better performance or did it speed up your training process?

Also, I'm curious to know is it going to impact the inference time if I train the model with 16-bit precision? (I mean if I train with 16-bit precision, am I going to achieve higher FPS? I have achieved ~25FPS on 1080p video with 32-bit precision)

Thanks

@jasonkena
Copy link

@Auth0rM0rgan to be honest, I haven't done any performance/accuracy benchmarks, so I can't say anything for sure. But theoretically, it should improve training time since 16-bit computation is faster. As for the memory consumption, using 16-bit precision saves 1 GB of VRAM with a batch-size of 4.

The benchmark should be pretty straightforward since the AMP branch is compatible with master, you just need to add use_amp in config.py, then you can run the tests in eval.py just as you did in 32-bit.

@Auth0rM0rgan
Copy link

Hey @jasonkena,

I'm going to train the model with 16-bit precision and will let you know the performance. Hope I can see improvement in the inference time as well

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 23, 2020

Hey @jasonkena,
I have tried to use your code but I'm getting this error:

Traceback (most recent call last): File "train.py", line 696, in <module> train() File "train.py", line 281, in train yolact_net = Yolact() File "/home/yolact-amp/yolact.py", line 530, in __init__ self.backbone = construct_backbone(cfg.backbone, cfg.use_amp) File "/home/yolact-amp/backbone.py", line 548, in construct_backbone set_amp(use_amp) NameError: name 'set_amp' is not defined

I fixed the error by importing set_amp like this: from external.DCNv2.dcn_v2 import set_amp

After fixing the error, the model starts to train but sometimes during training, I'm getting Gradient overflow. Is it normal when we use amp?

[ 0] 0 || B: 5.955 | C: 24.126 | M: 5.992 | S: 65.320 | T: 101.393 || ETA: 11 days, 18:19:45 || timer: 2.382
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
[ 0] 10 || B: 6.035 | C: 21.918 | M: 5.816 | S: 57.595 | T: 91.365 || ETA: 4 days, 11:15:02 || timer: 0.747
[ 0] 20 || B: 5.654 | C: 19.253 | M: 5.818 | S: 41.033 | T: 71.758 || ETA: 4 days, 2:41:50 || timer: 0.728
[ 0] 30 || B: 5.576 | C: 17.321 | M: 5.953 | S: 28.553 | T: 57.403 || ETA: 3 days, 23:35:57 || timer: 0.755
[ 0] 40 || B: 5.473 | C: 15.529 | M: 5.935 | S: 22.000 | T: 48.938 || ETA: 3 days, 22:01:17 || timer: 0.748
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
[ 0] 50 || B: 5.403 | C: 14.210 | M: 5.927 | S: 18.069 | T: 43.609 || ETA: 3 days, 21:10:25 || timer: 0.745
[ 0] 60 || B: 5.399 | C: 13.137 | M: 5.981 | S: 15.367 | T: 39.884 || ETA: 3 days, 20:37:17 || timer: 0.776

Thanks

@jasonkena
Copy link

jasonkena commented Mar 23, 2020

Nice catch!

The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

Yup, it's perfectly normal, it's Apex's AMP's Dynamic Loss Scaling doing its magic.

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 23, 2020

@jasonkena Have you tried your code with Yolact++? It seems the code working fine with Yolact but not with Yolact++. Getting this error when using yolact++ config file. No idea how to fix this error :|

Traceback (most recent call last):
File "train.py", line 696, in
train()
File "train.py", line 347, in train
yolact_net(torch.zeros(1, 3, cfg.max_size, cfg.max_size).cuda())
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/yolact-amp/yolact.py", line 725, in forward
outs = self.backbone(x)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/PycharmProjects/yolact-amp/backbone.py", line 221, in forward
x = layer(x)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home//anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/yolact-amp/backbone.py", line 78, in forward
out = self.conv2(out)
File "/home/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/home/yolact/external/DCNv2/dcn_v2.py", line 128, in forward
self.deformable_groups)
File "/home/yolact/external/DCNv2/dcn_v2.py", line 31, in forward
ctx.deformable_groups)
RuntimeError: expected scalar type Float but found Half (data_ptr at /home/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorMethods.h:6321)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7fdd22d32627 in /home/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: float
at::Tensor::data_ptr() const + 0xf6 (0x7fdd07657b8a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #2: float
at::Tensor::data() const + 0x18 (0x7fdd0765b26a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #3: dcn_v2_cuda_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0xd48 (0x7fdd076518fd in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #4: dcn_v2_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0x91 (0x7fdd0763f721 in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #5: + 0x36cdb (0x7fdd0764ccdb in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #6: + 0x3351c (0x7fdd0764951c in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
omitting python frames
frame #11: THPFunction_apply(_object
, _object
) + 0xa0f (0x7fdd559a7a3f in /home/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Thanks

@jasonkena
Copy link

Hmm, it seems like you haven't recompiled the DCNv2 module since you switched to my branch.

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 23, 2020

I have recompiled the DCNv2 module when I switched to your branch and when I'm doing it again, It says DCNv2 installed

arous@DeepLearning:~/yolact-amp/external/DCNv2$ python setup.py build develop
running build
running build_ext
running develop
running egg_info
writing DCNv2.egg-info/PKG-INFO
writing dependency_links to DCNv2.egg-info/dependency_links.txt
writing top-level names to DCNv2.egg-info/top_level.txt
reading manifest file 'DCNv2.egg-info/SOURCES.txt'
writing manifest file 'DCNv2.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-3.7/_ext.cpython-37m-x86_64-linux-gnu.so ->
Creating /home/arous/anaconda3/lib/python3.7/site-packages/DCNv2.egg-link (link to .)
DCNv2 0.1 is already the active version in easy-install.pth
Installed /home/arous/yolact-amp/external/DCNv2
Processing dependencies for DCNv2==0.1
Finished processing dependencies for DCNv2==0.1

@jasonkena
Copy link

You have to delete all the build files before you compile, _ext.cpython*, DCNv2.egg-info/ and build/.

@Auth0rM0rgan
Copy link

I did it but still getting the same error :|

@jasonkena
Copy link

Sorry, I just realized something in the error you mentioned here.
Can you try removing the line you added: from external.DCNv2.dcn_v2 import set_amp, and in the beginning, replace

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

with from dcn_v2 import DCN, set_amp instead, so if the import fails, it raises an error instead?

@Auth0rM0rgan
Copy link

If I replace the line that I added (from external.DCNv2.dcn_v2 import set_amp) with

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

I'm getting this error name 'set_amp' is not defined. However, if I import DCNv2 like this from external.DCNv2.dcn_v2 import DCN, set_amp the code will work on Yolact++ as well :)

Thanks

@jasonkena
Copy link

jasonkena commented Mar 24, 2020

The reason it works with YOLACT, although the import fails, is because YOLACT doesn't use DCNv2 at all.
Now, the NameError shows up because the ImportError is excepted in the try-except block, so can you remove the try-except block and from external.DCNv2.dcn_v2 import DCN, set_amp with just from dcn_v2 import DCN, set_amp, so that the ImportError shows up?

I cannot reproduce your error, running fresh code on the branch. Can you push all your code to Github, so I can diff the changes?

@Auth0rM0rgan
Copy link

Yes, ImportError: cannot import name 'set_amp' from 'dcn_v2' shows up by removing the try-except block and from external.DCNv2.dcn_v2 import DCN, set_amp. I have to import it like this from external.DCNv2.dcn_v2 import DCN, set_amp to be able to run the code.

@jasonkena
Copy link

jasonkena commented Mar 26, 2020

Sorry, I don't know where the problem is.

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 29, 2020

Hey @jasonkena ,

as long as the loss scaler doesn't become 0.

Sometimes I: Mask IoU loss becomes 0 when 'use_amp' is True. Is it ok?

[  0]    1840 || B: 4.008 | C: 6.122 | M: 5.581 | S: 1.371 | I: 0.000 | T: 17.082 || ETA: 6 days, 16:00:06 || timer: 0.439
[  0]    1850 || B: 4.008 | C: 6.092 | M: 5.572 | S: 1.333 | I: 0.000 | T: 17.005 || ETA: 6 days, 16:01:49 

@jasonkena
Copy link

I'm not sure, it may be that the Mask-Rescoring network has fully converged (but this is unlikely).

But usually, I just disable the Mask-Rescoring loss.

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Mar 29, 2020

@jasonkena Also, I'm getting Keywords error I during training when 'use_amp' is True from this line

yolact/train.py

Line 168 in 092554a

out[k] = torch.stack([output[k].to(output_device) for output in outputs])
I can get rid of it with try-except.

What would be the impacts of disabling Mask-Rescoring loss on the model's performance? is it going to damage the performance?

@jasonkena
Copy link

Can you try cloning my branch on a completely new directory? @sdimantsd and I didn't get any of your errors running it out of the box.

According to the YOLACT++ paper, the Mask-Rescoring loss improves the performance by 1 mAP.

@Auth0rM0rgan
Copy link

Auth0rM0rgan commented Apr 1, 2020

Hey @jasonkena, I'm getting this error during testing with eval.py. need to set up AMP inside the eval.py as well.

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

@jasonkena
Copy link

Again. Please clone my branch from scratch. Neither I or sdimantsd can produce your problem.

You need to install conda for this.
Here are the complete instructions. Follow all of them:

  1. git clone -b amp https://github.com/jasonkena/yolact/
  2. git clone https://github.com/NVIDIA/apex
  3. cd apex and pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ and cd ../yolact
  4. Rename name within environment.yml, then conda env create -f environment.yml` (to create a new clean environment)
  5. cd external/DCNv2 and python setup.py build develop
  6. Change use_amp to True in config.py
  7. Setup the rest of the config

@Rm1n90
Copy link

Rm1n90 commented May 28, 2020

Hey @jasonkena,
I've trained my model with amp without any problem but the same as @Auth0rM0rgan when trying to evaluate to model on webcam I'm facing RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same . I've followed the instructions you wrote. Do you know how to solve the problem?
Thanks!

@jasonkena
Copy link

Can you give the whole traceback?

@Rm1n90
Copy link

Rm1n90 commented May 28, 2020

Loading model... Done.
Initializing model... Traceback (most recent call last):
  File "eval.py", line 1456, in <module>
    evaluate(net, dataset)
  File "eval.py", line 1191, in evaluate
    evalvideo(net, args.video)
  File "eval.py", line 1079, in evalvideo
    first_batch = eval_network(transform_frame(get_next_frame(vid)))
  File "eval.py", line 961, in eval_network
    out = net(imgs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/yolact.py", line 725, in forward
    outs = self.backbone(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 219, in forward
    x = layer(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 80, in forward
    out = self.conv3(out)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

I'm not getting this error If I set use_amp=False during eval

@jasonkena
Copy link

jasonkena commented May 28, 2020

Sorry @Auth0rM0rgan, I believe you were right. I did not initialize amp within eval.py, which is why the problem only showed up during inference.

@Rm1n90, to fix it I believe you have to add

if args.cuda:
    net = net.cuda()
if cfg.use_amp:
    from apex import amp

    if not args.cuda:
        raise ValueError("amp must be used with CUDA")
    net = amp.initialize(net, opt_level="O1")

before net = CustomDataParallel(net).cuda() (https://github.com/jasonkena/yolact/blob/e1a949445dc0c57eb7c8f10470630faff0ce22e2/eval.py#L913)

I haven't tested it, can you tell me how it turns out?

@Rm1n90
Copy link

Rm1n90 commented May 28, 2020

@jasonkena, Thanks, Eval now working with AMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants