Warning: Moving average ignored a value of inf #359

sdimantsd · 2020-02-26T18:36:27Z

Hi, im try to train yolact to detect cars with images from COCO.
I take all of the images with cars in it and make dataset from them.
My config look like this:
`
only_cars_coco2017_dataset = dataset_base.copy({
'name': 'cars COCO 2017',

# Training images and annotations
'train_info': '/home/ws/data/COCO/only_cars_train.json',
'train_images':   '/home/ws/data/COCO/train/train2017/',

# Validation images and annotations.
'valid_info': '/home/ws/data/COCO/only_cars_val.json',
'valid_images':   '/home/ws/data/COCO/val/val2017/',

'class_names': ('car'),
'label_map': {1: 1}

})

yolact_im200_coco_cars_config = yolact_base_config.copy({
'name': 'yolact_im200_coco_cars',

# Dataset stuff
'dataset': only_cars_coco2017_dataset,
'num_classes': len(only_cars_coco2017_dataset.class_names) + 1,

'masks_to_train': 20,
'max_num_detections': 20,
'max_size': 200,
'backbone': yolact_base_config.backbone.copy({
    'pred_scales': [[int(x[0] / yolact_base_config.max_size * 200)] for x in yolact_base_config.backbone.pred_scales],
}),

})
`

After a few iterations, my loss going very high...

Can somwone help me with this?

Update:
Also if im train with full COCO dataset i get the same error...

The text was updated successfully, but these errors were encountered:

sdimantsd · 2020-02-27T08:16:04Z

Update:
It's happend in YOLACT++ version.
The same config and dataset work's find in YOLACT (with no ++) version.

jasonkena · 2020-02-27T11:04:07Z

I just made a pull request that should fix the inf errors, you can try merging it locally.

sdimantsd · 2020-02-27T11:05:43Z

Thanks! i will try this next week :)

sdimantsd · 2020-02-27T11:37:01Z

@jasonkena What is the diffrents from @dbolya repo to you'r repo?

jasonkena · 2020-02-27T11:39:57Z

For the most part, I added support for Apex's AMP, one of it's features is dynamic loss scaling, so your losses will never overflow. Apex also supports 16-bit precision, so that's a plus.

To enable it, change use_amp in config.py to True

sdimantsd · 2020-02-27T11:41:04Z

OK, Thx.
Why @dbolya don't use it?
@jasonkena are you one of dbolya team?

jasonkena · 2020-02-27T11:42:41Z

Yeah, I just added the pull request about 40 minutes ago, so he might not have read it.

Hahaha, I'm not part of his team, I'm just doing it in my spare time.

sdimantsd · 2020-02-27T11:43:36Z

Hahaha, I hope i can do it oneday...
Thx

sdimantsd · 2020-02-27T11:46:13Z

@jasonkena one more question.
If i start to train with YOLACT (not YOLACT++). Should I marge from you'r repo and continue training?

jasonkena · 2020-02-27T11:48:04Z

Yes, that should work, but just backup your weights in case anything happens.

sdimantsd · 2020-02-27T11:54:16Z

OK, but it's may HELP my training?

jasonkena · 2020-02-27T11:56:12Z

Yes, your weights shouldn't explode

sdimantsd · 2020-02-27T12:28:24Z

lol
thx

sdimantsd · 2020-03-03T13:25:14Z

@jasonkena ,
I train with you'r fork, but i always get error/warning:
Gtadient overflow...

jasonkena · 2020-03-03T13:46:11Z

Did you set use_amp to True in config.py?
The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

sdimantsd · 2020-03-03T13:47:12Z

Yes

jasonkena · 2020-03-03T13:55:40Z

Does it work? Can you send me a screenshot?

sdimantsd · 2020-03-03T14:04:54Z

`
[ 1] 1440 || B: 5.495 | C: 3.668 | M: 6.919 | S: 1.564 | T: 17.647 || ETA: 197 days, 23:50:39 || timer: 0.298
[ 1] 1450 || B: 5.514 | C: 3.519 | M: 6.986 | S: 1.640 | T: 17.658 || ETA: 197 days, 3:40:56 || timer: 0.291
[ 1] 1460 || B: 5.534 | C: 3.423 | M: 7.028 | S: 1.677 | T: 17.662 || ETA: 197 days, 22:16:54 || timer: 1.842
[ 1] 1470 || B: 5.511 | C: 3.289 | M: 7.060 | S: 1.603 | T: 17.464 || ETA: 197 days, 13:59:05 || timer: 0.328
[ 1] 1480 || B: 5.505 | C: 3.218 | M: 7.148 | S: 1.514 | T: 17.384 || ETA: 197 days, 15:42:25 || timer: 1.852
[ 1] 1490 || B: 5.494 | C: 3.176 | M: 7.190 | S: 1.386 | T: 17.245 || ETA: 197 days, 1:47:05 || timer: 0.303
[ 1] 1500 || B: 5.505 | C: 3.123 | M: 7.184 | S: 1.254 | T: 17.066 || ETA: 197 days, 1:48:12 || timer: 1.197
[ 1] 1510 || B: 5.515 | C: 3.088 | M: 7.223 | S: 1.129 | T: 16.955 || ETA: 196 days, 13:00:18 || timer: 0.310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
[ 1] 1520 || B: 5.535 | C: 3.031 | M: 8.000 | S: 1.054 | T: 17.619 || ETA: 196 days, 23:30:26 || timer: 2.222
[ 1] 1530 || B: 5.557 | C: 2.994 | M: 8.920 | S: 0.971 | T: 18.442 || ETA: 196 days, 14:56:04 || timer: 0.321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
[ 1] 1540 || B: 5.597 | C: 2.949 | M: 21.387 | S: 0.881 | T: 30.813 || ETA: 196 days, 14:42:04 || timer: 0.286
[ 1] 1550 || B: 5.622 | C: 2.922 | M: 37.121 | S: 0.804 | T: 46.469 || ETA: 196 days, 12:09:21 || timer: 0.292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
[ 1] 1560 || B: 5.623 | C: 2.873 | M: 55.813 | S: 0.749 | T: 65.058 || ETA: 196 days, 9:30:19 || timer: 0.550
[ 1] 1570 || B: 5.619 | C: 2.851 | M: 73.614 | S: 0.724 | T: 82.808 || ETA: 196 days, 9:33:39 || timer: 0.296
[ 1] 1580 || B: 5.629 | C: 2.835 | M: 92.922 | S: 0.704 | T: 102.090 || ETA: 196 days, 6:45:42 || timer: 2.162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
[ 1] 1590 || B: 5.653 | C: 2.812 | M: 110.329 | S: 0.695 | T: 119.489 || ETA: 195 days, 21:04:57 || timer: 0.323
[ 1] 1600 || B: 5.685 | C: 2.799 | M: 127.568 | S: 0.679 | T: 136.732 || ETA: 195 days, 23:16:05 || timer: 0.306
[ 1] 1610 || B: 5.674 | C: 2.791 | M: 143.285 | S: 0.651 | T: 152.400 || ETA: 196 days, 6:41:42 || timer: 0.301
[ 1] 1620 || B: 5.635 | C: 2.779 | M: 159.871 | S: 0.625 | T: 168.909 || ETA: 195 days, 18:32:50 || timer: 0.281
[ 1] 1630 || B: 5.642 | C: 2.775 | M: 176.858 | S: 0.627 | T: 185.902 || ETA: 196 days, 1:03:51 || timer: 0.313
[ 1] 1640 || B: 5.665 | C: 2.775 | M: 184.345 | S: 0.645 | T: 193.430 || ETA: 195 days, 12:20:02 || timer: 1.755
[ 1] 1650 || B: 5.683 | C: 2.774 | M: 188.442 | S: 0.637 | T: 197.537 || ETA: 195 days, 20:48:08 || timer: 0.954
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
[ 1] 1660 || B: 5.685 | C: 2.775 | M: 189.106 | S: 0.636 | T: 198.203 || ETA: 195 days, 21:43:24 || timer: 0.859
[ 1] 1670 || B: 5.712 | C: 2.774 | M: 188.970 | S: 0.629 | T: 198.086 || ETA: 196 days, 0:34:48 || timer: 2.194
[ 1] 1680 || B: 5.740 | C: 2.775 | M: 189.250 | S: 0.629 | T: 198.395 || ETA: 195 days, 14:22:12 || timer: 0.339
[ 1] 1690 || B: 5.714 | C: 2.777 | M: 187.945 | S: 0.631 | T: 197.067 || ETA: 195 days, 8:43:53 || timer: 2.327
[ 1] 1700 || B: 5.663 | C: 2.778 | M: 185.333 | S: 0.638 | T: 194.412 || ETA: 194 days, 19:34:24 || timer: 0.311
[ 1] 1710 || B: 5.686 | C: 2.779 | M: 187.430 | S: 0.638 | T: 196.533 || ETA: 194 days, 11:48:10 || timer: 1.437
[ 1] 1720 || B: 5.697 | C: 2.779 | M: 186.670 | S: 0.641 | T: 195.788 || ETA: 194 days, 6:37:38 || timer: 0.316
[ 1] 1730 || B: 5.682 | C: 2.779 | M: 186.148 | S: 0.650 | T: 195.259 || ETA: 194 days, 7:36:17 || timer: 0.339
[ 1] 1740 || B: 5.645 | C: 2.779 | M: 183.526 | S: 0.637 | T: 192.587 || ETA: 194 days, 8:39:09 || timer: 1.945
[ 1] 1750 || B: 5.640 | C: 2.780 | M: 182.646 | S: 0.656 | T: 191.722 || ETA: 194 days, 7:13:58 || timer: 0.335
[ 1] 1760 || B: 5.663 | C: 2.779 | M: 179.543 | S: 0.653 | T: 188.637 || ETA: 194 days, 17:29:50 || timer: 1.747
[ 1] 1770 || B: 5.640 | C: 2.779 | M: 179.012 | S: 0.656 | T: 188.087 || ETA: 194 days, 8:49:16 || timer: 0.316
[ 1] 1780 || B: 5.643 | C: 2.779 | M: 180.269 | S: 0.654 | T: 189.346 || ETA: 194 days, 11:36:07 || timer: 0.297
[ 1] 1790 || B: 5.672 | C: 2.778 | M: 182.306 | S: 0.652 | T: 191.408 || ETA: 194 days, 7:13:17 || timer: 0.299
[ 1] 1800 || B: 5.698 | C: 2.776 | M: 185.035 | S: 0.637 | T: 194.145 || ETA: 194 days, 6:00:46 || timer: 0.285

`

jasonkena · 2020-03-03T14:11:28Z

Should work fine, unless the loss scaler becomes something ridiculous like e-40 or something. But if that happens, I'm afraid you'll have to follow the solutions from the other Issues:
#318 (comment)

sdimantsd · 2020-03-03T14:12:13Z

OK, thx

sdimantsd · 2020-03-03T14:21:34Z

Currently the loss is going high (start from ~7 and now it's ~180)

jasonkena · 2020-03-03T14:24:15Z

The total loss right?

sdimantsd · 2020-03-03T14:24:58Z

Nop, it's ~200

sdimantsd · 2020-03-03T14:25:14Z

The 'T' loss, right?

jasonkena · 2020-03-03T14:26:14Z

Yes

sdimantsd · 2020-03-03T14:26:35Z

It's ~200

jasonkena · 2020-03-03T14:28:10Z

Yeah, you shouldn't be surprised. Unfortunately the loss scaler makes the loss readings inaccurate, because it multiplies the loss by a factor. So you shouldn't compare losses in between different "Gradient Overflow" warnings. If it still doesn't converge, I'm guessing it's either your batch size or learning rate.

sdimantsd · 2020-03-03T14:29:11Z

I did't change the learning rate. and im using batch size of 32 with 2 GPUs. it's OK?

jasonkena · 2020-03-04T08:24:40Z

Sorry, I can't help.

sdimantsd · 2020-03-04T08:34:12Z

Thx.

sdimantsd · 2020-03-08T13:59:45Z

@dbolya anything new?

Auth0rM0rgan · 2020-03-20T15:09:18Z

Hey @jasonkena, @sdimantsd

I wanted to know the performance after training with Apex's AMP. Did you gain better performance or did it speed up your training process?

Also, I'm curious to know is it going to impact the inference time if I train the model with 16-bit precision? (I mean if I train with 16-bit precision, am I going to achieve higher FPS? I have achieved ~25FPS on 1080p video with 32-bit precision)

Thanks

jasonkena · 2020-03-20T15:24:34Z

@Auth0rM0rgan to be honest, I haven't done any performance/accuracy benchmarks, so I can't say anything for sure. But theoretically, it should improve training time since 16-bit computation is faster. As for the memory consumption, using 16-bit precision saves 1 GB of VRAM with a batch-size of 4.

The benchmark should be pretty straightforward since the AMP branch is compatible with master, you just need to add use_amp in config.py, then you can run the tests in eval.py just as you did in 32-bit.

Auth0rM0rgan · 2020-03-21T10:51:38Z

Hey @jasonkena,

I'm going to train the model with 16-bit precision and will let you know the performance. Hope I can see improvement in the inference time as well

Auth0rM0rgan · 2020-03-23T14:40:28Z

Hey @jasonkena,
I have tried to use your code but I'm getting this error:

Traceback (most recent call last): File "train.py", line 696, in <module> train() File "train.py", line 281, in train yolact_net = Yolact() File "/home/yolact-amp/yolact.py", line 530, in __init__ self.backbone = construct_backbone(cfg.backbone, cfg.use_amp) File "/home/yolact-amp/backbone.py", line 548, in construct_backbone set_amp(use_amp) NameError: name 'set_amp' is not defined

I fixed the error by importing set_amp like this: from external.DCNv2.dcn_v2 import set_amp

After fixing the error, the model starts to train but sometimes during training, I'm getting Gradient overflow. Is it normal when we use amp?

[ 0] 0 || B: 5.955 | C: 24.126 | M: 5.992 | S: 65.320 | T: 101.393 || ETA: 11 days, 18:19:45 || timer: 2.382
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
[ 0] 10 || B: 6.035 | C: 21.918 | M: 5.816 | S: 57.595 | T: 91.365 || ETA: 4 days, 11:15:02 || timer: 0.747
[ 0] 20 || B: 5.654 | C: 19.253 | M: 5.818 | S: 41.033 | T: 71.758 || ETA: 4 days, 2:41:50 || timer: 0.728
[ 0] 30 || B: 5.576 | C: 17.321 | M: 5.953 | S: 28.553 | T: 57.403 || ETA: 3 days, 23:35:57 || timer: 0.755
[ 0] 40 || B: 5.473 | C: 15.529 | M: 5.935 | S: 22.000 | T: 48.938 || ETA: 3 days, 22:01:17 || timer: 0.748
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
[ 0] 50 || B: 5.403 | C: 14.210 | M: 5.927 | S: 18.069 | T: 43.609 || ETA: 3 days, 21:10:25 || timer: 0.745
[ 0] 60 || B: 5.399 | C: 13.137 | M: 5.981 | S: 15.367 | T: 39.884 || ETA: 3 days, 20:37:17 || timer: 0.776

Thanks

jasonkena · 2020-03-23T14:46:27Z

Nice catch!

The Gradient Overflow warning is ok, as long as the loss scaler doesn't become 0. The warning means that it is scaling the loss, so it doesn't become infinite.

Yup, it's perfectly normal, it's Apex's AMP's Dynamic Loss Scaling doing its magic.

Auth0rM0rgan · 2020-03-23T15:00:21Z

@jasonkena Have you tried your code with Yolact++? It seems the code working fine with Yolact but not with Yolact++. Getting this error when using yolact++ config file. No idea how to fix this error :|

Traceback (most recent call last):
File "train.py", line 696, in
train()
File "train.py", line 347, in train
yolact_net(torch.zeros(1, 3, cfg.max_size, cfg.max_size).cuda())
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/yolact-amp/yolact.py", line 725, in forward
outs = self.backbone(x)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/PycharmProjects/yolact-amp/backbone.py", line 221, in forward
x = layer(x)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home//anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/yolact-amp/backbone.py", line 78, in forward
out = self.conv2(out)
File "/home/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/home/yolact/external/DCNv2/dcn_v2.py", line 128, in forward
self.deformable_groups)
File "/home/yolact/external/DCNv2/dcn_v2.py", line 31, in forward
ctx.deformable_groups)
RuntimeError: expected scalar type Float but found Half (data_ptr at /home/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorMethods.h:6321)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7fdd22d32627 in /home/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: float at::Tensor::data_ptr() const + 0xf6 (0x7fdd07657b8a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #2: float at::Tensor::data() const + 0x18 (0x7fdd0765b26a in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #3: dcn_v2_cuda_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0xd48 (0x7fdd076518fd in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #4: dcn_v2_forward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, int, int, int, int, int, int, int, int, int) + 0x91 (0x7fdd0763f721 in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #5: + 0x36cdb (0x7fdd0764ccdb in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
frame #6: + 0x3351c (0x7fdd0764951c in /home/yolact-amp/external/DCNv2/_ext.cpython-37m-x86_64-linux-gnu.so)
omitting python frames
frame #11: THPFunction_apply(_object, _object) + 0xa0f (0x7fdd559a7a3f in /home/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Thanks

jasonkena · 2020-03-23T15:03:29Z

Hmm, it seems like you haven't recompiled the DCNv2 module since you switched to my branch.

Auth0rM0rgan · 2020-03-23T15:11:27Z

I have recompiled the DCNv2 module when I switched to your branch and when I'm doing it again, It says DCNv2 installed

arous@DeepLearning:~/yolact-amp/external/DCNv2$ python setup.py build develop
running build
running build_ext
running develop
running egg_info
writing DCNv2.egg-info/PKG-INFO
writing dependency_links to DCNv2.egg-info/dependency_links.txt
writing top-level names to DCNv2.egg-info/top_level.txt
reading manifest file 'DCNv2.egg-info/SOURCES.txt'
writing manifest file 'DCNv2.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-3.7/_ext.cpython-37m-x86_64-linux-gnu.so ->
Creating /home/arous/anaconda3/lib/python3.7/site-packages/DCNv2.egg-link (link to .)
DCNv2 0.1 is already the active version in easy-install.pth
Installed /home/arous/yolact-amp/external/DCNv2
Processing dependencies for DCNv2==0.1
Finished processing dependencies for DCNv2==0.1

jasonkena · 2020-03-23T15:14:09Z

You have to delete all the build files before you compile, _ext.cpython*, DCNv2.egg-info/ and build/.

Auth0rM0rgan · 2020-03-23T15:24:22Z

I did it but still getting the same error :|

jasonkena · 2020-03-23T15:35:09Z

Sorry, I just realized something in the error you mentioned here.
Can you try removing the line you added: from external.DCNv2.dcn_v2 import set_amp, and in the beginning, replace

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

with from dcn_v2 import DCN, set_amp instead, so if the import fails, it raises an error instead?

Auth0rM0rgan · 2020-03-23T16:07:56Z

If I replace the line that I added (from external.DCNv2.dcn_v2 import set_amp) with

try:
    from dcn_v2 import DCN, set_amp
except ImportError:

    def DCN(*args, **kwdargs):
        raise Exception(
            "DCN could not be imported. If you want to use YOLACT++ models, compile DCN. Check the README for instructions."
        )

I'm getting this error name 'set_amp' is not defined. However, if I import DCNv2 like this from external.DCNv2.dcn_v2 import DCN, set_amp the code will work on Yolact++ as well :)

Thanks

jasonkena · 2020-03-24T01:48:25Z

The reason it works with YOLACT, although the import fails, is because YOLACT doesn't use DCNv2 at all.
Now, the NameError shows up because the ImportError is excepted in the try-except block, so can you remove the try-except block and from external.DCNv2.dcn_v2 import DCN, set_amp with just from dcn_v2 import DCN, set_amp, so that the ImportError shows up?

I cannot reproduce your error, running fresh code on the branch. Can you push all your code to Github, so I can diff the changes?

Auth0rM0rgan · 2020-03-25T21:31:56Z

Yes, ImportError: cannot import name 'set_amp' from 'dcn_v2' shows up by removing the try-except block and from external.DCNv2.dcn_v2 import DCN, set_amp. I have to import it like this from external.DCNv2.dcn_v2 import DCN, set_amp to be able to run the code.

jasonkena · 2020-03-26T02:33:42Z

Sorry, I don't know where the problem is.

Auth0rM0rgan · 2020-03-29T00:17:03Z

Hey @jasonkena ,

as long as the loss scaler doesn't become 0.

Sometimes I: Mask IoU loss becomes 0 when 'use_amp' is True. Is it ok?

[  0]    1840 || B: 4.008 | C: 6.122 | M: 5.581 | S: 1.371 | I: 0.000 | T: 17.082 || ETA: 6 days, 16:00:06 || timer: 0.439
[  0]    1850 || B: 4.008 | C: 6.092 | M: 5.572 | S: 1.333 | I: 0.000 | T: 17.005 || ETA: 6 days, 16:01:49

jasonkena · 2020-03-29T04:24:05Z

I'm not sure, it may be that the Mask-Rescoring network has fully converged (but this is unlikely).

But usually, I just disable the Mask-Rescoring loss.

Auth0rM0rgan · 2020-03-29T16:06:31Z

@jasonkena Also, I'm getting Keywords error I during training when 'use_amp' is True from this line

yolact/train.py

Line 168 in 092554a

out[k] = torch.stack([output[k].to(output_device) for output in outputs])

I can get rid of it with try-except.

What would be the impacts of disabling Mask-Rescoring loss on the model's performance? is it going to damage the performance?

jasonkena · 2020-03-29T16:13:34Z

Can you try cloning my branch on a completely new directory? @sdimantsd and I didn't get any of your errors running it out of the box.

According to the YOLACT++ paper, the Mask-Rescoring loss improves the performance by 1 mAP.

Auth0rM0rgan · 2020-04-01T17:53:42Z

Hey @jasonkena, I'm getting this error during testing with eval.py. need to set up AMP inside the eval.py as well.

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

jasonkena · 2020-04-02T01:25:15Z

Again. Please clone my branch from scratch. Neither I or sdimantsd can produce your problem.

You need to install conda for this.
Here are the complete instructions. Follow all of them:

git clone -b amp https://github.com/jasonkena/yolact/
git clone https://github.com/NVIDIA/apex
cd apex and pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ and cd ../yolact
Rename name within environment.yml, then conda env create -f environment.yml` (to create a new clean environment)
cd external/DCNv2 and python setup.py build develop
Change use_amp to True in config.py
Setup the rest of the config

Rm1n90 · 2020-05-28T15:28:21Z

Hey @jasonkena,
I've trained my model with amp without any problem but the same as @Auth0rM0rgan when trying to evaluate to model on webcam I'm facing RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same . I've followed the instructions you wrote. Do you know how to solve the problem?
Thanks!

jasonkena · 2020-05-28T15:50:19Z

Can you give the whole traceback?

Rm1n90 · 2020-05-28T15:53:58Z

Loading model... Done.
Initializing model... Traceback (most recent call last):
  File "eval.py", line 1456, in <module>
    evaluate(net, dataset)
  File "eval.py", line 1191, in evaluate
    evalvideo(net, args.video)
  File "eval.py", line 1079, in evalvideo
    first_batch = eval_network(transform_frame(get_next_frame(vid)))
  File "eval.py", line 961, in eval_network
    out = net(imgs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/yolact.py", line 725, in forward
    outs = self.backbone(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 219, in forward
    x = layer(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/PycharmProjects/instanceSegmentation/yolact_amp/backbone.py", line 80, in forward
    out = self.conv3(out)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

I'm not getting this error If I set use_amp=False during eval

jasonkena · 2020-05-28T16:11:48Z

Sorry @Auth0rM0rgan, I believe you were right. I did not initialize amp within eval.py, which is why the problem only showed up during inference.

@Rm1n90, to fix it I believe you have to add

if args.cuda:
    net = net.cuda()
if cfg.use_amp:
    from apex import amp

    if not args.cuda:
        raise ValueError("amp must be used with CUDA")
    net = amp.initialize(net, opt_level="O1")

before net = CustomDataParallel(net).cuda() (https://github.com/jasonkena/yolact/blob/e1a949445dc0c57eb7c8f10470630faff0ce22e2/eval.py#L913)

I haven't tested it, can you tell me how it turns out?

Rm1n90 · 2020-05-28T16:37:56Z

@jasonkena, Thanks, Eval now working with AMP.

jasonkena linked a pull request Feb 27, 2020 that will close this issue

16-bit Support and Dynamic Loss Scaling #360

Open

sdimantsd closed this as completed Feb 27, 2020

sdimantsd reopened this Mar 8, 2020

Warning: Moving average ignored a value of inf #359

Warning: Moving average ignored a value of inf #359

Comments

sdimantsd commented Feb 26, 2020 • edited Loading

sdimantsd commented Feb 27, 2020

jasonkena commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

jasonkena commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

jasonkena commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

jasonkena commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

jasonkena commented Feb 27, 2020

sdimantsd commented Feb 27, 2020

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 3, 2020 • edited Loading

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 3, 2020

sdimantsd commented Mar 3, 2020 • edited Loading

jasonkena commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 3, 2020

sdimantsd commented Mar 3, 2020

jasonkena commented Mar 4, 2020

sdimantsd commented Mar 4, 2020

sdimantsd commented Mar 8, 2020

Auth0rM0rgan commented Mar 20, 2020 • edited Loading

jasonkena commented Mar 20, 2020

Auth0rM0rgan commented Mar 21, 2020

Auth0rM0rgan commented Mar 23, 2020 • edited Loading

jasonkena commented Mar 23, 2020 • edited Loading

Auth0rM0rgan commented Mar 23, 2020 • edited Loading

jasonkena commented Mar 23, 2020

Auth0rM0rgan commented Mar 23, 2020 • edited Loading

jasonkena commented Mar 23, 2020

Auth0rM0rgan commented Mar 23, 2020

jasonkena commented Mar 23, 2020

Auth0rM0rgan commented Mar 23, 2020

jasonkena commented Mar 24, 2020 • edited Loading

Auth0rM0rgan commented Mar 25, 2020

jasonkena commented Mar 26, 2020 • edited Loading

Auth0rM0rgan commented Mar 29, 2020 • edited Loading

jasonkena commented Mar 29, 2020

Auth0rM0rgan commented Mar 29, 2020 • edited Loading

jasonkena commented Mar 29, 2020

Auth0rM0rgan commented Apr 1, 2020 • edited Loading

jasonkena commented Apr 2, 2020

Rm1n90 commented May 28, 2020

jasonkena commented May 28, 2020

Rm1n90 commented May 28, 2020

jasonkena commented May 28, 2020 • edited Loading

Rm1n90 commented May 28, 2020

sdimantsd commented Feb 26, 2020 •

edited

Loading

jasonkena commented Mar 3, 2020 •

edited

Loading

sdimantsd commented Mar 3, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 20, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 23, 2020 •

edited

Loading

jasonkena commented Mar 23, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 23, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 23, 2020 •

edited

Loading

jasonkena commented Mar 24, 2020 •

edited

Loading

jasonkena commented Mar 26, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 29, 2020 •

edited

Loading

Auth0rM0rgan commented Mar 29, 2020 •

edited

Loading

Auth0rM0rgan commented Apr 1, 2020 •

edited

Loading

jasonkena commented May 28, 2020 •

edited

Loading