-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warning: Moving average ignored a value of inf #359
Comments
Update: |
I just made a pull request that should fix the inf errors, you can try merging it locally. |
Thanks! i will try this next week :) |
@jasonkena What is the diffrents from @dbolya repo to you'r repo? |
For the most part, I added support for Apex's AMP, one of it's features is dynamic loss scaling, so your losses will never overflow. Apex also supports 16-bit precision, so that's a plus. To enable it, change |
OK, Thx. |
Yeah, I just added the pull request about 40 minutes ago, so he might not have read it. Hahaha, I'm not part of his team, I'm just doing it in my spare time. |
Hahaha, I hope i can do it oneday... |
@jasonkena one more question. |
Yes, that should work, but just backup your weights in case anything happens. |
OK, but it's may HELP my training? |
Yes, your weights shouldn't explode |
lol |
@jasonkena , |
Did you set |
Yes |
Does it work? Can you send me a screenshot? |
` ` |
Should work fine, unless the loss scaler becomes something ridiculous like |
OK, thx |
Currently the loss is going high (start from ~7 and now it's ~180) |
The total loss right? |
Nop, it's ~200 |
The 'T' loss, right? |
Yes |
It's ~200 |
Yeah, you shouldn't be surprised. Unfortunately the loss scaler makes the loss readings inaccurate, because it multiplies the loss by a factor. So you shouldn't compare losses in between different "Gradient Overflow" warnings. If it still doesn't converge, I'm guessing it's either your batch size or learning rate. |
I did't change the learning rate. and im using batch size of 32 with 2 GPUs. it's OK? |
Sorry, I can't help. |
Thx. |
@dbolya anything new? |
Hey @jasonkena, @sdimantsd I wanted to know the performance after training with Apex's AMP. Did you gain better performance or did it speed up your training process? Also, I'm curious to know is it going to impact the inference time if I train the model with 16-bit precision? (I mean if I train with 16-bit precision, am I going to achieve higher FPS? I have achieved ~25FPS on 1080p video with 32-bit precision) Thanks |
@Auth0rM0rgan to be honest, I haven't done any performance/accuracy benchmarks, so I can't say anything for sure. But theoretically, it should improve training time since 16-bit computation is faster. As for the memory consumption, using 16-bit precision saves 1 GB of VRAM with a batch-size of 4. The benchmark should be pretty straightforward since the AMP branch is compatible with |
Hey @jasonkena, I'm going to train the model with 16-bit precision and will let you know the performance. Hope I can see improvement in the inference time as well |
Hey @jasonkena,
I fixed the error by importing After fixing the error, the model starts to train but sometimes during training, I'm getting Gradient overflow. Is it normal when we use amp?
Thanks |
Nice catch!
Yup, it's perfectly normal, it's Apex's AMP's Dynamic Loss Scaling doing its magic. |
@jasonkena Have you tried your code with Yolact++? It seems the code working fine with Yolact but not with Yolact++. Getting this error when using yolact++ config file. No idea how to fix this error :|
Thanks |
Hmm, it seems like you haven't recompiled the DCNv2 module since you switched to my branch. |
I have recompiled the DCNv2 module when I switched to your branch and when I'm doing it again, It says DCNv2 installed
|
You have to delete all the build files before you compile, |
I did it but still getting the same error :| |
Sorry, I just realized something in the error you mentioned here.
with |
If I replace the line that I added (
I'm getting this error Thanks |
The reason it works with YOLACT, although the import fails, is because YOLACT doesn't use DCNv2 at all. I cannot reproduce your error, running fresh code on the branch. Can you push all your code to Github, so I can diff the changes? |
Yes, |
Sorry, I don't know where the problem is. |
Hey @jasonkena ,
Sometimes
|
I'm not sure, it may be that the Mask-Rescoring network has fully converged (but this is unlikely). But usually, I just disable the Mask-Rescoring loss. |
@jasonkena Also, I'm getting Keywords error Line 168 in 092554a
try-except .
What would be the impacts of disabling Mask-Rescoring loss on the model's performance? is it going to damage the performance? |
Can you try cloning my branch on a completely new directory? @sdimantsd and I didn't get any of your errors running it out of the box. According to the YOLACT++ paper, the Mask-Rescoring loss improves the performance by 1 mAP. |
Hey @jasonkena, I'm getting this error during testing with
|
Again. Please clone my branch from scratch. Neither I or sdimantsd can produce your problem. You need to install conda for this.
|
Hey @jasonkena, |
Can you give the whole traceback? |
I'm not getting this error If I set use_amp=False during eval |
Sorry @Auth0rM0rgan, I believe you were right. I did not initialize @Rm1n90, to fix it I believe you have to add
before I haven't tested it, can you tell me how it turns out? |
@jasonkena, Thanks, Eval now working with AMP. |
Hi, im try to train yolact to detect cars with images from COCO.
I take all of the images with cars in it and make dataset from them.
My config look like this:
`
only_cars_coco2017_dataset = dataset_base.copy({
'name': 'cars COCO 2017',
})
yolact_im200_coco_cars_config = yolact_base_config.copy({
'name': 'yolact_im200_coco_cars',
})
`
After a few iterations, my loss going very high...
Can somwone help me with this?
Update:
Also if im train with full COCO dataset i get the same error...
The text was updated successfully, but these errors were encountered: