Some models will calculate 'nan' #36

LuckGuySam · 2020-12-28T02:53:05Z

I try your code with your example picture on resnet34 and resnet50 ,the the scoreCAM, SSCAM, ISCAM will calculate 'nan' .
can you help me to solve this problem?

I try on torch 1.5

frgfm · 2020-12-29T00:52:55Z

Hi @LuckGuySam,

Thanks for reporting this! I just tried on my end and managed to reproduce the issue. As soon as I manage to find the source of the problem, I'll let you know!

frgfm · 2020-12-29T08:47:40Z

Hey there @LuckGuySam,

The problem should be taken care of on master now 👌
I recently forced models into eval mode on the example script, which does not seem to be a good idea for Score-CAMs. I reversed the changes and added a unittest checking for NaNs on all tested CAMs!

Let me know if you encounter the problem again!

LuckGuySam · 2020-12-30T02:12:35Z

hey bro @frgfm
I guess you forgot to update the file "./torchcam/cams/cam.py" !!

then I have a question on forced models into eval mode, I think it's necessary to eval mode because a lot still have dropout layer like VGG models. if I use the VGG models then I need to modify the forward function of architecture. it's not a good method ,right?
If my thoughts are incorrect, please let me know,thx!

frgfm · 2020-12-30T09:52:04Z

@LuckGuySam, are you still having the error? Because I tried on my end, and the problem is solved.

So generally speaking, the problem with staying in training mode is that some layer do update some of their buffers in that mode (batch norm for instance). As you saw it yourself, switching to eval only impacts some methods (namely the ScoreCAMs).

Whenever you can, switch the model to eval mode before extracting the CAM. It's not a software design choice, it's more based on the theoretical aspect of it since it changes the behaviour. But again, it depends on what you're doing:

investigating what you're doing on inference --> eval mode
investigating what your model focuses on before a parameter update during training --> training mode

Additionally, VGG is a cumbersome fellow for old CAM methods (because it lacks global pooling), so you won't be able to use base CAM on it. During my implementations, I had some time to consider the speed of each method and I'd argue that, using the default paper parameter of each method, SmoothGradCAMpp is the best option: no problem with models that don't have a global pooling layer, freaking fast, and doesn't require bunch of forwards to fit in memory (like ScoreCAMs).

I hope this helps!

LuckGuySam · 2020-12-31T06:26:06Z

@frgfm ,I try your new code and still happen NaN on resnet34 with SSCAM and ISCAM, then I see your code and it has NaN check just on grad-CAM, is it right? Your code "cam.py" and "gradcam.py" are have different modified time, are you updata the new "cam.py" file?

Thank you for answer about mode choose, I think this part you are correct！！

frgfm · 2021-01-01T16:58:16Z

@LuckGuySam I'll investigate again this weekend for resnet34, but I had no issues at all with resnet18 & resnet50 (didn't check back for resnet34 I must admit) after changing the mode forcing in the script!

Not exactly, I added a NaN check for all CAMs in the unittests. The only CAM that had NaNs consistently was gradcam, so I fixed the issue that was due to normalization.

I'm not sure what you mean by modifying cam.py 😅 If you want to open a PR, I'll review it happily but I really don't see what you mean. I'll check for resnet34, but again, on my end for mobilenet, resnet18 and resnet50, everything is working well on my end. (for obvious time constraints, I can't run unittests on each CAM for all torchvision models)

LuckGuySam · 2021-01-04T00:37:02Z

@frgfm At first , thank you for your help! I know where i made a mistake!

Second, can you give some recommendation if I happen NaN again? For some reasons, I need to test some models not on torch official, is it a good practice to simply ignore NaN to calculate?

frgfm · 2021-01-04T09:03:41Z

@LuckGuySam you're much welcome! Don't get me wrong, I prefer github issues that leave room for fixing/improvements rather than praises haha

It strongly depends on the framework that is being used, but here is how I see things:

in the case of pytorch, NaNs usually arise because of a division by zero (division by something that went underflow during a FP conversion, no eps in the denominator, etc.)
I look for the part of the code performing the division (usually batch norms, you'll notice there aren't many division in NNs apart from normalization layers)
either I made a mistake in my implementation or the NaN case was not handled in the paper (cf. my edit on gradcam). In our case, it is strange that both resnet18 & resnet50 are working properly and not resnet34. So most likely, there is an edge case I haven't handled yet.

Again, this heavily relies on the assumption that there is no implementation error.

On the question of what to do with NaN in CAMs:

if the NaN is produced by a zero division, since CAMs are normalized, it's most likely because the std of the CAM is zero.
std=0 means that the CAM is spatially invariant i.e. same value everywhere
since the CAM is a weighted sum of modified feature maps, a single one of this being NaNs will result in NaNs when summed up with something else
There are very few cases where this is possible: the likelyhood of having the exact same value everywhere is freaking low, apart from they just all went to 0 by underflow.
In any case, the best course of action would be to ignore the NaN in my opinion.

I hope this helped!

FredrikM97 · 2021-03-02T13:06:04Z

A late note but also had this problem. It is caused by the sum() within _cam. Not sure why but I replace it with torch.nansum. I think it is is caused by an overflow and instead of returning zeroes it show us NaN.

frgfm self-assigned this Dec 29, 2020

frgfm added bug Something isn't working ext: scripts labels Dec 29, 2020

frgfm added the help wanted Extra attention is needed label Dec 29, 2020

frgfm added this to the 0.1.3 milestone Dec 29, 2020

frgfm mentioned this issue Dec 29, 2020

fix: Fixed example script and added NaN checks #37

Merged

frgfm closed this as completed in #37 Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some models will calculate 'nan' #36

Some models will calculate 'nan' #36

LuckGuySam commented Dec 28, 2020 •

edited

frgfm commented Dec 29, 2020

frgfm commented Dec 29, 2020

LuckGuySam commented Dec 30, 2020

frgfm commented Dec 30, 2020

LuckGuySam commented Dec 31, 2020

frgfm commented Jan 1, 2021

LuckGuySam commented Jan 4, 2021

frgfm commented Jan 4, 2021

FredrikM97 commented Mar 2, 2021

Some models will calculate 'nan' #36

Some models will calculate 'nan' #36

Comments

LuckGuySam commented Dec 28, 2020 • edited

frgfm commented Dec 29, 2020

frgfm commented Dec 29, 2020

LuckGuySam commented Dec 30, 2020

frgfm commented Dec 30, 2020

LuckGuySam commented Dec 31, 2020

frgfm commented Jan 1, 2021

LuckGuySam commented Jan 4, 2021

frgfm commented Jan 4, 2021

FredrikM97 commented Mar 2, 2021

LuckGuySam commented Dec 28, 2020 •

edited