Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some models will calculate 'nan' #36

Closed
LuckGuySam opened this issue Dec 28, 2020 · 9 comments · Fixed by #37
Closed

Some models will calculate 'nan' #36

LuckGuySam opened this issue Dec 28, 2020 · 9 comments · Fixed by #37
Assignees
Labels
bug Something isn't working ext: scripts help wanted Extra attention is needed
Milestone

Comments

@LuckGuySam
Copy link

LuckGuySam commented Dec 28, 2020

I try your code with your example picture on resnet34 and resnet50 ,the the scoreCAM, SSCAM, ISCAM will calculate 'nan' .
can you help me to solve this problem?

I try on torch 1.5

@frgfm frgfm self-assigned this Dec 29, 2020
@frgfm frgfm added bug Something isn't working ext: scripts labels Dec 29, 2020
@frgfm
Copy link
Owner

frgfm commented Dec 29, 2020

Hi @LuckGuySam,

Thanks for reporting this! I just tried on my end and managed to reproduce the issue. As soon as I manage to find the source of the problem, I'll let you know!

@frgfm frgfm added the help wanted Extra attention is needed label Dec 29, 2020
@frgfm frgfm added this to the 0.1.3 milestone Dec 29, 2020
@frgfm frgfm closed this as completed in #37 Dec 29, 2020
@frgfm
Copy link
Owner

frgfm commented Dec 29, 2020

Hey there @LuckGuySam,

The problem should be taken care of on master now 👌
I recently forced models into eval mode on the example script, which does not seem to be a good idea for Score-CAMs. I reversed the changes and added a unittest checking for NaNs on all tested CAMs!

Let me know if you encounter the problem again!

@LuckGuySam
Copy link
Author

hey bro @frgfm
I guess you forgot to update the file "./torchcam/cams/cam.py" !!

then I have a question on forced models into eval mode, I think it's necessary to eval mode because a lot still have dropout layer like VGG models. if I use the VGG models then I need to modify the forward function of architecture. it's not a good method ,right?
If my thoughts are incorrect, please let me know,thx!

@frgfm
Copy link
Owner

frgfm commented Dec 30, 2020

@LuckGuySam, are you still having the error? Because I tried on my end, and the problem is solved.

So generally speaking, the problem with staying in training mode is that some layer do update some of their buffers in that mode (batch norm for instance). As you saw it yourself, switching to eval only impacts some methods (namely the ScoreCAMs).

Whenever you can, switch the model to eval mode before extracting the CAM. It's not a software design choice, it's more based on the theoretical aspect of it since it changes the behaviour. But again, it depends on what you're doing:

  • investigating what you're doing on inference --> eval mode
  • investigating what your model focuses on before a parameter update during training --> training mode

Additionally, VGG is a cumbersome fellow for old CAM methods (because it lacks global pooling), so you won't be able to use base CAM on it. During my implementations, I had some time to consider the speed of each method and I'd argue that, using the default paper parameter of each method, SmoothGradCAMpp is the best option: no problem with models that don't have a global pooling layer, freaking fast, and doesn't require bunch of forwards to fit in memory (like ScoreCAMs).

I hope this helps!

@LuckGuySam
Copy link
Author

@frgfm ,I try your new code and still happen NaN on resnet34 with SSCAM and ISCAM, then I see your code and it has NaN check just on grad-CAM, is it right? Your code "cam.py" and "gradcam.py" are have different modified time, are you updata the new "cam.py" file?

Thank you for answer about mode choose, I think this part you are correct!!

@frgfm
Copy link
Owner

frgfm commented Jan 1, 2021

@LuckGuySam I'll investigate again this weekend for resnet34, but I had no issues at all with resnet18 & resnet50 (didn't check back for resnet34 I must admit) after changing the mode forcing in the script!

Not exactly, I added a NaN check for all CAMs in the unittests. The only CAM that had NaNs consistently was gradcam, so I fixed the issue that was due to normalization.

I'm not sure what you mean by modifying cam.py 😅 If you want to open a PR, I'll review it happily but I really don't see what you mean. I'll check for resnet34, but again, on my end for mobilenet, resnet18 and resnet50, everything is working well on my end. (for obvious time constraints, I can't run unittests on each CAM for all torchvision models)

@LuckGuySam
Copy link
Author

@frgfm At first , thank you for your help! I know where i made a mistake!

Second, can you give some recommendation if I happen NaN again? For some reasons, I need to test some models not on torch official, is it a good practice to simply ignore NaN to calculate?

@frgfm
Copy link
Owner

frgfm commented Jan 4, 2021

@LuckGuySam you're much welcome! Don't get me wrong, I prefer github issues that leave room for fixing/improvements rather than praises haha

It strongly depends on the framework that is being used, but here is how I see things:

  • in the case of pytorch, NaNs usually arise because of a division by zero (division by something that went underflow during a FP conversion, no eps in the denominator, etc.)
  • I look for the part of the code performing the division (usually batch norms, you'll notice there aren't many division in NNs apart from normalization layers)
  • either I made a mistake in my implementation or the NaN case was not handled in the paper (cf. my edit on gradcam). In our case, it is strange that both resnet18 & resnet50 are working properly and not resnet34. So most likely, there is an edge case I haven't handled yet.

Again, this heavily relies on the assumption that there is no implementation error.

On the question of what to do with NaN in CAMs:

  • if the NaN is produced by a zero division, since CAMs are normalized, it's most likely because the std of the CAM is zero.
  • std=0 means that the CAM is spatially invariant i.e. same value everywhere
  • since the CAM is a weighted sum of modified feature maps, a single one of this being NaNs will result in NaNs when summed up with something else
  • There are very few cases where this is possible: the likelyhood of having the exact same value everywhere is freaking low, apart from they just all went to 0 by underflow.
  • In any case, the best course of action would be to ignore the NaN in my opinion.

I hope this helped!

@FredrikM97
Copy link

A late note but also had this problem. It is caused by the sum() within _cam. Not sure why but I replace it with torch.nansum. I think it is is caused by an overflow and instead of returning zeroes it show us NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ext: scripts help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants