Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastai efficientdet fails on learn.validate() with AttributeError: 'NoneType' object has no attribute 'shape' #638

Closed
potipot opened this issue Feb 11, 2021 · 11 comments 路 Fixed by #660
Labels
bug Something isn't working

Comments

@potipot
Copy link
Contributor

potipot commented Feb 11, 2021

馃悰 Bug

when trying to simply validate metrics for an efficientdet model with fastai

KeyError: 'image_id'
AttributeError: 'NoneType' object has no attribute 'shape'

it fails when trying to read the batch size automatically: in accumulate, find_bs

class AvgLoss(Metric):
    "Average the losses taking into account potential different batch sizes"
    def reset(self):           self.total,self.count = 0.,0
    def accumulate(self, learn):
        bs = find_bs(learn.yb)
        self.total += learn.to_detach(learn.loss.mean())*bs
        self.count += bs
    @property
    def value(self): return self.total/self.count if self.count != 0 else None
    @property
    def name(self):  return "loss"

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://colab.research.google.com/drive/1i4aXYu4wIKA7eLUK86GwTm7lq7zku_oF?usp=sharing
@potipot potipot added the bug Something isn't working label Feb 11, 2021
@potipot
Copy link
Contributor Author

potipot commented Feb 11, 2021

I found that applying a following patch:

from fastai.learner import AvgLoss
from fastai.torch_core import find_bs

@patch
def accumulate(self:AvgLoss, learn):
    #bs = find_bs(learn.yb)
    bs = find_bs(learn.xb)
    self.total += learn.to_detach(learn.loss.mean())*bs
    self.count += bs

fixes the issue. Im not sure though how should we store that patch in icevision. Anny suggestions?

@potipot
Copy link
Contributor Author

potipot commented Feb 15, 2021

@ganesh3
Copy link

ganesh3 commented Feb 15, 2021

I too had this issue which got solved with this patch and online help from @potipot . Thanks a lot. Hopefully, this is pushed as a PR.

@potipot
Copy link
Contributor Author

potipot commented Feb 23, 2021

@lgvaz this is still an issue. Just wanted to ask for your opinion on where to put this fastai patch?

@lgvaz
Copy link
Collaborator

lgvaz commented Feb 26, 2021

@potipot i get access denied on the colab notebook, can you change the permissions?

@lgvaz
Copy link
Collaborator

lgvaz commented Mar 4, 2021

Wow, this error is really subtle, specially the why it was not happening before.

The error was introduced by #630 but the bug was there since the beginning. To first understand it we have to take a look at how find_bs works:

def find_bs(b):
    "Recursively search the batch size of `b`."
    return item_find(b).shape[0]
    
def item_find(x, idx=0):
    "Recursively takes the `idx`-th element of `x`"
    if is_listy(x): return item_find(x[idx])
    if isinstance(x,dict):
        key = list(x.keys())[idx] if isinstance(idx, int) else idx
        return item_find(x[key])
    return x

In our case x that is passed to item_find will be a dictionary (the effdet target), note that find_bs calls item_find with idx=0, so key = list(x.keys())[**idx**] will take the first key from the dict.

And this is what #630 changed, before the first key in the dict was bbox:
image

And calling find_bs gives returns .shape[0] of the first prediction, which in this case is 3. Which is incorrect, the batch size is actually 16. So this is a silent bug.

After it started being img_size:
image

Now find_bs is called on None, hence the error.


The solution is to do what @potipot proposed here but without patching, instead we should do the same thing that we do for torchvision models, when creating the learner (models/torchvision/fastai/learner.py):

    class RCNNAvgLoss(fastai.AvgLoss):
        def accumulate(self, learn):
            bs = len(learn.yb)
            self.total += fastai.to_detach(learn.loss.mean()) * bs
            self.count += bs

    recorder = [cb for cb in learn.cbs if isinstance(cb, fastai.Recorder)][0]
    recorder.loss = RCNNAvgLoss()

We also have to check if the same happens for the mmdet models.

@potipot would you like to do a PR for this one?

@potipot
Copy link
Contributor Author

potipot commented Mar 5, 2021

@lgvaz im not sure whats the idea behind find_bs cause in the test cases it always is 1. Seems like accumulate operates on the current batch element, while the actual batch_size is 2. Copying the solution from RCNNAvgLoss works but now I'm wondering if this isn't some other issue in fact.

@potipot
Copy link
Contributor Author

potipot commented Mar 5, 2021

In the case of rcnn_learner the results are actually different. Don't know why this is necessary in case of rcnn_but this is what I get:

image

In[18]: len(learn.yb), find_bs(learn.yb), learn.dls.bs
Out[18]: (1, 2, 2)

@lgvaz
Copy link
Collaborator

lgvaz commented Mar 5, 2021

Wait, I got a bit confused, what solution actually gives the correct batch size?

For rcnn yb is a list of dictonaries, so len(yb) will give the number of samples in the current batch. This is different for effdet where yb is a dictionary. The implementation would be a bit different for both cases

@potipot
Copy link
Contributor Author

potipot commented Mar 5, 2021

Problem is, I'm not sure what this is trying to do. I wasn't able to get values other than 1 there. So I guess the idea is to take an average of the loss and weigh it with the batch size. We could actually ignore this and just put a blank 1, so no weighing would happen.

When training via tests with bs=2 using RCNN and EfficientDet both I was always getting a single element (not a batch) from learn.yb.

@potipot
Copy link
Contributor Author

potipot commented Mar 6, 2021

The problem is that the test dataset (fridge_ds) has only 1 element in the validation set. Therefore I cannot test which list corresponds to the actual batch size, as it is always 1. :D I will increase the size of validation set to say 3 and then test again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants