Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataBlock.dataloaders does not support the advertised "shuffle" argument #3133

Closed
juliangilbey opened this issue Jan 7, 2021 · 8 comments · Fixed by #3178
Closed

DataBlock.dataloaders does not support the advertised "shuffle" argument #3133

juliangilbey opened this issue Jan 7, 2021 · 8 comments · Fixed by #3178
Assignees
Labels

Comments

@juliangilbey
Copy link

Please confirm you have the latest versions of fastai, fastcore, and nbdev prior to reporting a bug (delete one): YES
fastai 2.2.2, fastcore 1.3.16, nbdev 1.1.6

Describe the bug
The DataBlock documentation describes the https://docs.fast.ai/data.block.html#DataBlock.dataloaders function and explicitly says that there is an optional kwarg shuffle. However, using this results in the error:

TypeError: fastai.data.core.TfmdDL() got multiple values for keyword argument 'shuffle'

To Reproduce
Take the notebook for Chapter 2 of the fastai book, and change the cell which reads dls = bears.dataloaders(path) to read dls = bears.dataloaders(path, shuffle=True) instead.

Expected behavior
It should perform shuffling as requested.

Error with full stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-cb46ba7faa34> in <module>
----> 1 dls = bears.dataloaders(path, shuffle=True)

~/fast.ai.course/fastai-venv/lib/python3.9/site-packages/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    113         dsets = self.datasets(source, verbose=verbose)
    114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
--> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
    116 
    117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",

~/fast.ai.course/fastai-venv/lib/python3.9/site-packages/fastai/data/core.py in dataloaders(self, bs, val_bs, shuffle_train, n, path, dl_type, dl_kwargs, device, **kwargs)
    211         if dl_type is None: dl_type = self._dl_type
    212         drop_last = kwargs.pop('drop_last', shuffle_train)
--> 213         dl = dl_type(self.subset(0), bs=bs, shuffle=shuffle_train, drop_last=drop_last, n=n, device=device,
    214                      **merge(kwargs, dl_kwargs[0]))
    215         dls = [dl] + [dl.new(self.subset(i), bs=(bs if val_bs is None else val_bs), shuffle=False, drop_last=False,

TypeError: fastai.data.core.TfmdDL() got multiple values for keyword argument 'shuffle'

Additional context
The cause of the error is clear, though the best way to solve it is not as obvious. The DataBlock.dataloaders() method:

fastai/fastai/data/block.py

Lines 112 to 115 in 373050b

def dataloaders(self, source, path='.', verbose=False, **kwargs):
dsets = self.datasets(source, verbose=verbose)
kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)

just passes most of the keyword arguments through to its Datasets object, which in turn is a subclass of FilteredBase, so inherits the dataloaders() method from there. This method is as follows:

fastai/fastai/data/core.py

Lines 207 to 217 in d1cf81c

def dataloaders(self, bs=64, val_bs=None, shuffle_train=True, n=None, path='.', dl_type=None, dl_kwargs=None,
device=None, **kwargs):
if device is None: device=default_device()
if dl_kwargs is None: dl_kwargs = [{}] * self.n_subsets
if dl_type is None: dl_type = self._dl_type
drop_last = kwargs.pop('drop_last', shuffle_train)
dl = dl_type(self.subset(0), bs=bs, shuffle=shuffle_train, drop_last=drop_last, n=n, device=device,
**merge(kwargs, dl_kwargs[0]))
dls = [dl] + [dl.new(self.subset(i), bs=(bs if val_bs is None else val_bs), shuffle=False, drop_last=False,
n=None, **dl_kwargs[i]) for i in range(1, self.n_subsets)]
return self._dbunch_type(*dls, path=path, device=device)

This has the keyword argument shuffle_train (but not shuffle), but on lines 213-214, calls the dl_type function with an explicit shuffle keyword and then the kwargs passed through (merged with dl_kwargs[0]). So if kwargs includes shuffle, there will be a repeated keyword argument.

I haven't checked whether the same would apply to the other keyword arguments listed on line 213.

One simple way would be to rewrite lines 213-214 as:

        dl = dl_type(self.subset(0),
                     **merge({'bs': bs, 'shuffle': shuffle_train, 'drop_last': drop_last, 'n': n, 'device': device},
                     kwargs, dl_kwargs[0]))
@juliangilbey
Copy link
Author

Actually, I see the same issue appears on lines 215-216 as well, if shuffle appears in dl_kwargs[i].

@marii-moe
Copy link
Collaborator

Was able to reproduce this and taking a look at it now.

@marii-moe marii-moe self-assigned this Jan 8, 2021
@marii-moe marii-moe added the bug label Jan 8, 2021
@tcapelle
Copy link
Contributor

tcapelle commented Jan 8, 2021

just to understand, what are you trying to shuffle?

@juliangilbey
Copy link
Author

I was trying to shuffle the validation set (I had a good reason to), but realised that even if the shuffle argument did as it said, it would still not shuffle the validation set, only the training set. So I found a different (perhaps clunky) way to do what I wanted to do. But if someone didn't want to shuffle the training set for some reason, shuffle=False would not work either, so the bug definitely remains, though it's a pretty small one.

@marii-moe
Copy link
Collaborator

@juliangilbey Just to help with writing this, why are you trying to shuffle the validation set? I wouldn't think we would want to drop the last batch on the validation set so making sure.

@marii-moe
Copy link
Collaborator

marii-moe commented Jan 12, 2021

@jph00 For this one I think it might be good to assume keyword arguments apply to train, and can be overridden for the validation set by starting with 'val_'. The key idea below:

default_kwargs = {'bs':bs,'n':None}
def create_dl(): #place holder for creating dataloaders
val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')} #implmenting this is the key idea
#below validation set only
dls = [ create_dl(**merge(kwargs,default_kwargs,val_kwargs,dl_kwargs[i]) ) for i in range(num_dls) ]

Here is a more complete gist for my idea: https://gist.github.com/marii-moe/7374ee9d59400f85b17dc54b7a2c4af0

Here is the original:

def dataloaders(self, bs=64, val_bs=None, shuffle_train=True, n=None, path='.', dl_type=None, dl_kwargs=None,

This would mean removing 'shuffle_train' which has been part of the api as far back as I could track in fastai_dev/fastai2. kwargs was originally included in creating the validation dataloader, but more recently was removed, so less sure about including that one as part of the merge. Unable to find info on why kwargs is not included as part of creating validation dataloaders, but it could have been to simplfy things. Here is the commit that changed it though: a4e2456#diff-b881fa2851a7d60fe30b52d6cd1bc25cd510f0b9ac30de516cd01c3de05af1cbL166

It is important to put all in a merge, as this was we don't end up passing two of the same keyword argument.

@juliangilbey
Copy link
Author

@juliangilbey Just to help with writing this, why are you trying to shuffle the validation set? I wouldn't think we would want to drop the last batch on the validation set so making sure.

Hi @marii-moe I wanted to run dls.valid.show_batch() to check that things looked OK in the validation set. Unfortunately, the original data is very nicely ordered, so I only see data instances from one class. Being able to shuffle the validation set would help for this.

@jph00
Copy link
Member

jph00 commented Jan 12, 2021

For this one I think it might be good to assume keyword arguments apply to train, and can be overridden for the validation set by starting with 'val_'.

I like it. Let's leave shuffle_train as well for now, but have it raise a deprecation warning if not None, and have it just set shuffle. How does that sound?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment