Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for iterable datasets #3173

Merged
merged 1 commit into from Feb 8, 2021

Conversation

jcaw
Copy link
Contributor

@jcaw jcaw commented Jan 18, 2021

IterableDataset objects currently don't work when _one_pass is called, because this method attempts to index item 0 but None would be required for an IterableDataset. However, passing None is incompatible with indexed datasets. This PR aims to fix the issues.

The following changes are made:

  • Fix create_item to be compatible with both iterable and indexed datasets.
  • If a sample item is required, index do_item with None, not 0. For iterable datasets, this will get the next item, for indexible datasets it will get the first item.
  • Subclasses of DataLoader are updated to be compatible with this indexing scheme.
  • Native PyTorch IterableDatasets have a stubbed __getitem__ method (that just raises an error), so don't just rely on the presence of that method when establishing the indexed property - check for IterableDataset classes/subclasses too.

While I've tried to familiarise myself with how _one_pass is used, I'm a little unsure whether it could be called such that it consumed the first item of the iterator, then the iterator continued to be used by e.g. the training loop (effectively skipping the first item). Is this a problem, or is the iterator always reset by calling create_batches?

@jcaw jcaw requested a review from jph00 as a code owner January 18, 2021 13:36
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jcaw
Copy link
Contributor Author

jcaw commented Jan 18, 2021

Seems like I've pushed some cruft in the notebooks - let me remove that and force-push before this is reviewed.

- Fix `create_item` to be compatible with both iterable and indexed
  datasets.

- If a sample item is required, index `do_item` with `None`, not `0`.
  For iterable datasets, this will get the next item, for indexible
  datasets it will get the first item.

- Subclasses of `DataLoader` are updated to be compatible with this
  indexing scheme.

- Native PyTorch `IterableDataset`s have a stubbed `__getitem__`
  method (that just raises an error), so don't just rely on the presence
  of that method when establishing the `indexed` property - check for
  `IterableDataset` classes/subclasses too.
@jph00
Copy link
Member

jph00 commented Feb 8, 2021

Looks good! - Many thanks :)

@jph00 jph00 merged commit 45376f1 into fastai:master Feb 8, 2021
@jph00 jph00 changed the title Fix iterable datasets Better support for iterable datasets Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants