Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pytorch dataloader index error #2089

Closed
lspinheiro opened this issue Jan 3, 2023 · 7 comments 路 Fixed by #2092
Closed

[BUG] pytorch dataloader index error #2089

lspinheiro opened this issue Jan 3, 2023 · 7 comments 路 Fixed by #2092
Assignees
Labels
bug Something isn't working

Comments

@lspinheiro
Copy link

lspinheiro commented Jan 3, 2023

馃悰馃悰 Bug Report

I'm trying to understand an issue that is making the PyTorch data loader from deeplake throw an index error for some samples unexpectedly. When I try to fetch the data directly from the data set, the behaviour is not reproducible.

The error first appeared during model training. I was able to reproduce it with the following code:

def deeplake_transform(sample_in, patch_size: int, num_seg_classes: int):
    seg_indices = sample_in["masks/label"]
    partial_mask = sample_in["masks/mask"].astype("float32")
    full_mask = np.zeros((num_seg_classes, patch_size, patch_size), dtype=np.float32)
    for i, idx in enumerate(seg_indices):
        full_mask[idx] = partial_mask[i]

    return dict(
        inputs=dict(image=T.ToTensor()(sample_in["images"])),
        targets=dict(
            segmentations=full_mask,
            classifications=sample_in["labels"].astype("float32"),
        ),
    )
data_loader = ds.pytorch(
        transform=deeplake_transform,
        decode_method={"images": "numpy"},
        batch_size=1,
        num_workers=1,
        transform_kwargs={"num_seg_classes": 67, "patch_size": 512},
    )
iter_loader = iter(data_loader)


while True:
    try:
        sample = next(iter_loader)
    except Exception as e:
        print(e)
        break

    idx += 1
    if idx == len(ds):
        print("finished")
        break

The following error is thrown without much context.

Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data.append(next(self.dataset_iter))
  File "/home/test/.venv/lib/python3.8/site-packages/deeplake/integrations/pytorch/dataset.py", line 472, in __iter__
    for data in stream:
  File ".venv/lib/python3.8/site-packages/deeplake/core/io.py", line 311, in read
    yield from self.stream(block)
  File "/home/test/.venv/lib/python3.8/site-packages/deeplake/core/io.py", line 355, in stream
    data = engine.read_sample_from_chunk(
  File ".venv/lib/python3.8/site-packages/deeplake/core/chunk_engine.py", line 1528, in read_sample_from_chunk
    return chunk.read_sample(
  File ".venv/lib/python3.8/site-packages/deeplake/core/chunk/uncompressed_chunk.py", line 213, in read_sample
    sb, eb = bps[local_index]
  File ".venv/lib/python3.8/site-packages/deeplake/core/meta/encode/base_encoder.py", line 247, in __getitem__
    self._encoded[row_index], row_index, local_sample_index
IndexError: index 7133 is out of bounds for axis 0 with size 7133

But the following code produces no errors and exhausts the iterator.

for sample in ds:
    try: # try to read all the data that is used in the 
        sample["images"].data()['value']
        sample["masks/mask"].data()['value']
        sample["masks/label"].data()['value']
        sample["labels"].data()['value']
    except:
        break
    

I'm looking for help here since it may be related to the chunk_engine behaviour. It could help if the internal exception handler were more explicit about the error.

鈿欙笍 Environment

  • Python version(s): 3.8.10
  • OS: Ubuntu 18.04
  • IDE: VS-Code
  • Packages: [torch==1.13.1, deeplake==3.1.7]
@lspinheiro lspinheiro added the bug Something isn't working label Jan 3, 2023
@istranic
Copy link
Contributor

istranic commented Jan 3, 2023

Hey @lspinheiro Thank you for reporting this issue! Were you running this on one of our public datasets? If so, could you pls share the link. It will help us reproduce the issues. If you're running on a private dataset, don't worry about it.

@lspinheiro
Copy link
Author

Hi @istranic. It is a private dataset, sadly I can't share any scripts about the ingestion.

If it helps, I'm ingesting into deeplake by appending samples with the following format:

    dset_entry = {
        "images": image,
        "labels": classes.astype(np.int32),
        "masks/label": seg_classes,
        "masks/mask": masks.astype(np.bool8),
        "metadata": metadata,
    }

This is my dataset specification:

image

@lspinheiro
Copy link
Author

@farizrahman4u @istranic . I'm still debugging this. It looks like the data loader starts failing for all samples after some index. I'm guessing it is something to do with how the bytes_positions_encoder is behaving. Can you help me understand how local sample index and global sample index work so that I can investigate further?

The local_index variable has the value 9316 which corresponds to num_samples. Could it be that one of these variables is not resetting during the chunk lookup process?

image

@istranic
Copy link
Contributor

istranic commented Jan 4, 2023

Hey @lspinheiro This is out of my wheelhouse, but @farizrahman4u will get back to you tomorrow. Thank you for digging in further!

@lspinheiro
Copy link
Author

Two updates.

  1. It seems the issue is caused by property num_samples_per_chunk, in the chunk_engine class with the read_sample method (can't recall the class name), not being calculated correctly. It looks like the variable is calculated only once (if self._num_samples_per_chunk is None: ...`), for the first chunk, but it can vary slightly between chunks, so the error is thrown whenever the first chunk with more samples than expected is processed.
  2. The vary chunk sample size issue seemed to be caused by my metadata json tensor. I processed the dataset again without adding it and I didn't observe the problem after that.

Maybe you can try to reproduce it by generating a dataset with JSON tensors with attributes of varying types and lengths.

@farizrahman4u
Copy link
Contributor

@lspinheiro Thanks for the ticket and detailed break down of the issue. Would you be able to check if the problem persists with the branch fr_json_fixed_shape_fix?

@lspinheiro
Copy link
Author

Thanks @farizrahman4u , I'm travelling and with pretty crap internet atm, but I will give it a try as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants