[BUG] pytorch dataloader index error #2089

lspinheiro · 2023-01-03T05:30:04Z

🐛🐛 Bug Report

I'm trying to understand an issue that is making the PyTorch data loader from deeplake throw an index error for some samples unexpectedly. When I try to fetch the data directly from the data set, the behaviour is not reproducible.

The error first appeared during model training. I was able to reproduce it with the following code:

def deeplake_transform(sample_in, patch_size: int, num_seg_classes: int):
    seg_indices = sample_in["masks/label"]
    partial_mask = sample_in["masks/mask"].astype("float32")
    full_mask = np.zeros((num_seg_classes, patch_size, patch_size), dtype=np.float32)
    for i, idx in enumerate(seg_indices):
        full_mask[idx] = partial_mask[i]

    return dict(
        inputs=dict(image=T.ToTensor()(sample_in["images"])),
        targets=dict(
            segmentations=full_mask,
            classifications=sample_in["labels"].astype("float32"),
        ),
    )
data_loader = ds.pytorch(
        transform=deeplake_transform,
        decode_method={"images": "numpy"},
        batch_size=1,
        num_workers=1,
        transform_kwargs={"num_seg_classes": 67, "patch_size": 512},
    )
iter_loader = iter(data_loader)


while True:
    try:
        sample = next(iter_loader)
    except Exception as e:
        print(e)
        break

    idx += 1
    if idx == len(ds):
        print("finished")
        break

The following error is thrown without much context.

Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data.append(next(self.dataset_iter))
  File "/home/test/.venv/lib/python3.8/site-packages/deeplake/integrations/pytorch/dataset.py", line 472, in __iter__
    for data in stream:
  File ".venv/lib/python3.8/site-packages/deeplake/core/io.py", line 311, in read
    yield from self.stream(block)
  File "/home/test/.venv/lib/python3.8/site-packages/deeplake/core/io.py", line 355, in stream
    data = engine.read_sample_from_chunk(
  File ".venv/lib/python3.8/site-packages/deeplake/core/chunk_engine.py", line 1528, in read_sample_from_chunk
    return chunk.read_sample(
  File ".venv/lib/python3.8/site-packages/deeplake/core/chunk/uncompressed_chunk.py", line 213, in read_sample
    sb, eb = bps[local_index]
  File ".venv/lib/python3.8/site-packages/deeplake/core/meta/encode/base_encoder.py", line 247, in __getitem__
    self._encoded[row_index], row_index, local_sample_index
IndexError: index 7133 is out of bounds for axis 0 with size 7133

But the following code produces no errors and exhausts the iterator.

for sample in ds:
    try: # try to read all the data that is used in the 
        sample["images"].data()['value']
        sample["masks/mask"].data()['value']
        sample["masks/label"].data()['value']
        sample["labels"].data()['value']
    except:
        break

I'm looking for help here since it may be related to the chunk_engine behaviour. It could help if the internal exception handler were more explicit about the error.

⚙️ Environment

Python version(s): 3.8.10
OS: Ubuntu 18.04
IDE: VS-Code
Packages: [torch==1.13.1, deeplake==3.1.7]

The text was updated successfully, but these errors were encountered:

istranic · 2023-01-03T12:43:04Z

Hey @lspinheiro Thank you for reporting this issue! Were you running this on one of our public datasets? If so, could you pls share the link. It will help us reproduce the issues. If you're running on a private dataset, don't worry about it.

lspinheiro · 2023-01-03T12:51:28Z

Hi @istranic. It is a private dataset, sadly I can't share any scripts about the ingestion.

If it helps, I'm ingesting into deeplake by appending samples with the following format:

    dset_entry = {
        "images": image,
        "labels": classes.astype(np.int32),
        "masks/label": seg_classes,
        "masks/mask": masks.astype(np.bool8),
        "metadata": metadata,
    }

This is my dataset specification:

lspinheiro · 2023-01-04T00:16:17Z

@farizrahman4u @istranic . I'm still debugging this. It looks like the data loader starts failing for all samples after some index. I'm guessing it is something to do with how the bytes_positions_encoder is behaving. Can you help me understand how local sample index and global sample index work so that I can investigate further?

The local_index variable has the value 9316 which corresponds to num_samples. Could it be that one of these variables is not resetting during the chunk lookup process?

istranic · 2023-01-04T00:19:00Z

Hey @lspinheiro This is out of my wheelhouse, but @farizrahman4u will get back to you tomorrow. Thank you for digging in further!

lspinheiro · 2023-01-04T09:07:11Z

Two updates.

It seems the issue is caused by property num_samples_per_chunk, in the chunk_engine class with the read_sample method (can't recall the class name), not being calculated correctly. It looks like the variable is calculated only once (if self._num_samples_per_chunk is None: ...`), for the first chunk, but it can vary slightly between chunks, so the error is thrown whenever the first chunk with more samples than expected is processed.
The vary chunk sample size issue seemed to be caused by my metadata json tensor. I processed the dataset again without adding it and I didn't observe the problem after that.

Maybe you can try to reproduce it by generating a dataset with JSON tensors with attributes of varying types and lengths.

farizrahman4u · 2023-01-04T13:31:31Z

@lspinheiro Thanks for the ticket and detailed break down of the issue. Would you be able to check if the problem persists with the branch fr_json_fixed_shape_fix?

lspinheiro · 2023-01-06T02:45:24Z

Thanks @farizrahman4u , I'm travelling and with pretty crap internet atm, but I will give it a try as soon as possible.

lspinheiro added the bug Something isn't working label Jan 3, 2023

tatevikh assigned farizrahman4u Jan 3, 2023

farizrahman4u mentioned this issue Jan 4, 2023

Json fixed shape issue #2092

Merged

7 tasks

farizrahman4u closed this as completed in #2092 Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pytorch dataloader index error #2089

[BUG] pytorch dataloader index error #2089

lspinheiro commented Jan 3, 2023 •

edited

istranic commented Jan 3, 2023

lspinheiro commented Jan 3, 2023

lspinheiro commented Jan 4, 2023

istranic commented Jan 4, 2023

lspinheiro commented Jan 4, 2023

farizrahman4u commented Jan 4, 2023

lspinheiro commented Jan 6, 2023

[BUG] pytorch dataloader index error #2089

[BUG] pytorch dataloader index error #2089

Comments

lspinheiro commented Jan 3, 2023 • edited

🐛🐛 Bug Report

⚙️ Environment

istranic commented Jan 3, 2023

lspinheiro commented Jan 3, 2023

lspinheiro commented Jan 4, 2023

istranic commented Jan 4, 2023

lspinheiro commented Jan 4, 2023

farizrahman4u commented Jan 4, 2023

lspinheiro commented Jan 6, 2023

lspinheiro commented Jan 3, 2023 •

edited