Add PanNuke Dataloader #153

anwai98 · 2023-10-05T17:08:00Z

Adding the dataloader for the PanNuke dataset (histopathology domain)

anwai98 · 2023-10-05T17:12:10Z

@constantinpape At the current stage, it looks like it's working. It would be great to have an overlook on this.
In brief (the current stage): It does the download and all the formattings end-to-end, to just expect the path where the data is stored (or supposed to be), and with download=True does all the rest stuff.

There's a rather complicated label transformation taking care of a lot of stuff (briefed in the respective function in the file itself). Let me know how it looks.

constantinpape

There are a couple of things that don't look correct to me, I left comments about that.
Another general note: I think it would make more sense to apply the transformation that make instance and semantic labels out of the masks when creating the data; this makes it easier to later use different kinds of label transforms on the data (e.g. connected components for sub-patches).

torch_em/data/datasets/pannuke.py

constantinpape · 2023-10-06T08:15:01Z

torch_em/data/datasets/pannuke.py

+        tmp_name = tmp_fold.split("_")[0] + tmp_fold.split("_")[1]  # name of a particular sub-directory (per fold)
+        with h5py.File(os.path.join(path, f"pannuke_{tmp_fold}.h5"), "w") as f:
+            img_path = glob(os.path.join(path, tmp_fold, "*", "images", tmp_name, "images.npy"))[0]
+            gt_path = glob(os.path.join(path, tmp_fold, "*", "masks", tmp_name, "masks.npy"))[0]


You only take a single of the files here? Is this on purpose?
I would have assumed that the code should look something like this instead:

img_paths = sorted(glob(..., "images.npy)) # imported to do sorted here, so that you get the same order for masks and images gt_paths = sorted(glob(..., "masks.npy)) # imported to do sorted here, so that you get the same order for masks and images for i, (img_path, gt_path) in enumerate(zip(img_paths, gt_paths)): img = np.load(img_path) gt = np.load(gt_path) assert img,shape == gt.shape # or similar; but make sure that the shapes match, you might need to account for a different number of channels out_path = os.path.join(..., f"pannuke_{tmp_fold}_{i}.h5" with h5py.File(out_path, "w") as f: ...

That way you would use all the images in a given fold

Ah, it's to provide the option to (only) use specific folds of the dataset (there are 3 in total now) (I took the inspiration for this from cremi). Do you think we should do the download and h5 conversion at once for all the folds already?

Ah, it's to provide the option to (only) use specific folds of the dataset (there are 3 in total now) (I took the inspiration for this from cremi).

That's not what I mean. Having the separate folds is ok. But with the glob here and then accessing only the first it looks like you're just selecting one image.

glob(os.path.join(path, tmp_fold, "*", "images", tmp_name, "images.npy"))[0]

Maybe I am also understanding the data organization wrong and it's many images stacked. But if that is the case we should save them in separate files to match the data organization expected by torch_em better.

Ahha yes, because we do the downloads first (for n number of folds) and then convert them once the download is done for all n folds.

For instance, we take all 3 folds into account, we download all of them first, and then go ahead to individually take care of h5 conversions.

(however, now that I think about it, I could just do sorted(glob(os.path.join(path, "**", "images.npy"), recursive=True)) (and same for masks.npy) and that should do the trick for me (to access the respective samples) as it's doing the downloads first and then conversion, nice)

constantinpape · 2023-10-06T08:16:07Z

torch_em/data/datasets/pannuke.py

+            gt_path = glob(os.path.join(path, tmp_fold, "*", "masks", tmp_name, "masks.npy"))[0]
+
+            f.create_dataset("images", data=np.load(img_path).transpose(3, 0, 1, 2))
+            f.create_dataset("masks", data=np.load(gt_path).transpose(3, 0, 1, 2))


I am a bit confused by how many channels we have here. I would have assumed that the image data has 3 dimensions (2 spatial ones and 1 channel dimension) and that the segmentation has 2 dimensions (only the spatial ones).

Ah never mind about the comment on the segmentations / labels, I saw the label trafo now. Still, I would only expect 3 dimensions, not four.

Just to be sure, the input image and input label dimensions look like this (S x H x W x C)

where,

S is the number of slices

C is the number of channels

for the input images - it's RGB (3)

for the input labels - it's 6 (5 tissue types and last channel is the background)

torch_em/data/datasets/pannuke.py

constantinpape · 2023-10-06T08:39:37Z

torch_em/data/datasets/pannuke.py

+    return torch_em.get_data_loader(ds, batch_size=batch_size, **loader_kwargs)
+
+
+def label_trafo(labels):


I don't think this gives you a correct instance segmentation. Even if it does, it's too complex, you don't need the np.where. I would write it like this:

segmentation = np.zeros(labels.shape[1:]) max_id = 0 for label_channel in labels[:-1]: # from what I understand we can just ignore the last channel because it encodes background this_labels = vigra.analysis.labelImage(label_channel) # connected components to make sure we have an instance segmentation foreground = this_labels > 0 segmentation[foreground] = this_labels[foreground] + max_id max_id = segmentation.max()

I refactored the snippets a slight bit with some minor updates (the major fix is on np.where)

Fix np.where masking

anwai98 · 2023-10-09T14:38:41Z

torch_em/data/datasets/pannuke.py

+
+    for img_path, gt_path in zip(img_paths, gt_paths):
+        with h5py.File(os.path.join(path, f"pannuke_{fold}.h5"), "w") as f:
+            f.create_dataset("images", data=np.load(img_path).transpose(3, 0, 1, 2))


# chunks: (3. 1, 256, 256) - C x 1 x H x W chunks = (data.shape[-1], 1) + data.shape[:2] f.create_dataset(..., compression="gzip", chunks=chunks)

constantinpape

This looks good now! I will also check it out later.

anwai98 added 6 commits October 5, 2023 00:28

WIP Add pannuke download script

01b8911

Simplify sorting of inputs

6037ff2

Add h5 conversion for pannuke dataset

b249660

Add pannuke dataset and dataloader

fa6275c

Add label transformation

f9c4b50

Refactor label trafo call and add loader checks

c578bf6

anwai98 marked this pull request as ready for review October 5, 2023 17:12

anwai98 added 2 commits October 5, 2023 20:10

Update initialization

5b6e4f7

Refactor download and h5 conversion scripts

485d713

constantinpape reviewed Oct 6, 2023

View reviewed changes

anwai98 and others added 5 commits October 6, 2023 13:06

Add checksums

157ca83

Refactor h5 conversion

d5cc16c

Fix np.where masking

4961d8c

Merge pull request #2 from anwai98/main

e91ca0b

Fix np.where masking

Update pannuke dataset download

da90b59

anwai98 commented Oct 9, 2023

View reviewed changes

anwai98 added 2 commits October 10, 2023 23:24

Update h5 conversion and adding label trafo

bdb82c1

Add choice for label types - pannuke

f1475b7

anwai98 requested a review from constantinpape October 10, 2023 21:44

constantinpape approved these changes Oct 11, 2023

View reviewed changes

constantinpape merged commit abf2f50 into constantinpape:main Oct 11, 2023
2 checks passed

anwai98 deleted the pannuke branch November 7, 2023 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PanNuke Dataloader #153

Add PanNuke Dataloader #153

anwai98 commented Oct 5, 2023

anwai98 commented Oct 5, 2023 •

edited

Loading

constantinpape left a comment

constantinpape Oct 6, 2023

anwai98 Oct 6, 2023

constantinpape Oct 6, 2023

anwai98 Oct 6, 2023 •

edited

Loading

constantinpape Oct 6, 2023

constantinpape Oct 6, 2023

anwai98 Oct 6, 2023

constantinpape Oct 6, 2023

anwai98 Oct 6, 2023

anwai98 Oct 9, 2023

constantinpape left a comment

		return torch_em.get_data_loader(ds, batch_size=batch_size, **loader_kwargs)


		def label_trafo(labels):

Add PanNuke Dataloader #153

Add PanNuke Dataloader #153

Conversation

anwai98 commented Oct 5, 2023

anwai98 commented Oct 5, 2023 • edited Loading

constantinpape left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anwai98 Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

constantinpape left a comment

Choose a reason for hiding this comment

anwai98 commented Oct 5, 2023 •

edited

Loading

anwai98 Oct 6, 2023 •

edited

Loading