Data Loader Support #5

mathpluscode · 2020-06-10T21:33:45Z

Data Loader Support

To facilitate the user experience, we plan to prepare some default data loaders for different use scenarios. Currently, Nifti and H5 formats are supported. For different types of use cases and image formats, a customised data loader is needed (add a link to the tutorial).

Data Format

There are some prerequisites on the data:

Data must be split into train / val / test before and stored in different directories. Although val or test data are optional.
Each image or label is in 3D. Image has shape (width, height, depth); label has shape (width, height, depth) or (width, height, depth, num_labels).
The data do not have to be of the same shape - All will be resized to the same shape before feed-in. In order to prevent unexpected effects, it may be recommended that all images are pre-processed to the desirable shape.

Supported scenarios

Unpaired images (e.g. single-modality inter-subject registration)

Case 1-1 multiple independent images.
Case 1-2 multiple independent images and corresponding labels.

Grouped unpaired images (e.g. single-modality intra-subject registration)

Case 2-1 multiple subjects each with multiple images.
Case 2-2 multiple subjects each with multiple images and corresponding labels.

Paired images (e.g. two-modality intra-subject registration)

Case 3-1 multiple paired images.
Case 3-2 multiple paired images and corresponding labels.

Sampling during training

Sampling for multiple labels

In any case when corresponding labels are available and there are multiple types of labels, e.g. the segmentation of different organs in a CT image, two options are available:

During one epoch, each image would be sampled only once and when there are multiple labels, we will randomly sample one label at a time. (Default)
During one epoch, each image would be paired with each available label. So, if an image has four types of labels, it will be sampled for four times and each time corresponds to a different label.
When using multiple labels, it is the user's responsibility to ensure the labels are ordered, such that label_idx are the corresponding types in (width, height, depth, label_idx) - the same type of landmark or ROI - between all labels

Sampling for multiple subjects each with multiple images

When multiple subjects each with multiple images are available, multiple different sampling methods are supported:

Inter-subject, one image is sampled from subject A as moving image, and another one image is sampled from a different subject B as fixed image.
Intra-subject, two images are sampled from the same subject. In this case, we can specify:
a) moving image always has a smaller index, e.g. at an earlier time;
b) moving image always has a larger index, e.g. at a later time; or
c) no constraint on the order.

For the first two options, the intra-subject images will be ascending-sorted by name to represent ordered sequential images, such as time-series data
*Multiple label sampling is also supported once image pair is sampled; In case there are no consistent label types defined between subjects, an option is available to turned off label contribution to the loss for those inter-subject image pairs.

Examples (folder structure and filename requirement)

In the following, we take train directory as an example to list how the files should be stored.

Nifti Data Format

Assuming each .nii.gz file contains only one tensor, which is either image or label.

Unpaired data

This is the simplest case. Data are assumed to be stored under train/images and train/labels directories.

Nifti Case 1-1 Images only

We only have images without any labels and all images are considered to be independent samples. So all data should be stored under train/images, e.g.:

train
- images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...

(It is also ok if the data are further grouped into different directories under images as we will directly scan all nifti files under train/images.)

Nifti Case 1-2 Images with labels

In this case, we have both images and labels. So all images should be stored under train/images and all labels should be stored under train/labels. The corresponding image file name and label file name should be exactly the same, e.g.:

train
- images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...
- labels
  - subject1.nii.gz
  - subject2.nii.gz
  - ...

Grouped unpaired images

Nifti Case 2-1 Images only

We have images without any labels, but images are grouped under different subjects/groups, e.g. time-series observations for each subject/group. For instance, the data set can be the CT scans of multiple patients (subjects/groups) where each patient has multiple scans acquired at different time points. So all data should be stored under train/images and the leaf directories (directories that do not have sub-directories) must represent different subjects/groups, e.g.:

train
- images
  - subject1
    - obs1.nii.gz
    - obs2.nii.gz
    - ...
  - subject2
    - obs1.nii.gz
    - obs2.nii.gz
    - ...
  - ...

(It is also ok if the data are grouped into different directories, but the leaf directories will be considered as different subjects/groups.)

Nifti Case 2-2 Images with labels

We have both images and labels. So all images should be stored under train/images and all labels should be stored under train/labels. The leaf directories will be considered as different subjects/groups and the corresponding image file name and label file name should be exactly the same, e.g.:

train
- images
  - subject1
    - obs1.nii.gz
    - obs2.nii.gz
    - ...
  - ...
- labels
  - subject1
    - obs1.nii.gz
    - obs2.nii.gz
    - ...
  - ...

Paired images

In this case, images are paired, for example, to represent a multimodal moving and fixed image pairs to register. Data are assumed to be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels directories.

Nifti Case 3-1 Images only

We only have paired images without any labels. So all data should be stored under train/moving_images, train/fixed_images and the images corresponding to the same subject should have exactly the same name, e.g.:

train
- moving_images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...
- fixed_images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...

(It is ok if the data are further grouped into different directories under train/moving_images and train/fixed_images as we will directly scan all nifti files under them.)

Nifti Case 3-2 Images with labels

We have both images and labels. So all data should be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels . The images and labels corresponding to the same subjects/groups should have exactly the same names, e.g.:

train
- moving_images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...
- fixed_images
  - subject1.nii.gz
  - subject2.nii.gz
  - ...
- moving_labels
  - subject1.nii.gz
  - subject2.nii.gz
  - ...
- fixed_labels
  - subject1.nii.gz
  - subject2.nii.gz
  - ...

H5 Data Format

Each .h5 file is similar to a dictionary, having multiple key-value pairs. Hierarchical multi-level h5 indexing is not used. Each value is either image or label.

Unpaired images

H5 Case 1-1 Images only

Each key corresponds to one image, e.g. {"subject1": data1, "subject2": data1, ...}. All data should be stored under train/images, it can be a single h5 file or multiple h5 files e.g.:

train
- images
  - part1.h5
  - part2.h5
  - ...

H5 Case 1-2 Images with labels

Each key corresponds to one subject. Data can be stored in two single h5 files (one for image and one for label), the keys in the files should be the same.

train
- images
  - data.h5 (keys = ["subject1", "subject2", ...])
- labels
  - data.h5 (keys = ["subject1", "subject2", ...])

Grouped unpaired images

H5 Case 2-1 Images only

Similar to case 1-1 above, but the keys, in this case, have to share the same format like subject%d-%d where %d represents a number. For instance, subject3-2 corresponds to the second observation for the subjects. Otherwise, the file structure is the same as case 1-1, e.g.

train
- images
  - part1.h5 (keys = ["subject1-1", "subject1-2", "subject2-1", ...])
  - part2.h5
  - ...

H5 Case 2-2 Images with labels

Similar to case 1-2 and 2-1 above, the keys have to share the same format like subject%d-%d and the keys for images and labels should be consistent.

train
- images
  - part1.h5 (keys = ["subject1-1", "subject1-2", ...])
  - part2.h5 (keys = ["subject101-1", "subject101-2", ...])
  - ...
- labels
  - part1.h5 (keys = ["subject1-1", "subject1-2", ...])
  - part2.h5 (keys = ["subject101-1", "subject101-2", ...])
  - ...

Paired images

In this case, data are paired. Data are assumed to be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels directories.

H5 Case 3-1 Images only

We only have paired images without any labels. So all data should be stored under train/moving_images, train/fixed_images and the keys corresponding to the same subject should be the same, e.g.:

train
- moving_images
  - part1.h5 (keys = ["subject1", "subject2", ...])
  - part2.h5
  - ...
- fixed_images
  - part1.h5 (keys = ["subject1", "subject2", ...])
  - part2.h5
  - ...

H5 Case 3-2 Images with labels

We have both images and labels. So all data should be stored under train/moving_images, train/fixed_images, train/moving_labels, and train/fixed_labels. The keys corresponding to the same subject should be the same, e.g.:

train
- moving_images
  - data.h5 (keys = ["subject1", "subject2", ...])
- fixed_images
  - data.h5 (keys = ["subject1", "subject2", ...])
- moving_labels
  - data.h5 (keys = ["subject1", "subject2", ...])
- fixed_labels
  - data.h5 (keys = ["subject1", "subject2", ...])

The text was updated successfully, but these errors were encountered:

ucl-candi · 2020-06-12T18:13:29Z

@tvercaut can you review this please?

ucl-candi · 2020-06-14T22:15:05Z

@QianyeYang could you please also review this and comment if any? Thanks!

…30 minutes

…config

…30 minutes

…config

YipengHu self-assigned this Jun 14, 2020

NMontanaBrown added a commit that referenced this issue Jun 15, 2020

Issue #5: modified travis setup to extend the wait time for tests to …

cff73b5

…30 minutes

NMontanaBrown added a commit that referenced this issue Jun 15, 2020

Issue #5: moved test files into specific folders and modified travis …

fd71b3e

…config

NMontanaBrown added a commit that referenced this issue Jun 15, 2020

Issue #5: updated travis config to run coverage with tests

08bd838

NMontanaBrown added a commit that referenced this issue Jun 15, 2020

Issue #5: updated travis config to remove coverage

675b07f

mathpluscode added a commit that referenced this issue Jun 15, 2020

issue #5 improve yaml parser

72e0ce1

mathpluscode added a commit that referenced this issue Jun 15, 2020

issue #5 correct typo

875ff88

mathpluscode added a commit that referenced this issue Jun 15, 2020

Issue #5: refactor train/predict

0eb2e8c

mathpluscode added a commit that referenced this issue Jun 15, 2020

Issue #5: extend to (un)paired and (un)labeled

52f8336

mathpluscode added a commit that referenced this issue Jun 15, 2020

Issue #5: add err msg for sample label in predict

56b27e9

ucl-candi added this to the Pre-alpha-0-loader milestone Jun 16, 2020

ucl-candi mentioned this issue Jun 16, 2020

Adding h5 data loaders #32

Closed

ucl-candi added help wanted Extra attention is needed question Further information is requested labels Jun 16, 2020

ucl-candi assigned mathpluscode and s-sd Jun 16, 2020

mathpluscode added a commit that referenced this issue Jun 16, 2020

issue #5 refactor ddf code

2bc6cd1

mathpluscode added a commit that referenced this issue Jun 19, 2020

issue #5 format change

18c9ff0

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 rename unpaired loader

bbf9a4d

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 add support on unpaired and unlabeled

773cf7c

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 merge unlabeled and labeled

7819dda

mathpluscode mentioned this issue Jun 21, 2020

5 data loader support #62

Merged

8 tasks

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 update readme

165d105

YipengHu closed this as completed in #62 Jun 21, 2020

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 group unpaired data

0822b83

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 add configs

bc6374c

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 rename subject to group

3783e54

mathpluscode added a commit that referenced this issue Jun 21, 2020

issue #5 regroup data

61f4f5e

mathpluscode added a commit that referenced this issue Jun 22, 2020

issue #5 rename undirected to unconstrained

c1705e1

mathpluscode added a commit that referenced this issue Jun 22, 2020

issue #5 fix config

77d3542

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: modified travis setup to extend the wait time for tests to …

9289582

…30 minutes

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: moved test files into specific folders and modified travis …

11f6911

…config

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: updated travis config to run coverage with tests

70b19e4

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: updated travis config to remove coverage

59b02e6

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 improve yaml parser

61aa9b3

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 correct typo

30f8122

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: refactor train/predict

a67b12b

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: extend to (un)paired and (un)labeled

e54a4c3

s-sd pushed a commit that referenced this issue Jul 2, 2020

Issue #5: add err msg for sample label in predict

dd748d9

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 refactor ddf code

df63991

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 format change

5899a95

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 rename unpaired loader

dc55f20

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 add support on unpaired and unlabeled

f6a794c

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 merge unlabeled and labeled

22a96d7

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 update readme

da05cb7

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 group unpaired data

f12cfa0

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 add configs

5a4cdef

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 rename subject to group

ba632e6

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 regroup data

9ee0d92

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 reduce test network size

82617c8

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 less grouped data

bb73a53

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 share data generator, define sample index generator

881ac5b

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 reduce num samples for mix in grouped case

9ec399c

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5: update grouped data

4f09255

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 correct grouped data loader

b25e9f1

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 add type of data

94039b9

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 rename undirected to unconstrained

eb0f719

s-sd pushed a commit that referenced this issue Jul 2, 2020

issue #5 fix config

de4f449

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loader Support #5

Data Loader Support #5

mathpluscode commented Jun 10, 2020 •

edited by s-sd

Loading

ucl-candi commented Jun 12, 2020 •

edited

Loading

ucl-candi commented Jun 14, 2020

Data Loader Support #5

Data Loader Support #5

Comments

mathpluscode commented Jun 10, 2020 • edited by s-sd Loading

Data Loader Support

Data Format

Supported scenarios

Unpaired images (e.g. single-modality inter-subject registration)

Grouped unpaired images (e.g. single-modality intra-subject registration)

Paired images (e.g. two-modality intra-subject registration)

Sampling during training

Sampling for multiple labels

Sampling for multiple subjects each with multiple images

Examples (folder structure and filename requirement)

Nifti Data Format

Unpaired data

Nifti Case 1-1 Images only

Nifti Case 1-2 Images with labels

Grouped unpaired images

Nifti Case 2-1 Images only

Nifti Case 2-2 Images with labels

Paired images

Nifti Case 3-1 Images only

Nifti Case 3-2 Images with labels

H5 Data Format

Unpaired images

H5 Case 1-1 Images only

H5 Case 1-2 Images with labels

Grouped unpaired images

H5 Case 2-1 Images only

H5 Case 2-2 Images with labels

Paired images

H5 Case 3-1 Images only

H5 Case 3-2 Images with labels

ucl-candi commented Jun 12, 2020 • edited Loading

ucl-candi commented Jun 14, 2020

mathpluscode commented Jun 10, 2020 •

edited by s-sd

Loading

ucl-candi commented Jun 12, 2020 •

edited

Loading