Add SliceableDataset #454

Hakuyume · 2017-10-14T11:40:42Z

related to #453

This PR adds AnnotatedImageDatasetMixin.

This mix-in requires these methods.

__len__: returns the length of dataset
get_image: takes an index and returns an image
get_annotation: takes an index and returns annotation(s)

This mix-in provides these attribute/method.

__getitem__: takes an index/slice and returns example(s). An example is a tuple of image and annotations. If get_annotation returns (anno0, anno1), an example will be (img, anno0, anno1).
annotations: a dataset without images. If get_annotation returns (anno0, anno1), an example of this dataset will be (anno0, anno1).

yuyu2172 · 2017-10-14T12:17:10Z

I like how you kept the design simple.
I think it is worthwhile to discuss possibility to support accessing annotations quickly without an extra abstraction.

I found two problems.
First, the dataset needs to retrieve all annotations together.
This may become annoying in a situation when annotations are label, mask.
The loading of mask will prevent users from quickly accessing labels.
Second, this may be less efficient in the case when annotations do not need to be fetched sequentially.
This is the case when retrieving the entire labels for ImageNet.
I ran a small benchmark, and there is a noticeable overhead.

import time
import numpy as np

x = np.arange(1000000)   # 1M

start = time.time()
out = list()
for i in range(len(x)):
    out.append(x[i])
print(time.time() - start)  # 0.1742 s

Also, there is another person making similar feature in Chainer project.
chainer/chainer#3252

Hakuyume · 2017-10-14T12:35:33Z

First, the dataset needs to retrieve all annotations together.
This may become annoying in a situation when annotations are label, mask.
The loading of mask will prevent users from quickly accessing labels.

Yes. If annotations contain some large data, my API is not efficient.
For example, my API is not good for semantic segmentation dataset.

Second, this may be less efficient in the case when annotations do not need to be fetched sequentially.
This is the case when retrieving the entire labels for ImageNet.
I ran a small benchmark, and there is a noticeable overhead.

What do you mean?

Also, there is another person making similar feature in Chainer project.
chainer/chainer#3252

I see. That PR is more general than mine. If that PR is merged, we don't need to add labels to BboxDataset APIs.

yuyu2172 · 2017-10-14T12:40:37Z

Second, this may be less efficient in the case when annotations do not need to be fetched sequentially.
This is the case when retrieving the entire labels for ImageNet.
I ran a small benchmark, and there is a noticeable overhead.

If a dataset has an array labels as an attribute, you can access it instantly (i.e. labels = dataset.labels).
In your design, you need to run a for-loop to get labels, which is slow.

Hakuyume · 2017-10-14T12:47:11Z

If a dataset has an array labels as an attribute, you can access it instantly (i.e. labels = dataset.labels).
In your design, you need to run a for-loop to get labels, which is slow.

I see. That makes sense. However, if users use only a subset of the dataset and labels are stored under multiple files, returning a list is not efficient. For example, If I want to filter the first 100 examples of VOCBboxDataset, loading labels for 101-st image is unnecessary.

yuyu2172 · 2017-10-14T13:12:43Z

If I want to filter the first 100 examples of VOCBboxDataset, loading labels for 101-st image is unnecessary.

I see.

Since the purpose of this PR conflicts with #3252 and this is not unique to Vision, I think the design should be discussed with Chainer team.
BTW, your design seems to be better because extending it is a lot more intuitive.

yuyu2172 · 2017-10-14T15:31:30Z

Since the purpose of this PR conflicts with #3252 and this is not unique to Vision, I think the design should be discussed with Chainer team.

Since #3252 does not seem to solve the efficiency problem that we have been discussing, I changed my mind and think that this dataset is appropriate for ChainerCV.
The problem AnnotatedImageDatasetMixin solves is specific to vision.
In that sense, it is different from datasets like ConcatenatedDataset.
Do you think this distinction is reasonable?

First, the dataset needs to retrieve all annotations together.
This may become annoying in a situation when annotations are label, mask.
The loading of mask will prevent users from quickly accessing labels.

img, label, mask is quite common scenario (e.g. Instance Segmentation).
Do you have any workarounds in mind?

Hakuyume · 2017-10-15T00:07:49Z

Since #3252 does not seem to solve the efficiency problem that we have been discussing, I changed my mind and think that this dataset is appropriate for ChainerCV.
The solution AnnotatedImageDatasetMixin solves is specific to vision.
In that sense, it is different from datasets like ConcatenatedDataset.
Do you think this distinction is reasonable?

I think it is reasonable to keep our own mix-in. The style of chainer/chainer#3252 is not optimized to ChainerCV. If we use chainer/chainer#3252, we have to implement 5 methods for each dataset.
In the term of efficiency, chainer/chainer#3252 seems to solve the efficiency problem. We can override extract_feature to avoid calling get_example.

img, label, mask is quite common scenario (e.g. Instance Segmentation).
Do you have any workarounds in mind?

A simple solution is excluding mask from get_annotation and overriding get_example to return mask. Perhaps, get_lightweight_annotation is a proper name.
The code will be as follows.

    def get_lightweight_annotation(self, i):
        ...
        return label

    def get_image(self, i):
        ...
        img = read_image(...)
        return img

    def get_example(self, i):  # the default implementation returns (img, label)
        ...
        mask = read_image(...)
        ...
        label = self.get_lightweight_annotation(i)
        img = self.get_image(i)
        return img, mask, label

This workaround can be used for prob_map of CUB.

yuyu2172 · 2017-10-15T01:02:46Z

A simple solution is excluding mask from get_annotation and overriding get_example to return mask. Perhaps, get_lightweight_annotation is a proper name.
The code will be as follows.

I have two problems.
First, there is a situation when it is more efficient and readable to compute label and mask together.
This happens in the case when logic to compute the two annotations are similar.
For instance, here https://github.com/yuyu2172/chainercv/blob/5c11243351ce7dfb95be659ee343a74e262e614d/chainercv/datasets/coco/coco_bbox_dataset.py#L151.

Second, it became more complicated than what it was.
get_annotation is obvious about what it returns, but get_lightweight_annotation is not.

How about setting an attribute that configures get_annotation when necessary?
For instance, set an attribute like label_only.
This makes the dataset to only load labels when annotations is called.

BTW, shouldn't get_annotation be get_annotations?

Hakuyume · 2017-10-15T01:50:17Z

First, there is a situation when it is more efficient and readable to compute label and mask together.
This happens in the case when logic to compute the two annotations are similar.
For instance, here https://github.com/yuyu2172/chainercv/blob/5c11243351ce7dfb95be659ee343a74e262e614d/chainercv/datasets/coco/coco_bbox_dataset.py#L151.

In that case, get_annotation should contain mask because it does not require heavy computation.

Second, it became more complicated than what it was.
get_annotation is obvious about what it returns, but get_lightweight_annotation is not.

Yes. That is a problem.

yuyu2172 · 2017-10-15T02:02:23Z

In that case, get_annotation should contain mask because it does not require heavy computation.

What do you mean?
mask is still heavy to load.
It gets slightly more efficient if computed together with other annotations.
Ideally, it should be possible to load all of them together and load annotations without masks.

Hakuyume · 2017-10-15T03:05:42Z

I'm trying to implement better APIs.

Hakuyume · 2017-10-15T03:45:52Z

How about this API? I have implemented PickableDataset.

class MyDataset(PickableDataset):
    def __init__(self):
        super().__init__()

        self.data_names = ('img', 'bbox', 'label', 'mask')
        self.add_getter('img', self.get_image)
        self.add_getter('mask', self.get_mask)
        self.add_getter(('bbox', 'label'), self.get_bbox_label)

    def get_image(self, i):
        print('get_image')
        return 'img_{:d}'.format(i)

    def get_mask(self, i):
        print('get_mask')
        return 'mask_{:d}'.format(i)

    def get_bbox_label(self, i):
        print('get_bbox_label')
        return 'bbox_{:d}'.format(i), 'label_{:d}'.format(i)


dataset = MyDataset()
print(dataset[0])
# get_image
# get_bbox_label
# get_mask
# ('img_0', 'bbox_0', 'label_0', 'mask_0')

picked_dataset = dataset.pick('label')
print(picked_dataset[1])
# get_bbox_label
# label_1

picked_dataset = dataset.pick('img', 'label', 'bbox')
print(picked_dataset[2])
# get_image
# get_bbox_label
# ('img_2', 'label_2', 'bbox_2')

yuyu2172 · 2017-10-17T14:19:15Z

I am not sure whether we should introduce a new abstraction or remain with the minimal abstraction (DatasetMixIn).
Since supporting easy access to labels can be achieved without any new abstraction, I would like to know why this kind of abstraction is needed.

In my opinion, by the nature that the class is intended to be extended by users, this has to be perfect or should not be used.

I think there is still to be improved.
For instance, PickableDataset returns PickedDataset with method pick.
But, the returned object doesn't support pick.
Since pick() is somewhat equivalent to a[:, ***] of arrays, it feels more natural if the returned dataset supports the interface.

Hakuyume · 2017-10-17T14:39:40Z

Since supporting easy access to labels can be achieved without any new abstraction, I would like to know why this kind of abstraction is needed.

I don't think providing partial access only for labels is not enough. For example, some user wants to filter images by the size of bounding boxes. In this case, partial access for bboxes is required. With this class, we can implement partial access for all data easily. Especially, return_*** options will get more simple. Please check #457.

In my opinion, by the nature that the class is intended to be extended by users, this has to be perfect or should not be used.

I agree. I would like to make this class perfect as much as possible.

I think there is still to be improved.
For instance, PickableDataset returns PickedDataset with method pick.
But, the returned object doesn't support pick.
Since pick() is somewhat equivalent to a[:, ***] of arrays, it feels more natural if the returned dataset supports the interface.

Good idea. I'll fix it.

yuyu2172 · 2017-10-18T02:11:50Z

I agree. I would like to make this class perfect as much as possible.

Ideally, it is beneficial to have a dataset abstraction that never changes its functionality by modifying it.
For instance, currently, an attribute becomes private when a dataset is used together with SubDataset or TransformDataset. The problem with pick is similar to this problem.

On top of that, we have another demand that subset of dataset annotations should be easily and efficiently accessible. This is the problem that we have been discussing from the beginning.

With that in mind, I suggest to introduce a new dataset class possibly different from DatasetMixIn.
In short, I suggest to make two styles of datasets for Chainer. The first one involves thin abstractions (e.g. DatasetMixIn and TransformDataset). The second one is more complex, but combines all the functionalities together. Users can pick based on their preference on the level of abstraction they want from a base dataset class.

If we take this strategy, chainer/chainer#3343 and chainer/chainer#3252 should be replaced.

I don't think providing partial access only for labels is not enough. For example, some user wants to filter images by the size of bounding boxes. In this case, partial access for bboxes is required. With this class, we can implement partial access for all data easily. Especially, return_*** options will get more simple. Please check #457.

We can add bboxes as a property as well.

Hakuyume · 2017-10-18T02:32:34Z

We can add bboxes as a property as well.

Will you add labels, bboxes, labels_bboxes, difficulties, labels_difficulties, bboxes_difficulties ... ?

yuyu2172 · 2017-10-18T03:35:14Z

Will you add labels, bboxes, labels_bboxes, difficulties, labels_difficulties, bboxes_difficulties ... ?

I see what you want to say. If all of them are going to be supported, we need to take more general approach with base dataset class.
If ad hoc approach is taken, I would probably start with labels only or labales and bboxes.

yuyu2172 · 2018-03-01T02:27:16Z

Can you add at least one code example to the doc string?
Also, it would be better to include in the docstrings that the entire module is planned to be removed.

yuyu2172 · 2018-03-06T04:15:33Z

@Hakuyume
Can you work on this?

yuyu2172 · 2018-04-06T10:53:28Z

chainercv/chainer_experimental/datasets/sliceable/concatenated_dataset.py

+class ConcatenatedDataset(SliceableDataset):
+    """A sliceable version of :class:`chainer.datasets.ConcatenatedDataset`.
+
+    Hew is an example.


yuyu2172 · 2018-04-06T10:54:03Z

chainercv/chainer_experimental/datasets/sliceable/concatenated_dataset.py

+    Args:
+        datasets: The underlying datasets.
+            Each dataset should inherit
+            :class:~chainer.datasets.sliceable.Sliceabledataset`.


Not chainer.datasets, but chainercv.*

yuyu2172 · 2018-04-06T10:55:06Z

chainercv/chainer_experimental/datasets/sliceable/getter_dataset.py

+
+
+class GetterDataset(SliceableDataset):
+    """A sliceable dataset class that defined by getters.


that is defined with

yuyu2172 · 2018-04-06T10:57:31Z

chainercv/chainer_experimental/datasets/sliceable/sliceable_dataset.py

+    This ia a dataset class that supports slicing.
+    A dataset class inheriting this class should implement
+    three methods: :meth:`__len__`, :meth:`keys`, and
+    :meth:`get_example_by_keys`.


Should we recommend users to use GetterDataset?

I mean the relationship between SlicableDataset and GetterDataset should be clear.

From users perspective,
they would first come and read SlicableDataset. Since this is not intended to be directly touched by users, we should guide them to GetterDataset.

Nice suggestion. I agree with you.

yuyu2172 · 2018-04-06T10:58:16Z

chainercv/chainer_experimental/datasets/sliceable/transform_dataset.py

+    Note that it reuqires :obj:`keys` to determine the names of returned
+    values.
+
+    Hew is an example.


yuyu2172 · 2018-04-06T10:58:29Z

chainercv/chainer_experimental/datasets/sliceable/tuple_dataset.py

+class TupleDataset(SliceableDataset):
+    """A sliceable version of :class:`chainer.datasets.TupleDataset`.
+
+    Hew is an example.


yuyu2172 · 2018-04-06T10:58:49Z

chainercv/chainer_experimental/datasets/sliceable/tuple_dataset.py

+
+    Args:
+        datasets: The underlying datasets.
+            Following datasets are acceptable.


The following datasets

yuyu2172 · 2018-04-06T11:05:39Z

chainercv/chainer_experimental/datasets/sliceable/getter_dataset.py

+
+    >>> class SliceableLabeledImageDataset(GetterDataset):
+    >>>     def __init__(self, pairs, root='.'):
+    >>>         super().__init__()


Python 2 would not work.

yuyu2172 · 2018-04-06T11:06:45Z

chainercv/chainer_experimental/datasets/sliceable/getter_dataset.py

+    >>> # get a subset with label = 0, 1, 2
+    >>> # no images are loaded
+    >>> indices = [i for i, label in
+    >>> enumerate(dataset.slice[:, 'label']) if label in {0, 1, 2}]


indentation is wiered

yuyu2172 · 2018-04-06T11:11:10Z

chainercv/chainer_experimental/datasets/sliceable/sliceable_dataset.py

+    def slice(self):
+        return SliceHelper(self)
+
+    def __iter__(self):


Do we need this?

it is not necessary for the sliceable functionality. I added this for convenience.

In the case of calculating statistics of labels, we can write

collections.Counter(dataset[:, 'label'])

Without __iter__, we have to write

collections.Counter(dataset[:, 'label'][:]) # temporary list creation (not good for both speed and memory) # or collections.Counter(dataset[:, 'label'][i] for i in range(len(dataset))) # lengthy

yuyu2172 · 2018-04-17T06:52:06Z