Skip to content
This repository has been archived by the owner on Jul 2, 2021. It is now read-only.

Support MixUpSoftLabelDataset. #497

Merged
merged 14 commits into from
Apr 4, 2018
Merged

Conversation

fukatani
Copy link
Contributor

Between class learning is simple and effective technique, I think it is valuable to merge to chainercv.
https://arxiv.org/abs/1711.10284

@yuyu2172
Copy link
Member

yuyu2172 commented Dec 21, 2017

Hi thanks for a timely PR. I am interested in supporting this feature.
I checked BC paper, and I found that it is slightly different from a more popular MixUp paper https://arxiv.org/abs/1710.09412.

  1. MixUp takes positive pairs, but BC only takes negative pairs.
  2. BC proposes two formulas for mixing two samples. The first one is the same as MixUp. The second one is different.

Since MixUp is simpler, it is easier to include in a library.
I would suggest to only add MixUp, and not BC to ChainerCV.
However, if there is a strong advantage for BC (note: I would like to know this!), it can be integrated into ChainerCV together with MixUp.

Having said that, I have several design comments.

  1. Can you change the name to MixUpSoftLabelDataset? We call a dataset *LabelDataset when it returns an image and an integer. Since, MixUp returns an array of size (n_class,) as label, it needs to be differentiated. Thus, I would name it *SoftLabelDataset. Also, MixUp is easier to recognize, and I prefer the name better.
  2. I found that it is common in CV to fetch two samples from two datasets. Thus, I made a dataset class
    called SiameseDataset that supports this functionality in a different PR Add SiameseDataset #505. Could you to wait until that one is merged to ChainerCV, then change your code to use SiameseDataset? Sorry for inconvenience with this point.


"""

def __init__(self, dataset, dtype=numpy.float32, label_dtype=numpy.float32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove dtype and label_dtype from arguments?
By default, images and soft-labels are dtype=np.float32.
If different dtypes are needed, transforms can be used.

@fukatani
Copy link
Contributor Author

Thank you for your quick and attentive response.

OK, I will change BetweenClassLabeledDataset to MixUpSoftLabelDataset and implement it as MixUp paper.

After merging this PR and if I can confirm advantage fo BC, I may suggest BC-like mixing as option.

Sorry for inconvenience with this point.

Don't mind. Actually, I will take time to restart this PR.

@fukatani fukatani changed the title [WIP] Add BetweenClassLabeledImageDatasets. Support MixUpSoftLabelDataset. Mar 3, 2018
@@ -0,0 +1,60 @@
import numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use np in ChainerCV. Could you change it?

@@ -15,6 +15,7 @@
from chainercv.datasets.cub.cub_utils import cub_label_names # NOQA
from chainercv.datasets.directory_parsing_label_dataset import directory_parsing_label_names # NOQA
from chainercv.datasets.directory_parsing_label_dataset import DirectoryParsingLabelDataset # NOQA
from chainercv.datasets.mixup_dataset import MixUpSoftLabelDataset # NOQA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the name of the file to mixup_soft_label_dataset?
Also, could you change the name of the test file as well?


"""

def __init__(self, dataset, max_label):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using n_class instead of max_label?
n_class is widely used in other parts of ChainerCV.

In that case the doc would be

n_class (int): The number of classes in the base dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove `+1 in line 57 if you do this.

>>> mixed_image, mixed_label = dataset[0]

Args:
dataset: The underlying dataset. dataset should returns two image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daset should return ... can be written more explicitly.

The dataset returns :obj:`img_0, label_0, img_1, label_1`, which is a tuple containing two pairs of an image and a label.

Args:
dataset: The underlying dataset. dataset should returns two image
and their label. Typically, dataset is `SiameseDataset`.
More over, each element of each dataset should have same shape.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, ... can be rewritten as...

The shapes of images and labels should be constant.

>>> mnist, _ = get_mnist()
>>> base_dataset = SiameseDataset(mnist, mnist)
>>> dataset = MixUpSoftLabelDataset(base_dataset)
>>> mixed_image, mixed_label = dataset[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about printing the shape? This is an important details.

>>> print(mixed_label.shape, mixed_label.dtype)   # ((10,), numpy.float32)

self.assertEqual(example[1].dtype, np.float32)
self.assertEqual(example[1].ndim, 1)
self.assertEqual(example[1].shape[0], self.n_class + 1)
self.assertAlmostEqual(example[1].sum(), 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check that the label is nonnegative?

dataset respectively by weighted average.

Unlike `LabeledImageDatasets`, label is a one-dimensional float array with
at most two nonzero weights (i.e. soft label). The summed weights is one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum of the two weights is one

Unlike `LabeledImageDatasets`, label is a one-dimensional float array with
at most two nonzero weights (i.e. soft label). The summed weights is one.

The base dataset `__getitem__` should return image and label. Please see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. We can remove this.


"""Dataset which returns mixed images and labels for mixup learning[1].

`MixUpSoftLabelDataset` mix two images and labels which is chosen by base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`MixUpSoftLabelDataset` mixes two pairs of labeled images fetched from the base dataset.

@yuyu2172
Copy link
Member

Thank you for a nice PR!
I am so sorry for the late response.

@yuyu2172 yuyu2172 self-assigned this Mar 21, 2018
>>> from chainercv.datasets import MixUpSoftLabelDataset
>>> mnist, _ = get_mnist()
>>> base_dataset = SiameseDataset(mnist, mnist)
>>> dataset = MixUpSoftLabelDataset(base_dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to supply the number of class to MixupSoftLabelDataset

@yuyu2172
Copy link
Member

The link to the paper is not compiled properly.

screenshot from 2018-03-21 11 33 06

@fukatani
Copy link
Contributor Author

Thank you for your attentive review.
(And sorry for my poor doc)

I addressed your comments.

@fukatani
Copy link
Contributor Author

fukatani commented Mar 25, 2018

mix
I confirmed document was compiled successfully. (And changed reference style as other arxiv paper in chainercv)

@@ -18,6 +18,10 @@ SiameseDataset
~~~~~~~~~~~~~~
.. autoclass:: SiameseDataset

MixUpSoftLabelDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this before SiameseDataset?
We are ordering docs in alphabetical order.

return len(self._dataset)

def get_example(self, i):
image1, label1, image2, label2 = self._dataset[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the names of the variables to img_0, label_0, img_1, label_1 so that they are consistent with the documentation?

@fukatani
Copy link
Contributor Author

Thanks! I addressed your comments.

@yuyu2172
Copy link
Member

yuyu2172 commented Apr 4, 2018

LGTM. Thank you for your contribution!

@yuyu2172 yuyu2172 merged commit 74de140 into chainer:master Apr 4, 2018
@yuyu2172 yuyu2172 added this to the v0.9 milestone Apr 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants