Add ImageFolderDataset #271

yuyu2172 · 2017-06-13T10:59:20Z

Following the discussion in #264, I made a dataset that obtains image classification dataset by just parsing the directory tree.

There can be some image that a user do not want to include in the dataset. For example, some datasets organize images in sub-directories according to classes, but they contain both color and depth images. The option check_img_file is helpful in this case.

Any suggestions are welcome as this class can have wide range of potential applications, which I have not probably covered.

Hakuyume · 2017-06-15T04:08:46Z

chainercv/datasets/image_folder_dataset.py

+def find_label_names(root):
+    label_names = [d for d in os.listdir(root)
+                   if os.path.isdir(os.path.join(root, d))]
+    label_names.sort()


How about supporting numerical sort? Sometimes labels are given as numbers.

What kind of directory structure do you have in your mind?

Here is an example.

- root - 0 - a.jpg - b.jpg - 1 - c.jpg - 2 ... - 10 - y.jpg - z.jpg

In this case, user may want to sort directories numerical order (0, 1, 2 ... 10). Current code sorts them by alphabetical order (0, 1, 10, 2, ...).

Hakuyume · 2017-06-15T04:18:58Z

chainercv/datasets/image_folder_dataset.py

+    return img_paths, np.array(labels, np.int32)
+
+
+class ImageFolderDataset(chainer.dataset.DatasetMixin):


How about ClassificationDatasetFromDirectory? I think <task>Dataset is consistent with DetectionDataset and SemanticSegmentationDataset. Another choice is DirectoryClassificationDataset but it sounds that the task is classifying directories.

You are worried that this name may confuse users to think that there is a task "ImageFolder"?

Oh. You want to say that this is a ClassificationDataset.

ClassificationDatasetFromDirectory is OK, but this breaks the rule that dataset object class ends with "Dataset".

Perhaps, FolderParsingClassificationDataset, DirectoryParsingClassificationDataset or ParsingClassificationDataset.

Oh. You want to say that this is a ClassificationDataset.

Yes. DirectoryParsingClassificationDataset looks good to me.

Hakuyume · 2017-06-15T04:19:46Z

chainercv/datasets/image_folder_dataset.py

+
+
+class ImageFolderDataset(chainer.dataset.DatasetMixin):
+    """A data loader that loads images arranged in directory by classes.


data loader -> classification dataset?
directory -> directories?

Hakuyume · 2017-06-15T04:21:31Z

chainercv/datasets/image_folder_dataset.py

+                |-- img_0.png
+
+        >>> from chainercv.dataset import ImageFolderDataset
+        >>> dataset = ImageFolderDataset(root)


root -> 'root'

Hakuyume · 2017-06-15T04:23:19Z

chainercv/datasets/image_folder_dataset.py

+        '.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG',
+        '.ppm', '.PPM', '.bmp', '.BMP',
+    ]
+    return any(filename.endswith(extension) for extension in img_extensions)


How about using os.path.splitext and str.lower()?

yuyu2172 · 2017-06-16T07:19:03Z

I added asset_is_classification_dataset.

Hakuyume · 2017-07-03T10:11:42Z

chainercv/datasets/directory_parsing_classification_dataset.py

+
+    The label names can be used together with
+    :class:`chainercv.datasets.DirectoryParsingClassificationDataset`.
+    An index of the label names correspond to a label id


correspond -> corresponds

Hakuyume · 2017-07-03T10:15:15Z

chainercv/datasets/__init__.py

@@ -5,6 +5,8 @@
 from chainercv.datasets.cub.cub_keypoint_dataset import CUBKeypointDataset  # NOQA
 from chainercv.datasets.cub.cub_label_dataset import CUBLabelDataset  # NOQA
 from chainercv.datasets.cub.cub_utils import cub_label_names  # NOQA
+from chainercv.datasets.directory_parsing_classification_dataset import DirectoryParsingClassificationDataset  # NOQA
+from chainercv.datasets.directory_parsing_classification_dataset import parse_label_names  # NOQA


parse_label_names is too ambiguous. If you want to import it under chainercv.datasets, the name should be more specific. How about directory_parsing_label_names? I think the name should be similar to DirectoryParsingClassificationDataset in order to show the relationship.

Hakuyume · 2017-07-03T10:15:52Z

chainercv/datasets/directory_parsing_classification_dataset.py

+    """Get label names from directories that are named by them.
+
+    The label names are names of the directories that locate a layer below the
+    root.


root -> root directory

Hakuyume · 2017-07-03T10:19:35Z

chainercv/datasets/directory_parsing_classification_dataset.py

+
+def _ends_with_img_ext(filename):
+    img_extensions = ['.jpg', '.jpeg', '.png', '.ppm', '.bmp']
+    return any(os.path.splitext(filename)[1].lower().endswith(extension) for


You can use == instead of endswith.

Hakuyume · 2017-07-03T10:21:25Z

chainercv/datasets/directory_parsing_classification_dataset.py

+    else:
+        label_names = [int(name) for name in label_names]
+        label_names.sort()
+        label_names = [str(name) for name in label_names]


How about using key= of sorted?

Hakuyume · 2017-07-03T10:25:07Z

chainercv/datasets/directory_parsing_classification_dataset.py

+def _parse_classification_dataset(root, label_names,
+                                  check_img_file=_ends_with_img_ext):
+    # Use label_name_to_idx for performance.
+    label_name_to_idx = {label_names[i]: i for i in range(len(label_names))}


{label_name: l for l, label_name in enumerate(label_names)}

Hakuyume · 2017-07-03T10:28:40Z

chainercv/datasets/directory_parsing_classification_dataset.py

+        if not os.path.isdir(label_dir):
+            continue
+
+        for cur_dir, _, filenames in sorted(os.walk(label_dir)):


numerical_sort is not used for this sort. I don't think this is bad. However, current doc sounds numerical_sort is used anywhere.

I think it is better to use numerical_sort consistently.

Thanks for your suggestion.

Hakuyume · 2017-07-03T10:36:59Z

chainercv/utils/testing/assertions/assert_is_classification_dataset.py

+    assert isinstance(label, np.int32), \
+        'label must be a numpy.int32.'
+    assert label.ndim == 0, 'The ndim of label must be 0'
+    assert label.min() >= 0 and label.max() < n_class, \


min and max are unnecessary because label is a scalar value.

Hakuyume · 2017-07-03T10:37:37Z

docs/source/reference/datasets.rst

@@ -4,6 +4,14 @@ Datasets
 .. module:: chainercv.datasets


+DirectoryParsingClassificationDataset
+-------------------------------------
+.. autofunction:: DirectoryParsingClassificationDataset


autofunction -> autoclass

Hakuyume · 2017-07-03T10:42:05Z

chainercv/datasets/directory_parsing_classification_dataset.py

+
+    img_paths = []
+    labels = []
+    for label_name in os.listdir(root):


Why don't you use for label, label_name in enumerate(label_names)? If we use this, we can remove label_name_to_idx.

Hakuyume · 2017-07-03T10:46:36Z

tests/utils_tests/testing_tests/assertions_tests/test_assert_is_classification_dataset.py

+        ]
+    )
+))
+class TestAssertIsSemanticSegmentationDataset(unittest.TestCase):


SemanticSegmentation -> Classification

Hakuyume

I'm sorry for the delay in reviewing.

Hakuyume · 2017-07-28T05:23:35Z

chainercv/datasets/directory_parsing_classification_dataset.py

+    Args:
+        root (str): The root directory.
+        numerical_sort (bool): Label names are sorted numerically.
+            This means that :obj:`'2'` is before :obj:`10`,


:obj:``2`` -> :obj:`2`

Hakuyume · 2017-07-28T05:45:42Z

tests/datasets_tests/test_directory_parsing_classification_dataset.py

+                    os.path.join(class_dir,
+                                 'img{}.{}'.format(j, self.suffix)),
+                    self.size, self.color)
+            open(os.path.join(class_dir, 'dummy_file.XXX'), 'a').close()


Current code of DirectoryParsingClassificationDataset supports a nested directory tree. How about testing that situation?

Hakuyume · 2017-07-28T05:46:31Z

tests/datasets_tests/test_directory_parsing_classification_dataset.py

+        self.assertEqual(len(dataset), self.n_img_per_class * self.n_class)
+
+        assert_is_classification_dataset(
+            dataset, self.n_class, color=self.color)


How about checking the number of items and their order?

What do you mean by "the number of items"?
By the way, the length of the dataset is checked.

I mean "the length of the dataset". As you pointed out, it is already checked. Sorry.
How about the order of elements?

How about the order of elements?

Sorry, you have already added it. Thank you.

yuyu2172 · 2017-07-28T09:23:27Z

I stopped using *_paths to represent file paths.
Instead, I used *_filenames that is used in other part of the library.

Hakuyume · 2017-08-07T09:42:26Z

tests/datasets_tests/test_directory_parsing_classification_dataset.py

+
+    def test_numerical_sort(self):
+        dataset = DirectoryParsingClassificationDataset(
+            self.tmp_dir, numerical_sort=False)


numerical_sort=False -> numerical_sort=True
With this modification, nosetests will find the bug of DirectoryParsingClassificationDataset

Hakuyume · 2017-08-07T09:46:47Z

@yuyu2172 I'm sorry, I kept you waiting so long. I found a bug in your code but I couldn't understand why nosetests passed. Now I find test code also has a bug. Please check my comment.

yuyu2172 · 2017-08-07T09:47:41Z

OK. I will look into it.
Thank you for checking.

yuyu2172 · 2017-08-07T10:01:14Z

@Hakuyume

For strings except label names (file name and directory other than the top directory), it seems better to avoid numerical sort even if numerical_sort=True.

In order to carry out numerical sort, we need to assume that these values are all numbers, which is usually not the case.

Hakuyume · 2017-08-07T10:04:23Z

@Hakuyume

For strings except label names (file name and directory other than the top directory), it seems better to avoid numerical sort even if numerical_sort=True.

In order to carry out numerical sort, we need to assume that these values are all numbers, which is usually not the case.

I agree with you. That is the reason why I said applying numerical sort only to label names is not bad (#271 (comment))

Hakuyume · 2017-08-07T10:47:46Z

chainercv/datasets/directory_parsing_classification_dataset.py

+        numerical_sort (bool): Label names are sorted numerically.
+            This means that label :obj:`2` is before label :obj:`10`,
+            which is not the case when string sort is used.
+            Regardless of this option, non-numerical sort is used for the


You are using string sort and non-numerical sort. It sounds there are three types of sorting.

Hakuyume

LGTM

add ImageFolderDataset

5610133

yuyu2172 force-pushed the image-folder-dataset branch from 40ddcaf to 5610133 Compare June 13, 2017 10:59

make code to work

f746653

yuyu2172 force-pushed the image-folder-dataset branch from 589269a to f746653 Compare June 14, 2017 05:30

use label_name_to_idx for performance

c633852

yuyu2172 force-pushed the image-folder-dataset branch from e0c0996 to c633852 Compare June 14, 2017 06:30

yuyu2172 mentioned this pull request Jun 15, 2017

Add VGG16 #265

Merged

7 tasks

yuyu2172 added the feature label Jun 15, 2017

Hakuyume reviewed Jun 15, 2017

View reviewed changes

yuyu2172 added 7 commits June 16, 2017 15:03

use directory_parsing_classification_dataset

52f766b

Merge remote-tracking branch 'origin/master' into image-folder-dataset

0ea7607

fix doc

8577ea6

add color option

87d28f9

add a doc for parse_label_names

c381d54

use assert_is_classification_dataset

7a2ee88

use assert_is_classification_dataset

5fe3db4

yuyu2172 force-pushed the image-folder-dataset branch from 3ba9878 to b29c5c1 Compare June 16, 2017 07:18

yuyu2172 changed the title ~~[WIP] Add ImageFolderDataset.~~ Add ImageFolderDataset. Jun 16, 2017

yuyu2172 added this to the v0.6 milestone Jun 16, 2017

yuyu2172 force-pushed the image-folder-dataset branch 2 times, most recently from e1508b8 to f316024 Compare June 16, 2017 07:32

yuyu2172 assigned Hakuyume Jun 16, 2017

fix doc

4a9d989

yuyu2172 force-pushed the image-folder-dataset branch from f316024 to 4a9d989 Compare June 16, 2017 07:36

add numerical sort option

5729d29

yuyu2172 changed the title ~~Add ImageFolderDataset.~~ Add ImageFolderDataset Jun 23, 2017

Hakuyume reviewed Jul 3, 2017

View reviewed changes

use name directory_parsing_label_names

685cc33

yuyu2172 removed this from the v0.6 milestone Jul 6, 2017

yuyu2172 added this to the v0.7 milestone Jul 14, 2017

Hakuyume reviewed Jul 28, 2017

View reviewed changes

yuyu2172 added 6 commits July 28, 2017 17:05

fix doc

51bcf5b

img_paths --> img_filenames

f9fc870

sort names of files

543c60f

test on nested directory for DirectoryParsingClassificationDataset

1bbbb0c

fix test to check order of filenames

8ad3e9b

flake8 & move nested function declaration in the beginning

8bc236c

yuyu2172 force-pushed the image-folder-dataset branch 2 times, most recently from 3730dcc to b4e64d5 Compare July 28, 2017 09:28

fix doc

141dd42

yuyu2172 force-pushed the image-folder-dataset branch from b4e64d5 to 141dd42 Compare July 28, 2017 09:29

yuyu2172 added 2 commits July 28, 2017 18:41

fix doc

8147f63

add doc on arg color to assert_is_classification_dataset

eadf001

Hakuyume reviewed Aug 7, 2017

View reviewed changes

yuyu2172 added 3 commits August 7, 2017 19:09

fix a bug in test

5ed96a1

do not use numerical sort for filenames

6abf884

make multiple nested directories in the test

4858391

Hakuyume reviewed Aug 7, 2017

View reviewed changes

yuyu2172 added 3 commits August 7, 2017 19:56

non-numerical sort --> string sort

efa5836

fix doc

b0c80b2

fix doc

ba367fa

Hakuyume approved these changes Aug 8, 2017

View reviewed changes

Hakuyume merged commit d8d45dd into chainer:master Aug 8, 2017

		return img_paths, np.array(labels, np.int32)


		class ImageFolderDataset(chainer.dataset.DatasetMixin):



		class ImageFolderDataset(chainer.dataset.DatasetMixin):
		"""A data loader that loads images arranged in directory by classes.

Add ImageFolderDataset #271

Add ImageFolderDataset #271

Conversation

yuyu2172 commented Jun 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hakuyume Jun 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyu2172 commented Jun 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hakuyume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hakuyume Jul 28, 2017 • edited

Choose a reason for hiding this comment

yuyu2172 commented Jul 28, 2017

Choose a reason for hiding this comment

Hakuyume commented Aug 7, 2017

yuyu2172 commented Aug 7, 2017

yuyu2172 commented Aug 7, 2017

Hakuyume commented Aug 7, 2017 • edited

Choose a reason for hiding this comment

Hakuyume left a comment

Choose a reason for hiding this comment

yuyu2172 commented Jun 13, 2017 •

edited

Hakuyume Jun 16, 2017 •

edited

Hakuyume Jul 28, 2017 •

edited

Hakuyume commented Aug 7, 2017 •

edited