Skip to content
This repository has been archived by the owner on Jul 2, 2021. It is now read-only.

Add ImageFolderDataset #271

Merged
merged 36 commits into from
Aug 8, 2017
Merged

Conversation

yuyu2172
Copy link
Member

@yuyu2172 yuyu2172 commented Jun 13, 2017

Following the discussion in #264, I made a dataset that obtains image classification dataset by just parsing the directory tree.

There can be some image that a user do not want to include in the dataset. For example, some datasets organize images in sub-directories according to classes, but they contain both color and depth images. The option check_img_file is helpful in this case.

Any suggestions are welcome as this class can have wide range of potential applications, which I have not probably covered.

@yuyu2172 yuyu2172 mentioned this pull request Jun 15, 2017
7 tasks
def find_label_names(root):
label_names = [d for d in os.listdir(root)
if os.path.isdir(os.path.join(root, d))]
label_names.sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about supporting numerical sort? Sometimes labels are given as numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of directory structure do you have in your mind?

Copy link
Member

@Hakuyume Hakuyume Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example.

- root
    - 0
        - a.jpg
        - b.jpg
    - 1
        - c.jpg
    - 2
...
    - 10
        - y.jpg
        - z.jpg

In this case, user may want to sort directories numerical order (0, 1, 2 ... 10). Current code sorts them by alphabetical order (0, 1, 10, 2, ...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return img_paths, np.array(labels, np.int32)


class ImageFolderDataset(chainer.dataset.DatasetMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ClassificationDatasetFromDirectory? I think <task>Dataset is consistent with DetectionDataset and SemanticSegmentationDataset. Another choice is DirectoryClassificationDataset but it sounds that the task is classifying directories.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are worried that this name may confuse users to think that there is a task "ImageFolder"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. You want to say that this is a ClassificationDataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClassificationDatasetFromDirectory is OK, but this breaks the rule that dataset object class ends with "Dataset".

Perhaps, FolderParsingClassificationDataset, DirectoryParsingClassificationDataset or ParsingClassificationDataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. You want to say that this is a ClassificationDataset.

Yes. DirectoryParsingClassificationDataset looks good to me.



class ImageFolderDataset(chainer.dataset.DatasetMixin):
"""A data loader that loads images arranged in directory by classes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data loader -> classification dataset?
directory -> directories?

|-- img_0.png

>>> from chainercv.dataset import ImageFolderDataset
>>> dataset = ImageFolderDataset(root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root -> 'root'

'.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG',
'.ppm', '.PPM', '.bmp', '.BMP',
]
return any(filename.endswith(extension) for extension in img_extensions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using os.path.splitext and str.lower()?

@yuyu2172
Copy link
Member Author

I added asset_is_classification_dataset.

@yuyu2172 yuyu2172 changed the title [WIP] Add ImageFolderDataset. Add ImageFolderDataset. Jun 16, 2017
@yuyu2172 yuyu2172 added this to the v0.6 milestone Jun 16, 2017
@yuyu2172 yuyu2172 force-pushed the image-folder-dataset branch 2 times, most recently from e1508b8 to f316024 Compare June 16, 2017 07:32
@yuyu2172 yuyu2172 changed the title Add ImageFolderDataset. Add ImageFolderDataset Jun 23, 2017

The label names can be used together with
:class:`chainercv.datasets.DirectoryParsingClassificationDataset`.
An index of the label names correspond to a label id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correspond -> corresponds

@@ -5,6 +5,8 @@
from chainercv.datasets.cub.cub_keypoint_dataset import CUBKeypointDataset # NOQA
from chainercv.datasets.cub.cub_label_dataset import CUBLabelDataset # NOQA
from chainercv.datasets.cub.cub_utils import cub_label_names # NOQA
from chainercv.datasets.directory_parsing_classification_dataset import DirectoryParsingClassificationDataset # NOQA
from chainercv.datasets.directory_parsing_classification_dataset import parse_label_names # NOQA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_label_names is too ambiguous. If you want to import it under chainercv.datasets, the name should be more specific. How about directory_parsing_label_names? I think the name should be similar to DirectoryParsingClassificationDataset in order to show the relationship.

"""Get label names from directories that are named by them.

The label names are names of the directories that locate a layer below the
root.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root -> root directory


def _ends_with_img_ext(filename):
img_extensions = ['.jpg', '.jpeg', '.png', '.ppm', '.bmp']
return any(os.path.splitext(filename)[1].lower().endswith(extension) for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use == instead of endswith.

else:
label_names = [int(name) for name in label_names]
label_names.sort()
label_names = [str(name) for name in label_names]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using key= of sorted?

def _parse_classification_dataset(root, label_names,
check_img_file=_ends_with_img_ext):
# Use label_name_to_idx for performance.
label_name_to_idx = {label_names[i]: i for i in range(len(label_names))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{label_name: l for l, label_name in enumerate(label_names)}

if not os.path.isdir(label_dir):
continue

for cur_dir, _, filenames in sorted(os.walk(label_dir)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numerical_sort is not used for this sort. I don't think this is bad. However, current doc sounds numerical_sort is used anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use numerical_sort consistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion.

assert isinstance(label, np.int32), \
'label must be a numpy.int32.'
assert label.ndim == 0, 'The ndim of label must be 0'
assert label.min() >= 0 and label.max() < n_class, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and max are unnecessary because label is a scalar value.

@@ -4,6 +4,14 @@ Datasets
.. module:: chainercv.datasets


DirectoryParsingClassificationDataset
-------------------------------------
.. autofunction:: DirectoryParsingClassificationDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autofunction -> autoclass


img_paths = []
labels = []
for label_name in os.listdir(root):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use for label, label_name in enumerate(label_names)? If we use this, we can remove label_name_to_idx.

]
)
))
class TestAssertIsSemanticSegmentationDataset(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SemanticSegmentation -> Classification

@yuyu2172 yuyu2172 removed this from the v0.6 milestone Jul 6, 2017
@yuyu2172 yuyu2172 added this to the v0.7 milestone Jul 14, 2017
Copy link
Member

@Hakuyume Hakuyume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the delay in reviewing.

Args:
root (str): The root directory.
numerical_sort (bool): Label names are sorted numerically.
This means that :obj:`'2'` is before :obj:`10`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:obj:``2`` -> :obj:`2`

os.path.join(class_dir,
'img{}.{}'.format(j, self.suffix)),
self.size, self.color)
open(os.path.join(class_dir, 'dummy_file.XXX'), 'a').close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code of DirectoryParsingClassificationDataset supports a nested directory tree. How about testing that situation?

self.assertEqual(len(dataset), self.n_img_per_class * self.n_class)

assert_is_classification_dataset(
dataset, self.n_class, color=self.color)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking the number of items and their order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "the number of items"?
By the way, the length of the dataset is checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean "the length of the dataset". As you pointed out, it is already checked. Sorry.
How about the order of elements?

Copy link
Member

@Hakuyume Hakuyume Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the order of elements?

Sorry, you have already added it. Thank you.

@yuyu2172
Copy link
Member Author

I stopped using *_paths to represent file paths.
Instead, I used *_filenames that is used in other part of the library.

@yuyu2172 yuyu2172 force-pushed the image-folder-dataset branch 2 times, most recently from 3730dcc to b4e64d5 Compare July 28, 2017 09:28

def test_numerical_sort(self):
dataset = DirectoryParsingClassificationDataset(
self.tmp_dir, numerical_sort=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numerical_sort=False -> numerical_sort=True
With this modification, nosetests will find the bug of DirectoryParsingClassificationDataset

@Hakuyume
Copy link
Member

Hakuyume commented Aug 7, 2017

@yuyu2172 I'm sorry, I kept you waiting so long. I found a bug in your code but I couldn't understand why nosetests passed. Now I find test code also has a bug. Please check my comment.

@yuyu2172
Copy link
Member Author

yuyu2172 commented Aug 7, 2017

OK. I will look into it.
Thank you for checking.

@yuyu2172
Copy link
Member Author

yuyu2172 commented Aug 7, 2017

@Hakuyume

For strings except label names (file name and directory other than the top directory), it seems better to avoid numerical sort even if numerical_sort=True.

In order to carry out numerical sort, we need to assume that these values are all numbers, which is usually not the case.

@Hakuyume
Copy link
Member

Hakuyume commented Aug 7, 2017

@Hakuyume

For strings except label names (file name and directory other than the top directory), it seems better to avoid numerical sort even if numerical_sort=True.

In order to carry out numerical sort, we need to assume that these values are all numbers, which is usually not the case.

I agree with you. That is the reason why I said applying numerical sort only to label names is not bad (#271 (comment))

numerical_sort (bool): Label names are sorted numerically.
This means that label :obj:`2` is before label :obj:`10`,
which is not the case when string sort is used.
Regardless of this option, non-numerical sort is used for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using string sort and non-numerical sort. It sounds there are three types of sorting.

Copy link
Member

@Hakuyume Hakuyume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hakuyume Hakuyume merged commit d8d45dd into chainer:master Aug 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants