-
Notifications
You must be signed in to change notification settings - Fork 306
Conversation
40ddcaf
to
5610133
Compare
589269a
to
f746653
Compare
e0c0996
to
c633852
Compare
def find_label_names(root): | ||
label_names = [d for d in os.listdir(root) | ||
if os.path.isdir(os.path.join(root, d))] | ||
label_names.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about supporting numerical sort? Sometimes labels are given as numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of directory structure do you have in your mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example.
- root
- 0
- a.jpg
- b.jpg
- 1
- c.jpg
- 2
...
- 10
- y.jpg
- z.jpg
In this case, user may want to sort directories numerical order (0
, 1
, 2
... 10
). Current code sorts them by alphabetical order (0
, 1
, 10
, 2
, ...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return img_paths, np.array(labels, np.int32) | ||
|
||
|
||
class ImageFolderDataset(chainer.dataset.DatasetMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ClassificationDatasetFromDirectory
? I think <task>Dataset
is consistent with DetectionDataset
and SemanticSegmentationDataset
. Another choice is DirectoryClassificationDataset
but it sounds that the task is classifying directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are worried that this name may confuse users to think that there is a task "ImageFolder"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. You want to say that this is a ClassificationDataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClassificationDatasetFromDirectory
is OK, but this breaks the rule that dataset object class ends with "Dataset".
Perhaps, FolderParsingClassificationDataset
, DirectoryParsingClassificationDataset
or ParsingClassificationDataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. You want to say that this is a ClassificationDataset.
Yes. DirectoryParsingClassificationDataset
looks good to me.
|
||
|
||
class ImageFolderDataset(chainer.dataset.DatasetMixin): | ||
"""A data loader that loads images arranged in directory by classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data loader
-> classification dataset
?
directory
-> directories
?
|-- img_0.png | ||
|
||
>>> from chainercv.dataset import ImageFolderDataset | ||
>>> dataset = ImageFolderDataset(root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root
-> 'root'
'.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG', | ||
'.ppm', '.PPM', '.bmp', '.BMP', | ||
] | ||
return any(filename.endswith(extension) for extension in img_extensions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using os.path.splitext
and str.lower()
?
3ba9878
to
b29c5c1
Compare
I added |
e1508b8
to
f316024
Compare
f316024
to
4a9d989
Compare
|
||
The label names can be used together with | ||
:class:`chainercv.datasets.DirectoryParsingClassificationDataset`. | ||
An index of the label names correspond to a label id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correspond
-> corresponds
chainercv/datasets/__init__.py
Outdated
@@ -5,6 +5,8 @@ | |||
from chainercv.datasets.cub.cub_keypoint_dataset import CUBKeypointDataset # NOQA | |||
from chainercv.datasets.cub.cub_label_dataset import CUBLabelDataset # NOQA | |||
from chainercv.datasets.cub.cub_utils import cub_label_names # NOQA | |||
from chainercv.datasets.directory_parsing_classification_dataset import DirectoryParsingClassificationDataset # NOQA | |||
from chainercv.datasets.directory_parsing_classification_dataset import parse_label_names # NOQA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_label_names
is too ambiguous. If you want to import it under chainercv.datasets
, the name should be more specific. How about directory_parsing_label_names
? I think the name should be similar to DirectoryParsingClassificationDataset
in order to show the relationship.
"""Get label names from directories that are named by them. | ||
|
||
The label names are names of the directories that locate a layer below the | ||
root. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root
-> root directory
|
||
def _ends_with_img_ext(filename): | ||
img_extensions = ['.jpg', '.jpeg', '.png', '.ppm', '.bmp'] | ||
return any(os.path.splitext(filename)[1].lower().endswith(extension) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ==
instead of endswith
.
else: | ||
label_names = [int(name) for name in label_names] | ||
label_names.sort() | ||
label_names = [str(name) for name in label_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using key=
of sorted
?
def _parse_classification_dataset(root, label_names, | ||
check_img_file=_ends_with_img_ext): | ||
# Use label_name_to_idx for performance. | ||
label_name_to_idx = {label_names[i]: i for i in range(len(label_names))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{label_name: l for l, label_name in enumerate(label_names)}
if not os.path.isdir(label_dir): | ||
continue | ||
|
||
for cur_dir, _, filenames in sorted(os.walk(label_dir)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numerical_sort
is not used for this sort. I don't think this is bad. However, current doc sounds numerical_sort
is used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to use numerical_sort
consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion.
assert isinstance(label, np.int32), \ | ||
'label must be a numpy.int32.' | ||
assert label.ndim == 0, 'The ndim of label must be 0' | ||
assert label.min() >= 0 and label.max() < n_class, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min
and max
are unnecessary because label
is a scalar value.
docs/source/reference/datasets.rst
Outdated
@@ -4,6 +4,14 @@ Datasets | |||
.. module:: chainercv.datasets | |||
|
|||
|
|||
DirectoryParsingClassificationDataset | |||
------------------------------------- | |||
.. autofunction:: DirectoryParsingClassificationDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autofunction
-> autoclass
|
||
img_paths = [] | ||
labels = [] | ||
for label_name in os.listdir(root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use for label, label_name in enumerate(label_names)
? If we use this, we can remove label_name_to_idx
.
] | ||
) | ||
)) | ||
class TestAssertIsSemanticSegmentationDataset(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SemanticSegmentation
-> Classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delay in reviewing.
Args: | ||
root (str): The root directory. | ||
numerical_sort (bool): Label names are sorted numerically. | ||
This means that :obj:`'2'` is before :obj:`10`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:obj:``2`` -> :obj:`2`
os.path.join(class_dir, | ||
'img{}.{}'.format(j, self.suffix)), | ||
self.size, self.color) | ||
open(os.path.join(class_dir, 'dummy_file.XXX'), 'a').close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code of DirectoryParsingClassificationDataset
supports a nested directory tree. How about testing that situation?
self.assertEqual(len(dataset), self.n_img_per_class * self.n_class) | ||
|
||
assert_is_classification_dataset( | ||
dataset, self.n_class, color=self.color) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking the number of items and their order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "the number of items"?
By the way, the length of the dataset is checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean "the length of the dataset". As you pointed out, it is already checked. Sorry.
How about the order of elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the order of elements?
Sorry, you have already added it. Thank you.
I stopped using |
3730dcc
to
b4e64d5
Compare
b4e64d5
to
141dd42
Compare
|
||
def test_numerical_sort(self): | ||
dataset = DirectoryParsingClassificationDataset( | ||
self.tmp_dir, numerical_sort=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numerical_sort=False
-> numerical_sort=True
With this modification, nosetests will find the bug of DirectoryParsingClassificationDataset
@yuyu2172 I'm sorry, I kept you waiting so long. I found a bug in your code but I couldn't understand why nosetests passed. Now I find test code also has a bug. Please check my comment. |
OK. I will look into it. |
For strings except label names (file name and directory other than the top directory), it seems better to avoid numerical sort even if In order to carry out numerical sort, we need to assume that these values are all numbers, which is usually not the case. |
I agree with you. That is the reason why I said applying numerical sort only to label names is not bad (#271 (comment)) |
numerical_sort (bool): Label names are sorted numerically. | ||
This means that label :obj:`2` is before label :obj:`10`, | ||
which is not the case when string sort is used. | ||
Regardless of this option, non-numerical sort is used for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using string sort
and non-numerical sort
. It sounds there are three types of sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Following the discussion in #264, I made a dataset that obtains image classification dataset by just parsing the directory tree.
There can be some image that a user do not want to include in the dataset. For example, some datasets organize images in sub-directories according to classes, but they contain both color and depth images. The option
check_img_file
is helpful in this case.Any suggestions are welcome as this class can have wide range of potential applications, which I have not probably covered.