Stratified Splitter #201

mottodora · 2018-06-28T09:19:26Z

codecov-io · 2018-06-29T00:23:16Z

Codecov Report

Merging #201 into master will increase coverage by 2.42%.
The diff coverage is 98.86%.

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   79.01%   81.43%   +2.42%     
==========================================
  Files          95      101       +6     
  Lines        4193     4714     +521     
==========================================
+ Hits         3313     3839     +526     
+ Misses        880      875       -5

corochann · 2018-06-29T08:03:18Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+                                          1.)
+
+        seed = kwargs.get('seed', None)
+        labels_feature_id = kwargs.get('labels_feature_id', -1)


It specifies the axis, so I feel label_axis better.

corochann · 2018-06-29T08:28:05Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+
+        seed = kwargs.get('seed', None)
+        labels_feature_id = kwargs.get('labels_feature_id', -1)
+        task_id = kwargs.get('task_id', 0)


How about task_index?

corochann · 2018-06-29T08:31:20Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+        rng = numpy.random.RandomState(seed)
+
+        if not isinstance(dataset, NumpyTupleDataset):
+            raise NotImplementedError


Let me raise an issue
#206

corochann · 2018-06-29T08:34:45Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+            labels = labels_feature[:, task_id]
+
+        if labels.dtype.kind == 'i':
+            classes, label_indices = numpy.unique(labels, return_inverse=True)


label_indices -> class_indices may be less confusing?

class_indices is already used. I will use labels.

corochann · 2018-06-29T08:36:09Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+        else:
+            labels = labels_feature[:, task_id]
+
+        if labels.dtype.kind == 'i':


I think it is better to add option to which algorithm to use for 'classification' or 'regression'.
Default behavior is ok with automatically infer it by dtype, but we may have a situation of the regression task with integer value.

corochann · 2018-06-29T09:14:49Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+        if labels.dtype.kind == 'i':
+            classes, label_indices = numpy.unique(labels, return_inverse=True)
+        elif labels.dtype.kind == 'f':
+            n_bin = 10


Can you add it as optional kwarg??

mottodora · 2018-06-30T04:47:07Z

resolve #206

corochann · 2018-07-02T04:21:39Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+        else:
+            labels = labels[:, task_index]
+
+        if task_type == 'infer':


corochann · 2018-07-02T04:25:54Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset
+
+
+def _approximate_mode(class_counts, n_draws):


comment: url you referred.

corochann · 2018-07-02T04:33:58Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+
+        n_classes = classes.shape[0]
+        n_total_valid = int(numpy.floor(frac_valid * len(dataset)))
+        n_total_test = int(numpy.floor(frac_test * len(dataset)))


please put comment
# n_total_train is the remainder: n - n_total_valid - n_total_test

I wrote comments in other place.

corochann · 2018-07-02T04:37:32Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+
+        return rng.permutation(train_index),\
+            rng.permutation(valid_index),\
+            rng.permutation(test_index),


I think it is ok to just return array.

corochann · 2018-07-02T04:39:11Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+            >>> c = numpy.random.random((10, 1))
+            >>> d = NumpyTupleDataset(a, b, c)
+            >>> splitter = StratifiedSplitter()
+            >>> splitter.train_valid_test_split()


remove this line

corochann · 2018-07-02T04:39:38Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+            >>> c = numpy.random.random((10, 1))
+            >>> d = NumpyTupleDataset(a, b, c)
+            >>> splitter = StratifiedSplitter()
+            >>> splitter.train_valid_split()


remove this line

corochann · 2018-07-02T04:40:26Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+                raise ValueError("Please assign label dataset.")
+            labels = dataset.features[:, label_axis]
+
+        if len(labels.shape) == 1:


labels.ndim

corochann · 2018-07-02T04:41:32Z

chainer_chemistry/dataset/splitters/stratified_splitter.py

+            if not isinstance(dataset, NumpyTupleDataset):
+                raise ValueError("Please assign label dataset.")
+            labels = dataset.features[:, label_axis]
+


if isinstance(labels, list):
labels = numpy.array(labels)

corochann · 2018-07-03T02:38:48Z

.pytest_cache/v/cache/nodeids

@@ -0,0 +1,330 @@
+[


can you remove this file from commit?
and if not done, please put .pytest_cache to .gitignore.

I already remove this file.

corochann · 2018-07-03T02:42:45Z

tests/dataset_tests/splitters_tests/test_stratified_splitter.py

+    assert train_ind.shape[0] == 15
+    assert valid_ind.shape[0] == 9
+    assert test_ind.shape[0] == 6
+


How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?

corochann · 2018-07-03T02:43:02Z

tests/dataset_tests/splitters_tests/test_stratified_splitter.py

+    assert test_ind.shape[0] == 6
+
+
+def test_classification_split_by_labels_list(cls_dataset, cls_label):


How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?

mottodora · 2018-07-03T04:28:35Z

Updated.

corochann · 2018-07-03T05:22:25Z

LGTM

mottodora added 7 commits June 28, 2018 13:55

add stratified splitter

11243eb

add unit tests about classification task

6cc5891

support regression dataset

ab9aaa9

add unit tests for regression dataset

6328bbe

[feature] seed fix

4318564

add documents

1d6b05d

fix

8f03961

mottodora mentioned this pull request Jun 28, 2018

Scaffold splitter #202

Merged

3 tasks

mottodora changed the title ~~Stratified Splitter~~ [WIP] Stratified Splitter Jun 28, 2018

corochann mentioned this pull request Jun 28, 2018

Support MoleculeNet experiment TODO list #178

Closed

5 tasks

fix bugs

c7998c4

mottodora changed the title ~~[WIP] Stratified Splitter~~ Stratified Splitter Jun 29, 2018

corochann reviewed Jun 29, 2018

View reviewed changes

mottodora added 2 commits June 30, 2018 13:16

apply comments

ee840b0

support ndarray dataset

a6e79e5

mottodora mentioned this pull request Jun 30, 2018

Integrate get_molnet_dataset with Splitter #209

Merged

add documents

f3739b9

corochann reviewed Jul 2, 2018

View reviewed changes

mottodora added 2 commits July 2, 2018 16:06

apply comments

dba27a5

remove unnecessary files

b06c333

corochann requested changes Jul 3, 2018

View reviewed changes

add some tests

5944803

corochann approved these changes Jul 3, 2018

View reviewed changes

corochann merged commit 33b6f16 into chainer:master Jul 3, 2018

mottodora added this to the 0.4.0 milestone Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratified Splitter #201

Stratified Splitter #201

mottodora commented Jun 28, 2018 •

edited

codecov-io commented Jun 29, 2018 •

edited

corochann Jun 29, 2018

corochann Jun 29, 2018

corochann Jun 29, 2018

corochann Jun 29, 2018

mottodora Jun 29, 2018

corochann Jun 29, 2018

corochann Jun 29, 2018

mottodora commented Jun 30, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

mottodora Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 2, 2018

corochann Jul 3, 2018

mottodora Jul 3, 2018

corochann Jul 3, 2018

corochann Jul 3, 2018

mottodora commented Jul 3, 2018

corochann commented Jul 3, 2018

		from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset


		def _approximate_mode(class_counts, n_draws):

		assert test_ind.shape[0] == 6


		def test_classification_split_by_labels_list(cls_dataset, cls_label):

Stratified Splitter #201

Stratified Splitter #201

Conversation

mottodora commented Jun 28, 2018 • edited

codecov-io commented Jun 29, 2018 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mottodora commented Jun 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mottodora commented Jul 3, 2018

corochann commented Jul 3, 2018

mottodora commented Jun 28, 2018 •

edited

codecov-io commented Jun 29, 2018 •

edited