Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratified Splitter #201

Merged
merged 14 commits into from
Jul 3, 2018
Merged

Conversation

mottodora
Copy link
Member

@mottodora mottodora commented Jun 28, 2018

  • Stratified Splitting for classification
  • Stratified Splitting for regression
  • Unit tests for classification
  • Unit tests for regression
  • fix seed
  • Documents

@mottodora mottodora mentioned this pull request Jun 28, 2018
3 tasks
@mottodora mottodora changed the title Stratified Splitter [WIP] Stratified Splitter Jun 28, 2018
@codecov-io
Copy link

codecov-io commented Jun 29, 2018

Codecov Report

Merging #201 into master will increase coverage by 2.42%.
The diff coverage is 98.86%.

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   79.01%   81.43%   +2.42%     
==========================================
  Files          95      101       +6     
  Lines        4193     4714     +521     
==========================================
+ Hits         3313     3839     +526     
+ Misses        880      875       -5

@mottodora mottodora changed the title [WIP] Stratified Splitter Stratified Splitter Jun 29, 2018
1.)

seed = kwargs.get('seed', None)
labels_feature_id = kwargs.get('labels_feature_id', -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It specifies the axis, so I feel label_axis better.


seed = kwargs.get('seed', None)
labels_feature_id = kwargs.get('labels_feature_id', -1)
task_id = kwargs.get('task_id', 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about task_index?

rng = numpy.random.RandomState(seed)

if not isinstance(dataset, NumpyTupleDataset):
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me raise an issue
#206

labels = labels_feature[:, task_id]

if labels.dtype.kind == 'i':
classes, label_indices = numpy.unique(labels, return_inverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label_indices -> class_indices may be less confusing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class_indices is already used. I will use labels.

else:
labels = labels_feature[:, task_id]

if labels.dtype.kind == 'i':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to add option to which algorithm to use for 'classification' or 'regression'.
Default behavior is ok with automatically infer it by dtype, but we may have a situation of the regression task with integer value.

if labels.dtype.kind == 'i':
classes, label_indices = numpy.unique(labels, return_inverse=True)
elif labels.dtype.kind == 'f':
n_bin = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add it as optional kwarg??

@mottodora
Copy link
Member Author

resolve #206

else:
labels = labels[:, task_index]

if task_type == 'infer':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset


def _approximate_mode(class_counts, n_draws):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: url you referred.


n_classes = classes.shape[0]
n_total_valid = int(numpy.floor(frac_valid * len(dataset)))
n_total_test = int(numpy.floor(frac_test * len(dataset)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put comment
# n_total_train is the remainder: n - n_total_valid - n_total_test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote comments in other place.


return rng.permutation(train_index),\
rng.permutation(valid_index),\
rng.permutation(test_index),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to just return array.

>>> c = numpy.random.random((10, 1))
>>> d = NumpyTupleDataset(a, b, c)
>>> splitter = StratifiedSplitter()
>>> splitter.train_valid_test_split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

>>> c = numpy.random.random((10, 1))
>>> d = NumpyTupleDataset(a, b, c)
>>> splitter = StratifiedSplitter()
>>> splitter.train_valid_split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

raise ValueError("Please assign label dataset.")
labels = dataset.features[:, label_axis]

if len(labels.shape) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labels.ndim

if not isinstance(dataset, NumpyTupleDataset):
raise ValueError("Please assign label dataset.")
labels = dataset.features[:, label_axis]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(labels, list):
labels = numpy.array(labels)

@@ -0,0 +1,330 @@
[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this file from commit?
and if not done, please put .pytest_cache to .gitignore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already remove this file.

assert train_ind.shape[0] == 15
assert valid_ind.shape[0] == 9
assert test_ind.shape[0] == 6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?

assert test_ind.shape[0] == 6


def test_classification_split_by_labels_list(cls_dataset, cls_label):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?

@mottodora
Copy link
Member Author

Updated.

@corochann
Copy link
Member

LGTM

@corochann corochann merged commit 33b6f16 into chainer:master Jul 3, 2018
@mottodora mottodora added this to the 0.4.0 milestone Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants