-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stratified Splitter #201
Stratified Splitter #201
Conversation
mottodora
commented
Jun 28, 2018
•
edited
edited
- Stratified Splitting for classification
- Stratified Splitting for regression
- Unit tests for classification
- Unit tests for regression
- fix seed
- Documents
Codecov Report
@@ Coverage Diff @@
## master #201 +/- ##
==========================================
+ Coverage 79.01% 81.43% +2.42%
==========================================
Files 95 101 +6
Lines 4193 4714 +521
==========================================
+ Hits 3313 3839 +526
+ Misses 880 875 -5 |
1.) | ||
|
||
seed = kwargs.get('seed', None) | ||
labels_feature_id = kwargs.get('labels_feature_id', -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It specifies the axis
, so I feel label_axis
better.
|
||
seed = kwargs.get('seed', None) | ||
labels_feature_id = kwargs.get('labels_feature_id', -1) | ||
task_id = kwargs.get('task_id', 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about task_index
?
rng = numpy.random.RandomState(seed) | ||
|
||
if not isinstance(dataset, NumpyTupleDataset): | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me raise an issue
#206
labels = labels_feature[:, task_id] | ||
|
||
if labels.dtype.kind == 'i': | ||
classes, label_indices = numpy.unique(labels, return_inverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_indices
-> class_indices
may be less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class_indices
is already used. I will use labels
.
else: | ||
labels = labels_feature[:, task_id] | ||
|
||
if labels.dtype.kind == 'i': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to add option to which algorithm to use for 'classification' or 'regression'.
Default behavior is ok with automatically infer it by dtype, but we may have a situation of the regression task with integer value.
if labels.dtype.kind == 'i': | ||
classes, label_indices = numpy.unique(labels, return_inverse=True) | ||
elif labels.dtype.kind == 'f': | ||
n_bin = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add it as optional kwarg??
resolve #206 |
else: | ||
labels = labels[:, task_index] | ||
|
||
if task_type == 'infer': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto
from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset | ||
|
||
|
||
def _approximate_mode(class_counts, n_draws): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment: url you referred.
|
||
n_classes = classes.shape[0] | ||
n_total_valid = int(numpy.floor(frac_valid * len(dataset))) | ||
n_total_test = int(numpy.floor(frac_test * len(dataset))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put comment
# n_total_train is the remainder: n - n_total_valid - n_total_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote comments in other place.
|
||
return rng.permutation(train_index),\ | ||
rng.permutation(valid_index),\ | ||
rng.permutation(test_index), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to just return array.
>>> c = numpy.random.random((10, 1)) | ||
>>> d = NumpyTupleDataset(a, b, c) | ||
>>> splitter = StratifiedSplitter() | ||
>>> splitter.train_valid_test_split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
>>> c = numpy.random.random((10, 1)) | ||
>>> d = NumpyTupleDataset(a, b, c) | ||
>>> splitter = StratifiedSplitter() | ||
>>> splitter.train_valid_split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
raise ValueError("Please assign label dataset.") | ||
labels = dataset.features[:, label_axis] | ||
|
||
if len(labels.shape) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels.ndim
if not isinstance(dataset, NumpyTupleDataset): | ||
raise ValueError("Please assign label dataset.") | ||
labels = dataset.features[:, label_axis] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(labels, list):
labels = numpy.array(labels)
.pytest_cache/v/cache/nodeids
Outdated
@@ -0,0 +1,330 @@ | |||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this file from commit?
and if not done, please put .pytest_cache
to .gitignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already remove this file.
assert train_ind.shape[0] == 15 | ||
assert valid_ind.shape[0] == 9 | ||
assert test_ind.shape[0] == 6 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?
assert test_ind.shape[0] == 6 | ||
|
||
|
||
def test_classification_split_by_labels_list(cls_dataset, cls_label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking the number of class 0, 1 must be larget than some amount? to ensure stratified splitting?
Updated. |
LGTM |