Skip to content

Conversation

@teresaconc
Copy link
Contributor

Fix #538 Issue: Floating point issues when choosing different holdout set train sizes.

Pass train_size and test_size as integers (number of samples) instead of passing floats (ratio). This way we prevent floating point errors in the multiplication (n_samples * train_size/test_size) that happened in some specific cases.

@teresaconc teresaconc changed the title Fix #538 Floating point issue in Fix #538 Nov 30, 2018
@teresaconc teresaconc changed the title Fix #538 Fix floating point issues (Issue #538) Nov 30, 2018
@mfeurer mfeurer changed the base branch from master to development November 30, 2018 12:22
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your fix. Could you please also add a unit test to test/test_evaluation/test_train_evaluator.py?

@codecov-io
Copy link

codecov-io commented Nov 30, 2018

Codecov Report

Merging #589 into development will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #589      +/-   ##
===============================================
- Coverage        78.65%   78.61%   -0.04%     
===============================================
  Files              130      130              
  Lines            10119    10120       +1     
===============================================
- Hits              7959     7956       -3     
- Misses            2160     2164       +4
Impacted Files Coverage Δ
autosklearn/evaluation/train_evaluator.py 93.77% <100%> (+0.02%) ⬆️
..._preprocessing/select_percentile_classification.py 86.2% <0%> (-3.45%) ⬇️
autosklearn/automl.py 81.18% <0%> (-0.37%) ⬇️
...ipeline/components/classification/decision_tree.py 93.75% <0%> (ø) ⬆️
...rn/pipeline/components/regression/decision_tree.py 94.64% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cc546d...e3241e1. Read the comment docs.

@teresaconc
Copy link
Contributor Author

@mfeurer I commited some changes and added unit tests.

I am uncertain in what's the supposed behaviour for multilabel classification tasks.
In the original train_evaluator.py for the non shuffle case we have tmp_train_size = int(np.floor(train_size * y.shape[0])), which means that if I have 10 samples with 2 labels, and since y is previously flatten through "ravel()" the tmp_train_size will be train_size20 and not train_size10.

Should I keep this behaviour for both shuffle and not shuffle versions?
In other words, in this part of the code for MULTILABEL_CLASSIFICATION:

if shuffle:
      cv = ShuffleSplit(n_splits=1, train_size=train_size,
                                 test_size=test_size, random_state=1)

should I give the train_size has a ratio of the flatten or non flatten version?

@mfeurer
Copy link
Contributor

mfeurer commented Dec 3, 2018

Thanks for adding the tests. Looking at them they appear to work on rather simple data. Do they fail without your fix? If no, could you please alter the data such that the unit test would fail without your fix?

Also, could you please remove the pytest_cache files from the pull request?

since y is previously flatten through "ravel()"

y is only ravelled for non-multiclass problems

should I give the train_size has a ratio of the flatten or non flatten version?

The non-flattened version.

@teresaconc
Copy link
Contributor Author

Looking at them they appear to work on rather simple data. Do they fail without your fix?

Yes, they fail without my fix. The issue was that for some train_sizes this linetest_size = 1- train_size was yielding extra decimal numbers, like test_size = 1 - 0.7 was resulting in 0.300000004, so the number of samples does not need to be particularly large for it to fail. I added two tests with sample size of 900, but let me know if you would like something more elaborated.

Also, could you please remove the pytest_cache files from the pull request?

Yeah, sorry about that. Done

y is only ravelled for non-multiclass problems.

y = D.data['Y_train'].ravel() is being used in get_splitter() for every type of problem. I moved it inside the

if D.info['task'] in CLASSIFICATION_TASKS and \
                        D.info['task'] != MULTILABEL_CLASSIFICATION:

@teresaconc
Copy link
Contributor Author

I changed the solution for a simpler version that it seems to be working ok too. Instead of providing the number of samples as the train and test size (integer) , I provide only the test_size (float). This way, sklearn won't do any problematic computation and train_size will always be the complement of test_size.

If you would rather go with the previous version I can revert the changes, but I think is more readable and simpler this way.

@mfeurer
Copy link
Contributor

mfeurer commented Dec 4, 2018

Thanks for your comments and code updates. The code looks much simpler now. Could you please fix the PEP8 issues, then I'd be happy to merge.

@teresaconc
Copy link
Contributor Author

Done!

@mfeurer
Copy link
Contributor

mfeurer commented Dec 6, 2018

I just deleted the a c file which I believe sneaked in by accident.

Thanks a lot!

@mfeurer mfeurer merged commit ff2a91c into automl:development Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants