Fix floating point issues (Issue #538) #589

teresaconc · 2018-11-30T10:52:32Z

Fix #538 Issue: Floating point issues when choosing different holdout set train sizes.

Pass train_size and test_size as integers (number of samples) instead of passing floats (ratio). This way we prevent floating point errors in the multiplication (n_samples * train_size/test_size) that happened in some specific cases.

Create Release 0.4.1

… of floats (ratio of samples)

mfeurer

Thanks a lot for your fix. Could you please also add a unit test to test/test_evaluation/test_train_evaluator.py?

codecov-io · 2018-11-30T12:50:14Z

Codecov Report

Merging #589 into development will decrease coverage by 0.03%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##           development     #589      +/-   ##
===============================================
- Coverage        78.65%   78.61%   -0.04%     
===============================================
  Files              130      130              
  Lines            10119    10120       +1     
===============================================
- Hits              7959     7956       -3     
- Misses            2160     2164       +4

Impacted Files	Coverage Δ
autosklearn/evaluation/train_evaluator.py	`93.77% <100%> (+0.02%)`	⬆️
..._preprocessing/select_percentile_classification.py	`86.2% <0%> (-3.45%)`	⬇️
autosklearn/automl.py	`81.18% <0%> (-0.37%)`	⬇️
...ipeline/components/classification/decision_tree.py	`93.75% <0%> (ø)`	⬆️
...rn/pipeline/components/regression/decision_tree.py	`94.64% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cc546d...e3241e1. Read the comment docs.

teresaconc · 2018-12-03T10:47:03Z

@mfeurer I commited some changes and added unit tests.

I am uncertain in what's the supposed behaviour for multilabel classification tasks.
In the original train_evaluator.py for the non shuffle case we have tmp_train_size = int(np.floor(train_size * y.shape[0])), which means that if I have 10 samples with 2 labels, and since y is previously flatten through "ravel()" the tmp_train_size will be train_size20 and not train_size10.

Should I keep this behaviour for both shuffle and not shuffle versions?
In other words, in this part of the code for MULTILABEL_CLASSIFICATION:

if shuffle:
      cv = ShuffleSplit(n_splits=1, train_size=train_size,
                                 test_size=test_size, random_state=1)

should I give the train_size has a ratio of the flatten or non flatten version?

mfeurer · 2018-12-03T17:22:50Z

Thanks for adding the tests. Looking at them they appear to work on rather simple data. Do they fail without your fix? If no, could you please alter the data such that the unit test would fail without your fix?

Also, could you please remove the pytest_cache files from the pull request?

since y is previously flatten through "ravel()"

y is only ravelled for non-multiclass problems

should I give the train_size has a ratio of the flatten or non flatten version?

The non-flattened version.

teresaconc · 2018-12-04T09:45:31Z

Looking at them they appear to work on rather simple data. Do they fail without your fix?

Yes, they fail without my fix. The issue was that for some train_sizes this linetest_size = 1- train_size was yielding extra decimal numbers, like test_size = 1 - 0.7 was resulting in 0.300000004, so the number of samples does not need to be particularly large for it to fail. I added two tests with sample size of 900, but let me know if you would like something more elaborated.

Also, could you please remove the pytest_cache files from the pull request?

Yeah, sorry about that. Done

y is only ravelled for non-multiclass problems.

y = D.data['Y_train'].ravel() is being used in get_splitter() for every type of problem. I moved it inside the

if D.info['task'] in CLASSIFICATION_TASKS and \
                        D.info['task'] != MULTILABEL_CLASSIFICATION:

teresaconc · 2018-12-04T09:48:17Z

I changed the solution for a simpler version that it seems to be working ok too. Instead of providing the number of samples as the train and test size (integer) , I provide only the test_size (float). This way, sklearn won't do any problematic computation and train_size will always be the complement of test_size.

If you would rather go with the previous version I can revert the changes, but I think is more readable and simpler this way.

mfeurer · 2018-12-04T14:23:43Z

Thanks for your comments and code updates. The code looks much simpler now. Could you please fix the PEP8 issues, then I'd be happy to merge.

teresaconc · 2018-12-04T16:30:19Z

Done!

mfeurer · 2018-12-06T09:49:05Z

I just deleted the a c file which I believe sneaked in by accident.

Thanks a lot!

mfeurer and others added 2 commits November 12, 2018 11:16

Merge pull request automl#579 from automl/development

fd0fe3f

Create Release 0.4.1

Pass train_size and test_size as integers (number of samples) instead…

f0523fe

… of floats (ratio of samples)

teresaconc changed the title ~~Fix #538 Floating point issue in~~ Fix #538 Nov 30, 2018

teresaconc changed the title ~~Fix #538~~ Fix floating point issues (Issue #538) Nov 30, 2018

mfeurer changed the base branch from master to development November 30, 2018 12:22

mfeurer reviewed Nov 30, 2018

View reviewed changes

Teresa Conceição added 2 commits December 3, 2018 08:47

Changed assigment of train samples for general use in every cv case

84b28ea

Added Unit Tests

1ddf3a7

Changed for simpler solution. Remove 'raveling' for multilabel cases

267a328

Teresa Conceição added 3 commits December 4, 2018 15:01

Fix PEP8 errors

a046e8d

Addition Fix PEP8 errors

971e9e7

Addition Fix PEP8 errors

9833d9c

Delete competition_c_functions.c

e3241e1

mfeurer merged commit ff2a91c into automl:development Dec 6, 2018

mfeurer mentioned this pull request Dec 6, 2018

Floating point issue when choosing holdout set train size #538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix floating point issues (Issue #538) #589

Fix floating point issues (Issue #538) #589

Uh oh!

teresaconc commented Nov 30, 2018

Uh oh!

mfeurer left a comment

Uh oh!

codecov-io commented Nov 30, 2018 •

edited

Loading

Uh oh!

teresaconc commented Dec 3, 2018

Uh oh!

mfeurer commented Dec 3, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

mfeurer commented Dec 4, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

mfeurer commented Dec 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix floating point issues (Issue #538) #589

Fix floating point issues (Issue #538) #589

Uh oh!

Conversation

teresaconc commented Nov 30, 2018

Uh oh!

mfeurer left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

teresaconc commented Dec 3, 2018

Uh oh!

mfeurer commented Dec 3, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

mfeurer commented Dec 4, 2018

Uh oh!

teresaconc commented Dec 4, 2018

Uh oh!

mfeurer commented Dec 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Nov 30, 2018 •

edited

Loading