Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mb disc space limit for ensemble #874

Merged
merged 7 commits into from
Jul 3, 2020

Conversation

franchuterivera
Copy link
Contributor

Allow the user to specify the maximum megabytes of disc space that are allowed for models to exist.

The idea is to re-use the existing max models in the disc, so that if a float is provided it will be interpreted as maximum megabytes allowed of disc usage. The reason behind this is to simplify the control logic and usability -- for example, it is simpler than if we added a new variable.

The functionality revolves on the idea that the worst possible disc usage per model will define the disc usage. So we figure out what is the worst disc penalty of having a model, and divide the user-specified amount of megabytes by this number.

Added test code also for this.

autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2020

Codecov Report

Merging #874 into development will increase coverage by 0.52%.
The diff coverage is 92.68%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #874      +/-   ##
===============================================
+ Coverage        84.12%   84.65%   +0.52%     
===============================================
  Files              127      126       -1     
  Lines             9435     9246     -189     
===============================================
- Hits              7937     7827     -110     
+ Misses            1498     1419      -79     
Impacted Files Coverage Δ
autosklearn/ensemble_builder.py 73.72% <92.68%> (+2.63%) ⬆️
autosklearn/data/abstract_data_manager.py 77.02% <0.00%> (-12.17%) ⬇️
...mponents/feature_preprocessing/nystroem_sampler.py 85.29% <0.00%> (-5.89%) ⬇️
..._preprocessing/select_percentile_classification.py 86.20% <0.00%> (-3.45%) ⬇️
autosklearn/evaluation/__init__.py 80.54% <0.00%> (-2.17%) ⬇️
...ine/components/classification/gradient_boosting.py 91.89% <0.00%> (-0.91%) ⬇️
autosklearn/smbo.py 72.72% <0.00%> (-0.70%) ⬇️
autosklearn/data/competition_data_manager.py
autosklearn/estimators.py 90.41% <0.00%> (+0.05%) ⬆️
autosklearn/metrics/__init__.py 87.28% <0.00%> (+0.10%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d313f26...41ab718. Read the comment docs.

autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
this_model_cost = sum([os.path.getsize(path) for path in paths])

# get the megabytes
return round(this_model_cost / math.pow(1024, 2), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this become zero? If yes, it's ambiguous with respect to the initial value of self.read_preds[y_ens_fn] and I suggest changing the initial value to -1 or None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None will be the default value. If we fail to read the data structure, the prediction will be ignored form the calculation.

autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
autosklearn/ensemble_builder.py Outdated Show resolved Hide resolved
@franchuterivera
Copy link
Contributor Author

Comments have been implemented.
Here I also observed the kernel pca error but also a fit jobs error.

This fit jobs error can be reproduced even on development branch. This fit jobs 2 has to be improved. For instance, in my computer the assertion is AssertionError: 73 != 50, as there can only be top 50 models in disc, and this check performs a ls on the directory.

This is something for our todo list, but not related to this feature.

@mfeurer mfeurer merged commit ffead2b into automl:development Jul 3, 2020
franchuterivera added a commit to franchuterivera/auto-sklearn that referenced this pull request Aug 21, 2020
* Mb disc space limit for ensemble

* track disc consumption

* Solved artifacts of rebase

* py3.5 compatible print message

* Don't be pessimistic in Gb calc

* Incomporate comments

* Handle failure cases in ensemble disk space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants