Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[memo] High memory consumption and the places of doubts #180

Open
nabenabe0928 opened this issue Apr 20, 2021 · 5 comments
Open

[memo] High memory consumption and the places of doubts #180

nabenabe0928 opened this issue Apr 20, 2021 · 5 comments
Assignees

Comments

@nabenabe0928
Copy link
Contributor

nabenabe0928 commented Apr 20, 2021

I write down the current memory usage as a memo just in case when we encounter memory leak issues in the future.
This post is based on the current implementation.

When we run a dataset with the size of 300B, AutoPytorch consumes ~1.5GB and the followings are the major source of the memory consumptions:

Source Consumption [GB]
Import modules 0.35
Dask Client 0.35
Logger (Thread safe) 0.4
Running of context.Process in multiprocessing module 0.4
Model 0 ~ inf
Total 1.5 ~ inf

When we run a dataset with the size of 300MB (400,000 instances x 80 features) such as Albert, AutoPytorch consumes ~2.5GB and the followings are the major source of the memory consumptions:

Source Consumption [GB]
Import modules 0.35
Dask Client 0.35
Logger (Thread safe) 0.4
Dataset itself 0.3
self.categories in InputValidator 0.3
Running of context.Process in multiprocessing module 0.4
Model (e.g. LightGBM) 0.4 ~ inf
Total 2.5 ~ inf

All the information was obtained by:

$ mprof run --include-children python -m examples.tabular.20_basics.example_tabular_classification

and the logger which I set for the debugging. Note that I also added time.sleep(0.5) before and after the line of interest to eliminate the possibilities of the influences from other elements and checked each line in detail.

@ArlindKadra
Copy link

Interesting :), I think the analysis in the future should also be extended to the following datasets:

https://archive.ics.uci.edu/ml/datasets/covertype
https://archive.ics.uci.edu/ml/datasets/HIGGS
https://archive.ics.uci.edu/ml/datasets/Poker+Hand

They proved tricky.

@nabenabe0928 nabenabe0928 self-assigned this May 5, 2021
@nabenabe0928
Copy link
Contributor Author

FYI, when we use Optuna with a tiny model, we consume around only 150MB.
This module is also thread safe.

import optuna


def objective(trial):
    x0 = trial.suggest_uniform('x0', -10, 10)
    x1 = trial.suggest_uniform('x1', -10, 10)
    return x0 ** 2 + x1 ** 2


if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=5000, n_jobs=4)

@nabenabe0928
Copy link
Contributor Author

nabenabe0928 commented May 20, 2021

I tested the memory usage for the following datasets:

Dataset name # of features # of instances Approx. Datasize [MB]
Covertype 55 581012 60 ~ 240
Higgs 29 98050 5 ~ 20
Poker-hands 11 1025009 22 ~ 90

The details of the memory usage are the followings:

Source Consumption in covertype [GB] Consumption in higgs [GB] Consumption in pocker-hand [GB]
Import modules 0.35 0.35 0.35
Dask Client 0.35 0.35 0.35
Logger (Thread safe) 0.35 0.35 0.35
Dataset itself 0.1 0.05 0.1
self.categories in InputValidator 0 0 0.02
Running of context.Process in multiprocessing module 0.4 0.4 0.4
LightGBM 0.6 0.1 0.3
CatBoost 0.8 0.1 0.6
Random Forest 1.2 0.5 1.0
Extra Trees 1.2 0.2 1.1
SVM 0.9 0.2 0.6
KNN 0.8 - 0.4
Total 2.0 ~ 1.5 ~ 1.7 ~

Note that KNN failed in Higgs and some trainings for each dataset were canceled because of memory out error.
This time, I set memory_limit = 4096, but I somehow got memory out error with lower numbers such 2.5 ~ 3.0 GB.
Probably it is better to check if it works well on the latest branch as well.

@nabenabe0928
Copy link
Contributor Author

nabenabe0928 commented Jun 21, 2021

This is from #259 by @franchuterivera.

  • We should not let the datamanager actively reside in memory when we are not using it. For example, there is no need to have a datamanager in smbo.
  • Also, after search has save the datamanager to disk, we can delete and garbage collect it.
  • We also should datacollect and challenge the need of datamanager in the evaluator
  • We should improve the cross validation handling of the out of fold predictions. Rather than having a list that contains the OOF predicitons here we should have a fixed array of n_samples created once at the beginning. OOF predictions from the k-fold model should be added smartly to this pre-existing array, something like self.Y_optimization[test_indices] = opt_pred. This ways predictions are sorted and can be used directly by ensemble selection without the need of saving this array
  • Calculating the train loss should be optional, not done by default here. We should prevent doing predict if not strictly needed.
  • As reported already by @nabenabe0928 the biggest contribution comes from import files. In particular, just doing import torch consumes 2Gb of peak virtual memory and the majority of times this happens is for mypy typing. We should encapsulate these calls under typing.TYPE_CHECKING and only import the strictly needed class from pytorch.

@nabenabe0928
Copy link
Contributor Author

Check if we can use generator instead of np.ndarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants