[memo] High memory consumption and the places of doubts #180

nabenabe0928 · 2021-04-20T22:05:41Z

I write down the current memory usage as a memo just in case when we encounter memory leak issues in the future.
This post is based on the current implementation.

When we run a dataset with the size of 300B, AutoPytorch consumes ~1.5GB and the followings are the major source of the memory consumptions:

Source	Consumption [GB]
Import modules	0.35
Dask Client	0.35
Logger (Thread safe)	0.4
Running of context.Process in multiprocessing module	0.4
Model	0 ~ inf
Total	1.5 ~ inf

When we run a dataset with the size of 300MB (400,000 instances x 80 features) such as Albert, AutoPytorch consumes ~2.5GB and the followings are the major source of the memory consumptions:

Source	Consumption [GB]
Import modules	0.35
Dask Client	0.35
Logger (Thread safe)	0.4
Dataset itself	0.3
self.categories in InputValidator	0.3
Running of context.Process in multiprocessing module	0.4
Model (e.g. LightGBM)	0.4 ~ inf
Total	2.5 ~ inf

All the information was obtained by:

$ mprof run --include-children python -m examples.tabular.20_basics.example_tabular_classification

and the logger which I set for the debugging. Note that I also added time.sleep(0.5) before and after the line of interest to eliminate the possibilities of the influences from other elements and checked each line in detail.

The text was updated successfully, but these errors were encountered:

ArlindKadra · 2021-04-21T13:42:38Z

Interesting :), I think the analysis in the future should also be extended to the following datasets:

https://archive.ics.uci.edu/ml/datasets/covertype
https://archive.ics.uci.edu/ml/datasets/HIGGS
https://archive.ics.uci.edu/ml/datasets/Poker+Hand

They proved tricky.

nabenabe0928 · 2021-05-07T03:01:05Z

FYI, when we use Optuna with a tiny model, we consume around only 150MB.
This module is also thread safe.

import optuna


def objective(trial):
    x0 = trial.suggest_uniform('x0', -10, 10)
    x1 = trial.suggest_uniform('x1', -10, 10)
    return x0 ** 2 + x1 ** 2


if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=5000, n_jobs=4)

nabenabe0928 · 2021-05-20T13:36:41Z

I tested the memory usage for the following datasets:

Dataset name	# of features	# of instances	Approx. Datasize [MB]
Covertype	55	581012	60 ~ 240
Higgs	29	98050	5 ~ 20
Poker-hands	11	1025009	22 ~ 90

The details of the memory usage are the followings:

Source	Consumption in covertype [GB]	Consumption in higgs [GB]	Consumption in pocker-hand [GB]
Import modules	0.35	0.35	0.35
Dask Client	0.35	0.35	0.35
Logger (Thread safe)	0.35	0.35	0.35
Dataset itself	0.1	0.05	0.1
self.categories in InputValidator	0	0	0.02
Running of context.Process in multiprocessing module	0.4	0.4	0.4
LightGBM	0.6	0.1	0.3
CatBoost	0.8	0.1	0.6
Random Forest	1.2	0.5	1.0
Extra Trees	1.2	0.2	1.1
SVM	0.9	0.2	0.6
KNN	0.8	-	0.4
Total	2.0 ~	1.5 ~	1.7 ~

Note that KNN failed in Higgs and some trainings for each dataset were canceled because of memory out error.
This time, I set memory_limit = 4096, but I somehow got memory out error with lower numbers such 2.5 ~ 3.0 GB.
Probably it is better to check if it works well on the latest branch as well.

nabenabe0928 · 2021-06-21T12:46:17Z

This is from #259 by @franchuterivera.

We should not let the datamanager actively reside in memory when we are not using it. For example, there is no need to have a datamanager in smbo.
Also, after search has save the datamanager to disk, we can delete and garbage collect it.
We also should datacollect and challenge the need of datamanager in the evaluator
We should improve the cross validation handling of the out of fold predictions. Rather than having a list that contains the OOF predicitons here we should have a fixed array of n_samples created once at the beginning. OOF predictions from the k-fold model should be added smartly to this pre-existing array, something like self.Y_optimization[test_indices] = opt_pred. This ways predictions are sorted and can be used directly by ensemble selection without the need of saving this array
Calculating the train loss should be optional, not done by default here. We should prevent doing predict if not strictly needed.
As reported already by @nabenabe0928 the biggest contribution comes from import files. In particular, just doing import torch consumes 2Gb of peak virtual memory and the majority of times this happens is for mypy typing. We should encapsulate these calls under typing.TYPE_CHECKING and only import the strictly needed class from pytorch.

nabenabe0928 · 2021-08-12T11:55:26Z

Check if we can use generator instead of np.ndarray

nabenabe0928 self-assigned this May 5, 2021

This was referenced Feb 10, 2022

[bug] There might be some memory leak #379

Open

[FIX] Only import required components from torch #381

Closed

[FIX] Datamanager in memory #382

Merged

ravinkohli mentioned this issue Mar 31, 2022

nn.Embedding to avoid OneHotEncoding all categorical columns #425

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[memo] High memory consumption and the places of doubts #180

[memo] High memory consumption and the places of doubts #180

nabenabe0928 commented Apr 20, 2021 •

edited

ArlindKadra commented Apr 21, 2021

nabenabe0928 commented May 7, 2021

nabenabe0928 commented May 20, 2021 •

edited

nabenabe0928 commented Jun 21, 2021 •

edited by ravinkohli

nabenabe0928 commented Aug 12, 2021

[memo] High memory consumption and the places of doubts #180

[memo] High memory consumption and the places of doubts #180

Comments

nabenabe0928 commented Apr 20, 2021 • edited

ArlindKadra commented Apr 21, 2021

nabenabe0928 commented May 7, 2021

nabenabe0928 commented May 20, 2021 • edited

nabenabe0928 commented Jun 21, 2021 • edited by ravinkohli

nabenabe0928 commented Aug 12, 2021

nabenabe0928 commented Apr 20, 2021 •

edited

nabenabe0928 commented May 20, 2021 •

edited

nabenabe0928 commented Jun 21, 2021 •

edited by ravinkohli