Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS : Program is Terminated. Because you tried to allocate too many memory regions. #1020

Closed
Innixma opened this issue Mar 16, 2021 · 7 comments · Fixed by #1722
Closed
Assignees
Labels
bug Something isn't working help wanted Contributions welcome! module: tabular urgent
Milestone

Comments

@Innixma
Copy link
Contributor

Innixma commented Mar 16, 2021

When training K-nearest-neighbors (KNN) models, sometimes a rare error can occur that crashes the entire process:

BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault: 11

It has so far only occurred on machines with >300 GB of memory (2 confirmed instances). In both cases, the system was not low on memory and had plenty to complete the task, yet the error still occurred.

@Innixma Innixma added bug Something isn't working help wanted Contributions welcome! urgent module: tabular labels Mar 16, 2021
@Innixma Innixma added this to the 0.2 Release milestone Mar 16, 2021
@Innixma
Copy link
Contributor Author

Innixma commented Mar 16, 2021

Current theory based on OpenMathLib/OpenBLAS#1882:

AutoGluon will call .fit many times very quickly when sampling KNN with 96 threads, each fit call creates 96 threads for parallel fitting/inference. Because this process repeats so rapidly, python isn't able to clean the threads faster than they are created, causing BLAS to error.

@rakshithvasudev
Copy link

I can confirm this is still happening. Any best practices you recommend to alleviate this bug?

Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 761.1s of the 761.0s of remaining time.
        Time limit exceeded... Skipping ExtraTreesEntr_BAG_L2.
Fitting model: KNeighborsUnif_BAG_L2 ... Training model for up to 448.45s of the 448.35s of remaining time.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

@Innixma
Copy link
Contributor Author

Innixma commented Apr 23, 2021

@rakshithvasudev I suspect it occurs when you have many CPUs. Could you tell me how many CPUs your machine has? One option is to specify your hyperparameters explicitly, and set n_jobs for KNN to a small enough value to stop the error from occurring. This is a work-around until a better solution is implemented.

@rakshithvasudev
Copy link

@Innixma Thanks for your response. I have 128 cores on 2 CPUs. How do I set n_jobs explictly for KNN model only?

I'm very new to autogluon. I'm using the TabularPredictor.

TabularPredictor(label=label, path=save_path, eval_metric="f1").fit(train_data, time_limit=6000, presets='high_quality_fast_inference_only_refit')

Including the full log it that helps:

        Train Data (Processed) Memory Usage: 904.79 MB (0.2% of available memory)
Data preprocessing and feature engineering runtime = 58.75s ...
AutoGluon will gauge predictive performance using evaluation metric: 'f1'
        To change this, specify the eval_metric argument of fit()
Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 2970.62s of the 5941.24s of remaining time.
        0.7243   = Validation f1 score
        2130.26s         = Training runtime
        7.81s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 824.57s of the 3795.18s of remaining time.
        Time limit exceeded... Skipping RandomForestEntr_BAG_L1.
Fitting model: ExtraTreesGini_BAG_L1 ... Training model for up to 631.98s of the 3602.59s of remaining time.
        Time limit exceeded... Skipping ExtraTreesGini_BAG_L1.
Fitting model: ExtraTreesEntr_BAG_L1 ... Training model for up to 282.32s of the 3252.94s of remaining time.
        Time limit exceeded... Skipping ExtraTreesEntr_BAG_L1.
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 2873.03s of remaining time.
        0.7243   = Validation f1 score
        0.85s    = Training runtime
        3.35s    = Validation runtime
Fitting model: RandomForestGini_BAG_L2 ... Training model for up to 2868.72s of the 2868.61s of remaining time.
        0.9371   = Validation f1 score
        1622.53s         = Training runtime
        7.18s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L2 ... Training model for up to 1231.22s of the 1231.12s of remaining time.
        Time limit exceeded... Skipping RandomForestEntr_BAG_L2.
Fitting model: ExtraTreesGini_BAG_L2 ... Training model for up to 1089.88s of the 1089.78s of remaining time.
        Time limit exceeded... Skipping ExtraTreesGini_BAG_L2.
Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 761.1s of the 761.0s of remaining time.
        Time limit exceeded... Skipping ExtraTreesEntr_BAG_L2.
Fitting model: KNeighborsUnif_BAG_L2 ... Training model for up to 448.45s of the 448.35s of remaining time.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.

@Innixma
Copy link
Contributor Author

Innixma commented Apr 23, 2021

Thanks for the info!

To understand how to specify custom hyperparameters, refer to the hyperparameters argument documentation: https://auto.gluon.ai/stable/_modules/autogluon/tabular/predictor/predictor.html#TabularPredictor.fit

The default hyperparameters are:

hyperparameters = {
    'NN': {},
    'GBM': [
        {},
        {'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}},
        'GBMLarge',
    ],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [
        {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'mse', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
    ],
    'XT': [
        {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
        {'criterion': 'mse', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
    ],
    'KNN': [
        {'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}},
        {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}},
    ],
}

You would want to edit KNN (example to enforce using only 16 cores):

'KNN': [
    {'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}, 'n_jobs': 16},
    {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}, 'n_jobs': 16},
],

Another option is to disable KNN entirely:

excluded_model_types = ['KNN']

@rakshithvasudev
Copy link

rakshithvasudev commented Apr 24, 2021

Thanks @Innixma n_jobs set to 16 works. I tried running the job twice and had no problem running both the times. I believe the current theory holds well :)

@Innixma Innixma modified the milestones: 0.3 Release, 0.4 Release Aug 14, 2021
@gebawe
Copy link

gebawe commented Sep 23, 2021

This worked for me.
"BLAS stands for Basic Linear Algebra Subprograms. BLAS provides standard interfaces for linear algebra, including BLAS1 (vector-vector operations), BLAS2 (matrix-vector operations), and BLAS3 (matrix-matrix operations).
As per the documentation, if your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Therefore, you must set OpenBLAS to use a single thread.
So, it seems that your application is conflicting with OpenBLAS multi-threading. You need to run the followings on the command line and it should fix the error:"

export OPENBLAS_NUM_THREADS=1
export GOTO_NUM_THREADS=1
export OMP_NUM_THREADS=1

https://www.discoverbits.in/2509/program-terminated-because-tried-allocate-memory-regions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Contributions welcome! module: tabular urgent
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants