numpy runs out of memory #561

jendrikseipp · 2019-11-20T08:11:02Z

Description

SMAC runs out of memory for some of our scenarios (it has 3.5 GiB available). It catches this and aborts gracefully, but it would be great if there was some way of reducing the amount of memory that SMAC tries to reserve via numpy.

Here is the error traceback:

Traceback (most recent call last):
  File "/infai/seipp/projects/new-benchmarks/optimization/linear.py", line 703, in <module>
    incumbent = smac.optimize()
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/facade/smac_ac_facade.py", line 542, in optimize
    incumbent = self.solver.run()
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 201, in run
    challengers = self.choose_next(X, Y)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 277, in choose_next
    random_configuration_chooser=self.random_configuration_chooser
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 658, in maximize
    _sorted=True,
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 554, in _maximize
    return self._sort_configs_by_acq_value(rand_configs)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 137, in _sort_configs_by_acq_value
    acq_values = self.acquisition_function(configs)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/acquisition.py", line 77, in __call__
    acq = self._compute(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/acquisition.py", line 382, in _compute
    m, var_ = self.model.predict_marginalized_over_instances(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/rf_with_instances.py", line 269, in predict_marginalized_over_instances
    mean_, var = self.predict(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/base_epm.py", line 207, in predict
    mean, var = self._predict(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/rf_with_instances.py", line 223, in _predict
    preds_as_array = np.log(np.nanmean(np.exp(preds_as_array), axis=2) + VERY_SMALL_NUMBER)
  File "<__array_function__ internals>", line 6, in nanmean
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/numpy/lib/nanfunctions.py", line 949, in nanmean
    cnt = np.sum(~mask, axis=axis, dtype=np.intp, keepdims=keepdims)
MemoryError: Unable to allocate array with shape (10000, 10, 500) and data type bool

Steps/Code to Reproduce

I don't have a minimal example to reproduce the error, but here are our logs and SMAC output files: https://ai.dmi.unibas.ch/_tmp_files/seipp/smac-numpy-out-of-memory.tar.gz
You can find the stdout and stderr output in run.log and run.err. The smac files are under smac/run_*.

Do you have any suggestion how to reduce the memory usage?

Versions

0.11.1

The text was updated successfully, but these errors were encountered:

mfeurer · 2019-11-20T08:49:35Z

Thank you very much for reporting this.

The code fails when SMAC tries to compute the acquisition function for 10000 configurations. A practical solution would be to reduce this number to something like 1000, by passing either --acq_opt_challengers or --acq-opt-challengers to SMAC.

jendrikseipp · 2019-11-20T12:41:06Z

I'll try that, thanks!

jendrikseipp · 2019-12-03T15:38:48Z

We changed the code to

scenario = Scenario({
    ....
    "acq_opt_challengers": 1000,
})

but still get the same error messages. Can it be that the setting is not picked up by SMAC? If we don't set it explicitly, shouldn't the default be 5000 instead of 10000?

jendrikseipp · 2019-12-07T12:36:01Z

In smac_hpo_facade.py I found the following code snippet:

# better improve acquisition function optimization
# 2. more randomly sampled configurations
self.solver.scenario.acq_opt_challengers = 10000

I think the value should only be overriden if it hasn't been set by the user.

mlindauer · 2019-12-09T08:34:40Z

Hi Jendrik,

I would say that users should not change options such as self.solver.scenario.acq_opt_challengers at all. I also don't believe that this will fix your memory problem. Looking into the log files, I'm also quite confused. Was there only a single output of statistics? This would mean we had only a single run of intensification. Considering your small configuration space and that you have no instances, I wonder how SMAC can use so much memory. I worry that something else is broken. Could you please increase the log-level to DEBUG and send us the debug output.

Best,
Marius

jendrikseipp · 2019-12-09T13:56:29Z

Could it be that the problem is that no tested configuration is better than the initial incumbent?

mlindauer · 2019-12-10T07:54:58Z

I don't think so. But I would need either a toy example to reproduce the problem on my machine or at the very least a debug output s.t. I have a chance to guess the problem.

jendrikseipp · 2019-12-10T10:56:01Z

I have reduced the run to a toy example (test-numpy.py). When I use "ulimit -Sv 600000" and then "rm -rf smac && ./test-numpy.py", I get "MemoryError: Unable to allocate array with shape (10000, 10, 9) and data type bool" after 4 seconds. Here is the script:

#! /usr/bin/env python3

import argparse
import logging
import sys
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
import numpy as np

from smac.configspace import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter
from smac.scenario.scenario import Scenario
from smac.facade.smac_hpo_facade import SMAC4HPO
from smac.initial_design.default_configuration_design import DefaultConfiguration

from ConfigSpace.hyperparameters import CategoricalHyperparameter


def evaluate_cfg(cfg):
    logging.info(f"Evaluate configuration {cfg.get_dictionary()}")
    return 10 ** 6


# Build Configuration Space which defines all parameters and their ranges.
cs = ConfigurationSpace()

cs.add_hyperparameters([
    CategoricalHyperparameter("num_machines", [1, 2, 3]),
    CategoricalHyperparameter("wood_factor", [1.0, 1.25, 1.5, 2.0]),
    ])

scenario = Scenario(
    {
        "run_obj": "quality",
        "wallclock_limit": 20 * 60 * 60,
        "cs": cs,
        "deterministic": "true",
        # memory limit for evaluate_cfg (we set the limit ourselves)
        "memory_limit": None,
        # time limit for evaluate_cfg (we cut off planner runs ourselves)
        "cutoff": None,
        "output_dir": "smac",
        # "acq_opt_challengers": 1000,  # Overriden in SMAC4HPO constructor.
    }
)

# Example call of the function
default_cfg = cs.get_default_configuration()
print("Default config:", default_cfg)
# evaluate_cfg(default_cfg)

print("Optimizing...")
# When using SMAC4HPO, the default configuration has to be requested explicitly
# as first design (see https://github.com/automl/SMAC3/issues/533).
smac = SMAC4HPO(
    scenario=scenario,
    initial_design=DefaultConfiguration,
    rng=np.random.RandomState(42),
    tae_runner=evaluate_cfg,
)
# SMAC4HPO overrides the value for acq_opt_challengers in the scenario with
# a fixed value of 10000, so we set it here (see https://github.com/automl/SMAC3/issues/561).
#smac.solver.scenario.acq_opt_challengers = 10 ** 3
incumbent = smac.optimize()

print("Final configuration: {}".format(incumbent.get_dictionary()))
evaluate_sequence(incumbent)

mlindauer · 2019-12-10T12:39:25Z

Hi,

Thank you for the example.

Some comments:

I would not recommend to limit virtual memory, since languages such as Java and Python will reserve more virtual memory than they actually use the real memory in the end.
Your example has only 12 configurations. SMAC will try all of these and is caught in an infinite loop afterwards. Please note that SMAC does not recognize that it has looked at all configurations.
Because of 2., the memory consumption is nearly constant (~160MB on my machine). I would say that 160MB is fine given that we build some ML models and use Python (and not C).

Best,
Marius

jendrikseipp · 2019-12-10T14:59:22Z

Thanks for your comments!

Reg. 1: I agree that it would be better to not limit the virtual memory, but we have to make sure that the SMAC runs don't use too much memory when we run them in parallel on shared compute nodes on our cluster. Do you know an alternative way of limiting memory for this setting?

Reg. 2: Even if this only occurs for small configuration spaces, I think it would be good if SMAC stopped when it has tried all configurations. This would make debugging much easier.

Reg. 3: Yes, 160MB is definitely fine.

mlindauer · 2019-12-10T17:03:30Z

I completely agree regarding 2., but this is not trivial to implement for complex configuration spaces with conditionals and forbidden constraints. Essentially this leads to counting all solutions of a constraint satisfaction problem. For simple configuration spaces (without forbidden constraints), this should be feasible. We will consider to implement a solution for such configuration spaces in a future release.

Regarding 1, you could try to use ulimit -m instead of -v.

jendrikseipp · 2019-12-10T20:45:14Z

Thanks! I'll try that.

mfeurer · 2019-12-11T09:48:35Z

I guess we then have a duplicate of #21 and #25? Based on the date these issues were opened this doesn't seem to be too high on our priority list and we could use some help here.

jendrikseipp · 2019-12-12T11:43:19Z

I just found out that ulimit -m has no effect on modern Linux: https://unix.stackexchange.com/questions/129587/does-ulimit-m-not-work-on-modern-linux

BTW, setting acq_opt_challengers = 1000 removed the numpy memory error for us (even if users shouldn't need to set it).

stale · 2022-06-18T01:39:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2022-08-31T19:56:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Jun 18, 2022

dengdifan removed the stale label Jun 23, 2022

stale bot added the stale label Aug 31, 2022

stale bot closed this as completed Sep 7, 2022

renesass added documentation Documentation is needed/added. and removed stale labels Sep 8, 2022

renesass reopened this Sep 8, 2022

renesass linked a pull request Sep 8, 2022 that will close this issue

Version 2.0.0a1 #875

Merged

renesass closed this as completed in #875 Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy runs out of memory #561

numpy runs out of memory #561

jendrikseipp commented Nov 20, 2019

mfeurer commented Nov 20, 2019

jendrikseipp commented Nov 20, 2019

jendrikseipp commented Dec 3, 2019

jendrikseipp commented Dec 7, 2019

mlindauer commented Dec 9, 2019

jendrikseipp commented Dec 9, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mfeurer commented Dec 11, 2019

jendrikseipp commented Dec 12, 2019

stale bot commented Jun 18, 2022

stale bot commented Aug 31, 2022

numpy runs out of memory #561

numpy runs out of memory #561

Comments

jendrikseipp commented Nov 20, 2019

Description

Steps/Code to Reproduce

Versions

mfeurer commented Nov 20, 2019

jendrikseipp commented Nov 20, 2019

jendrikseipp commented Dec 3, 2019

jendrikseipp commented Dec 7, 2019

mlindauer commented Dec 9, 2019

jendrikseipp commented Dec 9, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mlindauer commented Dec 10, 2019

jendrikseipp commented Dec 10, 2019

mfeurer commented Dec 11, 2019

jendrikseipp commented Dec 12, 2019

stale bot commented Jun 18, 2022

stale bot commented Aug 31, 2022