Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy runs out of memory #561

Closed
jendrikseipp opened this issue Nov 20, 2019 · 16 comments · Fixed by #875
Closed

numpy runs out of memory #561

jendrikseipp opened this issue Nov 20, 2019 · 16 comments · Fixed by #875
Labels
documentation Documentation is needed/added.

Comments

@jendrikseipp
Copy link

Description

SMAC runs out of memory for some of our scenarios (it has 3.5 GiB available). It catches this and aborts gracefully, but it would be great if there was some way of reducing the amount of memory that SMAC tries to reserve via numpy.

Here is the error traceback:

Traceback (most recent call last):
  File "/infai/seipp/projects/new-benchmarks/optimization/linear.py", line 703, in <module>
    incumbent = smac.optimize()
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/facade/smac_ac_facade.py", line 542, in optimize
    incumbent = self.solver.run()
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 201, in run
    challengers = self.choose_next(X, Y)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/smbo.py", line 277, in choose_next
    random_configuration_chooser=self.random_configuration_chooser
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 658, in maximize
    _sorted=True,
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 554, in _maximize
    return self._sort_configs_by_acq_value(rand_configs)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/ei_optimization.py", line 137, in _sort_configs_by_acq_value
    acq_values = self.acquisition_function(configs)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/acquisition.py", line 77, in __call__
    acq = self._compute(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/optimizer/acquisition.py", line 382, in _compute
    m, var_ = self.model.predict_marginalized_over_instances(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/rf_with_instances.py", line 269, in predict_marginalized_over_instances
    mean_, var = self.predict(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/base_epm.py", line 207, in predict
    mean, var = self._predict(X)
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/smac/epm/rf_with_instances.py", line 223, in _predict
    preds_as_array = np.log(np.nanmean(np.exp(preds_as_array), axis=2) + VERY_SMALL_NUMBER)
  File "<__array_function__ internals>", line 6, in nanmean
  File "/infai/seipp/.conda/envs/smac-conda/lib/python3.7/site-packages/numpy/lib/nanfunctions.py", line 949, in nanmean
    cnt = np.sum(~mask, axis=axis, dtype=np.intp, keepdims=keepdims)
MemoryError: Unable to allocate array with shape (10000, 10, 500) and data type bool

Steps/Code to Reproduce

I don't have a minimal example to reproduce the error, but here are our logs and SMAC output files: https://ai.dmi.unibas.ch/_tmp_files/seipp/smac-numpy-out-of-memory.tar.gz
You can find the stdout and stderr output in run.log and run.err. The smac files are under smac/run_*.

Do you have any suggestion how to reduce the memory usage?

Versions

0.11.1

@mfeurer
Copy link
Contributor

mfeurer commented Nov 20, 2019

Thank you very much for reporting this.

The code fails when SMAC tries to compute the acquisition function for 10000 configurations. A practical solution would be to reduce this number to something like 1000, by passing either --acq_opt_challengers or --acq-opt-challengers to SMAC.

@jendrikseipp
Copy link
Author

I'll try that, thanks!

@jendrikseipp
Copy link
Author

We changed the code to

scenario = Scenario({
    ....
    "acq_opt_challengers": 1000,
})

but still get the same error messages. Can it be that the setting is not picked up by SMAC? If we don't set it explicitly, shouldn't the default be 5000 instead of 10000?

@jendrikseipp
Copy link
Author

In smac_hpo_facade.py I found the following code snippet:

# better improve acquisition function optimization
# 2. more randomly sampled configurations
self.solver.scenario.acq_opt_challengers = 10000

I think the value should only be overriden if it hasn't been set by the user.

@mlindauer
Copy link
Contributor

Hi Jendrik,

I would say that users should not change options such as self.solver.scenario.acq_opt_challengers at all. I also don't believe that this will fix your memory problem. Looking into the log files, I'm also quite confused. Was there only a single output of statistics? This would mean we had only a single run of intensification. Considering your small configuration space and that you have no instances, I wonder how SMAC can use so much memory. I worry that something else is broken. Could you please increase the log-level to DEBUG and send us the debug output.

Best,
Marius

@jendrikseipp
Copy link
Author

Could it be that the problem is that no tested configuration is better than the initial incumbent?

@mlindauer
Copy link
Contributor

I don't think so. But I would need either a toy example to reproduce the problem on my machine or at the very least a debug output s.t. I have a chance to guess the problem.

@jendrikseipp
Copy link
Author

I have reduced the run to a toy example (test-numpy.py). When I use "ulimit -Sv 600000" and then "rm -rf smac && ./test-numpy.py", I get "MemoryError: Unable to allocate array with shape (10000, 10, 9) and data type bool" after 4 seconds. Here is the script:

#! /usr/bin/env python3

import argparse
import logging
import sys
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
import numpy as np

from smac.configspace import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter
from smac.scenario.scenario import Scenario
from smac.facade.smac_hpo_facade import SMAC4HPO
from smac.initial_design.default_configuration_design import DefaultConfiguration

from ConfigSpace.hyperparameters import CategoricalHyperparameter


def evaluate_cfg(cfg):
    logging.info(f"Evaluate configuration {cfg.get_dictionary()}")
    return 10 ** 6


# Build Configuration Space which defines all parameters and their ranges.
cs = ConfigurationSpace()

cs.add_hyperparameters([
    CategoricalHyperparameter("num_machines", [1, 2, 3]),
    CategoricalHyperparameter("wood_factor", [1.0, 1.25, 1.5, 2.0]),
    ])

scenario = Scenario(
    {
        "run_obj": "quality",
        "wallclock_limit": 20 * 60 * 60,
        "cs": cs,
        "deterministic": "true",
        # memory limit for evaluate_cfg (we set the limit ourselves)
        "memory_limit": None,
        # time limit for evaluate_cfg (we cut off planner runs ourselves)
        "cutoff": None,
        "output_dir": "smac",
        # "acq_opt_challengers": 1000,  # Overriden in SMAC4HPO constructor.
    }
)

# Example call of the function
default_cfg = cs.get_default_configuration()
print("Default config:", default_cfg)
# evaluate_cfg(default_cfg)

print("Optimizing...")
# When using SMAC4HPO, the default configuration has to be requested explicitly
# as first design (see https://github.com/automl/SMAC3/issues/533).
smac = SMAC4HPO(
    scenario=scenario,
    initial_design=DefaultConfiguration,
    rng=np.random.RandomState(42),
    tae_runner=evaluate_cfg,
)
# SMAC4HPO overrides the value for acq_opt_challengers in the scenario with
# a fixed value of 10000, so we set it here (see https://github.com/automl/SMAC3/issues/561).
#smac.solver.scenario.acq_opt_challengers = 10 ** 3
incumbent = smac.optimize()

print("Final configuration: {}".format(incumbent.get_dictionary()))
evaluate_sequence(incumbent)

@mlindauer
Copy link
Contributor

Hi,

Thank you for the example.

Some comments:

  1. I would not recommend to limit virtual memory, since languages such as Java and Python will reserve more virtual memory than they actually use the real memory in the end.
  2. Your example has only 12 configurations. SMAC will try all of these and is caught in an infinite loop afterwards. Please note that SMAC does not recognize that it has looked at all configurations.
  3. Because of 2., the memory consumption is nearly constant (~160MB on my machine). I would say that 160MB is fine given that we build some ML models and use Python (and not C).

Best,
Marius

@jendrikseipp
Copy link
Author

Thanks for your comments!

Reg. 1: I agree that it would be better to not limit the virtual memory, but we have to make sure that the SMAC runs don't use too much memory when we run them in parallel on shared compute nodes on our cluster. Do you know an alternative way of limiting memory for this setting?

Reg. 2: Even if this only occurs for small configuration spaces, I think it would be good if SMAC stopped when it has tried all configurations. This would make debugging much easier.

Reg. 3: Yes, 160MB is definitely fine.

@mlindauer
Copy link
Contributor

I completely agree regarding 2., but this is not trivial to implement for complex configuration spaces with conditionals and forbidden constraints. Essentially this leads to counting all solutions of a constraint satisfaction problem. For simple configuration spaces (without forbidden constraints), this should be feasible. We will consider to implement a solution for such configuration spaces in a future release.

Regarding 1, you could try to use ulimit -m instead of -v.

@jendrikseipp
Copy link
Author

Thanks! I'll try that.

@mfeurer
Copy link
Contributor

mfeurer commented Dec 11, 2019

I guess we then have a duplicate of #21 and #25? Based on the date these issues were opened this doesn't seem to be too high on our priority list and we could use some help here.

@jendrikseipp
Copy link
Author

I just found out that ulimit -m has no effect on modern Linux: https://unix.stackexchange.com/questions/129587/does-ulimit-m-not-work-on-modern-linux

BTW, setting acq_opt_challengers = 1000 removed the numpy memory error for us (even if users shouldn't need to set it).

@stale
Copy link

stale bot commented Jun 18, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 18, 2022
@dengdifan dengdifan removed the stale label Jun 23, 2022
@stale
Copy link

stale bot commented Aug 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 31, 2022
@stale stale bot closed this as completed Sep 7, 2022
@renesass renesass added documentation Documentation is needed/added. and removed stale labels Sep 8, 2022
@renesass renesass reopened this Sep 8, 2022
@renesass renesass linked a pull request Sep 8, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation is needed/added.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants