Skip to content

Latest commit

 

History

History
executable file
·
559 lines (420 loc) · 31.5 KB

MapElites.adoc

File metadata and controls

executable file
·
559 lines (420 loc) · 31.5 KB

fcmaes - a Python 3 gradient-free optimization library

Join%20Chat

logo

Map Elites

This tutorial

  • Discusses how to apply QD-algorithms (Quality Diversity) to a an optimization problem from the space flight planning domain which has an extremely rugged solution landscape and a huge number of local optima.

  • Discusses how different configurations / solution-emitters may affect the result.

  • Explains how CVT MAP-Elites needs to be implemented in Python to provide minimal algorithm overhead and maximal scaling with the number of parallel processes.

  • Shows how CMA-ES can be used to enhance the optimization results or to improve existing solutions.

Remark

If you just want to play around with QD-algorithms, there is a simple example arm.py, the simulation of a planar robot arm. Change the parameters / used algorithms to learn how they work. Note that the problem discussed here is "special" because of the extreme "ruggedness" of the solution landscape. For other problems we recommend applying the fcmaes diversifier QD meta-algorithm, see Diversity.adoc. As we see below diversifier also works fine for the problem discussed here.

Background

Back in 2015 an influential book was published: Why Greatness Cannot Be Planned: The Myth of the Objective, review. It inspired a flurry of research activities in the area of quality diversity (QD): See Overview QD, Papers QD and Map-Elites-Overview.

QD-algorithms often find better global solutions compared to traditional algorithms dedicated specifically to find a global optimum. Gradients, if used as indicators directing to better solutions often lead to dead ends. The fcmaes library, although dedicated to diversity by its avoidance of gradients and its different parallel retry mechanism didn’t include a specific QD-algorithm. Recently I invested more time to investigate the topic to evaluate if QD-algorithms are mature enough to be applied even to the hardest optimization problems - for instance to ESAs GTOP problems, were these were never applied before.

Note that the fcmaes library is primarily aimed at problems with dimension ⇐ 1000. Optimizing neural networks may require many thousand decision variables and should preferably be run on GPUs/TPUs. You may try evojax map_elites or evojax diversifier which is derived from fcmaes diversifier instead.

Motivation

Lets start with an example. Suppose we are planning a space exploration mission to Saturn. As long as we cannot produce propellant/fuel in space, it is very expensive to to overcome earths gravity before our mission can start. Which means we should exploit the gravity "assistance" of several planets we pass to save fuel consumption.

There are simplified models describing such a mission, ESA used one of them to formulate a 22 dimensional optimization benchmark. This model uses so called "deep space maneuvers" and is precise enough to produce meaningful results, although the solar wind and details of the 3 body problem describing the planet flybys are ignored.

The accuracy of the model is sufficient to look for interesting planning alternatives which can be previewed for involving other criteria and then feed into a more detailed planning/optimization process.

But there is a problem here: ESAs Cassini2 benchmark evaluates a single optimization result. From a real world perspective such a result is more or less useless:

  • The model is not accurate enough to determine that the solutions shown are really the optimal.

  • No alternative solutions needed for the planning process are required by the benchmark.

The only things we can learn from the results are:

  • The problem is hard - it took more than a year to find the optimal solution back in 2009.

  • We have a "reference solution" which can be used to evaluate better methods / algorithms.

  • There is some "diversity" since six different teams produced slightly different results. But if you need 100 alternatives, do you really want to hire 100 teams?

You could try random solutions, which produces beautiful pictures like

cassSun1Gb

if you use a billion of them. But even the best solutions found this way misses the global optimum by factor 2.8. Even testing a billion random solutions doesn’t help to identify "areas of opportunity" in the two-dimensional feature space - start time / time of flight. A QD-algorithm can produce a picture like

cassBestSun

where we clearly see we should start around day -800, and it makes no sense to fly shorter than 1900 and longer than 3500 days.

Planning Goal

As a mission planner, what do we really want as the result of an optimization algorithm? Perhaps something like this:

cassini 2.cma

This diagram above shows 2159 different solutions to the Cassini2 problem with a fitness value ⇐ 20. The fitness value of Cassini2 is proportional to the propellant/fuel requirements of the whole mission. We can clearly see for which mission start day and overall mission time we have a valid planning alternative to chose from. Additionally, the global optimum around 8.4 is included, so although we computed all these alternatives we still were able to find the real absolute optimum.

Next we would like to have a "fast preview" looking like this:

cassini 2.fast

Less good solutions (1391), but this preview only took about 12 minutes on a modern 16 core desktop CPU. Note that the good solutions are positioned quite similarly, only their fitness values are slightly worse. Whether we really need the "refinement" of this preview shown above is debatable because of the limitations of our model and the preview didn’t miss most of the best planning options representing the real global optimum.

Existing Python Implementations

The Cassini2 2 benchmark is implemented in C++ and wrapped in a Python API. Existing Python implementations fail to cover this use case for mainly two reasons:

  • The C++ wrapper causes issues with the way parallelization is usually implemented in Python - the serialization required for transferring objects between processes fails.

  • If only single fitness calls are parallelized, the overhead for parallelization outweighs its gain for very fast fitness functions like Cassini2.

It is possible to overcome these issues even in Python, but a completely different approach implementing inter-process communication is required: Namely the one fcmaes already uses for parallel optimization which is based on shared memory.

Fortunately regarding the basic algorithm the "fast preview" use case is already covered by CVT Map Elites where a Python reference implementation is given from which we can learn.

We can even learn how to implement an "afterburner" improving existing solutions, which is also able to expand the number of solutions: CMA-ME implemented here integrates CMA-ES with MAP-Elites. Here we can find a JAX based implementation running on GPUs/TPUs.

The main ideas contributing the the success of these ideas are:

  • The fitness function is required to additionally return a behavior / description vector which describes features of an individual solution used to distinguish them - for instance "start day" and "time of flight" for the Cassini2 problem.

  • Application of Voronoi tessellation to tesselate the behavior space into niches. This works even for higher dimensional behavior spaces. Experiments with the Cassini2 problem have shown, that it is advantageous even for two behavior dimensions.

  • Adaption of CMA-ES by sorting solutions not according to their fitness value, but to the fitness-difference to the existing elites of their corresponding niche. Since this sorting determines the reshaping of the search space for each generation we "encourage" the algorithm to search for new solutions.

But what if the underlying model used as basis of the optimization is not completely accurate - as it is the case with ESAs Cassini2 benchmark? Then you probably shouldn’t invest too much time in improving existing solutions. Instead you would filter them using a more accurate model - or considering additional criteria / constraints. Only these will be used further, and could be further optimized.

Existing algorithms don’t support this use case, so we had to create a new one: We simply apply CMA-ES, but this time we modify the fitness differently:

  • We use the new fitness function returning the behavior vector.

  • But instead of returning it we check if we are still in the initial niche.

  • If yes, we return the fitness value, if not we return infinity.

  • Additionally we restrict the box boundaries: We use the minimal/maximal values of the decision variable values for all fitness computations executed during the preliminary Map-Elites run associated with the niche we optimize.

For Cassini2 this method works quite well in improving a specific selection of niches.

Multi Modal Optimization Problems

Most real world optimization problems are multi-modal, which means they have many local minima:

rastrigin me

Often we are not only interested in the best solution, but want to know what are our alternatives. The picture above plots the first two dimensions against the fitness value for the 10-dimensional rastrigin function. You cannot easily enumerate a complete grid of solution variables because the size of such a grid grows exponentially with the number of decision variables. But you could generate millions of random solutions and use these:

from numpy.random import default_rng
from numba import jit
import numpy as np
import math

@jit
def rastrigin(x):
    return 10 * x.shape[0] + (x * x - 10 * np.cos(2 * math.pi * x)).sum()

def random_test(dim = 10, rng = default_rng()):
    xs = rng.uniform(np.full(dim, -5), np.full(dim, 5), (10000000, dim))
    best = math.inf
    for x in xs:
        y = rastrigin(x)
        best = min(y, best)
    print(best)

Note: Never forget to use numba or JAX to speed up your fitness function if you don’t want to wait forever.

As a result you usually will get a fitness optimum between 30 and 40. Looking at the picture above you probably guessed already: It was generated using a better approach. There are many real world fitness functions were your CPU capabilities restrict the number of evaluations even if parallelization is fully exploited. To analyze the optimization result we also could use a 3d view:

rastrigin me3d

Such a 3d representation is better analyzed interactively when you can view it from different angles. Questions:

  • Is there a method which can explore a complex multi-modal fitness function thereby capturing the local minima correctly ?

  • Can it find the global optimum ?

  • Does it work for complex real world applications ?

All these question will be addressed below.

Multi-objective optimization

One approach to solve the problem is to apply multi-objective optimization using additional objectives for the x- and y- axis:

from scipy.optimize import Bounds
from fcmaes modecpp

@jit(cache=True,fastmath=True)
def rastrigin_mo(x):
    return x[0], x[1], 10 * x.shape[0] + (x * x - 10 * np.cos(2 * math.pi * x)).sum()

def mo_test(dim = 10):)
    bounds = Bounds(np.full(dim, -5), np.full(dim, 5))
    xs, ys = modecpp.retry(rastrigin_mo, 3,
                0, bounds, num_retries=32, popsize = 1000, max_evaluations = 5000000, workers=32)

Since fcmaes multi-objective optimization scales very well with the number of cores, on a modern 16-core CPU like the AMD 5950x we can execute 32x5000000 evaluations in less than one minute and get the following picture:

rastrigin mo

We immediately spot the issue: The global optimum was found, but we only see one quadrant of the real solution. What happened? By defining x[0] and x[1] as additional objectives, we "told" the algorithm to prefer solutions having a lower x[0] and x[1] value. The pareto-front computation eliminated all dominated solutions, so we only see solutions with negative x[0] and x[1] values.

If such a prioritization is not intended, we need another approach:

Map-Elites

A few years ago a new approach to this problem was proposed: Map Elites.

For Map-Elites the fitness function returns not only a fitness value, but additionally a list of "behavior/descriptor" values used ensure solution diversity. We aim at finding good solutions / local minima for different "descriptor" values:

def fitness_me(x):
    ...
    return fitness, np.array[descriptor1, descriptor2, ...]

In the rastrigin examples above, the descriptor values are x[0] and x[1] used as x- and y-axis of the diagrams. The returned descriptor vector usually has a lower dimensionality as 'x', so it is easier to tesselate into separate cells. Map-Elites is a well known QD (Quality-Diversity) algorithm which works as follows:

  • Tesselate the descriptor space into n cells called archive.

  • Initialize each archive cell with a random solution and assign 'math.inf' as fitness value.

  • Generate candidate solutions by crossover / mutation or other methods based on a random selection of solutions from the archive.

  • Evaluate the candidates and determine their descriptors applying 'fitness_me'

  • For each candidate determine its cell and replace its content, if the candidate improves its incumbent.

But what if we have more than two descriptor dimensions? Then the "curse of dimensionality" applies also here and tesselation is less trivial. Fortunately CVT Map Elites solves this issue by using Voronoi tessellation. Even better: There is a reference implementation.

Performance Comparison

cvt_rastrigin.py already provides the application of the 10 dimensional rastrigin function shown above. For our experiments we decrease the number of archive cells to 'n_niches=4000', since otherwise the algorithm is dominated by the cost to find the cell associated to a descriptor vector. We increase 'px["dump_period"] = 10000000' to avoid any file writes during optimization. Then we test the performance of the optimization excluding the initialization/archive creation phase. We test both 'px["parallel"] = False' and px["parallel"] = True and both regular fitness and applying numba/@jit.

Table 1. Fitness evaluations per second rastrigin
parallel=False @jit off parallel=False @jit on parallel=True @jit off parallel=True @jit on

reference implementation

11527

13526

9480

9632

fcmaes Map-Elites

64214

90577

755254

950557

  • If we compare the best settings for each implementation we get a 950557 / 13526 = factor 70 speedup - caused by the different algorithm overhead and the different scaling by parallelization.

  • Parallelization reduces performance for the reference implementation.

  • Single threaded we get 90577 / 13526 = factor 6.7 speedup - caused by the algorithm overhead alone.

The reference implementation implements parallelism utilizing 'multiprocessing.Pool.map':

def parallel_eval(evaluate_function, to_evaluate, pool, params):
    if params['parallel'] == True:
        s_list = pool.map(evaluate_function, to_evaluate)
    else:
        s_list = map(evaluate_function, to_evaluate)
    return list(s_list)

This has several disadvantages:

  • A parallel call for each fitness evaluation increases the parallelization overhead

  • 'multiprocessing.Pool.map' uses serialization / pickle to transfer data and uses locks to protect against conflicting access.

  • Serialization causes issues with closures and functions calling C-code.

  • Locks are not necessary if communication is implemented using shared memory instead as fcmaes does.

fcmaes processes a whole chunk of fitness evaluations in the same process to reduce the overhead.

We performed another test using a far more expensive fitness evaluation:

Table 2. Fitness evaluations per second expensive fitness
parallel=False @jit off parallel=True @jit off

reference implementation

12.6

200.6

fcmaes Map-Elites

17.0

304.8

As we can see, in this case the disadvantage using 'multiprocessing.Pool.map' shrinks significantly.

You may argue that real word fitness function are expensive: Examples are complex simulations shown in the FluidDynamics and PowerPlant tutorials. But all these expensive real world fitness functions don’t survive the serialization done by 'multiprocessing.Pool.map'. And often fitness evaluation is very fast if we use numba or implement it directly in C as done in many of the other tutorials.

Space flight mission design

We will use ESAs Cassini2 Mission design benchmark already discusse in SpaceFlight.

It is about the planning of a mission to Saturn involving several planet gravity assist maneuvers. It uses a simplified model involving the start time and velocity, the timings between the planets, the flyby height and angle and the timing of the deep space maneuvers between the planets.

Lets first have a look at the original benchmark which uses a fixed planet sequence and requires 22 decision variables. Although not the hardest of the GTOP problems, it is not easy to solve, even if you are only interested in the global optimum.

Meaningful Map-elites descriptors are the mission start time and the over all time of flight, since we are interested in our mission options for different start and flight times. Note that since there is a clear preference for earlier starts and a shorter flight time multi-objective optimization using the descriptors as additional objectives is a valid alternative here.

CVT MAP-Elites doesn’t work for ESAs Cassini2 benchmark

Lets first try the Python reference implementation of CVT MAP-Elites. We have to normalize both the solutions and the behavior descriptions, but otherwise the implementation is straightforward. We use a fcmaes fitness wrapper to minitor progress end measure the evaluation rate.

import map_elites.cvt as cvt_map_elites
import map_elites.common as cm_map_elites
from fcmaes.astro import Cassini2
from fcmaes.optimizer import wrapper

problem = Cassini2()
bounds = problem.bounds
px = cm_map_elites.default_params.copy()
px["dump_period"] = 2000000
px["batch_size"] = 200
px["min"] = 0
px["max"] = 1
px["parallel"] = False

fun = wrapper(problem.fun)

lb = bounds.lb
scale = bounds.ub - bounds.lb

def fitness(x):
    x = lb + np.multiply(x,scale)  # denormalize
    return -fun(x), get_tof_launch_time(x)

def get_tof_launch_time(x): # normalize
    tof = sum(x[4:9]) / 5000.0
    launch_time = (1000 + x[0]) / 1000.0
    return np.array([tof, launch_time])

def test_cassini2():
    archive = cvt_map_elites.compute(2, 22, fitness, n_niches=4000, max_evals=1e8, log_file=open('cvt.dat', 'w'), params=px)

if __name__ == '__main__':
    test_cassini2()

For the configured 1e8 evaluations my machine (AMD 5950x) needs almost 4 hours. Best fitness value found is 10.85, the optimization performed 7150 fitness evaluations / sec. This is by far too slow to be usable. But otherwise the reference implementation works fine, for 1e8 evaluations this result is expected since we don’t use CMA-ES here.

But with px["parallel"] = True the optimization slows down to about 4500 evaluations / sec because of the additional process creation overhead which by far outweighs the gain.

CVT MAP-Elites does work for ESAs Cassini2 benchmark

The code implementing CVT MAP-Elites optimization using fcmaes is here: elitescass2.py. It performs up to 350000 fitness evaluations / sec which is about factor 50 faster. (factor 78 if you compare with px["parallel"] = True). Additionally, because of the CMA-ES emitter, convergence is also improved and we find solutions around 8.6 in about 15 minutes. Above we show plots of the whole archive which contains a great number of good solutions. The code uses mapelites.py, the fcmaes CVT Map-Elites implementation.

...
def tof(x):
    return sum(x[4:9])

def launch(x):
    return x[0]

class Cassini2_me():
    ''' Map-Elites wrapper for the ESA Cassini2 benchmark problem'''

    def __init__(self, prob):
        ...
        self.bounds = prob.bounds
        min_tof = tof(prob.bounds.lb)
        max_tof = tof(prob.bounds.ub)
        min_launch = launch(prob.bounds.lb)
        max_launch = launch(prob.bounds.ub)
        self.qd_bounds = Bounds([min_tof, min_launch], [max_tof, max_launch])

    def qd_fitness(self, x):
        return self.problem.fun(x), np.array([tof(x), launch(x)])

def run_map_elites():
    problem = Cassini2_me(Cassini2())
    me_params = {'generations':100, 'chunk_size':1000}
    cma_params = {'cma_generations':100, 'best_n':200, 'maxiters':1000, 'miniters':200}
    fitness =  mapelites.wrapper(problem.qd_fitness, problem.qd_dim)

    archive = mapelites.optimize_map_elites(
        fitness, problem.bounds, problem.qd_bounds, niche_num = niche_num,
          iterations = 50, me_params = me_params, cma_params = cma_params)

Here the solution and description space are not normalized, which means that we have to provide the boundaries to the algorithm (problem.bounds, problem.qd_bounds). Note that archive objects have a load and save method to save the current status of the optimization to disk. A loaded archive can be forwarded to mapelites.optimize_map_elites via the optional archive parameter to continue a saved optimization. Reasons for the performance difference to the reference solution:

  • Although both implementations use sklearn.neighbors.KDTree to determine the niche, fcmaes forwards whole chunks of solutions to KDTree which speeds up things significantly.

  • fcmaes applies the same trick for SBX (Simulated binary crossover) and Iso+Line to avoid slow Python loops

  • Since fcmaes uses shared memory for the contents of the archive, processes can compute the fitness of many solutions before they have to synchronize. This way much less process creation/shutdown overhead is created. Note that because of the Python global interpreter lock multi threading is not applicable, and process creation is quite heavyweight compared to thread creation.

  • The CVT MAP-elite implementation doesn’t support CMA-ES update, so there is no direct comparison there. Because of its computationally expensive covariance matrix update CMA-ES may either become very slow for higher solution dimensions, or the underlying matrix library may allocate multiple CPU cores for a single optimization which is counterproductive if the whole optimization is already parallelized. For extremly high dimensions (> 1000) there is no alternative to a JAX based implementation where the matrix operations are delegated to a GPU/TPU. But as the benchmarks results in EvoJax.adoc show: For these high dimensional problems CMA-ES is not the best choice anyway because of its slow convergence. For MAP-Elites which typically is applied to problems with dimension < 1000, a fast C++ CMA-ES implementation as the fcmaes one is the best choice, since it is always executed single threaded and integrates well with parallelization at a higher level.

There is an alternative: Diversifier

diversifier.py is a new alternative to MAP-Elites. It generalizes ideas from CMA-ME to other wrapped algorithms. It uses the archive from CVT MAP-Elites (https://arxiv.org/abs/1610.05729) implemented in mapelites.py.

Note that there is an equivalent implementation dedicated to the machine learning domain: evojax diversifier, able to handle many thousand decision variables. It implements parallelism at a different level: At the fitness function and the optimization algorithm itself.

Both the parallel retry and the archive based modification of the fitness function enhance the diversification of the optimization result. The resulting archive may be stored and can be used to continue the optimization later.

As MAP-Elites it requires a QD-fitness function returning both an fitness value and a behavior vector used to determine the corresponding archive niche using Voronoi tesselation. It returns (or improves) an archive of niche-elites containing also for each niche statistics about the associated solutions.

def run_diversifier():
    name = 'cass2div'
    problem = Cassini2_me(Cassini2())
    opt_params0 = {'solver':'elites', 'popsize':1000, 'workers':16}
    opt_params1 = {'solver':'DE_CPP', 'max_evals':50000, 'popsize':31, 'stall_criterion':3}
    opt_params2 = {'solver':'CMA_CPP', 'max_evals':100000, 'popsize':31, 'stall_criterion':3}
    archive = diversifier.minimize(
         mapelites.wrapper(problem.qd_fitness, 2), problem.bounds, problem.qd_bounds,
         workers = 32, opt_params=[opt_params0, opt_params1, opt_params2], max_evals=2000000*32)
    diversifier.apply_advretry(wrapper(problem.fitness), problem.descriptors, problem.bounds, archive, num_retries=20000)
    archive.save(name)
    plot_archive(archive)

As can be seen by the example code above, it supports the application of arbitrary combination of solvers. Solvers are treated as a sequence with the exception of 'solver':'elites' which is executed in parallel. workers':16 means that 16 of the configured 32 parallel processes are allocated to MAP-Elites, the rest to the other solvers. In a sequence - here it is DE → CMA-ES - the optimal solution found is forwarded to the next solver as initial mean of its solution distribution. max_evals for diversifier.minimize limits the overall number of fitness evaluations. For the optimizers max_evals limits the number of evaluations for a single optimizer run.

Note that the use of solver-sequences is specific to space flight planning tasks. In most cases a combination of CMA-ES or CR-FM-NES and MAP-Elites executed in parallel works very well.

A resulting niche-archive may further be improved by applying a regular non-QD algorithm: the fcmaes advanced parallel retry / smart boundary management meta-algorithm. Archive-contents are transferred to its solution store and then back to the QD-archive. This method is very effective in improving the best solutions found so far, but doesn’t improve the diversity of the solutions much.

The diversifier needs less fitness evaluations for the price of diversity. It may be an interesting alternative, if the evaluation budged is limited and/or the fitness evaluation is expensive. Its results are far more diverse than what can be achieved without a QD-archive of niche-elites.

Conclusion

We have shown that:

  • CVT MAP-elites can handle even the hardest optimization problems.

  • QD algorithms are more useful for real world problems as optimizers returning a single best solution.

  • The CMA-ES emitter improves effectively existing archive solutions during optimization.

  • CMA-ES can be applied after MAP-elites optimization to improve selected niches.

  • Alternatively there is the general diversifier applying a QD-archive to different non-QD algorithms like CR-FM-NES, DE, PGPA and CMA-ES or to sequences of these. This flexibility is specially valuable for application domains were CMA-ES becomes slow (dimension > 1000, as for machine learning tasks).

  • QD-archives and fcmaes parallel retry stores can interchange their solutions to improve results further applying non-QD (meta-)algorithms.

Our CVT MAP elites (with CMA-Emitter) implementation elitescass2.py introduces a number of novelties enhancing its performance:

  • Our archive uses shared memory to reduce inter-process communication overhead.

  • The initial behavior space is generated from uniform behavior samples because random solutions may cover only parts of the behavior space. Some parts may only be reachable by optimization.

  • Fitness computations may be expensive - therefore we avoid the computation of fitness values for the initial solution population.

  • The initial solution space is generated from uniform samples of the solution space. These solutions are never evaluated but serve as initial population for SBX or Iso+LineDD. Their associated fitness value is set to math.inf (infinity). This way we:

    • Avoid computing fitness values for the initial population.

    • Enhance the diversity of initial solutions emitted by SBX or Iso+LineDD.

  • Iso+LineDD (https://arxiv.org/pdf/1804.03906) is implemented but doesn’t work well with extremely ragged solution landscapes. Therefore SBX+mutation is the default setting.

  • SBX (Simulated binary crossover) is combined with mutation. Both spread factors - for crossover and mutation - are randomized for each application.

  • Candidates for CMA-ES are sampled with a bias to better niches.

  • There is a CMA-ES drill down for specific niches - in this mode all solutions outside the niche are rejected. Restricted solution box bounds are used derived from statistics maintained by the archive during the addition of new solution candidates.

Note that the Python reference implementation of CVT MAP-Elites also tesselates the behavior space independent from fitness evaluations of emitted solutions.