This tutorial
-
Shows how to implement the noisy TSP problem aiming at a robust route efficiently using numba.
-
Compares with an alternative implementation.
-
Compares different optimization algorithms applied to both implementations
-
Use optimization to evaluate machine learning based results for another TSP variant
There is a dramatic performance difference > factor 10000 when comparing optimization algorithms and objective function implementations. The purpose of this tutorial is to "sharpen your senses" regarding how to implement this kind of objective function and which optimization algorithm to choose when you selected Python as implementation language. Don’t underestimate the influence of your specific implementation of a problem, not only regarding evaluations/second but also regarding the number of evaluations required applying the same optimization algorithm.
As starting point we use TSP.py from the 'ExpensiveOptimBenchmark' suite from the Technical University of Delft. In Hospital we discuss four other problems from this problem suite, but only for this one we provide our own alternative implementation.
All results were produced using a 16 core / 32 thread AMD CPU 5950x. Some algorithms use only a single thread, others utilize the whole processor.
"When a black-box optimization objective can only be evaluated with costly or noisy measurements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models."
Such a method - specialized for integer-valued minima - is presented in the paper (IDONE) and applied to a noisy variant of an asymmetric traveling salesman problem (17 cities).
The statement "most standard optimization algorithms are unsuited" is probably true, but what about the other ones? We will identify "suited" standard optimization algorithms and prove that they work fine for TSP.py, the implementation of the objective function used in the paper. Additionally we can show that for an alternative implementation of the same noisy TSP problem (100 iterations, noise=1.0) noisy_tsp.py standard optimization algorithms can solve the noisy 17 cities problem BR17 in about 2 seconds utilizing the parallel threads modern CPUs provide.
The other problem shown in the paper is artificial and based on matrix multiplication using random matrices with objective function:
-
f(x) = (x − x^∗ )^T A(x − x^∗)
applied to binary input values. Here IDONE indeed uses far less function evaluations as all standard algorithms we tried. But what happens if instead dimension = 100 we use dimension = 5000 ? IDONE doesn’t scale very well, I wasn’t able to apply it successfully. A standard algorithm (BiteOpt) uses about 900.000 function evaluations and about 5 hours (single threaded), but at least I got the problem solved. You could argue: Real world problems are too expensive to apply their objective function so often, but on the other hand they usually cannot be solved so exceptionally well using IDONE.
The Traveling Salesman Problem/TSP asks the question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It appears as subproblem in many other areas, is very well studied and there are methods solving it for large numbers of cities.
But what if the distance between cities is not known exactly? This problem can be "simulated" by adding some random noise to the individual distances and determining a "robust" solution: We compute the path length n times and determine the "worst case", the longest path we got for all iterations each time adding different random noise to the individual distances. For all our experiments we use 100 iterations and a random noise between 0 and 1 added to all distances.
Solving this "noisy TSP" is more challenging, standard branch and bound algorithms like tsp_solver cannot be applied.
Lets first have a look at TSP.py, an implementation of an objective function computing the robust path length for continuous input values (mapped to discrete integers) determining the selected tour.
def evaluate(self, x):
robust_total_route_length = 0.0
for iteration in range(self.n_iter):
current = 0
unvisited = list(range(1, self.d+2))
total_route_length = 0.0
for di, i in enumerate(x):
next_up = unvisited.pop(int(round(i)))
total_route_length += self.W[current, next_up]
total_route_length += self.noise_rng.random() * self.noise_factor
current = next_up
last = unvisited.pop()
total_route_length += self.W[current, last]
total_route_length += self.noise_rng.random() * self.noise_factor
total_route_length += self.W[last, 0]
total_route_length += self.noise_rng.random() * self.noise_factor
robust_total_route_length = max(total_route_length, robust_total_route_length)
return robust_total_route_length
This implementation aims at minimizing the number of input variables: The number of cities N minus two.
Since the tour is always a ring you can choose any fixed city as a starting point. When you have
performed N-2 steps there is only one city left, you haven’t really a choice.
next_up = unvisited.pop(int(round(i)))
maps continuous input variables to integer values used
for the selection. At least for continuous optimizers you should use a different approach as we will
show below.
At solvers we find a number of optimization algorithms to test. Many are not applicable for the TSP problem, so we choose the following selection for testing:
We don’t need a huge problem instance to drive these algorithms at and beyond their limits: br17 is a small TSP instance involving only 17 cities. According to ATSP-solutions the shortest path without noise has length 39.
Next lets have a look how fcmaes standard optimization algorithhms perform - also single threaded: We use the Python wrapper of the following three algorithms implemented using Eigen/C++:
The fcmaes CMA variant terminates fast, but provides inconsistent results.
fcmaes supports the parallel retry of optimization algorithms, lets see what happens if we apply 32 optimization runs in parallel using retry.py :
The original algorithms are clearly outperformed, even CMA-ES performs quite well when applied in a parallel retry scenario.
Both the Differential Evolution and multiobjective DE/NSGA (MODE) algorithm got an update:
-
Both algorithms now have specific mixed-integer support. If you tell the algo via a new boolean array
ints
parameter which are your discrete integer variables, convergence will be much faster. This works both for the Python and the C++ variants.ints = [True, True, False]
for instance means that the first two variables are discrete. Usingnp.argsort(x)
together with continuous variables for sequences as shown below is still a valid option. But this trick doesn’t work if the same discrete value can occur in different variables. The original problem is now solvable in about 10 seconds using multiple threads.
The new implementation (noisy_tsp.py) uses numba to speed up the objective function evaluation quite significantly:
@njit(fastmath=True)
def evaluate_tsp(x, W, d, noise_factor, iter_num):
robust_total_route_length = 0
order = np.argsort(x) + 1
for _ in range(iter_num):
total_route_length = 0
total_route_length += W[0, order[0]] + np.random.random() * noise_factor
total_route_length += W[order[d-1], 0] + np.random.random() * noise_factor
for i in range(d-1):
total_route_length += W[order[i], order[i+1]] + np.random.random() * noise_factor
robust_total_route_length = max(total_route_length, robust_total_route_length)
return robust_total_route_length
This implementation uses np.argsort(x)
to determine the order the cities are visited.
The first city is fixed, so we have the number of cities N minus one argument variables x
.
This is one variable more, but it nevertheless works much better with continuous optimization
algorithms. We used this idea also in
Scheduling
and JobShop.
Applying fcmaes standard optimization algorithms to the modified objective function
results in a solution time of about 2 seconds, even for CMA-ES its only about 5 seconds.
Here a table comparing the number of function evaluations per second for all algorithms and objective function variants. numba and the way the new implementation is designed speeds up the computation by about factor 100 thereby also improving convergence:
algorithm | problem | evals/sec | used threads |
---|---|---|---|
idone |
original |
13 |
1 |
MSVRM |
original |
23 |
1 |
CMA |
original |
271 |
1 |
SA |
original |
335 |
1 |
BiteOpt |
original |
11800 |
32 |
fcmaes-CMA |
original |
11600 |
32 |
BiteOpt |
numba based |
1150000 |
32 |
fcmaes-CMA |
numba based |
1190000 |
32 |
What happens if we increase the noise?
Another experiment uses the symmetrical TSP gr17.
This time we configured noise_factor=50
.
According to TSP-solutions the shortest path without noise
has length 2085.
The CMA algorithm shown here is not the one from fcmaes, but the original one:
The CMA results are hard to see, reason is that we get a "termination on tolstagnation" after about 8000 function evaluations. It seems the default termination criteria don’t work for the TSP problem. As we will later see, this is not a problem with the CMA-ES algorithm itself.
Here are the results applying fcmaes-CMA (not the original one) and BiteOpt, both utilizing all 32 threads provided by the processor:
Finally we see a direct comparison of the different objective function implementations for the same optimization algorithm. Beside the speedup (evaluations/sec) we find better robust tours using both algorithms.
We have to be very careful when implementing an objective function representing a specific problem. Not always the implementation requiring the least number of variables wins. Use numba whenever possible for the time critical parts. BiteOpt or Differential Evolution, specially if used with parallel retry, are very good algorithm choices which should be tried early, if the problem is single objective and there are no constraints (which cannot be easily expressed using the weighted sum approach). Algorithms with huge overhead like IDONE and MSVRM should only be applied for very expensive objective functions. Noisy TSP can be evaluated nearly 1.2 million times / sec, so it definitely doesn’t fall into that category.
Lets have a look at ai-for-tsp-competition which contains the Python code for a machine learning competition to solve another TSP variant called TD-OPSWTW. Results and the details of the TSP variant are described in BliekEtAl2021:
"The stochastic travel times between locations are only revealed as the salesman travels in the network. The salesman starts from a depot and must return to the depot at the end of the tour. Moreover, each node (customer) in the network is assigned a prize, representing how important it is to visit a given customer on a given tour. Each node has associated time windows. We consider that a salesman may arrive earlier at a node without compromising its prize, but the salesman must wait until the opening time to serve the customer. Lastly, the tour must not violate a total travel time budget while collecting prizes in the network. The goal is to collect the most prizes in the network while respecting the time windows and the total travel time of a tour allowed to the salesman."
We will focus on the first task described by the paper: A specific 65 node instance of this TSP has to be solved using machine learning techniques. Lets suppose we are the organizer of this competition and want to evaluate the applicability of machine learning for this problem by comparing the winner results with a pure optimization based approach.
We do (as with the first TSP problem above) the following
-
Apply numba to speed up the evaluation of a solution tour
-
Formulate an objective function which needs to be fast to evaluate thereby being accurate enough for the final test - which in this case is computing the average of 10000 noisy tour evaluations.
-
Apply the BiteOpt optimizer with parallel retry.
The central routine evaluating a given tour is tour_check which we have to adapt:
from numba import njit
def tour_check(tour, x, time_matrix, maxT_pen, tw_pen, n_nodes):
return tour_check_numba(np.array(tour, dtype=np.int32), np.array(x, dtype=np.float32),
np.array(time_matrix, dtype=np.int32), maxT_pen, tw_pen, n_nodes)
@njit(fastmath=True)
def tour_check_numba(tour, x, time_matrix, maxT_pen, tw_pen, n_nodes):
numba is very picky regarding argument array types, so we have to wrap it converting into float32 and int32 arrays which make numba happy. This minimal change causes a factor 8 speedup of the evaluation of TSP tours.
The fitness object representing the objective function used for optimization gets an Env object representing a problem instance as argument.
-
Progress is monitored using
mp.RawValue
variables which share their values between processes. -
We define the bounds for all continuous variables as
[0,1]
which insolution
is converted into an integer vector representing the tour using thenp.argsort
sorting trick. -
value
computes the minimal or average score performingn
calls tocheck_solution
which gives a noisy tour evaluation. For smalln
the minimum is a more reliable value. -
The objective function itself
__call__
usesvalue
for incrementaln
values to compute the final evaluation of the tour. This method represents a compromise between accuracy and performance. For lown
penalized noisy values may be overlooked. For good candidates we have to increasen
to get a reliable result which works withn=10000
used for the final test. -
Note that the optimizer could find noisy "outliers" delivering worse results when called again. Therefore we finally use
n=100000
, becausen=10000
inside the objective function didn’t deliver reliable results.
from scipy.optimize import Bounds
import time, math
import ctypes as ct
import multiprocessing as mp
from fcmaes.optimizer import logger, Bite_cpp, dtime
from fcmaes import retry
class fitness:
def __init__(self, env):
self.evals = mp.RawValue(ct.c_long, 0) # writable across python processes
self.best_y = mp.RawValue(ct.c_double, math.inf)
self.t0 = time.perf_counter()
self.env = env
self.d = env.n_nodes
def bounds(self):
return Bounds(np.zeros(self.d), np.array([1]*self.d))
def value_min(self, sol, n):
val = math.inf
for _ in range(n):
_, rewards, pen, _ = self.env.check_solution(sol)
val = min(val, rewards + pen)
return val
def value(self, sol, n):
if n < 1000: # for small n take the minimum instead
return self.value_min(sol, n)
val = 0
for _ in range(n):
_, rewards, pen, _ = self.env.check_solution(sol)
val += rewards + pen
return val/n
def solution(self, x): # disjoined all locations
return [1] + [int(xi) for xi in (np.argsort(x) + 1)]
def __call__(self, x):
self.evals.value += 1
sol = self.solution(x)
n = 10
while n <= 100000:
y = -self.value(sol, n)
if y >= self.best_y.value:
return y
n *= 10
if y < self.best_y.value:
self.best_y.value = y
logger().info("evals = {0}: time = {1:.1f} y = {2:.5f} x= {3:s}"
.format(self.evals.value, dtime(self.t0), y, str(sol)))
return y
def optimize(self):
self.bestY = 1E99
self.bestX = []
ret = retry.minimize(self, self.bounds(), optimizer=Bite_cpp(200000,M=16,stall_iterations=3), num_retries=32)
sol = self.solution(ret.x)
num = 10000
logger().info("val" + str(num) + " = " + str(self.value(sol, num)))
return sol
if __name__ == '__main__':
env = Env(n_nodes=65, seed=6537855)
sol = fitness(env).optimize()
A typical output is:
evals = 32: time = 12.7 y = 72.28609 x= [1, 28, 44, 36, 4, 17, 54, 40, 19, 45, 1, 14, 37, 12, 6, 11, 34, 13, 52, 38, 62, 60, 42, 5, 8, 61, 63, 35, 33, 65, 2, 30, 53, 27, 48, 56, 20, 26, 3, 25, 55, 10, 22, 59, 18, 58, 46, 57, 24, 29, 16, 49, 64, 47, 51, 50, 21, 31, 39, 32, 23, 43, 9, 41, 15, 7]
evals = 36: time = 12.7 y = 67.64000 x= [1, 34, 20, 18, 63, 1, 26, 43, 10, 36, 24, 44, 56, 62, 3, 17, 42, 27, 25, 41, 21, 64, 12, 54, 31, 22, 7, 9, 16, 28, 4, 60, 11, 45, 2, 23, 32, 39, 51, 8, 46, 29, 61, 65, 47, 35, 14, 40, 13, 52, 33, 55, 57, 48, 53, 38, 6, 19, 58, 15, 59, 50, 37, 5, 49, 30]
evals = 56: time = 12.7 y = 66.42925 x= [1, 36, 13, 56, 62, 17, 1, 55, 30, 46, 25, 53, 34, 10, 38, 27, 7, 9, 54, 64, 28, 22, 57, 23, 31, 2, 6, 18, 47, 58, 44, 39, 50, 43, 42, 8, 12, 65, 32, 37, 61, 5, 24, 21, 26, 14, 48, 11, 15, 63, 33, 45, 16, 52, 51, 59, 35, 41, 49, 20, 4, 19, 40, 3, 60, 29]
...
evals = 930879: time = 142.3 y = -5.00259 x= [1, 45, 5, 44, 47, 42, 2, 46, 9, 22, 7, 4, 24, 30, 40, 48, 1, 41, 39, 43, 49, 52, 51, 38, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 50, 37, 33, 35, 3, 6, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 23, 25, 26, 27, 28, 29, 31, 32, 64, 34, 36, 65, 21]
...
evals = 6060526: time = 474.5 y = -11.20464 x= [1, 32, 45, 55, 49, 47, 41, 5, 23, 57, 6, 16, 2, 42, 46, 60, 11, 33, 43, 64, 13, 19, 29, 9, 22, 65, 7, 35, 62, 63, 4, 24, 30, 40, 48, 1, 17, 56, 26, 3, 36, 27, 37, 10, 34, 8, 58, 61, 39, 12, 25, 59, 44, 52, 15, 14, 28, 51, 54, 20, 18, 31, 53, 50, 38, 21]
...
evals = 6186914: time = 492.6 y = -11.31701 x= [1, 32, 45, 55, 5, 49, 41, 47, 44, 23, 6, 57, 16, 33, 60, 46, 42, 2, 11, 64, 19, 43, 13, 29, 65, 9, 22, 35, 7, 62, 63, 4, 24, 30, 40, 48, 1, 21, 8, 61, 14, 50, 54, 17, 38, 59, 27, 28, 26, 36, 15, 58, 20, 56, 34, 37, 18, 12, 31, 51, 10, 25, 3, 53, 39, 52]
evals = 6232151: time = 501.8 y = -11.31998 x= [1, 32, 45, 55, 5, 49, 41, 47, 44, 23, 6, 57, 16, 33, 2, 60, 46, 42, 11, 64, 19, 43, 13, 29, 65, 9, 22, 35, 7, 62, 63, 4, 24, 30, 40, 48, 1, 21, 8, 61, 14, 50, 54, 17, 38, 59, 27, 28, 26, 36, 15, 58, 20, 56, 34, 37, 18, 12, 31, 51, 10, 25, 3, 53, 39, 52]
evals = 6260525: time = 518.0 y = -11.32000 x= [1, 32, 45, 55, 5, 49, 41, 47, 44, 23, 6, 57, 16, 33, 2, 60, 46, 42, 11, 64, 19, 43, 13, 29, 65, 9, 22, 35, 7, 62, 63, 4, 24, 30, 40, 48, 1, 21, 8, 61, 14, 50, 54, 17, 38, 59, 27, 28, 26, 36, 15, 56, 58, 20, 34, 37, 18, 12, 31, 51, 10, 25, 3, 53, 39, 52]
val10000 = 11.319876970495283
Both the machine learning and the optimization approach achieve almost the same result of 11.32
.
Optimization reaches 11.20
consistently after about 400-500 sec, but can need up to 1800 sec to reach 11.32
for the TD-OPSWTW instance from the competition using an AMD-5950x 16 core CPU.
Even for 10000 evaluations there is still some noise, so the final test is slightly worse although we use 100000 evaluations inside the objective function. It seems pure optimization is a valid alternative for TD-OPSWTW and other realistic TSP variants.
Compare yourself with the machine learning based approach of one of the winners: https://github.com/mustelideos/td-opswtw-competition-rl . Which solution do you prefer?