Skip to content
Dennis Ideler edited this page Jun 9, 2014 · 18 revisions

The shortest driving route through all 48 contigous US states (Takes 113 hours) Source: http://t.co/eQpXlo4v8L - pic.twitter.com/sKt2zlPpl3

— Amazing Maps (@Amazing_Maps) June 8, 2014
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Shortest Hamiltonian path problem?

Touring Singer Problem (TSP)

A musician is expected to make a round trip at all the venues she'll be performing at, before returning home. This musician likes to be efficient, and wants to minimize her time on the road. A short route would help a lot...

NP-Hard

TSP is an NP-hard problem in combinatorial optimization. That's a term academics use to classify its complexity. When a problem is NP-hard, we cannot (yet) prove that a polynomial time solution exists.

A problem is said to be solvable in polynomial time if the number of steps required to complete the algorithm for a given input is O(nk). Polynomial-time algorithms are said to be "fast."

BUT WHAT DOES IT MEAN‽

In layman's terms, when a problem is described as "NP-hard", that means it belongs to a class of problems that very quickly become computationally expensive to solve as the input size grows. These problems are in fact rather easy to solve by enumerating all possible solutions and testing them, but solving them takes a long time on larger instances of the problems.

Optimization

TSP is one of the most intensively studied problems in optimization. It's used as a benchmark for many optimization methods. Even though the problem is computationally difficult, a large number of heuristics and exact methods are known, so that some instances with tens of thousands of cities can be solved completely and even problems with millions of cities can be approximated within a small fraction of error.

Graph

TSP can be modelled as an undirected weighted graph.

  • vertices = cities
  • edges = paths
  • edge weight = path distance

It's a minimization problem starting and finishing at a specified vertex after having visited each other vertex exactly once. Often, the model is a complete graph (i.e. each pair of vertices is connected). If no path exists between two cities, adding an arbitrarily long edge will complete the graph.

Note that the symmetry which undirected graphs provide is often an oversimplification. Traffic collisions, construction, one-way streets, and paid routes are examples of how this symmetry could break down. In such cases a directed graph will be more accurate.

Genetic Algorithm (GA)

A genetic algorithm is a search technique inspired by Darwin's Theory of Evolution and is used in computing to find exact or approximate solutions to optimization and search problems.

GAs are mainly used for problems where satisficing is expected. A good solution is satisfactory, and there is little benefit to be gained by going for an expensive optimal solution.

BUT WHAT DOES IT MEAN‽

Think of it as a guided random search. You create a bunch of random solutions and evaluate each of them. Then you repeatedly choose some good-ish solutions, and modify them. Some are big modifications (crossovers), some are tiny (mutation). For the big modifications, you "combine" the two solutions.

If you were to graph the average fitness of all the solutions you have, you'll see it gradually improve, until at some point it converges. At that point you can modify some parameters, retry, and hope the next run goes better.

Often the most difficult part can be data representation. How do you represent a real world problem so that your GA can work with it? In terms of efficiency, the fitness evaluation can usually be the bottleneck, especially if fitness is "subjective".

Flowchart can look different...

Partially matched crossover (PMX)
  • Two crossover points selected at random.
  • Points give matching selection. Parents are "mapped" to each other.

Cycle crossover (CX)
  • Each element comes from one parent together with its position.
  • Genes selected for inheritance is based on a cycle of connecting genes.
  • Cycle starts with the first gene from the first parent
  • Cycle ends when the last gene inherited is connected to the first gene inherited.

Parameters

There are 6 different (user specified) GA parameters used in this program:

  1. Population Size
  2. Maximum Generations
  3. Crossover Rate or Crossover Probability
  4. Mutation Rate or Mutation Probability
  5. Number of Elites
  6. K value for Tournament Selection

Optimal value ranges for these GA parameters:

  1. 100 ≤ popSize ≤ 500
  2. 100 ≤ maxGen ≤ 300
  3. 0.7 ≤ Pc ≤ 0.9 (lower from 1.0)
  4. 0.1 ≤ Pm ≤ 0.3 (raise from 0.0)
  5. 1 ≤ numElites ≤ 3
  6. 2 ≤ k ≤ 4

TODO

Q&A

**How do you represent the GA in biology terms?

A living organism consists of cells. In each cell lies the same set of chromosomes. Chromosomes are strings of DNA. A chromosome consists of genes (i.e. blocks of DNA). Each gene encodes a particular protein. Each gene encodes a trait (e.g. colour of hair). Possible settings for a trait (e.g. black, brown) are called alleles. Each gene has its own position in the chromosome, called locus. The complete set of genetic material (i.e. all chromosomes) is called genome. A particular set of genes in the genome is called genotype. The organism's characteristics is called phenotype.

In this GA we have:

  • Chromosome = a single solution to the TSP (i.e. a path of cities, each city visited once)

    • represented as a list of cities and its total distance (also the fitness of the solution)
  • Gene = a city

    • represented as a number (integer, float, or double)
  • Allele = a city's coordinates

    • it's x-y coordinates stored in a separate 2D list (used like a map container)
  • Locus = the specific location of a gene (or DNA sequence) on a chromosome

    • represented as the index in the list of cities
  • Genotype = a population (of chromosomes)

    • represented as a list containing lists of cities

Phenotype = the parameter set

  • population size, generations, crossover rate, mutation rate, elitism rate

Q: How to generate the initial population? A: Generate a random population of i chromosomes of size j (i is the population size specified by the user, j is the amount of cities in the input data) Thus the initial population consists of n amount of random, but valid, solutions. Each chromosome should also have its fitness calculated and associated with it.

Q: How to evaluate the fitness of a chromosome (or solution)? A: The fitness evaluation function is the total distance traveled in that path. Distance between cities is measured by Euclidean distance. TSP looks for the shortest path, thus the chromosome with the lowest total distance, is the most fit.

Q: How to do reproduction of parents (i.e. selection, crossover, mutation, elitism)? A: You must select two parents for reproduction. Each parent is selected and reproduced in a single generation (which means the second parent has to wait until the next generation).

The selection process used here is based on Tournament Selection. Select k chromosomes randomly, and from those k, choose the fittest chromosome. (2 <= k <= 4) Unfit chromosomes are left in the population in hope of them becoming useful later on. (The fitter the chromosome, the more times it is likely to be selected to reproduce.)

Now that the parents are selected, the reproduction process comes next. Reproduction consists of inheritance, crossovers, mutations and elitism.

Crossovers include inheritance. Some genes are inherited from the parents and some are crossed. The amount of genes that are inherited is determined randomly. The crossover rate determines how often crossover will be performed (percentage wise). If the crossover rate is 0%, then the offspring is fully inherited (i.e. exact copies of parents); if the crossover rate is 100%, then all offspring is made by crossover. Crossover is performed (optionally) in hope that new chromosomes will have good parts of old chromosomes and maybe the new chromosomes will be better. The crossover operator used here is the Partially Mapped Crossover (PMX). Cycle Crossover (CX) is provided as a secondary crossover.

Mutations are made to prevent the GA from getting stuck in local optimas. It is performed on the offspring. The mutation rate determines how often parts of the chromosome will be mutated. Unfit chromosomes that were left in the population earlier because might become useful after mutation. The mutation operator used here is the Reciprocal Exchange operator.

Elitism is introduced to prevent the loss of the fittest chromosome during reproduction. The n fittest chromosomes are copied into the new population once every generation. This guarantees that the best solution won't be lost and improves the GA performance. Optionally, only replace the parent with the child if the child is better (in fitness).