# Module 4 - Programming Assignment

## General Directions

1. You must follow the Programming Requirements outlined on Canvas.
2. The Notebook should be cleanly and fully executed before submission.
3. You should change the name of this file to be your JHED id. For example, `jsmith299.ipynb` although Canvas will change it to something else...
4. You must follow the Programming Requirments for this course.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You should always read the entire assignment before beginning your work, so that you know in advance what the requested output will be and can plan your implementation accordingly.
    </p>
</div>

<div style="color: white; background: #C83F49; margin:20px; padding: 20px;">
    <strong>Academic Integrity and Copyright</strong>
    <p>You are not permitted to consult outside sources (Stackoverflow, YouTube, ChatGPT, etc.) or use "code assistance" (Co-Pilot, etc) to complete this assignment. By submitting this assignment for grading, you certify that the submission is 100% your own work, based on course materials, group interactions, instructor guidance. You agree to comply by the requirements set forth in the Syllabus, including, by reference, the JHU KSAS/WSE Graduate Academic Misconduct Policy.</p>
    <p>Sharing this assignment either directly (e.g., email, github, homework site) or indirectly (e.g., ChatGPT, machine learning platform) is a violation of the copyright. Additionally, all such sharing is a violation the Graduate Academic Misconduct Policy (facilitating academic dishonesty is itself academic dishonesty), even after you graduate.</p>
    <p>If you have questions or if you're unsure about the policy, ask via Canvas Inbox. In this case, being forgiven is <strong>not</strong> easier than getting permission and ignorance is not an excuse.</p>
    <p>This assignment is copyright (&copy Johns Hopkins University &amp; Stephyn G. W. Butcher). All rights reserved.</p>
</div>

In [1282]:
from pprint import pprint

## Local Search - Genetic Algorithm

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

You are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
The only constraint for our objective function is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.
That measure becomes our **objective function** and we can use it with the Genetic Algorithm.

Since it is probably easier to come up with a score that measures how far apart the two strings are, we will end up with an objective function that represents a **minimization problem**.
Because a fitness function must always be "more is better", we'll need to convert our objective function to a proper fitness function as well.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

You will also need to pick a selection algorithm, either roulette wheel or tournament selection.
In the later case, you will need a tournament size.
This is all part of the problem.

Every **ten** (10) generations, you should print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function should return the best individual *of the entire run*, using the same format.

```
{'genotype': ['x', 'd', 'j', 'l', 's'],
  'fitness': 0.041666666666666664,
  'phenotype': 'xdjls'},
```

In [1283]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "

<a id="genetic_algorithm"></a>
### genetic_algorithm

**(**

You can add as many parameters as you need to `genetic_algorithm`.
The documentation should be filled out according to the programming assignment requirements for the course.
You do not need to unit test this function.
Remember that functions should "only do one thing", may not be more than 20 lines.
Write helper functions with documentation and unit tests.

A complex function and application like this has a lot of interesting potential:

1. There are a lot of parameters We could reduce those by using a Dictionary.
2. There are a lot of different possible behaviors, including problem specific behaviors. We could use higher order functions.

Beyond these hints, I leave those decisions to you.


*This is very Mission Impossible. After reading the directions in this Markdown cell, when the time is right, remove them  (everything between and including the parentheses) and replace with your documentation for `genetic_algorithm`! I have started you off.*

**)**

In [1284]:
import random
from typing import List, Dict, Tuple, Callable

<a id="encode_chrarr"></a>
## encode_chrarr

*`encode_chrarr` encode a string as a list of characters (strings of length 1)*  **Used by**: [genetic_algorithm](#genetic_algorithm)

* **pheno** str: phenotype of the gene

**returns** List[str] - genotype of the gene represented as a list of characters

In [1285]:
def encode_chrarr(pheno: str) -> List[str]:
    return list(pheno)

In [1286]:
# Tests
assert encode_chrarr("hello world") == ["h", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"] # test 1 - keeps spaces
assert encode_chrarr("") == [] # test 2 - empty string
assert encode_chrarr("üåãüêä‚õ∞") == ["üåã", "üêä", "‚õ∞"] # test 3 - non-ascii chars

<a id="decode_chrarr"></a>
## decode_chrarr

*`decode_chrarr` decode a gene (list of characters) into its phenotype (string)*  **Used by**: [describe_chromosome](#describe_chromosome)

* **geno** List[str] - genotype of the gene represented as a list of characters 

**returns** List[str] - phenotype of gene represented as a string

In [1287]:
def decode_chrarr(geno: List[str]) -> str:
    return "".join(geno)

In [1288]:
# Tests
assert decode_chrarr(["h", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]) == "hello world" # test 1 - keeps spaces
assert decode_chrarr([]) == "" # test 2 - empty string
assert decode_chrarr(["üåã", "üêä", "‚õ∞"]) == "üåãüêä‚õ∞" # test 3 - non-ascii chars

<a id="generate_chromosome"></a>
## generate_chromosome

*`generate_chromosome` generate a random chromosome by selecting characters from a provided alphabet*  **Used by**: [generate_population](#generate_population)

* **chromosome_length** int - number of characters in chromosome
* **alphabet** str - alphabet of characters in gene

**returns** List[str] - genotype of the gene represented as a list of characters

In [1289]:
def generate_chromosome(chromosome_length: int, alphabet: str) -> List[str]:
    return [random.choice(alphabet) for _ in range(chromosome_length)]

In [1290]:
assert len(generate_chromosome(4, "abc")) == 4 # test 1 - len of chromosome
assert all(chrom in "abc" for chrom in generate_chromosome(4, "abc")) # test 2 - chromosome contains specified chars
assert len(generate_chromosome(-1, "abc")) == 0 # test 3 - non-positive length returns empty chromosome 

<a id="generate_population"></a>
## generate_population

*`generate_population` generate a list of chromosomes*  **Used by**: [genetic_algorithm](#genetic_algorithm)

* **chromosome_length** int - number of characters in chromosome
* **alphabet** str - alphabet of characters in gene
* **population_size** int - number of chromosomes in the population

**returns** List[List[str]] - list of chromosomes in a population

In [1291]:
def generate_population(chromosome_length: int, alphabet: str, population_size: int) -> List[List[str]]:
    return [generate_chromosome(chromosome_length, alphabet) for _ in range(population_size)]

In [1292]:
# Tests
assert len(generate_population(4, "abc", 2)) == 2 # test 1 - len of chromosome
assert [chrom in "abc" for inner_list in generate_population(4, "abc", 4) for chrom in inner_list] # test 2 - chromosome contains specified chars
assert len(generate_population(-1, "abc", -1)) == 0 # test 3 - non-positive length returns empty chromosome 

<a id="fitness_forward"></a>
## fitness_forward

*`fitness_forward` calculate a potential solution chromosomes fitness by comparing it to the target chromosome. The chromosomes are compared gene by gene, ie character-wise, by summing up the difference in ascii value between the solution and target gene. The ascii-difference is transformed by $\frac{1}{1+f}$ so that the fitness function values increase as the solution chromosome becomes more similar to the target chromosome*  **Used by**: [genetic_algorithm](#genetic_algorithm)

* **chromosome** List[str] - genotype of potential solution 
* **target** List[str] - genotype of target

**returns** float - fitness score quantifying how similar the two chromosomes are 

In [1293]:
def fitness_forward(chromosome: List[str], target: List[str]) -> float:
    score = 0
    for gene, target_gene in zip(chromosome, target):
        score += abs(ord(gene) - ord(target_gene))
    return 1 / (1 + score)

In [1294]:
# Test
assert fitness_forward(["a", "b", "c"], ["x", "y", "z"]) < fitness_forward(["x", "s", "v"], ["x", "y", "z"]) # test 1 - fitness score increases with similarity
assert fitness_forward(["a", "b", "c"], ["a", "b", "c"]) == 1.0 # test 2 - fitness of perfect match is 1
assert fitness_forward(["#"], ["("]) # no error when non-alpha value is used 

<a id="fitness_reverse"></a>
## fitness_reverse

*`fitness_reverse` calculate a potential solution chromosomes fitness by comparing it to the target chromosome. The chromosomes are compared in reverse order, starting from the first gene in the solution and the last gene of the target. A heuristic is calculated by summing up the difference in ascii value between the solution and target genes. The ascii-difference is transformed by $\frac{1}{1+f}$ so that the fitness function values increase as the solution chromosome becomes more similar to the target chromosome*  **Used by**: [genetic_algorithm](#genetic_algorithm)

* **chromosome** List[str] - genotype of potential solution 
* **target** List[str] - genotype of target

**returns** float - fitness score quantifying how similar the two chromosomes are 

In [1295]:
def fitness_reverse(chromosome: List[str], target: List[str]) -> float:
    score = 0
    chromosome_length = len(chromosome) 
    for index in range(chromosome_length):
        score += abs(ord(chromosome[chromosome_length - index -1]) - ord(target[index]))
    return 1 / (1 + score)

In [1296]:
assert fitness_reverse(["a", "b", "c"], ["x", "y", "z"]) < fitness_reverse(["x", "s", "v"], ["x", "y", "z"]) # test 1 - fitness score increases with similarity
assert fitness_reverse(["c", "b", "a"], ["a", "b", "c"]) == 1.0 # test 2 - fitness of perfect match is 1
assert fitness_reverse(["#"], ["("]) # no error when non-alpha value is used 

<a id="rot13"></a>
## rot13

*`rot13` replace a letter by the letter 13 characters down the alphabet*  **Used by**: [fitness_ceaser](#fitness_ceaser)

* **char** str - character

**returns** str - character 13 letters down from the input

In [1297]:
def rot13(char: str) -> str:
    return chr((ord(char) - ord("a") + 13) % 26 + ord("a"))

In [1298]:
# Test
assert [rot13(c) for c in "abc"] == ["n", "o", "p"] # test 1 
assert [rot13(c) for c in "nop"] == ["a", "b", "c"] # test 2 
assert [rot13(c) for c in ""] == [] # test 3 empty string

<a id="fitness_caeser"></a>
## fitness_caeser

*`fitness_caeser` calculate a potential solution chromosomes fitness by comparing it to the target chromosome. The chromosomes are compared gene (character) by gene. A heuristic is calculated by summing up the difference in ascii value between the solution and target genes, after transforming the solution gene using [rot13](#rot13). The ascii-difference is transformed by $\frac{1}{1+f}$ so that the fitness function values increase as the solution chromosome becomes more similar to the target chromosome*  **Used by**: [genetic_algorithm](#genetic_algorithm)

* **chromosome** List[str] - genotype of potential solution 
* **target** List[str] - genotype of target

**returns** float - fitness score quantifying how similar the two chromosomes are 

In [1299]:
def fitness_caeser(chromosome: List[str], target: List[str]) -> float:
    score = 0
    for gene, target_gene in zip(chromosome, target):
        score += abs(ord(rot13(gene)) - ord(target_gene))
    return 1 / (1 + score)

In [1300]:
# Tests

assert fitness_caeser(["a", "b", "c"], ["x", "y", "z"]) > fitness_caeser(["x", "s", "v"], ["x", "y", "z"]) # test 1 - fitness score increases with similarity - note that rot13 is happening
assert fitness_caeser(["a", "b", "c"], ["n", "o", "p"]) == 1.0 # test 2 - fitness of perfect match is 1
assert fitness_caeser(["#"], ["("]) # no error when non-alpha value is used 

<a id="sort_population"></a>
## sort_population

*`sort_population` sort a population of chromosomes by fitness*  **Used by**: [tournament](#tournament), [get_solution](#get_solution)

* **population** List[List[str]] - list of chromosomes of potential solution 
* **target** List[str] - genotype of target
* **fitness** Callable - fitness function

**returns** List[List[str]] - list of chromosomes of potential solution in sorted order by fitness

In [1301]:
def sort_population(population: List[List[str]], target: List[str], fitness: Callable) -> List[List[str]]:
    return sorted(population, key=lambda chromosome: fitness(chromosome, target))

In [1302]:
# Tests
pop = [["a", "b", "c"],["d", "e", "f"]]
target = ["a", "b", "c"]

assert sort_population(pop, target, fitness_forward)[-1] == ["a", "b", "c"] 
assert sort_population(pop, target, fitness_forward)[0] == ["d", "e", "f"]
assert sort_population([[]], [], fitness_forward) == [[]]


<a id="tournament"></a>
## tournament

*`tournament` use tournament selection to pick two chromosomes from a population. First, a sub-population is chosen at random, the sub-population is sorted by fitness and the two chromosomes with the highest fitness are returned*  **Used by** [genetic_algorithm](#genetic_algorithm): 

* **population** List[List[str]] - list of chromosomes of potential solution 
* **target** List[str] - genotype of target
* **sample_size** int - size of tournament sub-population
* **fitness** Callable - fitness function

**returns** Tuple[List[str], List[str]] - two most fit chromosomes from randomly chosen sub-population

In [1303]:
def tournament(population: List[List[str]], target: List[str], sample_size: int, fitness: Callable) -> Tuple[List[str], List[str]]:
    if sample_size < 2 or len(population) < 2:
        return ([], [])
    sub_population = random.sample(population, k=sample_size)
    sub_population = sort_population(sub_population, target, fitness)
    return sub_population[-1], sub_population[-2]

In [1304]:
# Tests
pop = generate_population(3, ALPHABET, 100)
target = ["a", "b", "c"]
assert len(tournament(pop, target, 10, fitness_forward)) == 2

p1, p2 = tournament(pop, target, 10, fitness_forward)
assert fitness_forward(p2, target) < fitness_forward(p1, target) 

assert tournament([[]], [], 0, fitness_forward) ==  ([], [])


<a id="recombine"></a>
## recombine

*`recombine` swap portions of two chromosomes with each other at a specific crossover_index, if crossover_rand is less than crossover_rate. For example, given two chromosomes `#####` and `$$$$$`, a crossover at index 2 would look like: `##$$$`, `$$###`.* **Used by** [breed](#breed): 

* **chromosomes1** List[str] - chromosomes of potential solution 
* **chromosomes2** List[str] - chromosomes of potential solution 
* **crossover_index** int - position at which to crossover the two chromosomes
* **crossover_rand** float - random number to select if crossover event occurs
* **crossover_rate** float - rate at which crossover event occurs 

**returns** Tuple[List[str], List[str]] - two possibly recombined chromosomes

In [1305]:
def recombine(chromosome1: List[str], chromosome2: List[str], crossover_index: int, crossover_rand: float, crossover_rate: float) -> Tuple[List[str], List[str]]:
    if crossover_rand >= crossover_rate:
        return chromosome1, chromosome2
    child1 = chromosome1[:crossover_index] + chromosome2[crossover_index:]
    child2 = chromosome2[:crossover_index] + chromosome1[crossover_index:]  # crossover 2 list of strings
    return child1, child2

In [1306]:
# Tests
p1 = ["#","#","#","#","#"]
p2 = ["$","$","$","$","$"]
assert recombine(p1, p2, 2, 0.4, 0.5) == (["#","#","$","$","$"], ["$","$","#","#","#"])
assert recombine(p1, p2, 2, 0.5, 0.5) == (["#","#","#","#","#"], ["$","$","$","$","$"])
assert recombine(p1, p2, 2, 0.6, 0.5) == (["#","#","#","#","#"], ["$","$","$","$","$"])

<a id="mutate"></a>
## mutate

*`mutate` swap character in chromosome with a mutation character at zero-indexed position mutate_index, if mutate_rand is less than mutate_rate. For example, given a chromosomes `#####` a mutation at position 2 could be `##$##`.* **Used by** [breed](#breed):

* **chromosomes** List[str] - chromosomes of potential solution 
* **mutate_index** int - random number to select where mutation event occurs (zero-indexed)
* **mutation** str - character to mutate to
* **mutate_rand** float - random number to select if mutation event occurs
* **mutate_rate** float - random number to select if mutation event occurs

**returns** List[str] - possibly mutated chromosomes

In [1307]:
def mutate(chromosome: List[str], mutate_index: int, mutation: str, mutate_rand: float, mutate_rate: float) -> List[str]:
    if mutate_rand < mutate_rate:
        chromosome[mutate_index] = mutation
    return chromosome

In [1308]:
# Tests

assert mutate(["#","#","#","#","#"], 2, "$", 0.4, 0.5) == ["#","#","$","#","#"]
assert mutate(["#","#","#","#","#"], 2, "$", 0.5, 0.5) == ["#","#","#","#","#"]
assert mutate(["#","#","#","#","#"], 2, "$", 0.6, 0.5) == ["#","#","#","#","#"]

<a id="breed"></a>
## breed

*`breed` Create two modified children chromosomes from the parents by selectively performing recombination (based on crossover_rate) and mutation (based on mutate_rate).* **Used by** [genetic_algorithm](#genetic_algorithm):

* **parent1** List[str] - chromosomes of potential solution 
* **parent2** List[str] - chromosomes of potential solution 
* **crossover_rate** float - rate at which crossover event occurs 
* **mutate_rate** float - random number to select if mutation event occurs

**returns** Tuple[List[str], List[str]] - two possibly recombined or mutated chromosomes

In [1309]:
def breed(parent1: List[str], parent2: List[str], crossover_rate: float, mutate_rate: float) -> Tuple[List[str], List[str]]:
    child1, child2 = recombine(parent1, parent2, random.randint(0, len(parent1) - 1), random.uniform(0, 1), crossover_rate)
    child1 = mutate(child1, random.randint(0, len(child1) - 1), random.choice(ALPHABET), random.uniform(0, 1), mutate_rate)
    child2 = mutate(child2, random.randint(0, len(child2) - 1), random.choice(ALPHABET), random.uniform(0, 1), mutate_rate)
    return child1, child2

In [1310]:
# Tests
p1 = ["#","#","#","#","#"]
p2 = ["$","$","$","$","$"]

assert len(breed(p1, p2, 0.4, 0.4)) == 2  # output is right size
assert breed(p1, p2, 1.0, 0.1)[0] != p1 and breed(p1, p2, 1.0, 0.1)[1] != p2 # recombination happens
assert breed(p1.copy(), p2.copy(), 0.1, 1.0)[0] != p1 and breed(p1.copy(), p2.copy(), 0.1, 1.0)[1] != p2 # mutation happens

<a id="describe_chromosome"></a>
## describe_chromosome

*`describe_chromosome` return a dictionary containing the genotype, phenotype and fitness of a chromosome* **Uses** [decode_chrarr](#decode_chrarr) **Used by** [get_solution](#get_solution), [genetic_algorithm](#genetic_algorithm):

* **chromosomes** List[str] - chromosomes of potential solution 
* **target** List[str] - genotype of target
* **fitness** Callable - fitness function


**returns** Dict - dictionary containing the genotype, phenotype and fitness of a chromosome

In [1311]:
def describe_chromosome(chromosome: List[str], target: List[str], fitness: Callable) -> Dict:
    return {"genotype": chromosome, "fitness": fitness(chromosome, target), "phenotype": decode_chrarr(chromosome)}

In [1312]:
assert describe_chromosome(["a", "b", "c"], ["a", "b", "c"], fitness_forward) == {"genotype": ["a", "b", "c"], "fitness": 1.0, "phenotype": "abc"}
assert describe_chromosome(["a", "b", "c"], ["x", "y", "z"], fitness_forward) == {"genotype": ["a", "b", "c"], "fitness": fitness_forward(["a", "b", "c"], ["x", "y", "z"]), "phenotype": "abc"}
assert describe_chromosome([], [], fitness_forward) == {"genotype": [], "fitness": 1.0, "phenotype": ""}

<a id="get_solution"></a>
## get_solution

*`get_solution` return a dictionary containing the genotype, phenotype and fitness of the most fit chromosome in a population.* **Uses** [sort_population](#sort_population) and [describe_chromosome](#describe_chromosome) **Used by** [genetic_algorithm](#genetic_algorithm):

* **population** List[List[str]] - list of chromosomes of potential solution 
* **target** List[str] - genotype of target
* **fitness** Callable - fitness function

**returns** Dict - dictionary containing the genotype, phenotype and fitness of a chromosome

In [1313]:
def get_solution(population: List[List[str]], target: List[str], fitness: Callable) -> Dict: 
    best_solution = sort_population(population, target, fitness)[-1]
    return describe_chromosome(best_solution, target, fitness)

In [1314]:
# Tests
pop = [["a", "b", "c"],["d", "e", "f"]]
assert get_solution(pop, ["a", "b", "c"], fitness_forward) == {"genotype": ["a", "b", "c"], "fitness": 1.0, "phenotype": "abc"} 
assert get_solution(pop, ["g", "f", "g"], fitness_forward) ==  {"genotype": ["d", "e", "f"], "fitness": fitness_forward(["d", "e", "f"], ["g", "f", "g"]), "phenotype": "def"}
assert sort_population([[]], [], fitness_forward) == [[]]


<a id="genetic_algorithm"></a>
## genetic_algorithm

*`genetic_algorithm` The goal of this function it to demonstrate genetic algorithms by finding words that match a target word. At the core of genetic algorithms is representing solutions to the problem as gene. An array of values (normally binary) are used to represent each solution. By decoding the gene we obtain the phenotype or solution state. Mimicking biology, we assign each genotype a fitness, and apply evolutionary pressure to ensure only solutions with high fitness selected by the algorithm. In this implementation, the fitness function must be between 0 and 1, where 1 represents a perfect match to the target and 0 is the furthest distance from the target. In order to generate new solutions, genes are recombined and mutated. A recombination event is when a piece of two genes are swapped between them. A mutations is when a single allele in a gene is changed at random. With encoding/decoding, fitness and evolution pressure, we can produce generations of solutions that tend to include more fit genes within them. The algorithm begins by encoding the input string as a gene. Next, a population of solutions is generated at random. Then we create successive generations of solutions by selecting pairs of parents using tournament selection, randomly applying recombination or mutation to them and then adding them to the next generation. This repeats until a threshold number of generations is reached or the highest fit solution in the generation matches the target. In either case the algorithm returns a dictionary containing the genotype, phenotype and fitness of the solution.* **Uses** [encode_chrarr](#encode_chrarr), [generate_population](#generate_population), [tournament](#tournament), [describe_chromosome](#describe_chromosome), [breed](#breed), and [get_solution](#get_solution)

* **population** List[List[str]] - list of chromosomes of potential solution 
* **target** List[str] - genotype of target
* **fitness** Callable - fitness function

**returns** Dict - dictionary containing the genotype, phenotype and fitness of a chromosome

In [1315]:
def genetic_algorithm(target_str: str, fitness: Callable, parameters: Dict, log_frequency: int = 10) -> Dict:
    population_size, tournament_size, max_generation = parameters["population_size"], parameters["tournament_size"], parameters["max_generation"]
    crossover_rate, mutate_rate, alphabet = parameters["crossover_rate"], parameters["mutate_rate"], parameters["alphabet"]

    target: List[str] = encode_chrarr(target_str)  # encode
    population: List[List[str]] = generate_population(len(target), alphabet, population_size)

    for generation in range(max_generation):
        next_population = []
        
        solution = get_solution(population, target, fitness)
        if solution["fitness"] == 1.0: # stop if we have a match
            return solution
        
        for _ in range(population_size // 2): # fill next generation
            parent1, parent2 = tournament(population, target, tournament_size, fitness)
            child1, child2 = breed(parent1, parent2, crossover_rate, mutate_rate) # mutate + recombine
            next_population.extend([child1, child2])

        population = next_population

        if generation % log_frequency == 0:
            print(f"{generation:5} {get_solution(population, target, fitness)}")

    return get_solution(population, target, fitness)

## Problem 1

The target is the string "this is so much fun".
The challenge, aside from implementing the basic algorithm, is deriving a fitness function based on "b" - "p" (for example).
The fitness function should come up with a fitness score based on element to element comparisons between target v. phenotype.

In [1316]:
target1 = "this is so much fun"

In [1317]:
# set up if you need it.
params = {"population_size": 100, "tournament_size": 10, "max_generation": 1000, "crossover_rate": 0.9, "mutate_rate": 0.05, "alphabet": ALPHABET}

In [1318]:
result1 = genetic_algorithm(target1, fitness_forward, params)  # do what you need to do for your implementation but don't change the lines above or below.

    0 {'genotype': ['g', 'd', 'k', 't', 'r', 'j', 'y', 'u', 'm', 't', ' ', 'o', 'z', 'f', 'q', ' ', 'i', 'q', 'u'], 'fitness': 0.0041841004184100415, 'phenotype': 'gdktrjyumt ozfq iqu'}
   10 {'genotype': ['n', 'p', 'k', 'r', ' ', 'a', 't', 'e', 'p', 'h', ' ', 'o', 'r', 'c', 'o', ' ', 'i', 'q', 'r'], 'fitness': 0.007751937984496124, 'phenotype': 'npkr ateph orco iqr'}
   20 {'genotype': ['n', 'g', 'k', 'r', ' ', 'a', 't', 'e', 'p', 'h', ' ', 'o', 'r', 'c', 'o', ' ', 'i', 'v', 'q'], 'fitness': 0.00847457627118644, 'phenotype': 'ngkr ateph orco ivq'}
   30 {'genotype': ['t', 'g', 'k', 'r', ' ', 'a', 't', 'e', 'r', 'n', ' ', 'o', 'u', 'c', 'o', ' ', 'h', 'u', 'q'], 'fitness': 0.010101010101010102, 'phenotype': 'tgkr atern ouco huq'}
   40 {'genotype': ['t', 'g', 'i', 'r', ' ', 'k', 't', 'e', 'r', 'n', ' ', 'o', 'u', 'c', 'd', ' ', 'h', 'u', 'o'], 'fitness': 0.011627906976744186, 'phenotype': 'tgir ktern oucd huo'}
   50 {'genotype': ['t', 'g', 'i', 'r', ' ', 'h', 't', ' ', 'r', 'n', ' ', 

In [1319]:
pprint(result1, compact=True)

{'fitness': 1.0,
 'genotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
              'c', 'h', ' ', 'f', 'u', 'n'],
 'phenotype': 'this is so much fun'}


## Problem 2

You should have working code now.
The goal here is to think a bit more about fitness functions.
The target string is now, 'nuf hcum os si siht'.
This is obviously target #1 but reversed.
If we just wanted to match the string, this would be trivial.
Instead, this problem, we want to "decode" the string so that the best individual displays the target forwards.
In order to do this, you'll need to come up with a fitness function that measures how successful candidates are towards this goal.
The constraint is that you may not perform any global operations on the target or individuals.
Your fitness function must still compare a single gene against a single gene.
Your solution will likely not be Pythonic but use indexing.
That's ok.
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not reverse an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene (one letter against one letter).
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual in the population is the one who expresses this string *forwards*.

"this is so much fun"

In [1320]:
target2 = "nuf hcum os si siht"

In [1321]:
# set up if you need it.
params = {"population_size": 100, "tournament_size": 10, "max_generation": 1000, "crossover_rate": 0.9, "mutate_rate": 0.05, "alphabet": ALPHABET}

In [1322]:
result2 = genetic_algorithm(target2, fitness_reverse, params)  # do what you need to do for your implementation but don't change the lines above or below.

    0 {'genotype': ['m', 'x', 'i', 'u', ' ', 'w', 'y', 'd', 'q', 'p', 'h', 'y', 'w', 'f', 'v', 'g', 'g', 'q', 'm'], 'fitness': 0.003367003367003367, 'phenotype': 'mxiu wydqphywfvggqm'}
   10 {'genotype': ['z', 'e', 'j', 'u', ' ', 'c', 's', ' ', 'q', 'r', ' ', 'k', 'v', 'g', 'g', ' ', 'e', 'q', 'm'], 'fitness': 0.02631578947368421, 'phenotype': 'zeju cs qr kvgg eqm'}
   20 {'genotype': ['v', 'e', 'j', 'u', ' ', 'f', 's', ' ', 'q', 'o', ' ', 'k', 'v', 'g', 'g', ' ', 'e', 'v', 'm'], 'fitness': 0.04, 'phenotype': 'veju fs qo kvgg evm'}
   30 {'genotype': ['v', 'j', 'j', 's', ' ', 'f', 's', ' ', 'q', 'o', ' ', 'k', 'u', 'e', 'g', ' ', 'e', 'v', 'm'], 'fitness': 0.05263157894736842, 'phenotype': 'vjjs fs qo kueg evm'}
   40 {'genotype': ['v', 'j', 'j', 's', ' ', 'f', 's', ' ', 's', 'o', ' ', 'k', 'u', 'e', 'g', ' ', 'e', 'v', 'm'], 'fitness': 0.058823529411764705, 'phenotype': 'vjjs fs so kueg evm'}
   50 {'genotype': ['v', 'j', 'j', 's', ' ', 'j', 's', ' ', 's', 'o', ' ', 'k', 'u', 'e', 'g'

In [1323]:
pprint(result2, compact=True)

{'fitness': 1.0,
 'genotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
              'c', 'h', ' ', 'f', 'u', 'n'],
 'phenotype': 'this is so much fun'}


## Problem 3

This is a variation on the theme of Problem 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual will express the target *decoded*.

"thisissomuchfun"

In [1324]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [1325]:
target3 = "guvfvffbzhpusha"

In [1326]:
# set up if you need it
params = {"population_size": 100, "tournament_size": 10, "max_generation": 1000, "crossover_rate": 0.9, "mutate_rate": 0.05, "alphabet": ALPHABET3}

In [1327]:
result3 = genetic_algorithm(target3, fitness_caeser, params)  # do what you need to do for your implementation but don't change the lines above or below.

    0 {'genotype': ['q', 'k', 'h', 'm', 'h', 'w', 'w', 'q', 'b', 'u', 'd', 'j', 'c', 'v', 'v'], 'fitness': 0.015384615384615385, 'phenotype': 'qkhmhwwqbudjcvv'}
   10 {'genotype': ['t', 'i', 'i', 'o', 'i', 'u', 'p', 'q', 'l', 'u', 'd', 'j', 'd', 'v', 'q'], 'fitness': 0.043478260869565216, 'phenotype': 'tiioiupqludjdvq'}
   20 {'genotype': ['t', 'i', 'i', 's', 'i', 'u', 'p', 'q', 'l', 'u', 'd', 'g', 'f', 'v', 'o'], 'fitness': 0.07142857142857142, 'phenotype': 'tiisiupqludgfvo'}
   30 {'genotype': ['t', 'i', 'i', 's', 'i', 'r', 'u', 'p', 'l', 'u', 'd', 'g', 'f', 'v', ' '], 'fitness': 0.1, 'phenotype': 'tiisirupludgfv '}
   40 {'genotype': ['t', 'i', 'i', 's', 'i', 'r', 'r', 'p', 'l', 'u', 'd', 'g', 'f', 'v', ' '], 'fitness': 0.1111111111111111, 'phenotype': 'tiisirrpludgfv '}
   50 {'genotype': ['t', 'i', 'i', 's', 'i', 'r', 'r', 'p', 'l', 'u', 'd', 'g', 'f', 'v', ' '], 'fitness': 0.1111111111111111, 'phenotype': 'tiisirrpludgfv '}
   60 {'genotype': ['t', 'i', 'i', 's', 'i', 'r', 'r', '

In [1328]:
pprint(result3, compact=True)

{'fitness': 1.0,
 'genotype': ['t', 'h', 'i', 's', 'i', 's', 's', 'o', 'm', 'u', 'c', 'h', 'f',
              'u', ' '],
 'phenotype': 'thisissomuchfu '}


## Problem 4

There is no code for this problem.

In Problem 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

**answer here**

In order to capture the transformation needed to translate between the genotype and target, we can add another gene (or set of genes) to store transformation information. For string matching, we could store an extra gene that encode the amount to rotate the string. We would then adjust the fitness function to treat the first n genes as normal characters and the last one as a rotation amount. We can then use the machinery of the genetic algorithm to drive towards solutions that have the correct transformation. 

## Challenge

**You do not need to do this problem and it won't be graded if you do. It's just here if you want to push your understanding.**

The original GA used binary encodings for everything.
We're basically using a Base 27 encoding.
You could, however, write a version of the algorithm that uses an 8 bit encoding for each letter (ignore spaces as they're a bit of a bother).
That is, a 4 letter candidate looks like this:

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

If you wrote your `genetic_algorithm` code general enough, with higher order functions, you should be able to implement it using bit strings instead of latin strings.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.