# Module 2 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from pprint import pprint

## Local Search - Genetic Algorithm

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

You are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
We can use that measure as a **loss function**.
A loss function gives us a score tells us how similar two strings are.
The loss function becomes our objective function and we use the GA to minimize it by converting the objective function to a fitness function.
So that's the first step, come up with the loss/objective function.
The only stipulation is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

You will also need to pick a selection algorithm, either roulette wheel or tournament selection.
In the later case, you will need a tournament size.
This is all part of the problem.

Every **ten** (10) generations, you should print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function should return the best individual *of the entire run*, using the same format.

In [2]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "
from typing import List, Tuple, Dict, Callable, Any
from copy import deepcopy
import random
from random import random as rand
from random import randint as randi

<a id="generate_random_populations"></a>
## generate_random_populations

`generate_random_populations` creates a list of `N` randomly generated individuals, all with length `str_len` from characters in the alphabet `alph`. The format of an individual in this implemetation is `(phenotype, genotype, score)`. The phenotype is a string representation of a chromosome, the genotype is a list representation of a chromosome, and the score is the fitness of the individual (between 0 and 1). `seed` is used to control the randomization and can be omitted outside of testing. **Used by**: [genetic_algorithm](#genetic_algorithm)

* **N**: number of random strings to generate
* **str_len**: length of each new individual string in population
* **alph**: alphabet over which to generate new strings
* **seed**: optional parameter for random seed (only for testing)

**returns** `List[Tuple[str, List[str], float]]`: a list of `N` new individuals (tuples of `str, List[str], float`)

In [3]:
def generate_random_populations(N: int, str_len: int, alph: str, seed = None) -> List[Tuple[str, List[str], float]]:
    if seed: random.seed(seed)
    pops, i = [], 0
    for i in range(N):
        rand_str = ''.join(random.choices(alph, k=str_len))
        pops.append((rand_str, list(rand_str), 0))
    return pops

In [4]:
# assertions/unit tests
N, str_len, seed = 5, 5, 5
rand_pop = generate_random_populations(N, str_len, ALPHABET, seed)
assert rand_pop == [('quvzt', ['q', 'u', 'v', 'z', 't'], 0), 
                    ('yamzr', ['y', 'a', 'm', 'z', 'r'], 0), 
                    ('ydmgo', ['y', 'd', 'm', 'g', 'o'], 0), 
                    ('pafhy', ['p', 'a', 'f', 'h', 'y'], 0), 
                    ('uevdq', ['u', 'e', 'v', 'd', 'q'], 0)]

rand_pop = generate_random_populations(0, str_len, ALPHABET, seed)
assert rand_pop == []

rand_pop = generate_random_populations(N, 0, ALPHABET)
assert rand_pop == [('', [], 0), ('', [], 0), ('', [], 0), ('', [], 0), ('', [], 0)]

<a id="evaluate"></a>
## evaluate

`evaluate` is the fitness function applied over every individual in `population`. It takes `objective` as an argument, and converts the minimization problem to a maximization problem by inverting the score from the objective (f(x) = 1/(1 + f(x)). The function then returns the evaluated population. **Uses**: [objective_1](#objective_1), [objective_2](#objective_2), [objective_3](#objective_3). **Used by**: [genetic_algorithm](#genetic_algorithm)

* **objective**: a callable function that represents the objective function
* **target**: the target to which we want to maximize our fitness
* **population**: the list of individuals to be evaluated

**returns** `List[Tuple[str, List[str], float]]`: `population` after evaluation

In [5]:
def evaluate(objective: Callable, target: str, population: List[Tuple[str, List[str], float]]) -> List[Tuple[str, List[str], float]]:
    for i, individual in enumerate(population):
        score = objective(target, individual)
        population[i] = (individual[0], individual[1], 1/(1 + score))
    return population

In [6]:
# assertions/unit tests
def objectivetest(target: str, pheno: Tuple[str, List[str], int]) -> int:
    score = 0
    for i, c_tgt in enumerate(target):
        score += abs(ord(c_tgt) - ord(pheno[1][i]))
    return score

population = [('tixlz', ['t', 'i', 'x', 'l', 'z'], 0), 
                    ('w xuq', ['w', ' ', 'x', 'u', 'q'], 0), 
                    ('a oyh', ['a', ' ', 'o', 'y', 'h'], 0), 
                    ('ubfdl', ['u', 'b', 'f', 'd', 'l'], 0), 
                    ('phmrd', ['p', 'h', 'm', 'r', 'd'], 0)]
new_pop = evaluate(objectivetest, "w xuq", population)
assert population[1][2] == 1
assert len(new_pop) == len(population)

new_pop = evaluate(objectivetest, "abcde", [])
assert new_pop == []

<a id="pick_parents"></a>
## pick_parents

`pick_parents` uses a tournament selection algorithm to select two parents from `population` for reproduction. The `tournament_size` is a parameter of the function. By selecting a random group of (`tournament_size`) individuals (a tournament) from `population`, and then picking the best one of that group, we ensure that the best genes are passed onto future generations. `seed` is a parameter used only in testing and is omitted outside of testing. **Used by**: [genetic_algorithm](#genetic_algorithm)

* **population**: a list of individuals to select tournaments from
* **tournament_size**: the size of each tournament being run
* **seed**: optional parameter for random seed (only for testing)

**returns** `[Tuple[str, str]]`: a tuple of two parent strings after tournament selection

In [7]:
def pick_parents(population: List[Tuple[str, List[str], float]], tournament_size: int, seed = None) -> Tuple[str, str]:
    if seed: random.seed(seed)
    tourn_1 = random.sample(population, tournament_size)
    parent1 = max(tourn_1, key = lambda competitor: competitor[2])
    tourn_2 = random.sample(population, tournament_size)
    parent2 = max(tourn_2, key = lambda competitor: competitor[2])
    return (parent1, parent2)

In [8]:
# assertions/unit tests
population = [('tixlz', ['t', 'i', 'x', 'l', 'z'], 5), 
                    ('w xuq', ['w', ' ', 'x', 'u', 'q'], 4), 
                    ('a oyh', ['a', ' ', 'o', 'y', 'h'], 3), 
                    ('ubfdl', ['u', 'b', 'f', 'd', 'l'], 2), 
                    ('phmrd', ['p', 'h', 'm', 'r', 'd'], 1)]
actual_parents = pick_parents(population, 3, 5)
assert actual_parents == (('a oyh', ['a', ' ', 'o', 'y', 'h'], 3), ('tixlz', ['t', 'i', 'x', 'l', 'z'], 5))

actual_parents = pick_parents(population, 1, 5)
assert actual_parents == (('phmrd', ['p', 'h', 'm', 'r', 'd'], 1), ('a oyh', ['a', ' ', 'o', 'y', 'h'], 3))

actual_parents = pick_parents(population, 1, 2)
assert actual_parents== (('tixlz', ['t', 'i', 'x', 'l', 'z'], 5), ('tixlz', ['t', 'i', 'x', 'l', 'z'], 5))

<a id="mutate"></a>
## mutate

`mutate` performs a mutation on the given gene. The function takes a `phenotype`, a probability of mutation (`pm`), a value for the chance of mutation (`pm_value`), and a new gene/new location on which to mutate the `phenotype`. The function returns the new mutated `phenotype` if `pm_value` is lower than `pm`, and returns the original `phenotype` otherwise.

Mutation is one of the two ways in which the genetic algorithm produces variance in future generations. Although mutation is much less likely than `crossover` (usually <= 5%), the addition of mutation allows for only specific genes to be improved, rather than changing the entire genome (which is the most extreme case in `crossover`). **Used by**: [reproduce](#reproduce)

* **phenotype**: a list representation of the phenotype to mutate
* **pm**: the probability of mutation as a float
* **pm_value**: the chance of mutation as a float
* **new_gene**: the new character to mutate to
* **new_loc**: the index at which to mutate

**returns** `List[str]`: the mutated phenotype or the original phenotype as a list (depending on `pm_value`)

In [9]:
def mutate(phenotype: List[str], pm: float, pm_value: float, new_gene: str, new_loc: int) -> List[str]:
    if pm_value <= pm:
        phenotype[new_loc] = new_gene
    return phenotype

In [10]:
# assertions/unit tests
parent = list("abcdef")
actual_mutate = mutate(parent, 0.05, 0.01, "a", 3)
assert actual_mutate == list("abcaef")

parent = list("abcdef")
actual_mutate = mutate(parent, 0.05, 0.5, "a", 3)
assert actual_mutate == list("abcdef")

parent = list("abcdef")
actual_mutate = mutate(parent, 0.05, 0.01, "a", 0)
assert actual_mutate == list ("abcdef")

<a id="crossover"></a>
## crossover

`crossover` performs the crossover operation on two parents. The function takes two parents (as list representations), a probability of crossover (`pc`), a chance of crossover (`pc_value`), and a crossover index `cross_idx`. If the value for the chance is lower than the probability, then crossover occurs. Otherwise, the two parents are passed onto the next generation. Crossover is defined as:
* `child1` taking the first (`cross_idx`) elements from parent1 and the remaining elements from parent2
* `child2` taking the first (`cross_idx`) elements from parent2 and the remaining elements from parent1

This operation allows for individuals to create variance in future populations - the probability of crossover is between 0.85-0.95, usually. This is the main method by which individuals are changed in future generations in the genetic algorithm. **Used by**: [reproduce](#reproduce)

* **parent1**: a list representation of parent1
* **parent2**: a list representation of parent2
* **pc**: the probability of crossover
* **pc_value**: the random chance that crossover occurs
* **cross_idx**: a random index within the bounds of the string at which to perform crossover

**returns** `Tuple[List[str], List[str]]`: a tuple containing the two children.

In [11]:
def crossover(parent1: List[str], parent2: List[str], pc: float, pc_value: float, cross_idx: int) -> Tuple[List[str], List[str]]:
    if pc_value <= pc:
        child1 = parent1[:cross_idx] + parent2[cross_idx:]
        child2 = parent2[:cross_idx] + parent1[cross_idx:]
        return (child1, child2)
    else:
        return (deepcopy(parent1), deepcopy(parent2))

In [12]:
# assertions/unit tests
parent1 = list("abcdef")
parent2 = list("defabc")
parent3 = list("ghijkl")
actual_children = crossover(parent1, parent3, 0.9, 0.6, 3)
assert actual_children == (list("abcjkl"), list("ghidef"))

parent1 = list("abcdef")
parent2 = list("defabc")
actual_children = crossover(parent1, parent2, 0.9, 1, 5)
assert actual_children == (parent1, parent2)

parent1 = list("abcdef")
parent2 = list("defabc")
actual_children = crossover(parent1, parent2, 0.9, 0.3, 0)
assert actual_children == (parent2, parent1)

<a id="reproduce"></a>
## reproduce

`reproduce` takes the probabilities of crossover and mutation, two parents, and an alphabet. The function then generates random variables for crossover and mutation (for both children) to determine whether each operation occurs. Additionally, the function chooses two random symbols from `alphabet` for mutation. First, crossover is computed; then, mutation on each child is computed. The function returns a list of two individuals (`child1` and `child2`). 

Reproduction is how the genetic algorithm progresses. Through crossover and mutation, the best-fit parents will be bred into future generations, allowing for new variability and potential better-fit individuals to progress. **Uses**: [crossover](#crossover), [mutate](#mutate). **Used by**: [genetic_algorithm](#genetic_algorithm)

* **pm**: probability of mutation
* **pc**: probability of crossover
* **parent1**: the first parent in reproduction
* **parent2**: the second parent in reproduction
* **alphabet**: the list of characters over which to choose random genes
* **seed**: optional parameter for random seed (only for testing)

**returns** `List[Tuple[str, List[str], float]]`: a list of two children after reproduction

In [13]:
def reproduce(pm: float, pc: float, parent1: Tuple[str, List[str], float], 
              parent2: Tuple[str, List[str], float], alphabet: str, seed = None) -> List[Tuple[str, List[str], float]]:
    if seed: random.seed(seed)
    str_len = len(parent1[0]) - 1
    pc_value, pm_value1, pm_value2  = rand(), rand(), rand()
    cross_idx, new_loc1, new_loc2 = randi(0, str_len), randi(0, str_len), randi(0, str_len)
    
    (child1, child2) = crossover(parent1[1], parent2[1], pc, pc_value, cross_idx)
    
    new_gene1 = random.choice(alphabet)
    child1 = mutate(child1, pm, pm_value1, new_gene1, new_loc1)
    new_gene2 = random.choice(alphabet)
    child2 = mutate(child2, pm, pm_value2, new_gene2, new_loc2)
    return [child1, child2]

In [14]:
# assertions/unit tests
parent1 = ("abcdef", list("abcdef"), 0)
parent2 = ("defabc", list("defabc"), 0)

[child1, child2] = reproduce(0.05, 0.9, parent1, parent2, ALPHABET, 5)
assert child1 == ['a', 'b', 'c', 'd', 'e', 'c']
assert child2 == ['d', 'e', 'f', 'a', 'b', 'f']

[child1, child2] = reproduce(1, 0, parent1, parent2, ALPHABET, 5)
assert child1 == ['a', 'b', 'c', 'd', 'e', 'a']

<a id="curr_and_max_best"></a>
## curr_and_max_best

`curr_and_max_best` takes a population and a `max_best` representing the best individual found in the genetic algorithm and returns a list of two individuals: the `max_best`, as well as the best individual of the current population (the current generation). The comparison is based on the max fitness score. **Used by**: [genetic_algorithm](#genetic_algorithm)

* **population**: a list of individuals with fitness scores
* **max_best**: an individual with the current highest fitness of the algorithm runtime

**returns** `List[Tuple[str, List[str], int]]`: a list with the `max_best` and `curr_best` individuals

In [15]:
def curr_and_max_best(population: List[Tuple[str, List[str], float]], max_best: Tuple[str, List[str], float]) -> List[Tuple[str, List[str], float]]:
    curr_best = max(population, key = lambda score: score[2])
    if not max_best or curr_best[2] > max_best[2]:
        max_best = curr_best
    return [max_best, curr_best]

In [16]:
# assertions/unit tests
population = [('tixlz', ['t', 'i', 'x', 'l', 'z'], 5), 
                    ('w xuq', ['w', ' ', 'x', 'u', 'q'], 4), 
                    ('a oyh', ['a', ' ', 'o', 'y', 'h'], 3), 
                    ('ubfdl', ['u', 'b', 'f', 'd', 'l'], 2), 
                    ('phmrd', ['p', 'h', 'm', 'r', 'd'], 1)]
(actual_max, actual_curr) = curr_and_max_best(population, ('a oyh', ['a', ' ', 'o', 'y', 'h'], 3))
assert actual_max == ('tixlz', ['t', 'i', 'x', 'l', 'z'], 5)
assert actual_curr == actual_max

(actual_max, actual_curr) = curr_and_max_best(population, ('w xuq', ['w', ' ', 'x', 'u', 'q'], 7))
assert actual_max == ('w xuq', ['w', ' ', 'x', 'u', 'q'], 7)
assert actual_curr == ('tixlz', ['t', 'i', 'x', 'l', 'z'], 5)

(actual_max, actual_curr) = curr_and_max_best(population, ())
assert actual_max == actual_curr

<a id="add_children"></a>
## add_children

`add_children` takes a population and two children as strings and appends the two children as individuals to the population. It returns the population with the children added. **Used by**: [genetic_algorithm](#genetic_algorithm)

* **population**: a list of individuals with fitness scores
* **child1**: the first child string to be added to `population`
* **child2**: the second child string to be added to `population`

**returns** `List[Tuple[str, List[str], float]]`: the population with `child1`,`child2` added

In [17]:
def add_children(child1: List[str], child2: List[str], population: List[Tuple[str, List[str], float]]) -> List[Tuple[str, List[str], float]]:
    population.append((''.join(child1), child1, 0))
    population.append((''.join(child2), child2, 0))
    return population

In [18]:
# assertions/unit tests
population = []
new_pop = add_children(list("abc"), list("def"), population)
assert new_pop == [("abc", list("abc"), 0), ("def", list("def"), 0)]
assert new_pop == population

new_pop = add_children(list("abc"), [], population)
assert new_pop == [('abc', ['a', 'b', 'c'], 0), 
                   ('def', ['d', 'e', 'f'], 0), 
                   ('abc', ['a', 'b', 'c'], 0), 
                   ('', [], 0)]

<a id="genetic_algorithm"></a>
## genetic_algorithm

`genetic_algorithm` takes a dictionary of parameters (`params`), an `alphabet` of symbols, and an `objective` function and returns the best individual that matches the `target` string. The function generates a list of `N` new individuals (all of the same length as `target`) and iterates on this `population` for `limit` generations.  At each iteration, the current generation is evaluated, bred, and passed on. Once every 10 generations, the current best individual of that generation will be printed. As mentioned above, the individuals in the population are represented as `Tuple[str, List[str], float]`, with the tuple representing (phenotype, genotype, fitness).

The genetic algorithm is meant to mimic genetics/evolution found in nature - the concept of "survival of the fittest" is a key portion of how the genetic algorithm reaches the goal. By breeding only the "most fit" individuals (which is determined by the objective / fitness function in `objective` and `evaluate`), the function will converge on the `target` after a finite amount of time. In this implementation, the function uses a population size of `N = 200` and a limit of `limit = 200` generations, with probabilities of mutation and crossover of `pm = 0.9, pc = 0.05` respectively. Finally, in the tournament selection for picking parents, the `tournament_size` is set to 7. These parameters have been found to converge to the target very often (> 90% of the time). **Uses**: [generate_random_populations](#generate_random_populations), [evaluate](#evaluate), [pick_parents](#pick_parents), [reproduce](#reproduce), [add_children](#add_children), [curr_and_max_best](#curr_and_max_best)

In [19]:
def genetic_algorithm(params: Dict[Any, Any], alphabet: str, objective: Callable) -> Tuple[str, List[str], float]:
    population = generate_random_populations(params["N"], len(params["target"]), alphabet)
    generations, total_best = 0, ()
    while generations < params["limit"]:
        population = evaluate(objective, params["target"], population)
        next_population = []
        for n in range(params["N"] // 2):
            (parent1, parent2) = pick_parents(population, params["tourn_size"])
            (child1, child2) = reproduce(params["pm"], params["pc"], parent1, parent2, alphabet)
            next_population = add_children(child1, child2, next_population)
        population = evaluate(objective, params["target"], next_population)
        generations += 1
        (total_best, gen_best) = curr_and_max_best(population, total_best)
        if generations % 10 == 0 and generations:
            print("Best individual of generation", generations, " :", gen_best)   
    return total_best

## Problem 1

The target is the string "this is so much fun".
The challenge, aside from implementing the basic algorithm, is deriving a fitness function based on "b" - "p" (for example).
The fitness function should come up with a fitness score based on element to element comparisons between target v. phenotype.

In [20]:
target1 = "this is so much fun"

<a id="objective_1"></a>
## objective_1

`objective_1` measures the ASCII string-difference between the `target` and the given `phenotype` character-by-character and returns the sum of the differences. The objective function is inherently a minimization function (the larget the ASCII score, the worse the string is in terms of fitness). The `evaluate` function converts this to a maximization problem for use in the `genetic_algorithm`. This objective allows the `genetic_algorithm` to match strings for Problem 1's specifications. **Used by**: [evaluate](#evaluate), [genetic_algorithm](#genetic_algorithm).

* **target**: `target` string to which the objective measures fitness
* **pheno**: the current phenotype that is being scored

**returns** int: the ASCII score computed character-by-character.

In [21]:
def objective_1(target: str, pheno: Tuple[str, List[str], float]) -> int:
    score = 0
    for i, c_tgt in enumerate(target):
        score += abs(ord(c_tgt) - ord(pheno[1][i]))
    return score

In [22]:
# assertions/unit tests
target = "this is so much fun"
actual_score = objective_1(target, (target, list(target), 0))
assert actual_score == 0

test_pheno = "this is aa much fun"
actual_score = objective_1(target, (test_pheno, list(test_pheno), 0))
assert actual_score == 32

actual_score = objective_1("", ("", [], 0))
assert actual_score == 0

In [23]:
# set up if you need it.
params = {"N": 200, "limit": 200, "target": target1, "pc": 0.90, "pm": 0.05, "tourn_size": 7}

In [24]:
result1 = genetic_algorithm(params, ALPHABET, objective_1) # do what you need to do for your implementation but don't change the lines above or below.

Best individual of generation 10  : ('sjmq gp nm ozag juo', ['s', 'j', 'm', 'q', ' ', 'g', 'p', ' ', 'n', 'm', ' ', 'o', 'z', 'a', 'g', ' ', 'j', 'u', 'o'], 0.02702702702702703)
Best individual of generation 20  : ('sggq gt nm nzcg guo', ['s', 'g', 'g', 'q', ' ', 'g', 't', ' ', 'n', 'm', ' ', 'n', 'z', 'c', 'g', ' ', 'g', 'u', 'o'], 0.038461538461538464)
Best individual of generation 30  : ('sggt gt sn nwcg guo', ['s', 'g', 'g', 't', ' ', 'g', 't', ' ', 's', 'n', ' ', 'n', 'w', 'c', 'g', ' ', 'g', 'u', 'o'], 0.0625)
Best individual of generation 40  : ('sgit gt sn nwcg guo', ['s', 'g', 'i', 't', ' ', 'g', 't', ' ', 's', 'n', ' ', 'n', 'w', 'c', 'g', ' ', 'g', 'u', 'o'], 0.07142857142857142)
Best individual of generation 50  : ('sgit jt sn nwcg guo', ['s', 'g', 'i', 't', ' ', 'j', 't', ' ', 's', 'n', ' ', 'n', 'w', 'c', 'g', ' ', 'g', 'u', 'o'], 0.07692307692307693)
Best individual of generation 60  : ('tgis jt sn ntcg guo', ['t', 'g', 'i', 's', ' ', 'j', 't', ' ', 's', 'n', ' ', 'n', '

In [25]:
pprint(result1, compact=True)

('this is so much fun',
 ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u', 'c', 'h',
  ' ', 'f', 'u', 'n'],
 1.0)


## Problem 2

You should have working code now.
The goal here is to think a bit more about fitness functions.
The target string is now, 'nuf hcum os si siht'.
This is obviously target #1 but reversed.
If we just wanted to match the string, this would be trivial.
Instead, this problem, we want to "decode" the string so that the best individual displays the target forwards.
In order to do this, you'll need to come up with a fitness function that measures how successful candidates are towards this goal.
The constraint is that you may not perform any global operations on the target or individuals.
Your fitness function must still compare a single gene against a single gene.
Your solution will likely not be Pythonic but use indexing.
That's ok.

**NOTE FOR ME: just measure difference between character at i and -i instead of i vs i.**
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not reverse an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene (one letter against one letter).
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual in the population is the one who expresses this string *forwards*.

In [26]:
target2 = "nuf hcum os si siht"

<a id="objective_2"></a>
## objective_2

`objective_2` measures the ASCII string-difference between the `target` and the given `phenotype` character-by-character and returns the sum of the differences. This objective allows the `genetic_algorithm` to match strings for Problem 2's specifications. Here, the problem is to match to the `target` in reverse order, so each character is compared to its complement (the index computed from the end rather than the start). **Used by**: [evaluate](#evaluate), [genetic_algorithm](#genetic_algorithm).

* **target**: `target` string to which the objective measures fitness
* **pheno**: the current phenotype that is being scored

**returns** int: the ASCII score computed character-by-character.

In [27]:
def objective_2(target: str, pheno: Tuple[str, List[str], float]) -> int:
    score, str_len = 0, len(pheno[1]) - 1
    for i, c_tgt in enumerate(target):
        score += abs(ord(c_tgt) - ord(pheno[1][str_len-i]))
    return score

In [28]:
# assertions/unit tests
target = "this is so much fun"
pheno = "nuf hcum os si siht"
actual_score = objective_2(target, (pheno, list(pheno), 0))
assert actual_score == 0

actual_score = objective_2(target, ("nuf hcum    si siht", list("nuf hcum    si siht"), 0))
assert actual_score == 162

actual_score = objective_2("", ("", (), 0))
assert actual_score == 0

In [29]:
# set up if you need it.
params = {"N": 200, "limit": 200, "target": target2, "pc": 0.90, "pm": 0.05, "tourn_size": 7}

In [30]:
result2 = genetic_algorithm(params, ALPHABET, objective_2) # do what you need to do for your implementation but don't change the lines above or below.

Best individual of generation 10  : ('ykis lt xi iqgg ftt', ['y', 'k', 'i', 's', ' ', 'l', 't', ' ', 'x', 'i', ' ', 'i', 'q', 'g', 'g', ' ', 'f', 't', 't'], 0.022727272727272728)
Best individual of generation 20  : ('seis kt xl iuag ftt', ['s', 'e', 'i', 's', ' ', 'k', 't', ' ', 'x', 'l', ' ', 'i', 'u', 'a', 'g', ' ', 'f', 't', 't'], 0.03333333333333333)
Best individual of generation 30  : ('shis kt qm muag fun', ['s', 'h', 'i', 's', ' ', 'k', 't', ' ', 'q', 'm', ' ', 'm', 'u', 'a', 'g', ' ', 'f', 'u', 'n'], 0.08333333333333333)
Best individual of generation 40  : ('shis it qm muag fun', ['s', 'h', 'i', 's', ' ', 'i', 't', ' ', 'q', 'm', ' ', 'm', 'u', 'a', 'g', ' ', 'f', 'u', 'n'], 0.1)
Best individual of generation 50  : ('shis is qo muag fun', ['s', 'h', 'i', 's', ' ', 'i', 's', ' ', 'q', 'o', ' ', 'm', 'u', 'a', 'g', ' ', 'f', 'u', 'n'], 0.14285714285714285)
Best individual of generation 60  : ('shis is qo muag fun', ['s', 'h', 'i', 's', ' ', 'i', 's', ' ', 'q', 'o', ' ', 'm', 'u',

In [31]:
pprint(result2, compact=True)

('this is so much fun',
 ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u', 'c', 'h',
  ' ', 'f', 'u', 'n'],
 1.0)


## Problem 3

This is a variation on the theme of Problem 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

**NOTE FOR ME: compare i to i but transform each character to s[i] + 13**
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual will express the target *decoded*.

In [32]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [33]:
target3 = "guvfvffbzhpusha"

<a id="objective_3"></a>
## objective_3

`objective_3` measures the ASCII string-difference between the `target` and the given `phenotype` character-by-character and returns the sum of the differences. This objective allows the `genetic_algorithm` to match strings for Problem 3's specifications. In Problem 3, the goal is to match to the decoded `target` using the ROT13 Caeser Cypher. To compute the score, each character in `target` is decoded by shifting the character 13 down the alphabet (rotating after it hits the end), and then converted to an ASCII value. **Used by**: [evaluate](#evaluate), [genetic_algorithm](#genetic_algorithm).

* **target**: `target` string to which the objective measures fitness
* **pheno**: the current phenotype that is being scored

**returns** int: the ASCII score computed character-by-character.

In [34]:
def objective_3(target: str, pheno: Tuple[str, List[str], float]) -> int:
    score = 0
    for i, c_tgt in enumerate(target):
        decoded_tgt = ((ord(c_tgt) + 13 - 97) % 26) + 97
        score += abs(decoded_tgt - ord(pheno[1][i]))
    return score

In [35]:
# assertions/unit tests
target = "guvfvffbzhpusha"
pheno = "thisissomuchfun"
actual_score = objective_3(target, (pheno, list(pheno), 0))
assert actual_score == 0

actual_score = objective_3(target, ("thisisaamuchfun", list("thisisaamuchfun"), 0))
assert actual_score == 32

actual_score = objective_3("", ("", (), 0))
assert actual_score == 0

In [36]:
# set up if you need it
params = {"N": 200, "limit": 200, "target": target3, "pc": 0.90, "pm": 0.05, "tourn_size": 7}

In [37]:
result3 = genetic_algorithm(params, ALPHABET3, objective_3) # do what you need to do for your implementation but don't change the lines above or below.

Best individual of generation 10  : ('thjtfstpntbkeun', ['t', 'h', 'j', 't', 'f', 's', 't', 'p', 'n', 't', 'b', 'k', 'e', 'u', 'n'], 0.06666666666666667)
Best individual of generation 20  : ('thjtgsspmtbfeun', ['t', 'h', 'j', 't', 'g', 's', 's', 'p', 'm', 't', 'b', 'f', 'e', 'u', 'n'], 0.09090909090909091)
Best individual of generation 30  : ('thithsspmtcheun', ['t', 'h', 'i', 't', 'h', 's', 's', 'p', 'm', 't', 'c', 'h', 'e', 'u', 'n'], 0.16666666666666666)
Best individual of generation 40  : ('thithsspmtchfun', ['t', 'h', 'i', 't', 'h', 's', 's', 'p', 'm', 't', 'c', 'h', 'f', 'u', 'n'], 0.2)
Best individual of generation 50  : ('thithssomtchfun', ['t', 'h', 'i', 't', 'h', 's', 's', 'o', 'm', 't', 'c', 'h', 'f', 'u', 'n'], 0.25)
Best individual of generation 60  : ('thithssomtchfun', ['t', 'h', 'i', 't', 'h', 's', 's', 'o', 'm', 't', 'c', 'h', 'f', 'u', 'n'], 0.25)
Best individual of generation 70  : ('thishssomtchfun', ['t', 'h', 'i', 's', 'h', 's', 's', 'o', 'm', 't', 'c', 'h', 'f', 

In [38]:
pprint(result3, compact=True)

('thisissomuchfun',
 ['t', 'h', 'i', 's', 'i', 's', 's', 'o', 'm', 'u', 'c', 'h', 'f', 'u', 'n'],
 1.0)


## Problem 4

There is no code for this problem.

In Problem 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

In the case where the shift-value of the cypher is unknown, the simplest solution would be to run the genetic algorithm over and over with varying objective functions. See the below code for the objective function:

In [39]:
def objective_4(target: str, pheno: Tuple[str, List[str], float], shift: int) -> int:
    score = 0
    for i, c_tgt in enumerate(target):
        decoded_tgt = ((ord(c_tgt) + shift - 97) % 26) + 97
        score += abs(decoded_tgt - ord(pheno[1][i]))
    return score

Here, we can imagine that we add another parameter to the function, `shift`, which will control how far a character is shifted in the encode/decode of the cypher (1 <= `shift` <= 26, including spaces). Running the genetic algorithm in a loop with increasing `shift` values, and keeping track of the absolute best fit would result in us being able to figure out what the "most likely" cypher shift would be. 

Additionally, we could re-encode the space character to an ASCII value of 96 to avoid issues with shifting - if we encounter a space, make sure we use 96 instead of 32 as the ASCII value in the decoding (and make sure we subtract/add by 96 in the decoding, as well as use `mod 27`). Then, 'z' would wrap to the space character, and we would be able to properly encode/decode every string.

The only other change we would have to make would be to run the genetic algorithm in a loop, add a global maximum tracker in the loop along with the shift value it used, and change the objective function prototype to include `shift` - no other changes to the phenotype/genotype would be required. By the end of the loop, we should have a best shift value (which would identify the most accurate cypher shift) as well as the "most correct" string decoding.

Although this is brute-force and somewhat inefficient, it would be a guaranteed way of finding the best performing cypher shift (given a functioning genetic algorithm).

## Challenge

**You do not need to do this problem and it won't be graded if you do. It's just here if you want to push your understanding.**

The original GA used binary encodings for everything.
We're basically using a Base 27 encoding.
You could, however, write a version of the algorithm that uses an 8 bit encoding for each letter (ignore spaces as they're a bit of a bother).
That is, a 4 letter candidate looks like this:

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

If you wrote your `genetic_algorithm` code general enough, with higher order functions, you should be able to implement it using bit strings instead of latin strings.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

## Comments - Choice of N

Although the professor mentioned we should be fine with values of `N = 100, limit = 200`, I found that when I ran the algorithm with those parameters, it would sometimes fail to converge to the final string in the number of generations provided. It would converge to the right answer about 80% of the time. However, when I increased the value of `N` to 200, the convergence rate was much higher (>90%). Therefore, I thought it would be best to choose the values that are a more "guaranteed" convergence rather than those that might not converge in time. Consequently, you might see that the outputs for the problems show a much earlier convergence than the limit - but the choice of `N` was made to allow for the algorithm to perform well over multiple iterations, not just a single run.