## Andrew Taylor
## atayl136
### en605.645

# Module 3 - Programming Assignment

## General Directions

1. You must follow the Programming Requirements outlined on Canvas.
2. The Notebook should be cleanly and fully executed before submission.
3. You should change the name of this file to be your JHED id. For example, `jsmith299.ipynb` although Canvas will change it to something else...
4. You must follow the Programming Requirments for this course.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You should always read the entire assignment before beginning your work, so that you know in advance what the requested output will be and can plan your implementation accordingly.
    </p>
</div>

<div style="color: white; background: #C83F49; margin:20px; padding: 20px;">
    <strong>Academic Integrity and Copyright</strong>
    <p>You are not permitted to consult outside sources (Stackoverflow, YouTube, ChatGPT, etc.) or use "code assistance" (Co-Pilot, etc) to complete this assignment. By submitting this assignment for grading, you certify that the submission is 100% your own work, based on course materials, group interactions, instructor guidance. You agree to comply by the requirements set forth in the Syllabus, including, by reference, the JHU KSAS/WSE Graduate Academic Misconduct Policy.</p>
    <p>Sharing this assignment either directly (e.g., email, github, homework site) or indirectly (e.g., ChatGPT, machine learning platform) is a violation of the copyright. Additionally, all such sharing is a violation the Graduate Academic Misconduct Policy (facilitating academic dishonesty is itself academic dishonesty), even after you graduate.</p>
    <p>If you have questions or if you're unsure about the policy, ask via Canvas Inbox. In this case, being forgiven is <strong>not</strong> easier than getting permission and ignorance is not an excuse.</p>
    <p>This assignment is copyright (&copy Johns Hopkins University &amp; Stephyn G. W. Butcher). All rights reserved.</p>
</div>

In [391]:
from pprint import pprint

## Local Search - Genetic Algorithm

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

You are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
The only constraint for our objective function is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.
That measure becomes our **objective function** and we can use it with the Genetic Algorithm.

Since it is probably easier to come up with a score that measures how far apart the two strings are, we will end up with an objective function that represents a **minimization problem**.
Because a fitness function must always be "more is better", we'll need to convert our objective function to a proper fitness function as well.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

You will also need to pick a selection algorithm, either roulette wheel or tournament selection.
In the later case, you will need a tournament size.
This is all part of the problem.

Every **ten** (10) generations, you should print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function should return the best individual *of the entire run*, using the same format.

In [392]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "

# Helper Functions

### `initialize_population(pop_size, string_length)`

Initialize the population with random individuals.

**Args**:
- `pop_size` (*int*): The size of the population.
- `string_length` : The length of each individual string.

**Returns**:
- A list of individuals.


In [432]:
import random

def initialize_population(pop_size, string_length):
    population = []
    for _ in range(pop_size):
        individual = ''.join(random.choice(ALPHABET) for _ in range(string_length))
        population.append(individual)
    return population



In [458]:
# Unit Tests

# Test 1: Check if the population has the correct size
pop_size = 10
string_length = 5
population = initialize_population(pop_size, string_length)
assert len(population) == pop_size, "Population size is incorrect"

# Test 2: Verify that individuals has the correct string length
for individual in population:
    assert len(individual) == string_length, "Individual length is incorrect"

# Test 3: Ensure all characters in individuals are from ALPHABET
for individual in population:
    for char in individual:
        assert char in ALPHABET, f"Invalid character '{char}' found in individual"

print("All tests passed.")

All tests passed.


### `fitness_function(individual, target)`

Calculate the fitness of an individual.

Fitness is the number of matching characters in the correct positions.

**Args**:
- `individual` (*str*): The individual string.
- `target` (*str*): The target string.

**Returns**:
- The fitnesss value.


In [434]:
def fitness_function(individual, target):
    fitness = sum(1 for a, b in zip(individual, target) if a == b)
    return fitness


In [459]:
# Unit Tests

# Test 1: Identical strings
individual = "abcdef"
target = "abcdef"
fitness = fitness_function(individual, target)
assert fitness == len(individual), f"Test 1 Failed: Expected fitness {len(individual)}, got {fitness}"

# Test 2: Completely different strings
individual = "abcdef"
target = "ghijkl"
fitness = fitness_function(individual, target)
assert fitness == 0, f"Test 2 Failed: Expected fitness 0, got {fitness}"

# Test 3: Partially matching strings
individual = "abcxyz"
target = "abcmno"
fitness = fitness_function(individual, target)
expected_fitness = 3  
assert fitness == expected_fitness, f"Test 3 Failed: Expected fitness {expected_fitness}, got {fitness}"

print("All tests passed.")

All tests passed.


### `select_parents(population, fitnesses, method='roulette')`

Select two parents from the population.

**Args**:
- `population` (*list*): The list of individuals.
- `fitnesses` (*list*): The list of fitness values.
- `method` (*str*): The selection method (`'tournament'`).
- `tournament_size` (*int*): The tournament size (if using tournament selection).

**Returns**:
- Two selected parents.


In [436]:
def select_parents(population, fitnesses, method='roulette', tournament_size=5):
    if method == 'roulette':
        total_fitness = sum(fitnesses)
        if total_fitness == 0:
            selection_probs = [1 / len(population) for _ in fitnesses]
        else:
            selection_probs = [f / total_fitness for f in fitnesses]
        parent1 = random.choices(population, weights=selection_probs, k=1)[0]
        parent2 = random.choices(population, weights=selection_probs, k=1)[0]
    return parent1, parent2



In [475]:
# Unit Tests

# Test 1: 
...

Ellipsis

### `crossover(parent1, parent2, crossover_prob)`

Perform crossover between two parents.

**Args**:
- `parent1` (*str*): The first parent.
- `parent2` (): The second parent.
- `crossover_prob` (*float*): The probability of crossover.

**Returns**:
- Two offspring strings.


In [396]:
def crossover(parent1, parent2, crossover_prob):
    if random.random() < crossover_prob:
        point = random.randint(1, len(parent1)-1)
        child1 = parent1[:point] + parent2[point:]
        child2 = parent2[:point] + parent1[point:]
    else:
        child1 = parent1
        child2 = parent2
    return child1, child2


In [463]:
#  Unit Tests

# Set a seed involving randomness
random.seed(0)

parent1 = "AAAAAAAAAA"
parent2 = "BBBBBBBBBB"

# Test 1: Check that when crossover occurs, children are combinations of parents
crossover_prob = 1.0  
child1, child2 = crossover(parent1, parent2, crossover_prob)
# Ensure that child1 and child2 are different from both parents
assert child1 != parent1 and child1 != parent2, "Child1 should be a combination of both parents"
assert child2 != parent1 and child2 != parent2, "Child2 should be a combination of both parents"

# Test 2: Check that when crossover does not occur, children are exact copies of parents
crossover_prob = 0.0  
child1, child2 = crossover(parent1, parent2, crossover_prob)
assert child1 == parent1, "When no crossover, child1 should be identical to parent1"
assert child2 == parent2, "When no crossover, child2 should be identical to parent2"

# Test 3: Ensure that crossover results in children that are a combination of both parents
crossover_prob = 1.0  
child1, child2 = crossover(parent1, parent2, crossover_prob)
assert child1.startswith("A") and child1.endswith("B"), "Child1 should have parts from both parents"
assert child2.startswith("B") and child2.endswith("A"), "Child2 should have parts from both parents"

print("All tests passed.")

All tests passed.


### `mutate(individual, mutation_prob)`

Mutate an individual.

**Args**:
- `individual` (*str*): The individua;l to mutate.
- `mutation_prob` (*float*): The probability of mutation for each gene.

**Returns**:
- The mutated individual.


In [397]:
def mutate(individual, mutation_prob):
    import random
    individual = list(individual)
    for i in range(len(individual)):
        if random.random() < mutation_prob:
            individual[i] = random.choice(ALPHABET)
    return ''.join(individual)

In [440]:
# Unit Tests

# Set a fixed seed
random.seed(0)

# Test 1: Check that mutation does not occur when probability is 0
individual = "AAAAA"
mutation_prob = 0.0  
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual == individual, "When mutation_prob is 0, individual remain unchanged"

# Test 2: Check that mutation happens when probability is 1
individual = "AAAAA"
mutation_prob = 1.0 
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual != individual, "When mutation_prob is 1, individual changes completely"

# Test 3: Ensure mutation occurs only in some places when mutation prob is between 0 and 1
individual = "AAAAA"
mutation_prob = 0.5 
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual != individual, "When mutation_prob is 0.5, individual should change partially"
assert any(a == b for a, b in zip(individual, mutated_individual)), "Some characters should remain the same"

print("All tests passed.")

All tests passed.


<a id="genetic_algorithm"></a>
### genetic_algorithm

**(**

You can add as many parameters as you need to `genetic_algorithm`.
The documentation should be filled out according to the programming assignment requirements for the course.
You do not need to unit test this function.
Remember that functions should "only do one thing", may not be more than 20 lines.
Write helper functions with documentation and unit tests.

A complex function and application like this has a lot of interesting potential:

1. There are a lot of parameters We could reduce those by using a Dictionary.
2. There are a lot of different possible behaviors, including problem specific behaviors. We could use higher order functions.

Beyond these hints, I leave those decisions to you.


*This is very Mission Impossible. After reading the directions in this Markdown cell, when the time is right, remove them  (everything between and including the parentheses) and replace with your documentation for `genetic_algorithm`! I have started you off.*

**)**



## Genetic Algorithm Documentation

Runs the genetic aalgorithm to match the target string.  
  
Args:  
        `population_size` (int): The size of the population.  
        `generations` (int): The number of generations.  
        `crossover_prob` (float): The probability of crossover.  
        `mutation_prob` (float): The probability of mutation.  
        `selection_method` (str): The selection method ('roulette' or 'tournament').  
        `tournament_size` (int): The tournament size (if using tournament selection).  

    Returns:  
        dict: The best individual found with keys 'fitness', 'genotype', and 'phe notype'.

In [398]:
def genetic_algorithm(population_size=100, generations=1000, crossover_prob=0.7, mutation_prob=0.01, selection_method='roulette', tournament_size=5):
    target = target1
    string_length = len(target)
    population = initialize_population(population_size, string_length)
    best_fitness = -1
    best_individual = None

    for generation in range(1, generations + 1):
        fitnesses = [fitness_function(ind, target) for ind in population]
        current_best_fitness, current_best_individual = max(zip(fitnesses, population))
        if current_best_fitness > best_fitness:
            best_fitness = current_best_fitness
            best_individual = current_best_individual
        if generation % 10 == 0:
            print(f"Generation {generation}: Best Fitness = {best_fitness}, Genotype = {list(best_individual)}, Phenotype = {best_individual}")
        if best_fitness == string_length: break
        new_population = []
        while len(new_population) < population_size:
            parent1, parent2 = select_parents(population, fitnesses, selection_method, tournament_size)
            child1, child2 = crossover(parent1, parent2, crossover_prob)
            new_population.extend([mutate(child1, mutation_prob), mutate(child2, mutation_prob)])
        population = new_population[:population_size]
    return {'fitness': best_fitness, 'genotype': list(best_individual), 'phenotype': best_individual}

## Problem 1

The target is the string "this is so much fun".
The challenge, aside from implementing the basic algorithm, is deriving a fitness function based on "b" - "p" (for example).
The fitness function should come up with a fitness score based on element to element comparisons between target v. phenotype.

In [399]:
target1 = "this is so much fun"

In [400]:
# set up if you need it.

In [401]:
result1 = genetic_algorithm() # do what you need to do for your implementation but don't change the lines above or below.

Generation 10: Best Fitness = 7, Genotype = ['t', 'v', 'i', 's', 'w', 'j', 'h', 'n', 'h', 'p', 'l', 'm', 'u', 'n', 'h', 'k', 'f', 's', 'c'], Phenotype = tviswjhnhplmunhkfsc
Generation 20: Best Fitness = 10, Genotype = ['t', 'v', 'i', 's', ' ', 'f', 'k', 'k', 'm', 'j', ' ', 'm', 'u', 'h', 'h', 'x', 'f', 'e', 'n'], Phenotype = tvis fkkmj muhhxfen
Generation 30: Best Fitness = 11, Genotype = ['t', 'v', 'i', 's', 'r', 'x', 's', 'g', 'm', 'o', ' ', 'm', 'u', 'n', 'h', 'x', 'f', 'e', 'n'], Phenotype = tvisrxsgmo munhxfen
Generation 40: Best Fitness = 11, Genotype = ['t', 'v', 'i', 's', 'r', 'x', 's', 'g', 'm', 'o', ' ', 'm', 'u', 'n', 'h', 'x', 'f', 'e', 'n'], Phenotype = tvisrxsgmo munhxfen
Generation 50: Best Fitness = 12, Genotype = ['t', 'v', 'i', 's', 'w', 'x', 's', 'g', 'v', 'o', ' ', 'm', 'u', 'c', 'h', 'x', 'f', 's', 'n'], Phenotype = tviswxsgvo muchxfsn
Generation 60: Best Fitness = 13, Genotype = ['t', 'h', 'i', 's', 'w', 'x', 's', 'j', 'b', 'o', ' ', 'm', 'u', 'c', 'h', 'x', 'f', 

In [402]:
pprint(result1, compact=True)

{'fitness': 19,
 'genotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
              'c', 'h', ' ', 'f', 'u', 'n'],
 'phenotype': 'this is so much fun'}


*Comment*: The best individual is the target string.

## Problem 2

You should have working code now.
The goal here is to think a bit more about fitness functions.
The target string is now, 'nuf hcum os si siht'.
This is obviously target #1 but reversed.
If we just wanted to match the string, this would be trivial.
Instead, this problem, we want to "decode" the string so that the best individual displays the target forwards.
In order to do this, you'll need to come up with a fitness function that measures how successful candidates are towards this goal.
The constraint is that you may not perform any global operations on the target or individuals.
Your fitness function must still compare a single gene against a single gene.
Your solution will likely not be Pythonic but use indexing.
That's ok.
<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not reverse an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene (one letter against one letter).
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual in the population is the one who expresses this string *forwards*.

"this is so much fun"

In [403]:
target2 = "nuf hcum os si siht"

# Modification: Fitness Function

Since I cannot reverse the entire string I reversed the fitness function to compare candidate genes forward with targets backwards.

In [467]:
def fitness_function(individual, target):
    fitness = 0
    length = len(target)
    for i in range(length):
        if individual[i] == target[length - i - 1]:
            fitness += 1
    return fitness

In [471]:
# Unit Tests

# Test 1: Perfect match when individual is the reverse
individual = "abcde"
target = "edcba"
fitness = fitness_function(individual, target)
assert fitness == len(individual), f"Expected fitness {len(individual)}, but got {fitness}"

# Test 2: No match when individual does not match the reverse
individual = "abcde"
target = "vwxyz"
fitness = fitness_function(individual, target)
assert fitness == 0, f"Expected fitness 0, but got {fitness}"

# Test 3:


print("All tests passed.")

All tests passed.


In [428]:
# set up if you need it.

`genetic_algorithm`

- Changed the target string

In [429]:
def genetic_algorithm(population_size=100, generations=1000, crossover_prob=0.7, mutation_prob=0.01, selection_method='roulette', tournament_size=5):
    target = target2
    string_length = len(target)
    population = initialize_population(population_size, string_length)
    best_fitness = -1
    best_individual = None

    for generation in range(1, generations + 1):
        fitnesses = [fitness_function(ind, target) for ind in population]
        current_best_fitness, current_best_individual = max(zip(fitnesses, population))
        if current_best_fitness > best_fitness:
            best_fitness = current_best_fitness
            best_individual = current_best_individual
        if generation % 10 == 0:
            print(f"Generation {generation}: Best Fitness = {best_fitness}, Genotype = {list(best_individual)}, Phenotype = {best_individual}")
        if best_fitness == string_length: break
        new_population = []
        while len(new_population) < population_size:
            parent1, parent2 = select_parents(population, fitnesses, selection_method, tournament_size)
            child1, child2 = crossover(parent1, parent2, crossover_prob)
            new_population.extend([mutate(child1, mutation_prob), mutate(child2, mutation_prob)])
        population = new_population[:population_size]
    return {'fitness': best_fitness, 'genotype': list(best_individual), 'phenotype': best_individual}

In [430]:
result2 = genetic_algorithm() # do what you need to do for your implementation but don't change the lines above or below.

Generation 10: Best Fitness = 9, Genotype = ['q', 'x', 'f', ' ', 'h', 'c', 'p', 'm', 't', 'l', 's', ' ', 'c', 'x', 'l', 's', 'v', 'x', 't'], Phenotype = qxf hcpmtls cxlsvxt
Generation 20: Best Fitness = 10, Genotype = ['q', 'x', 'f', ' ', 'h', 'c', 'b', 'm', 'q', 'y', 'q', ' ', 'w', 'w', ' ', 's', 'i', ' ', 't'], Phenotype = qxf hcbmqyq ww si t
Generation 30: Best Fitness = 11, Genotype = ['q', 'x', 'f', ' ', 'h', 'c', 'b', 'm', 'w', 'y', 's', 'q', 's', 'w', ' ', 's', 'u', 'h', 't'], Phenotype = qxf hcbmwysqsw suht
Generation 40: Best Fitness = 13, Genotype = ['n', 'x', 'f', ' ', 'h', 'c', 'b', 'm', 'w', 'i', 's', ' ', 's', 'w', ' ', 's', 'u', 'h', 't'], Phenotype = nxf hcbmwis sw suht
Generation 50: Best Fitness = 13, Genotype = ['n', 'x', 'f', ' ', 'h', 'c', 'b', 'm', 'w', 'i', 's', ' ', 's', 'w', ' ', 's', 'u', 'h', 't'], Phenotype = nxf hcbmwis sw suht
Generation 60: Best Fitness = 13, Genotype = ['n', 'x', 'f', ' ', 'h', 'c', 'b', 'm', 'w', 'i', 's', ' ', 's', 'w', ' ', 's', 'u', 

In [407]:
pprint(result2, compact=True)

{'fitness': 19,
 'genotype': ['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 's', 'o', ' ', 'm', 'u',
              'c', 'h', ' ', 'f', 'u', 'n'],
 'phenotype': 'this is so much fun'}


Comment: If I understand correctly this is the best individual who expresses the target2 *forwards*.

## Problem 3

This is a variation on the theme of Problem 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
        Failure to follow these directions will result in 0 points for the problem.
    </p>
</div>

The best individual will express the target *decoded*.

"thisissomuchfun"

## Helper Functions Documentation

### `encode_rot13(text)`

Encode a string using ROT3.

**Args**:
- `text` (*str*): The original text.
- `alphabet1 (*str*) The new alphbet.

**Returns**:
  The ROT13 text.


In [408]:
def encode_rot13(text):
    encoded = ''
    for char in text:
        if char in ALPHABET:
            index = ALPHABET.index(char)
            rot13_index = (index + 13) % 26
            encoded += ALPHABET[rot13_index]
    return encoded


In [442]:
# Unit Tests

# Test 1: Encode a simple string
text = "abc"
encoded_text = encode_rot13(text)
expected_text = "nop"
assert encoded_text == expected_text, f"Expected '{expected_text}', but got '{encoded_text}'"

# Test 2: Encode a string withend of the alphabet
text = "xyz"
encoded_text = encode_rot13(text)
expected_text = "klm"
assert encoded_text == expected_text, f"Expected '{expected_text}', but got '{encoded_text}'"

# Test 3: Encode a string with a mix of ROT13 and alphabet characters
text = "hello"
encoded_text = encode_rot13(text)
expected_text = "uryyb"
assert encoded_text == expected_text, f"Expected '{expected_text}', but got '{encoded_text}'"

print("All tests passed.")

All tests passed.


### `fitness_function(individual, target)`

Calculate the fitness of an individual.

Fitness is the number of matching characters when ROT13 is applied to each gene
of the candidate and compared to the corresponding gene in the target string.

**Args**:
- `individual` (*str*): The individual string.
- `target` (*str*): The target string (ROT13 encoded).

**Returns**:
- The fitness value.


In [409]:
def fitness_function(individual, target):
    fitness = 0
    for i in range(len(target)):
        candidate_char = individual[i]
        target_char = target[i]
        if candidate_char in ALPHABET:
            index = ALPHABET.index(candidate_char)
            rot13_index = (index + 13) % 26
            candidate_rot13_char = ALPHABET[rot13_index]
            if candidate_rot13_char == target_char:
                fitness += 1
    return fitness


In [447]:
# Unit Tests



`intitialize_population(pop_size, string_length)`

- updated 

### `initialize_population(pop_size, string_length)`

Initialize the population with random individuals.

**Args**:
- `pop_size` (*int*): The size of the population.
- `string_length` (*int*): The length of each individual string.

**Returns**:
- A list of individuals (strings).


In [410]:
def initialize_population(pop_size, string_length):
    population = []
    for _ in range(pop_size):
        individual = ''.join(random.choice(ALPHABET) for _ in range(string_length))
        population.append(individual)
    return population


In [472]:
# Unit Tests

# Test 1: Check that the population has the correct size
pop_size = 10
string_length = 5
population = initialize_population(pop_size, string_length)
assert len(population) == pop_size, f"Expected populatiloon size {pop_size}, but got {len(population)}"

# Test 2: Check that all individuals have the correct string length
for individual in population:
    assert len(individual) == string_length, f"Expected individual length {string_length}, but got {len(individual)}"

# Test 3: Ensure that all characters in the individuals are from ALPHABET
for individual in population:
    for char in individual:
        assert char in ALPHABET, f"Character '{char}' is not in ALPHABET"

print("All tests passed.")

All tests passed.


`mutate(individual, mutation_prob, alphabet)`

- Updated 

### `mutate(individual, mutation_prob)`

Mutate an insdividual.

**Args**:
- `individual` (*str*): The individual to mutate.
- `mutation_prob` (*float*): The probability of mutation for each gene.

**Returns**:
- The mutated individual.


In [411]:
def mutate(individual, mutation_prob):
    individual = list(individual)
    for i in range(len(individual)):
        if random.random() < mutation_prob:
            individual[i] = random.choice(ALPHABET)
    return ''.join(individual)


In [451]:
# Unit Tests

# Test 1: Ensure no mutation occurs when mutation probability is 0
individual = "hello"
mutation_prob = 0.0 
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual == individual, f"Expected individual to remain unchanged, but got {mutated_individual}"

# Test 2: Ensure mutation happens when mutation probability is 1
mutation_prob = 1.0 
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual != individual, "Expected individual to be completely mutated"

# Test 3: Ensure some mutation occurs when mutation probability is between 0 and 1
mutation_prob = 0.5  
mutated_individual = mutate(individual, mutation_prob)
assert mutated_individual != individual, "Expected individual to be partially mutated"
assert any(a == b for a, b in zip(individual, mutated_individual)), "Expected some characters to remain unchanged"

print("All tests passed.")

All tests passed.


### `crossover (parent1, parent2, crossover_prob)`

Perform crossover between two parents.

**Args**:
- `parent1` (*sttr*): The first parent.
- `parent2` (*str*): The second parent.
- `crossover_prob`: The probability of crossover.

**Returns**:
- Two offspring strings.


In [412]:
def crossover(parent1, parent2, crossover_prob):
    if random.random() < crossover_prob:
        point = random.randint(1, len(parent1)-1)
        child1 = parent1[:point] + parent2[point:]
        child2 = parent2[:point] + parent1[point:]
    else:
        child1 = parent1
        child2 = parent2
    return child1, child2


In [453]:
#  Unit Tests

# Set a fixed seed
random.seed(0)

# Test 1: Ensure no crossover occurs when crossover probability is 0
parent1 = "AAAAA"
parent2 = "BBBBB"
crossover_prob = 0.0  
child1, child2 = crossover(parent1, parent2, crossover_prob)
assert child1 == parent1 and child2 == parent2, "Expected no crossover, but children are different from parents"

# Test 2: Ensure crossover occurs when crossover probability is 1
crossover_prob = 1.0  
child1, child2 = crossover(parent1, parent2, crossover_prob)
assert child1 != parent1 and child2 != parent2, "Expected crossover to occur, but children are identical to parents"

# Test 3: Ensure children are combinations of both parents when crossover happens
crossover_prob = 1.0 
child1, child2 = crossover(parent1, parent2, crossover_prob)
assert child1.startswith("A") and child1.endswith("B"), "Child1 should be a combination of parent1 and parent2"
assert child2.startswith("B") and child2.endswith("A"), "Child2 should be a combination of parent2 and parent1"

print("All tests passed.")

All tests passed.


### `select_parents(population, fitnesses, method='roulette', tournament_size=5)`

Select two parents from the population.

**Args**:
- `population` : The list of individuals.
- `fitnesses` : The list of fitness values.
- `method` : The selection method.

**Returns**:
- Two selected parents.


In [473]:
def select_parents(population, fitnesses, method='roulette', tournament_size=5):
    if method == 'roulette':
        total_fitness = sum(fitnesses)
        if total_fitness == 0:
            selection_probs = [1 / len(population) for _ in fitnesses]
        else:
            selection_probs = [f / total_fitness for f in fitnesses]
        parent1 = random.choices(population, weights=selection_probs, k=1)[0]
        parent2 = random.choices(population, weights=selection_probs, k=1)[0]
    return parent1, parent2


In [474]:
# Unit Tests

# Set a seed
random.seed(0)

# Sample population and fitness
population = ['A', 'B', 'C', 'D', 'E']
fitnesses = [10, 20, 30, 40, 50]

# Test 1: Ensure that parents are selected from population
parent1, parent2 = select_parents(population, fitnesses, method='roulette')
assert parent1 in population and parent2 in population, "Parents must be selected from the population"

# Test 2: Ensure that roulette selection favors individuals with higher fitness
parent1, parent2 = select_parents(population, fitnesses, method='roulette')
assert parent1 != parent2, "Roulette selection should generally select different individuals"

# Test 3: Check selection when all fitnesses are zero
zero_fitnesses = [0, 0, 0, 0, 0] 
parent1, parent2 = select_parents(population, zero_fitnesses, method='roulette')
assert parent1 in population and parent2 in population, "Parents must be selected even when fitnesses are zero"

print("All tests passed.")

All tests passed.


# `genetic_algorithm` Documentation

- Updated?

In [414]:
def genetic_algorithm(population_size=100, generations=1000, crossover_prob=0.7, mutation_prob=0.01, selection_method='roulette', tournament_size=5):
    target = target3  
    string_length = len(target)
    population = initialize_population(population_size, string_length)
    best_fitness = -1
    best_individual = None

    for generation in range(1, generations + 1):
        fitnesses = [fitness_function(ind, target) for ind in population]
        current_best_fitness, current_best_individual = max(zip(fitnesses, population))
        if current_best_fitness > best_fitness:
            best_fitness = current_best_fitness
            best_individual = current_best_individual
            
        if generation % 10 == 0:
            print(f"Generation {generation}: Best Fitness = {best_fitness}, Genotype = {list(best_individual)}, Phenotype = {best_individual}")
        if best_fitness == string_length:
            break
        new_population = []
        while len(new_population) < population_size:
            parent1, parent2 = select_parents(population, fitnesses, selection_method, tournament_size)
            child1, child2 = crossover(parent1, parent2, crossover_prob)
            new_population.extend([mutate(child1, mutation_prob), mutate(child2, mutation_prob)])
        population = new_population[:population_size]
    return {'fitness': best_fitness, 'genotype': list(best_individual), 'phenotype': best_individual}


In [415]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [416]:
target3 = "guvfvffbzhpusha"

In [417]:
# set up if you need it

In [418]:
result3 = genetic_algorithm() # do what you need to do for your implementation but don't change the lines above or below.

Generation 10: Best Fitness = 8, Genotype = ['t', 'c', 'i', 's', 'u', 'u', 's', 'j', 'm', 'g', 'c', 's', 'f', 'u', 'd'], Phenotype = tcisuusjmgcsfud
Generation 20: Best Fitness = 10, Genotype = ['t', 'a', 'i', 's', 'i', 'f', 's', 'o', 'm', 'g', 'c', 'h', 'g', 'u', 'q'], Phenotype = taisifsomgchguq
Generation 30: Best Fitness = 11, Genotype = ['t', 'b', 'i', 's', 'u', 'a', 's', 'o', 'm', 'u', 'c', 'h', 'f', 'u', 'q'], Phenotype = tbisuasomuchfuq
Generation 40: Best Fitness = 12, Genotype = ['t', 'm', 'i', 's', 'i', 'd', 's', 'o', 'm', 'u', 'c', 'h', 'f', 'u', 'q'], Phenotype = tmisidsomuchfuq
Generation 50: Best Fitness = 12, Genotype = ['t', 'm', 'i', 's', 'i', 'd', 's', 'o', 'm', 'u', 'c', 'h', 'f', 'u', 'q'], Phenotype = tmisidsomuchfuq
Generation 60: Best Fitness = 12, Genotype = ['t', 'm', 'i', 's', 'i', 'd', 's', 'o', 'm', 'u', 'c', 'h', 'f', 'u', 'q'], Phenotype = tmisidsomuchfuq
Generation 70: Best Fitness = 12, Genotype = ['t', 'm', 'i', 's', 'i', 'd', 's', 'o', 'm', 'u', 'c', 

In [419]:
pprint(result3, compact=True)

{'fitness': 15,
 'genotype': ['t', 'h', 'i', 's', 'i', 's', 's', 'o', 'm', 'u', 'c', 'h', 'f',
              'u', 'n'],
 'phenotype': 'thisissomuchfun'}


Comment: This individual expresses the target3 *decoded*.

## Problem 4

There is no code for this problem.

In Problem 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

# Problem 4 Solution

In this problem, the aim is to decode a message encrypted with a Caesar cipher when the shift value is unknown. We should use a genetic algorithm to find the correct shift that deciphers the message into readable English text. Spaces are included in the message, so our alphabet consists of the 26 letters plus the space. That's 27.

## Solution Encoding

### Chromosome Representation

- **Individual**: Represents a possible shifft value for the cipher.
- **Gene**: Each chromosome has a single gene representing the shift.
- **Shift Range**: The gene can take integer values from 1 to 27.

### Interpretation

- **Shift Value**: The gene's integer value is used to shift characters backwards to decode the message.
- *Decoding**: For each character in the message, we shift it backward. If the shift goes past 'a', we wrap around.

## Fitness Function

To evaluate how close a decoded message is, we define a fitness function that scores higher for messages resembling valid sentences.

1. **Word Matching**:
   - **Word Count**: Count the number of valid English words in the decoded message by comparing to a dictionary.
   - **Fitness Score**: The more valid words, the higher the fitness.

2. **Letter Frequency**:
   - **Frequency Comparison**: Maybe compare the letter frequencies in the decoded message to typical letter frequencies.
   - **Fitness Score**: Calculate a score based on how closely the frequencies match.

3. **Combined Fitness**:
   - **Weighted**: Combine the scores from dictionary matching and letter frequency analysis

## Conclusion

Representing shift values as chromosomes and using a fitness ffunction that rewards messages resembling good english, we can use a genetic algorithm to decode a Caesar cipher without knowing the shift. This works by evaluating and evolving possible shifts based on how closely the decoded messages match valid language patterns.



## Challenge

**You do not need to do this problem and it won't be graded if you do. It's just here if you want to push your understanding.**

The original GA used binary encodings for everything.
We're basically using a Base 27 encoding.
You could, however, write a version of the algorithm that uses an 8 bit encoding for each letter (ignore spaces as they're a bit of a bother).
That is, a 4 letter candidate looks like this:

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

If you wrote your `genetic_algorithm` code general enough, with higher order functions, you should be able to implement it using bit strings instead of latin strings.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

In [420]:
bin(1)[2:].rjust(3, '0')

'001'

In [421]:
[0, 0, 1, 0, 1, 0, 1, 1, 1] => [1, 2, 7]

SyntaxError: cannot assign to literal (1797879610.py, line 1)

In [None]:
[4, 2, 5] => [1, 0, 0, 0, 1, 0, 1, 0, 1]

In [None]:
population = [
    [1, 0, 0, 0, 1, 0, 1, 0, 1],
    [0, 0, 1, 0, 1, 0, 1, 1, 1],
    [0, 1, 1, 0, 1, 1, 1, 0, 1]
]

In [None]:
population = [
    [1, 0, 0],
    [0, 0, 1],
    [0, 1, 1],
    [0, 1, 0]
]

0. do we crossover? rand() < 0.9
1. pick a locus/index at random...
2. cross

```
mom    010111   011001000
dad    101001   011110110

chil1  010111   011110110
chil2  101001   011001000
```
0. do we mutate? rand() < 0.05 if TRUE
1. pick a locus/index at random... 
2. pick a symbol from the alphabet at random.. [0, 1]

```
chil1  010111010110110
chil2  101001011001000
```


In [None]:
ord("a")

In [None]:
chr(97)

In [None]:
["h", "e", "l", "l", "o"] -> genotype
"hello" -> phenotype

In [None]:
["b", "r", "e", "a", "d"] ---> target

F(target) -> f(genotype) -> score


["r", "l", "e", "k", "m"] ---> genotype

f(genotype) -> score 