# Selection on Diploid Genotypes

Deterministic (infinite population) model for exploring how **viability selection** shapes allele frequencies at a single locus.

Each generation:
1. Random mating → Hardy-Weinberg genotype frequencies
2. Viability selection acts on diploid genotypes (each has a fitness *w*)
3. New allele frequencies derived from post-selection genotype frequencies

No drift, no mutation, no migration — **pure selection dynamics**.

We track:
- **Allele frequencies** over time
- **Average excess of fitness**: how much better (or worse) an allele's carriers do compared to the population mean. This is what drives allele frequency change: $\Delta p_i = p_i \cdot a_i / \bar{w}$
- **Genotype frequencies** and **mean fitness** ($\bar{w}$)

In [None]:
from popgen_sim import *
print('Module loaded!')

---
## Quick Reference

```python
SelectionParams(
    n_alleles=3,                       # number of alleles
    freqs=[0.33, 0.34, 0.33],          # initial allele frequencies (must sum to 1)
    allele_labels=['A1', 'A2', 'A3'],  # optional custom names
    fitness={                           # diploid genotype fitnesses
        'A1A1': 1.0, 'A1A2': 0.9, 'A1A3': 0.85,
                      'A2A2': 0.7, 'A2A3': 0.6,
                                    'A3A3': 0.5,
    },
    n_generations=200,
)
```

**Functions:**
- `simulate_selection(params)` → `SelectionResult`
- `make_selection_player(result)` → interactive widget (play/pause through generations)
- `plot_selection(result)` → 4-panel static figure
- `plot_selection_trajectory(result)` → ternary simplex plot (3 alleles only)

**SelectionResult fields:**
- `.allele_freqs[gen]` — array of allele frequencies at generation *gen*
- `.avg_excess[gen]` — average excess of fitness ($\bar{w}_i - \bar{w}$) per allele
- `.marginal_fitness[gen]` — marginal fitness $\bar{w}_i$ per allele
- `.w_bar[gen]` — mean population fitness
- `.delta_p[gen]` — per-generation change in allele frequency

---
# Part 1: Two-Allele Models

We start with the classic two-allele, one-locus cases to build intuition for how dominance and fitness relationships determine the dynamics.

In all four cases below, both alleles start at frequency 0.5.

## 1a. Additive Selection

The heterozygote fitness is exactly midway between the two homozygotes. A is favored and will fix.

| Genotype | AA | Aa | aa |
|----------|----|----|----|
| Fitness  | 1.0 | 0.75 | 0.5 |

In [None]:
params_additive = SelectionParams(
    n_alleles=2,
    allele_labels=['A', 'a'],
    freqs=[0.5, 0.5],
    fitness={'AA': 1.0, 'Aa': 0.75, 'aa': 0.5},
    n_generations=100,
)

result_additive = simulate_selection(params_additive)
make_selection_player(result_additive)

In [None]:
plot_selection(result_additive)
plt.show()

**Notice:** The average excess of A is positive (carriers do better than average) and the average excess of a is negative. Both converge to zero as A approaches fixation — at that point there's no more variation for selection to act on.

## 1b. Dominant Selection

A is completely dominant: the heterozygote has the same fitness as the AA homozygote.

| Genotype | AA | Aa | aa |
|----------|----|----|----|
| Fitness  | 1.0 | 1.0 | 0.5 |

In [None]:
params_dominant = SelectionParams(
    n_alleles=2,
    allele_labels=['A', 'a'],
    freqs=[0.5, 0.5],
    fitness={'AA': 1.0, 'Aa': 1.0, 'aa': 0.5},
    n_generations=100,
)

result_dominant = simulate_selection(params_dominant)
make_selection_player(result_dominant)

In [None]:
plot_selection(result_dominant)
plt.show()

**Notice:** Selection is very effective early on (when a is common, most a alleles are in Aa heterozygotes that are visible to selection). But it slows dramatically as a becomes rare — the remaining a alleles are mostly hidden in Aa heterozygotes, which have full fitness. This is why it's hard to completely eliminate a recessive deleterious allele.

Compare the average excess trajectory to the additive case — can you see how dominance changes the dynamics?

## 1c. Overdominance (Heterozygote Advantage)

The heterozygote is *more* fit than either homozygote. This produces a **stable, balanced polymorphism**.

| Genotype | AA | Aa | aa |
|----------|----|----|----|
| Fitness  | 0.6 | 1.0 | 0.4 |

In [None]:
params_overdominance = SelectionParams(
    n_alleles=2,
    allele_labels=['A', 'a'],
    freqs=[0.5, 0.5],
    fitness={'AA': 0.6, 'Aa': 1.0, 'aa': 0.4},
    n_generations=200,
)

result_overdominance = simulate_selection(params_overdominance)
make_selection_player(result_overdominance)

In [None]:
plot_selection(result_overdominance)
plt.show()

**Notice:** At equilibrium, the average excess of both alleles is exactly zero — neither allele's carriers do better or worse than the population mean. This is the signature of a balanced polymorphism: selection has no net force in either direction.

The equilibrium frequency depends on the relative fitness costs of the two homozygotes:

$$\hat{p}_A = \frac{w_{Aa} - w_{aa}}{2 w_{Aa} - w_{AA} - w_{aa}} = \frac{1.0 - 0.4}{2(1.0) - 0.6 - 0.4} = 0.6$$

Try different starting frequencies — do you always converge to the same equilibrium?

## 1d. Underdominance (Heterozygote Disadvantage)

The heterozygote is *less* fit than either homozygote. This produces an **unstable equilibrium** — the outcome depends on starting frequency.

| Genotype | AA | Aa | aa |
|----------|----|----|----|
| Fitness  | 0.9 | 0.5 | 0.7 |

In [None]:
params_underdominance = SelectionParams(
    n_alleles=2,
    allele_labels=['A', 'a'],
    freqs=[0.6, 0.4],
    fitness={'AA': 0.9, 'Aa': 0.5, 'aa': 0.7},
    n_generations=200,
)

result_underdominance = simulate_selection(params_underdominance)
make_selection_player(result_underdominance)

In [None]:
plot_selection(result_underdominance)
plt.show()

**Try it:** Change `freqs=[0.4, 0.6]` — does the other allele fix instead? Where is the tipping point?

**Notice** that mean fitness actually *decreases* initially before recovering. Under underdominance, the population can move through a fitness valley on its way to fixation.

---
# Part 2: Three Alleles — Hemoglobin and Malaria

Now we move to three alleles at the β-globin locus, inspired by the classic malaria story:
- **A** — normal hemoglobin (HbA)
- **S** — sickle hemoglobin (HbS)
- **C** — hemoglobin C (HbC)

The fitness landscape (in a malaria-endemic environment):

| | A | S | C |
|---|---|---|---|
| **A** | 0.9 | 1.0 | 0.9 |
| **S** | 1.0 | 0.2 | 0.7 |
| **C** | 0.9 | 0.7 | 1.3 |

Key features:
- **AS heterozygotes** have high fitness (sickle-cell trait = malaria resistance)
- **SS homozygotes** are strongly deleterious (sickle-cell disease)
- **CC homozygotes** are actually the most fit genotype overall (protection without disease cost)
- But CC requires C to be common enough to form CC homozygotes at appreciable frequency

The critical question: **does the outcome depend on where C starts?**

## 2a. C is Very Rare → C is Lost (Balanced A/S Polymorphism)

Starting frequencies: A = 0.99, S = 0.005, C = 0.005

When C is this rare, almost all C alleles are in AC heterozygotes (fitness 0.9 — same as AA). Meanwhile S benefits enormously from the AS heterozygote advantage (fitness 1.0). What happens to C?

In [None]:
params_malaria1 = SelectionParams(
    n_alleles=3,
    allele_labels=['A', 'S', 'C'],
    freqs=[0.99, 0.005, 0.005],
    fitness={
        'AA': 0.9,  'AS': 1.0,  'AC': 0.9,
                     'SS': 0.2,  'SC': 0.7,
                                  'CC': 1.3,
    },
    n_generations=500,
)

result_malaria1 = simulate_selection(params_malaria1)

# NOTE the option to plot this in log scale to better visualize what happens to C – but also note that we're assuming infinite population sizes so it won't actually go to 0 
make_selection_player(result_malaria1, log_scale=True)

In [None]:
plot_selection(result_malaria1, log_scale=True)
plt.show()

In [None]:
plot_selection_trajectory(result_malaria1)
plt.show()

**Result:** C is lost — it can't gain traction when rare because its advantage comes from the CC homozygote, which is vanishingly rare at low C frequency. The population settles into the classic A/S balanced polymorphism.

Watch the average excess plot: C's average excess starts near zero (its carriers are mostly in AC heterozygotes, which have the same fitness as AA), then turns negative as S rises and SC heterozygotes (fitness 0.7) drag C's marginal fitness down.

## 2b. C is More Common → C Fixes (Transient Polymorphism)

Starting frequencies: A = 0.95, S = 0.005, C = 0.045

Now C starts at a high enough frequency that CC homozygotes (fitness 1.3!) are no longer negligible. Does this change the outcome?

NOTE that we don't know the actual starting frequencies of C in the empirical example of Western Africa, so the dynamics (i.e., exact number of generations) won't match

In [None]:
params_malaria2 = SelectionParams(
    n_alleles=3,
    allele_labels=['A', 'S', 'C'],
    freqs=[0.95, 0.005, 0.045],
    fitness={
        'AA': 0.9,  'AS': 1.0,  'AC': 0.9,
                     'SS': 0.2,  'SC': 0.7,
                                  'CC': 1.3,
    },
    n_generations=500,
)

result_malaria2 = simulate_selection(params_malaria2)
make_selection_player(result_malaria2, log_scale=False)

In [None]:
plot_selection(result_malaria2, log_scale=True)
plt.show()

In [None]:
plot_selection_trajectory(result_malaria2)
plt.show()

**Result:** C sweeps to fixation! Both A and S are eliminated. This is a **transient polymorphism** — the system passes through a temporary A/S/C coexistence before C wins.

The key insight: **the fate of an allele depends not just on its fitness effects but on its frequency.** The CC homozygote is the most fit genotype in this system, but C can only "access" that fitness when it's common enough for CC homozygotes to form at appreciable frequency. This is frequency-dependent selection arising from the diploid genotype structure.

## 2c. Compare the Two Scenarios Side by Side

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for col, result, title in [
    (0, result_malaria1, 'C starts rare (0.5%)'),
    (1, result_malaria2, 'C starts higher (9.5%)'),
]:
    labels = result.params.allele_labels
    gens = result.generations
    
    # Top row: allele frequencies
    ax = axes[0, col]
    for i in range(3):
        freqs_i = [af[i] for af in result.allele_freqs]
        color = COLORS_ALLELES[i % len(COLORS_ALLELES)]
        ax.plot(gens, freqs_i, color=color, lw=2, label=labels[i])
    ax.set_xlabel('Generation')
    ax.set_ylabel('Allele frequency')
    ax.set_title(title)
    ax.set_ylim(-0.02, 1.02)
    ax.legend()
    
    # Bottom row: average excess
    ax = axes[1, col]
    for i in range(3):
        ae_i = [ae[i] for ae in result.avg_excess]
        color = COLORS_ALLELES[i % len(COLORS_ALLELES)]
        ax.plot(gens, ae_i, color=color, lw=2, label=labels[i])
    ax.axhline(0, color='gray', ls='--', lw=0.5)
    ax.set_xlabel('Generation')
    ax.set_ylabel('Average excess (w̄ᵢ − w̄)')
    ax.set_title('Average excess: ' + title)
    ax.legend()

fig.suptitle('Same fitness landscape, different starting frequencies → different outcomes', fontsize=13)
plt.tight_layout()
plt.show()

---
## Questions to Explore

1. **Where is the threshold?** At what starting frequency of C does the outcome switch from "C lost" to "C fixed"? Try values between 0.005 and 0.095.

2. **What if CC fitness were lower?** Try `CC: 1.1` or `CC: 0.95`. How does this change both scenarios?

3. **Role of the SC heterozygote:** The SC fitness (0.7) is relatively low. What happens if you raise it to 1.0? Does C still get lost in scenario 1?

4. **Average excess interpretation:** In scenario 1, watch how C's average excess evolves. Can you explain *why* it goes negative even though CC is the most fit genotype?

In [None]:
# Space for your own experiments!

params_explore = SelectionParams(
    n_alleles=3,
    allele_labels=['A', 'S', 'C'],
    freqs=[0.99, 0.005, 0.005],
    fitness={
        'AA': 0.9,  'AS': 1.0,  'AC': 0.9,
                     'SS': 0.2,  'SC': 0.7,
                                  'CC': 1.3,
    },
    n_generations=500,
)

result_explore = simulate_selection(params_explore)
plot_selection(result_explore)
plt.show()