# HSC
Markov process with fixed-size population with k-types such that the type 0 is the wild-type with growth rate of `B0`. 

A cells can get a mutation conferring a proliferative advantage upon cell division. We model this process with a Bernouilli trial with success probability of `u`, with units of 1 mutation/division. To compute `u` we can do `u =  MU0 / (B0 * NCELLS)` for the symmetric division case.

For now, all k clones have the same proliferative advantage with k greater than 0.

**Entropy:** based on the code they [developped](https://github.com/emily-mitchell/normal_haematopoiesis/blob/23d221e8d125d78c1e8bcbe05d41d0f3594b0cfb/4_phylogeny_analysis/scripts/shannon_diversity.Rmd#L147), I think they define entropy as in [here](http://math.bu.edu/people/mkon/J6A.pdf) using the phylogenetic tree.
We just compute the entropy from the number of cells: we consider a class being the cells with the same number of mutations and compute the abbundance of those classes, that is the abbundance of cells with the same number of mutations.

## How to use it
Install a version of python greather or equal to 3.11 and then install `seaborn`, `scipy`, `pandas`, `ipykernel` with pip.
Then, install `futils` and `hscpy` in editable mode.
Finally, on the cluster, make this env availbale as a ipython kernel.

In [None]:
%%bash
cd ../hsc/
git pull
cargo b --release

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import socket
import seaborn as sns
import sys

from pathlib import Path

from hscpy.figures import burden as burden_figures
from hscpy.figures import sfs as sfs_figures
from hscpy.figures import variant as variant_figures
from hscpy.figures import options
from hscpy.sfs import compute_variants, Correction
from hscpy import burden, get_idx_timepoint_from_age

from futils import parse_version

PATH2BIN = Path("~").expanduser() / "hsc/target/release"
assert PATH2BIN.is_dir()

YEARS = 100
YEARS_ENTROPY = 1
RUNS = 64
NB_TIMEPOINTS = 20
DETECTION_THRESH = 0.01
SUBCLONES = 60
USE_SCRATCH = True
mitchell_ages = (0, 29, 38, 48, 63, 75, 81)

SAVE = True
BIGLABELS = False
FIGSIZE = [5, 3] if BIGLABELS else [6.4, 4.8]  # default matplotlib
PDF = True
EXTENSION = ".pdf" if PDF else ".png"
NAIVE = True # TODO remove
PLOT_OPTIONS = options.PlotOptions(figsize=FIGSIZE, extension=EXTENSION, save=SAVE)

In [None]:
if socket.gethostname() == "5X9ZYD3":
    PATH2SIMS = Path("/mnt/c/Users/terenz01/Documents/SwitchDrive/PhD/hsc")
elif socket.gethostname() == "LAPTOP-CEKCHJ4C":
    PATH2SIMS = Path("/mnt/c/Users/fra_t/Documents/PhD/hsc")
else:
    PATH2SIMS = Path("~").expanduser()

PATH2SIMS /= Path("totalVariantFracTime.csv")
assert PATH2SIMS.is_file(), f"cannot find totalVariantFracTime.csv from {PATH2SIMS}"

In [None]:
%%bash -s "$PATH2BIN" --out version
$1/hsc --version

In [None]:
VERSION = parse_version(version)
if USE_SCRATCH:
    if NAIVE:
        # PATH2SAVE = Path(f"/data/scratch/hfx923/hsc-draft/naive/{VERSION}")
        # todo fix regression in naive: job arrays do not take the version
        PATH2SAVE = Path(f"/data/scratch/hfx923/hsc-draft/naive/")
    else:
        PATH2SAVE = Path(f"/data/scratch/hfx923/hsc-draft/{VERSION}")
else:
    PATH2SAVE = Path(f"./{VERSION}")

print("Running hsc with version:", VERSION)

## Mitchell's data

In [None]:
summary = pd.read_csv(PATH2SIMS.parent / "Summary_cut.csv", index_col=0)
summary.cell_type = summary.cell_type.astype("category")
summary.sample_type = summary.sample_type.astype("category")
summary.sort_values(by="age", inplace=True)
summary.reset_index(inplace=True)
ages = summary.age.unique()
closest_age = dict.fromkeys(ages)
# neglect some duplicated colonies e.g. summary.colony_ID == "11_E07"
summary = summary.merge(
    summary[["donor_id", "age"]]
    .groupby("donor_id")
    .count()
    .reset_index()
    .rename(columns={"age": "cells"}),
    on="donor_id",
    validate="many_to_one",
    how="left",
)
summary.dtypes

In [None]:
print(summary.describe())
print(f"\n\ncell types: \n{summary.cell_type.value_counts()}")
print(f"\n\nsample types: \n{summary.sample_type.value_counts()}")
print(f"\n\ntimepoints: \n{summary.timepoint.value_counts()}")
print(
    f'\n\nages and cells: \n{summary[["donor_id", "cells", "age"]].drop_duplicates()}'
)
print(
    f'\n\nmutations per donor: \n{summary[["donor_id", "number_mutations"]].groupby("donor_id").sum()}'
)

In [None]:
for i in summary.donor_id.unique():
    fig, ax = plt.subplots(1, 1, tight_layout=True, figsize=(6, 4))
    sns.histplot(
        data=summary[summary.donor_id == i],
        x="number_mutations",
        hue="donor_id",
        kde=True,
        bins=50,
        ax=ax,
        stat="count",
    )
    if SAVE:
        plt.savefig(f"./{i}_burden{EXTENSION}")
    plt.show()

In [None]:
descr = summary.loc[summary.age == 0, ["donor_id", "number_mutations"]].groupby("donor_id").describe()
descr

In [None]:
descr[("number_mutations", "mean")].mean()/ ( 2 * np.log(200_000 - 2))

In [None]:
descr[("number_mutations", "std")]**2

## Simulations

In [None]:
SAMPLE_STRONG = int(summary.cells.mean())
SAMPLE_WEAK =  SAMPLE_STRONG * 10
NCELLS = 200_000
# mean of the Bernouilli trial (prob of success) to get an asymmetric
# division upon cell division, units are [1 asymmetric division / division]
P_ASYMMETRIC = 0

# division rate for the wild-type in units of [division / (year * cell)]
# Welch, J.S. et al. (2012) ‘The Origin and Evolution of Mutations in Acute Myeloid Leukemia’,
# Cell, 150(2), pp. 264–278
B0 = 1  # TODO: double check this, should be between 2 and 20?

## NEUTRAL RATES with untis [mut/(year * cell)]
# 
# Abascal, F. et al. (2021) ‘Somatic mutation landscapes at single-molecule resolution’,
# Nature, 593(7859), pp. 405–410. fig. 2b
# see also fig 1b of Mitchell, E. et al.
# (2022) ‘Clonal dynamics of haematopoiesis across the human lifespan’,
# Nature, 606(7913), pp. 343–350
# MITCHELL's has a mutation rate estimate at 16.8
# we disantagle the mutation rate into a background and division
# such that the sum is 16.8
MU_BACKGROUND = 15.5
MU_DIVISION = 1.3
MU_EXP = 1.3  

## FIT CLONES
# 
# avg fit mutations arising in 1 year, units are [mutations/year]
# from ABC's inference
MU0 = 2
# proliferative advantage conferred by fit mutations, all clones
# have the same proliferative advantage for now. Units are
# [mutation / division]
S = 0.11
# mean of the Bernouilli trial (prob of success) to get a fit variant upon
# cell division, units are [1 mutation/division]
u = MU0 / (B0 * NCELLS)
# should be 2.0 × 10−3 per HSC per year according to Mitchell, E. et al.
# (2022) ‘Clonal dynamics of haematopoiesis across the human lifespan’,
# Nature, 606(7913), pp. 343–350
# driver mutations enter the HSC compartment at 2.0 × 10−3 per HSC per year
print(f"average sucess rate of occurence of 1 fit mutation upon cell division u={u}")

### Positive selection

#### Rust simulations

We run the simulations with and without subsampling at the same time.

In [None]:
sim_options_population = options.SimulationOptions(
    runs=RUNS,
    cells=NCELLS,
    sample=NCELLS,
    path2save=PATH2SAVE,
    neutral_rate=MU_BACKGROUND,
    nb_timepoints=NB_TIMEPOINTS,
    last_timepoint_years=YEARS,
    nb_subclones=SUBCLONES,
    s=S,
)

sim_options_subsampling_strong = options.SimulationOptions(
    runs=RUNS,
    cells=NCELLS,
    sample=SAMPLE_STRONG,
    path2save=PATH2SAVE,
    neutral_rate=MU_BACKGROUND,
    nb_timepoints=NB_TIMEPOINTS,
    last_timepoint_years=YEARS,
    nb_subclones=SUBCLONES,
    s=S,
)

sim_options_subsampling_weak = options.SimulationOptions(
    runs=RUNS,
    cells=NCELLS,
    sample=SAMPLE_WEAK,
    path2save=PATH2SAVE,
    neutral_rate=MU_BACKGROUND,
    nb_timepoints=NB_TIMEPOINTS,
    last_timepoint_years=YEARS,
    nb_subclones=SUBCLONES,
    s=S,
)

In [None]:
%%bash -s "$PATH2BIN" "$sim_options_population.path2save" "$B0" "$MU0" "$sim_options_population.neutral_rate" "$sim_options_population.s" "$P_ASYMMETRIC" "$sim_options_population.runs" "$sim_options_population.cells" "$YEARS" "$sim_options_population.nb_timepoints" "$sim_options_subsampling_strong.sample" "$sim_options_subsampling_weak.sample" "$YEARS_ENTROPY"
rm -rf $2
$1/hsc -c $9 -y ${10} -r $8 --b0 $3 --mu0 $4 --neutral-rate $5 --p-asymmetric $7 --snapshot-entropy ${14} --subsample ${12} ${13} --snapshots ${11} --mean-std 0.1 0.03 --exponential $2

In [None]:
for i, f in enumerate((PATH2SAVE / "rates/").iterdir()):
    fig, ax = plt.subplots(1, 1)
    pd.read_csv(f, header=None).squeeze().plot(kind="hist", ax=ax, bins=15)
    ax.set_xlim(0.95, 1.25)  # TODO?
    ax.set_title(f"simulation id: {i}")
    plt.show()

In [None]:
donors = sfs_figures.donors_from_mitchell(summary, sim_options_population)

In [None]:
fig, ax = plt.subplots(1, 1, tight_layout=True, figsize=FIGSIZE)
sns.histplot(
    data=summary,
    x="number_mutations",
    hue="donor_id",
    kde=True,
    binwidth=10,
    ax=ax,
    stat="percent",
)
sns.move_legend(ax, bbox_to_anchor=(1.01, 1), loc="upper left", frameon=False)
if SAVE:
    plt.savefig(f"./mitchell_burden{EXTENSION}")
plt.show()

# for later
mean_mutations = (
    summary[["donor_id", "number_mutations"]]
    .groupby("donor_id")
    .mean()
    .reset_index()
    .merge(
        summary[["donor_id", "age"]].drop_duplicates(),
        on="donor_id",
        how="inner",
        validate="one_to_one",
    )
    .sort_values(by="age")
)

# fit only the neutral ones (without clonal exp)
x = mean_mutations[mean_mutations.age < 70].age.to_numpy()
y = mean_mutations[mean_mutations.age < 70].number_mutations.to_numpy()
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y, rcond=None)[0]

In [None]:
%%time
# compute the correction for the SFS with sampled distributions from
# https://www.biorxiv.org/content/10.1101/2022.11.07.515470v2
corrected_variants_one_over_1_squared = dict()
for donor in donors:
    print(
        f"apply sampling correction to SFS of donor {donor.name} with age {donor.age} mapped to closest_age {donor.closest_age}"
    )
    corrected_variants_one_over_1_squared[donor.name] = compute_variants(
        Correction.ONE_OVER_F_SQUARED,
        pop_size=sim_options_subsampling_strong.cells,
        sample_size=donor.cells,
    )

In [None]:
age

#### no subsampling

In [None]:
my_sfs = dict()
age_simulations = np.linspace(0, sim_options_population.last_timepoint_years, sim_options_population.nb_timepoints)[::-1]

for i in range(1, sim_options_population.nb_timepoints + 1):
    my_sfs[i] = dict()
    for file in (sim_options_population.path2save / f"{sim_options_population.sample}cells/sfs/{i}").iterdir():
        my_sfs[i][file.stem] = burden.load(file)

In [None]:
donor2id = {ele.donor_id: get_idx_timepoint_from_age(ele.age, years=sim_options_population.last_timepoint_years, nb_timepoints=sim_options_population.nb_timepoints, verbosity=False)[0] for ele in summary[["donor_id", "age"]].drop_duplicates().itertuples()}
donor2id

In [None]:
from hscpy.figures import mitchell
from hscpy import sfs
from typing import Dict

In [None]:
def plot_sfs_patient(ax, donor: sfs_figures.Donor, path2mitchell: Path, remove_indels: bool, normalise: bool, options: options.PlotOptions, **kwargs):
    if remove_indels:
        filtered_matrix = mitchell.filter_mutations(
            *mitchell.load_patient(
                donor.name,
                path2mitchell / f"mutMatrix{donor.name}.csv",
                path2mitchell / f"mutType{donor.name}.csv",
            )
        )
    else:
        filtered_matrix = mitchell.load_patient(
            donor.name,
            path2mitchell / f"mutMatrix{donor.name}.csv",
            path2mitchell / f"mutType{donor.name}.csv",
        )[0]

    sfs_donor = filtered_matrix.sum(axis=1).value_counts(normalize=True)
    sfs_donor.drop(index=sfs_donor[sfs_donor.index == 0].index, inplace=True)
    x_sfs = sfs_donor.index.to_numpy(dtype=int)
    
    my_sfs = {x: y for x,y in zip(x_sfs, sfs_donor.to_numpy())}
    return plot_sfs(ax, my_sfs, normalise, options, **kwargs)

In [None]:
def plot_sfs_correction(ax, correction: sfs.CorrectedVariants, cells: int, normalise: bool, options: options.PlotOptions, **kwargs):
    x = correction.frequencies[: cells]
    f_sampled = correction.corrected_variants
    my_sfs = {xx: f for xx, f in zip(x, f_sampled)}
    return plot_sfs(ax, my_sfs, normalise, options, **kwargs)

In [None]:
def plot_sfs_sim_with_id(ax, my_sfs: sfs.Sfs, sims: options.SimulationOptions, normalise: bool, options: options.PlotOptions, **kwargs):
    return plot_sfs(ax, my_sfs, normalise=normalise, options=options, **kwargs)

In [None]:
def plot_sfs(ax, my_sfs: sfs.Sfs, normalise: bool, options: options.PlotOptions, **kwargs):
    if normalise:
        jmuts = list(my_sfs.values())
        max_ = max(jmuts)
        jmuts = [ele / max_ for ele in jmuts]
    jcells = list(my_sfs.keys())
    ax.plot(jcells, jmuts, **kwargs)
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_ylabel("density of variants with j cells", fontsize="x-large")
    ax.set_xlabel("nb of j cells", fontsize="x-large")
    ax.tick_params(axis='both', which='both', labelsize=14)
    return ax

In [None]:
def plot_sfs_avg(ax, my_sfs: Dict[str, sfs.Sfs], sims: options.SimulationOptions, age: int, options: options.PlotOptions, **kwargs):
    pooled = burden.pooled_burden(my_sfs)
    ax = plot_sfs(ax, pooled, normalise=True, options=options, **kwargs)
    return ax

In [None]:
options_plot = options.PlotOptions(figsize=(6,4), extension=EXTENSION, save=SAVE)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
ax = plot_sfs_sim_with_id(
    ax, 
    my_sfs[donor2id['CB001']]['560'], 
    sim_options_population,normalise=True, 
    options=options_plot,
    color="yellowgreen", marker="d", linestyle="", 
    alpha=0.6, label=f"1 run with id {idx}"
)
ax = plot_sfs_avg(
    ax, 
    my_sfs[donor2id['CB001']], sim_options_population, age=0, options=options_plot,
    color="blue", marker=".", linestyle="", label=f"avg of {sim_options_population.runs} runs", alpha=0.6
)
ax = plot_sfs_correction(
    ax, corrected_variants_one_over_1_squared['CB001'], sim_options_population.sample, normalise=True, options=options_plot,
    color="grey", label=r"$1/f^2$ sampled"
)
ax = plot_sfs_patient(
    ax, donors[0], PATH2SIMS.parent, remove_indels=True, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
for donor, idx in donor2id.items():
    fig, ax = plt.subplots(1, 1)
    pooled = burden.pooled_burden(my_sfs[idx])
    jmuts = list(pooled.values())
    max_ = max(jmuts)
    ax.plot(list(pooled.keys()), [ele / max_ for ele in jmuts], linestyle="", marker=".")
    ax.plot(list(pooled.keys()), [ele / max_ for ele in jmuts], linestyle="", marker=".")
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_ylabel("density of variants with j cells", fontsize="x-large")
    ax.set_xlabel("nb of j cells", fontsize="x-large")
    ax.tick_params(axis='both', which='both', labelsize=14)
    ax.set_title(f"SFS computed from {RUNS} runs for age {age:.0f}")
    plt.tight_layout()
    if SAVE:
        plt.savefig(f"SFS_{sim_options_population.sample}cells_{int(round(age, 0))}age{EXTENSION}")
    plt.show()

In [None]:
my_burden = dict()
for i in range(1, sim_options_population.nb_timepoints + 1):
    my_burden[i] = burden.load_burden(sim_options_population.path2save, sim_options_population.runs, sim_options_population.sample, timepoint=i)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=FIGSIZE)
ax.plot(
    summary["age"], summary["number_mutations"], linestyle="", marker="o", 
    alpha=0.4, label="Mitchell's", mew=2,
)
ax.plot(mean_mutations.age, mean_mutations.number_mutations, "x", c="orange", label="avg Mitchell's", mew=2)
ax.plot(age_simulations, m * age_simulations + c, linestyle="--", c="orange", label=f"fit Mitchell's, m={m:.1f}, c={c:.1f}")

means = []
for idx, age in enumerate(age_simulations, 1):
    # plot only the snvs
    pooled = burden.pooled_burden(my_burden[idx])
    mean = burden.compute_mean_variance(pooled)[0]
    means.append(mean)
    snvs = [k for k, i in pooled.items() if i > 0.0]
    ax.plot([age] * len(snvs), snvs, "o", c="grey", alpha=0.01, label="sims" if idx == 1 else None)
ax.plot(age_simulations, means, "v", label=f"avg from {sim_options_population.runs} sims", c="yellowgreen", mew=2)

A_sims = np.vstack([age_simulations, np.ones(len(age_simulations))]).T
m_sims, c_sims = np.linalg.lstsq(A_sims, means, rcond=None)[0]
ax.plot(age_simulations, m_sims * age_simulations + c_sims, linestyle="--", c="yellowgreen", 
        label=f"fit sim's, m={m_sims:.1f}, c={c_sims:.1f}")

ax.set_xlabel("age [years]", fontsize="xx-large")
ax.set_ylabel("number of SNVs", fontsize="xx-large")
ax.tick_params(axis='both', which='major', labelsize=15)
leg = ax.legend(prop={'size': 11}, fancybox=False)
for lh in leg.legend_handles: 
    lh.set_alpha(0.6)
fig.tight_layout()
if SAVE:
    plt.savefig(f"burden{EXTENSION}")
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, tight_layout=True, figsize=(6, 4))
sns.histplot(
    data=summary[summary.age == 0],
    x="number_mutations",
    hue="donor_id",
    kde=True,
    bins=50,
    ax=ax,
    stat="density",
)
pooled = burden.pooled_burden(my_burden[len(age_simulations)])
ax.bar(x=list(pooled.keys()), height=list(pooled.values()), width=1, color="purple", alpha=0.3, edgecolor="black", align="edge", label=f"{RUNS} sims")
mean = sum([k * v for k, v in pooled.items()])
ax.vlines(x=mean, ymin=0, ymax=0.06, color="purple", linestyle="--")
ax.set_xlabel("number of SNVs", fontsize="xx-large")
ax.set_ylabel("density", fontsize="xx-large")
ax.tick_params(axis='both', which='major', labelsize=15)
ax.legend(["CB001", "CB002", f"simulation's mean={mean:.2f}"], prop={'size': 12}, fancybox=False)
if SAVE:
    plt.savefig(f"burden_year0{EXTENSION}")
plt.show()

In [None]:
variant_figures.show_variant_plots(
    sim_options_population, PLOT_OPTIONS, PATH2SIMS, DETECTION_THRESH
)

#### weak subsampling

In [None]:
%%time
# load simulated sfs for all the ages of the donors present in the data
sfs_age_simulations = sfs_figures.load_sfs_simulations(
    donors, sim_options_subsampling_weak
)

In [None]:
my_sfs = dict()
for i in range(1, sim_options_subsampling_weak.nb_timepoints + 1):
    my_sfs[i] = dict()
    for file in (sim_options_subsampling_weak.path2save / f"{sim_options_subsampling_weak.sample}cells/sfs/{i}").iterdir():
        my_sfs[i][file.stem] = burden.load(file)

In [None]:
for idx, age in enumerate(age_simulations, 1):
    fig, ax = plt.subplots(1, 1)
    pooled = burden.pooled_burden(my_sfs[idx])
    jmuts = list(pooled.values())
    max_ = max(jmuts)
    ax.plot(list(pooled.keys()), [ele / max_ for ele in jmuts], linestyle="", marker=".")
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_ylabel("density of variants with j cells", fontsize="x-large")
    ax.set_xlabel("nb of j cells", fontsize="x-large")
    ax.tick_params(axis='both', which='both', labelsize=14)
    ax.set_title(f"SFS computed from {RUNS} runs for age {age:.0f}")
    plt.tight_layout()
    if SAVE:
        plt.savefig(f"SFS_{sim_options_subsampling_weak.sample}cells_{int(round(age, 0))}age{EXTENSION}")
    plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
ax = plot_sfs_sim_with_id(
    ax, 
    my_sfs[donor2id['CB001']]['560'], 
    sim_options_subsampling_weak,normalise=True, 
    options=options_plot,
    color="yellowgreen", marker="d", linestyle="", 
    alpha=0.6, label=f"1 run with id {idx}"
)
ax = plot_sfs_avg(
    ax, 
    my_sfs[donor2id['CB001']], sim_options_subsampling_weak, age=0, options=options_plot,
    color="blue", marker=".", linestyle="", label=f"avg of {sim_options_subsampling_weak.runs} runs", alpha=0.6
)
ax = plot_sfs_correction(
    ax, corrected_variants_one_over_1_squared['CB001'], sim_options_subsampling_weak.sample, normalise=True, options=options_plot,
    color="grey", label=r"$1/f^2$ sampled"
)
ax = plot_sfs_patient(
    ax, donors[0], PATH2SIMS.parent, remove_indels=True, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
my_burden = dict()
for i in range(1, sim_options_population.nb_timepoints + 1):
    my_burden[i] = burden.load_burden(sim_options_subsampling_weak.path2save, sim_options_subsampling_weak.runs, sim_options_subsampling_weak.sample, timepoint=i)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=FIGSIZE)
ax.plot(
    summary["age"], summary["number_mutations"], linestyle="", marker="o", 
    alpha=0.4, label="Mitchell's", mew=2,
)
ax.plot(x, y, "x", c="orange", label="avg Mitchell's", mew=2)
age_simulations = np.linspace(0, sim_options_subsampling_weak.last_timepoint_years, sim_options_subsampling_weak.nb_timepoints)[::-1]
ax.plot(age_simulations, m * age_simulations + c, linestyle="--", c="orange", label=f"fit m={m:.1f}, c={c:.1f}")

means = []
for idx, age in enumerate(age_simulations, 1):
    # plot only the snvs
    pooled = burden.pooled_burden(my_burden[idx])
    mean = burden.compute_mean_variance(pooled)[0]
    means.append(mean)
    snvs = [k for k, i in pooled.items() if i > 0.0]
    ax.plot([age] * len(snvs), snvs, "o", c="grey", alpha=0.01, label="simulations" if idx == 1 else None)
ax.plot(age_simulations, means, "v", label=f"avg from {sim_options_subsampling_weak.runs} runs", c="yellowgreen", mew=2)

A_sims = np.vstack([age_simulations, np.ones(len(age_simulations))]).T
m_sims, c_sims = np.linalg.lstsq(A_sims, means, rcond=None)[0]
ax.plot(age_simulations, m_sims * age_simulations + c_sims, linestyle="--", c="yellowgreen", 
        label=f"fit m={m_sims:.1f}, c={c_sims:.1f}")

ax.set_xlabel("age [years]", fontsize="xx-large")
ax.set_ylabel("number of SNVs", fontsize="xx-large")
ax.tick_params(axis='both', which='major', labelsize=15)
leg = ax.legend(prop={'size': 13}, fancybox=False)
for lh in leg.legend_handles: 
    lh.set_alpha(0.6)
fig.tight_layout()
if SAVE:
    plt.savefig(f"burden{EXTENSION}")
plt.show()

#### strong subsampling

In [None]:
%%time
# load simulated sfs for all the ages of the donors present in the data
sfs_age_simulations = sfs_figures.load_sfs_simulations(
    donors, sim_options_subsampling_strong
)

In [None]:
my_sfs = dict()
for i in range(1, sim_options_subsampling_strong.nb_timepoints + 1):
    my_sfs[i] = dict()
    for file in (sim_options_subsampling_strong.path2save / f"{sim_options_subsampling_strong.sample}cells/sfs/{i}").iterdir():
        my_sfs[i][file.stem] = burden.load(file)

In [None]:
for idx, age in enumerate(age_simulations, 1):
    fig, ax = plt.subplots(1, 1)
    pooled = burden.pooled_burden(my_sfs[idx])
    jmuts = list(pooled.values())
    max_ = max(jmuts)
    ax.plot(list(pooled.keys()), [ele / max_ for ele in jmuts], linestyle="", marker=".")
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_ylabel("density of variants with j cells", fontsize="x-large")
    ax.set_xlabel("nb of j cells", fontsize="x-large")
    ax.tick_params(axis='both', which='both', labelsize=14)
    ax.set_title(f"SFS computed from {RUNS} runs for age {age:.0f}")
    plt.tight_layout()
    if SAVE:
        plt.savefig(f"SFS_{sim_options_subsampling_strong.sample}cells_{int(round(age, 0))}age{EXTENSION}")
    plt.show()

In [None]:
donors

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
donor = donors[7]
ax = plot_sfs_patient(
    ax, donor, PATH2SIMS.parent, remove_indels=False, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
donor = donors[6]
ax = plot_sfs_patient(
    ax, donor, PATH2SIMS.parent, remove_indels=False, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
donor = donors[6]
ax = plot_sfs_patient(
    ax, donor, PATH2SIMS.parent, remove_indels=True, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
donor = donors[5]
ax = plot_sfs_patient(
    ax, donor, PATH2SIMS.parent, remove_indels=True, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
donor = donors[5]
ax = plot_sfs_patient(
    ax, donor, PATH2SIMS.parent, remove_indels=False, normalise=True, options=options_plot,
    color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
)
ax.legend()
plt.show()

In [None]:
idx2show = '560'
for donor in donors:
    fig, ax = plt.subplots(1, 1, figsize=options_plot.figsize, layout="tight")
    ax = plot_sfs_sim_with_id(
        ax, 
        my_sfs[donor2id[donor.name]][idx2show], 
        sim_options_subsampling_strong,normalise=True, 
        options=options_plot,
        color="yellowgreen", marker="d", linestyle="", 
        alpha=0.6, label=f"1 run with id {idx2show}"
    )
    ax = plot_sfs_avg(
        ax, 
        my_sfs[donor2id[donor.name]], sim_options_subsampling_strong, age=0, options=options_plot,
        color="blue", linestyle="-", label=f"avg of {sim_options_subsampling_strong.runs} runs", alpha=0.6
    )
    ax = plot_sfs_correction(
        ax, corrected_variants_one_over_1_squared[donor.name], sim_options_subsampling_strong.sample, normalise=True, options=options_plot,
        color="grey", label=r"$1/f^2$ sampled"
    )
    ax = plot_sfs_patient(
        ax, donor, PATH2SIMS.parent, remove_indels=False, normalise=True, options=options_plot,
        color="purple", label=f"{donor.name}, age: {donor.age:.0f}", marker="x", linestyle=""
    )
    ax.legend()
    plt.tight_layout()
    if options_plot.save:
        plt.savefig(f"SFS_{sim_options_subsampling_strong.sample}cells_{donor.name}{EXTENSION}")
    plt.show()

### Competition vs neutral vs 1 clone (logistic fn)

In [None]:
# TODO 1 clone logistic fn

### Neutral scenario

In [None]:
%%bash -s "$PATH2BIN" "$sim_options_subsampling_strong.path2save" "$B0" "$MU0" "$sim_options_population.neutral_rate" "$sim_options_population.s" "$P_ASYMMETRIC" "$sim_options_population.runs" "$sim_options_population.cells" "$YEARS" "$sim_options_population.nb_timepoints" "$sim_options_subsampling_strong.sample" "$sim_options_subsampling_weak.sample" "$YEARS_ENTROPY"
rm -rf $2
$1/hsc -c $9 -y ${10} -r $8 --b0 $3 --mu0 $4 --neutral-rate $5 --p-asymmetric $7 --snapshot-entropy ${14} --subsample ${12} ${13} --snapshots ${11} --neutral --exponential $2

In [None]:
sfs_figures.show_entropy_plots(
    sim_options_population, PLOT_OPTIONS, mitchell_ages, early_variants_only=True
)

sfs_figures.show_entropy_plots(
    sim_options_population, PLOT_OPTIONS, mitchell_ages, early_variants_only=False
)

In [None]:
sfs_figures.show_entropy_plots(
    sim_options_subsampling_weak, PLOT_OPTIONS, mitchell_ages, early_variants_only=True
)

sfs_figures.show_entropy_plots(
    sim_options_subsampling_weak, PLOT_OPTIONS, mitchell_ages, early_variants_only=False
)

In [None]:
sfs_figures.show_entropy_plots(
    sim_options_subsampling_strong,
    PLOT_OPTIONS,
    mitchell_ages,
    early_variants_only=True,
)

sfs_figures.show_entropy_plots(
    sim_options_subsampling_strong,
    PLOT_OPTIONS,
    mitchell_ages,
    early_variants_only=False,
)