Quickstart
==

This notebook demonstrates how to use the Python module `mushi` for...

## Inferring mutation spectrum history (and demography)

We will use `mushi` to infer history of the mutation process, which we can think of as the mutation rate function over time for each triplet mutation type.
In `mushi`, we use coalescent theory and optimization techniques to learn about this history from the $k$-SFS.

We first import the `ksfs` module from the `mushi` package, and a few other packages.

In [None]:
from mushi.ksfs import kSFS

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

### Load $3$-SFS from the 1000 Genomes Finnish population (previously computed with the [`mutyper`](https://github.com/harrispopgen/mutyper) package)
Load the $k$-SFS

In [None]:
ksfs = kSFS(file='../example_data/3-SFS.EUR.FIN.tsv')

Plot the population variant spectrum (summing the $k$-SFS over sample frequency)

In [None]:
ksfs.as_df().sum(0).plot.bar(figsize=(17, 3))
plt.xticks(family='monospace')
plt.ylabel('number of variants')
plt.show()

Plot the total SFS (summing the $k$-SFS over mutation types)

In [None]:
ksfs.plot_total()
plt.yscale('log')

plot k-SFS composition as a scatter (a color for each mutation type)

In [None]:
ksfs.plot(clr=True)
plt.show()

...and as a heatmap (a column for each mutation type)

In [None]:
g = ksfs.clustermap(figsize=(17, 7), col_cluster=False, xticklabels=True, cmap='RdBu_r', rasterized=True, robust=True)
g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xmajorticklabels(), fontsize = 9, family='monospace')
plt.show()

We will also need the masked genome size for each mutation type, which we've also previously computed with `mutyper targets`. This defines mutational target sizes.

In [None]:
masked_genome_size = pd.read_csv(f'../example_data/masked_size.tsv', sep='\t', header=None, index_col=0)
masked_genome_size.index.name='mutation type'

masked_genome_size.plot.bar(figsize=(6, 3), legend=False)
plt.xticks(family='monospace')
plt.ylabel('mutational target size (sites)')
plt.show()

With this we can compute the number of SNPs per target in each mutation type. Notice the enrichment of C>T transitions at CpG sites.

In [None]:
normalized_hit_rates = ksfs.as_df().sum(0).to_frame(name='variant count')
normalized_hit_rates['target size'] = [int(masked_genome_size.loc[context])
                                       for context, _ in normalized_hit_rates['variant count'].index.str.split('>')]

(normalized_hit_rates['variant count'] /
 normalized_hit_rates['target size']).plot.bar(figsize=(17, 3), legend=False)
plt.xticks(family='monospace')
plt.ylabel('variants per target')
plt.show()

To compute the total mutation rate in units of mutations per masked genome per generation, we multiply an estimate of the site-wise rate by the target size

In [None]:
μ0 = 1.25e-8 * masked_genome_size[1].sum()
μ0

To render time in years rather than generations, we use an estimate of the generation time

In [None]:
t_gen = 29

### Joint coalescent inference of demography and mutation spectrum history

To access time-calibrated mutation spectrum histories, we first need to estimate the demographic history, since this defines the diffusion timescale of the coalescent process.

We first define a grid of times will represent history on, measured retrospectively from the present in units of Wright-Fisher generations.

In [None]:
t = np.logspace(np.log10(1), np.log10(200000), 200)

We now run the optimization, setting a few parameters to control how complicated we let the histories look.

In [None]:
ksfs.infer_history(t, μ0, alpha_tv=1e2, alpha_spline=3e3, alpha_ridge=1e-10,
                   beta_rank=1e1, beta_tv=7e1, beta_spline=1e1, beta_ridge=1e-10,
                   tol=1e-11)

Hopefully you agree that was fast 🏎

We'll now check that the demography has a few features we expect in the Finnish population: the out-of-Africa bottleneck shared by all Eurasians, a later bottleneck associated with northward migration, and exponential population growth toward the present.

- The plot on the left will show fit to the SFS
- The plot on the right will show the inferred haploid effective population size history.

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
ksfs.plot_total()
plt.yscale('log')
plt.subplot(122)
ksfs.eta.plot(t_gen=t_gen)
plt.xlim([1e3, 1e6])
plt.show()

Now let's take a look at the inferred mutation spectrum history (MuSH).
- The plot on the left will show the measured $k$-SFS composition (points) and the fit from `mushi` (lines)
- The plot on the right will show the inferred MuSH

In [None]:
plt.figure(figsize=(16, 5))
plt.subplot(121)            
ksfs.plot(clr=True)            
plt.subplot(122)
ksfs.μ.plot(t_gen=t_gen, clr=True, alpha=0.75)
ksfs.μ.plot(('TCC>TTC',), t_gen=t_gen, clr=True, lw=5)
plt.xscale('log')
plt.xlim([1e3, 1e6])          
plt.show()

We can also plot the MuSH as a heatmap with the y axis representing time.

In [None]:
g = ksfs.μ.clustermap(t_gen=t_gen, figsize=(17, 7), col_cluster=True, xticklabels=True, robust=False, cmap='RdBu_r')
g.ax_heatmap.set_xticklabels(g.ax_heatmap.get_xmajorticklabels(), fontsize = 9, family='monospace')
g.ax_heatmap.set_ylim([172, 58])
plt.show()

Now that you have a MuSH, you can start answering questions about mutation spectrum history!🤸‍