In [None]:
import sfacts as sf

Load the simulated metagenotype and filter

- positions by a minimum minor allele frequency
- samples by a minimum horizontal coverage (fraction of sites with counts)

In [None]:
!sfacts simulate \
    --num-strains=10 --num-samples=50 --num-positions=1000 \
    --hyperparameters pi_hyper=0.1 mu_hyper_mean=10.0 epsilon_hyper_mode=0.01 \
    --random-seed=0 \
    sim2.world.nc

In [None]:
!sfacts filter_mgen \
    --min-minor-allele-freq 0.05 \
    --min-horizontal-cvrg 0.15 \
    --random-seed 0 \
    sim2.mgen.nc sim2.filt.mgen.nc

In [None]:
!sfacts data_info sim2.filt.mgen.nc

Now subsample 500 of the positions.

In [None]:
%%bash
for i in `seq 0 9`
do
    sfacts sample_mgen --verbose \
        --num-positions 500 \
        --block-number $i \
        --random-seed 0 \
        sim.filt.mgen.nc sim.filt.ss-$i.mgen.nc
done

Before we apply the full-strength StrainFacts model to our data,
let's try out to approximations, NMF and clustering, and see what they say about genotypes and communities in these data.

In [None]:
!sfacts nmf_init \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.approx-nmf.world.nc

In [None]:
!sfacts cluster_init \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.approx-clust.world.nc

In [None]:
approx_nmf = sf.World.load('sim.filt.ss-0.approx-nmf.world.nc')
approx_clust = sf.World.load('sim.filt.ss-0.approx-clust.world.nc')

sf.plot.plot_community(
    approx_nmf,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)
sf.plot.plot_community(
    approx_clust,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)

In [None]:
sf.plot.plot_genotype(
    approx_nmf,
    col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
sf.plot.plot_genotype(
    approx_clust,
    col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)

By eye, NMF seems to do a better job here. We can use these genotypes as an intialization point for a more refined analysis.

Now the fun part: fitting the StrainFacts model to these data.

Let's take a look at the details of the default StrainFacts model.

We'll leave all of the hyperparameters set to their default values for this model.
In addition, we explicitly fit 15 strains (5 more than the simulation actually had),
and we set a random seed for reproducibility.

In [None]:
!sfacts fit \
    --verbose \
    --init-from sim.filt.ss-0.approx-nmf.world.nc --init-vars genotype \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.fit2.world.nc

Sometimes
we may want to re-estimate genotype
based on this initial estimate of strain relative abundances.
This can be useful if we have many more SNP positions than computational resources.

Here, we're going to fit the first 1000 simulated positions. By specifying a `--block-size` and `--block-number` we can
divide our computation up into parallel processes using a split-apply-combine workflow.
We then divide our computation serially (`--chunk-size`) and across multiple processes .

Several hyperparameters are set to the defaults for this model.
For this refitting we have explicitly set the regularization parameter, $\gamma^*$ / `gamma_hyper`, to 1.01,
which removes the bias towards discrete genotypes.
The result is that our genotype estimates will be "fuzzy",
incorporating more uncertainty.

In [None]:
!sfacts fit_geno \
    --verbose \
    --hyperparameters gamma_hyper=1.01 \
    --chunk-size=250 \
    --random-seed=0 \
    sim.filt.ss-0.fit2.world.nc sim.filt.ss-0.mgen.nc sim.filt.ss-0.refit-0.geno.nc

In [None]:
%%bash
for i in `seq 1 9`
do
    sfacts fit_geno \
        --verbose \
        --hyperparameters gamma_hyper=1.01 \
        --chunk-size=500 \
        --random-seed=0 \
        sim.filt.ss-0.fit2.world.nc sim.filt.ss-$i.mgen.nc sim.filt.ss-0.fit2.refit-$i.geno.nc
done

In [None]:
%%bash
sfacts concat_geno \
            --metagenotype sim.filt.mgen.nc \
            --community sim.filt.ss-0.fit2.world.nc \
            --outpath sim.filt.fit2.refit.world.nc \
            sim.filt.ss-0.refit-{0,1,2,3,4,5,6,7,8,9}.geno.nc

`concatenate_genotype_chunks` then recombines one or more genotype blocks refit in this step with the observed
metagenotype data and original community inference to build a new world file.

When we visualize these refit genotypes, we see that they look similar, but slightly "fuzzier"
than the original fit.

In [None]:
# TODO: Evaluate in-bag vs. out-of-bag refitting performance.

In [None]:
!sfacts evaluate_fit --outpath sim.filt.fit.eval.tsv sim.world.nc sim.world.nc sim.filt.ss-0.fit2.world.nc sim.filt.ss-0.refit-0.geno.nc
!column -t sim.filt.fit.eval.tsv

In the next example, we'll compare this fit to the simulated ground-truth
in order to evaluate our performance.