In [None]:
import sfacts as sf

Load the simulated metagenotype and filter

- positions by a minimum minor allele frequency
- samples by a minimum horizontal coverage (fraction of sites with counts)

In [None]:
!sfacts filter_mgen \
    --min-minor-allele-freq 0.05 \
    --min-horizontal-cvrg 0.15 \
    --random-seed 0 \
    sim.mgen.nc sim.filt.mgen.nc

In [None]:
mgen_raw = sf.data.Metagenotypes.load('sim.mgen.nc')
mgen_filt = sf.data.Metagenotypes.load('sim.filt.mgen.nc')
print(mgen_raw.sizes)
print(mgen_filt.sizes)

We can see that this did not reduce the size of the data; that's because our simulation had
uniform coverage and plenty of polymorphism at each site.
(Real data will be filtered much more than this.)

Plotting the metagenotypes again:

In [None]:
sf.plot.plot_metagenotype(mgen_filt.to_world())

Now the fun part: fitting the StrainFacts model to these data.

Let's take a look at the default hyperparameters for our model.

In [None]:
!sfacts describe default

We'll leave all of the hyperparameters set to their default values for this model.
In addition, we explicitly fit 15 strains (5 more than the simulation actually had),
and we set a random seed for reproducibility.

In [None]:
!sfacts fit \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.mgen.nc sim.filt.fit.world.nc

A model of this size on this dataset should fit relatively quickly (~1 minute on my computer).

When run on the command-line, several pieces of information are printed to the screen,
thanks to the `--verbose` flag.

The result of this fit is a "world" with a point estimate for _all_ of the parameters.

Let's load this into Python and plot the inferred genotypes and relative abundances.

In [None]:
fit = sf.data.World.load('sim.filt.fit.world.nc')

# Plot inferred relative abundances for each sample (the "community").
sf.plot.plot_community(
    fit,
    col_linkage_func=lambda w: w.metagenotypes.linkage('sample'),
    row_linkage_func=lambda w: w.genotypes.linkage('strain'),
)

# Plot the inferred genotypes of the 10 simulated strains.
sf.plot.plot_genotype(
    fit,
    row_linkage_func=lambda w: w.genotypes.linkage("strain"),
    col_linkage_func=lambda w: w.metagenotypes.linkage("position"),
)

Sometimes
we may want to re-estimate genotypes
based on this initial estimate of strain relative abundances.
This can be useful if we have many more SNP positions than computational resources.

Here, we fit again fit all 500 simulated positions.
If we had more, we could split it up serially (`--chunk-size`) or across multiple processes (`--block-size` and `--block-number`).

Several hyperparameters are set to the defaults for this model.
For this refitting we have explicitly set the regularization parameter, $\gamma^*$ / `gamma_hyper`, to 1.0,
which removes the bias towards discrete genotypes.
The result is that our genotype estimates will be more "fuzzy",
incorporating more uncertainty.

In [None]:
!sfacts fit_geno \
    --verbose \
    --model-structure model2 \
    --hyperparameters gamma_hyper=1.0 \
    --block-size=500 \
    --chunk-size=500 \
    --block-number=0 \
    --random-seed=0 \
    sim.filt.fit.world.nc sim.filt.mgen.nc sim.filt.fit.refit-0.geno.nc

In [None]:
!sfacts concat_geno \
            --metagenotype sim.filt.mgen.nc \
            --community sim.filt.fit.world.nc \
            --outpath sim.filt.fit.refit.world.nc \
            sim.filt.fit.refit-0.geno.nc

`concatenate_genotype_chunks` then recombines one or more genotype blocks refit in this step with the observed
metagenotype data and original community inference to build a new world file.

When we visualize these refit genotypes, we see that they look similar, but slightly "fuzzier"
than the original fit.

In [None]:
refit = sf.data.World.load('sim.filt.fit.refit.world.nc')
refit = refit.sel(position=fit.position.astype(str))


# Plot the 15 inferred genotypes of the 10 simulated strains.
sf.plot.plot_genotype(
    fit,
    row_linkage_func=lambda w: fit.genotypes.linkage("strain"),
    col_linkage_func=lambda w: w.metagenotypes.linkage("position"),
)

# We can see that we get approximately the same genotypes, but more fuzzy this time.
sf.plot.plot_genotype(
    refit,
    row_linkage_func=lambda w: fit.genotypes.linkage("strain"),
    col_linkage_func=lambda w: w.metagenotypes.linkage("position"),
)


Finally, we'll dump the relative abundance and genotype inferences out to TSV files,
which are now ready to be processed by downstream tools.

Note that the genotypes for each strain in each position are encoded as a float,
where 0.0 means entirely reference and 1.0 means entirely alternative allele.

In [None]:
!sfacts dump sim.filt.fit.refit.world.nc \
    --genotype sim.filt.fit.refit.geno.tsv \
    --community sim.filt.fit.refit.comm.tsv

In [None]:
!head sim.filt.fit.refit.geno.tsv sim.filt.fit.refit.comm.tsv

In the next example, we'll compare this fit to the simulated ground-truth
in order to evaluate our performance.