# Advanced Model Fitting

Sometimes we may want to initialize our fitting procedure with an approximate solution.
Here we'll run initialization procedures using both non-negative matrix factorization (NMF) and clustering,
and see what each says about genotypes and communities in these data.

In [None]:
!sfacts nmf_init \
    --verbose \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.approx-nmf.world.nc

In [None]:
!sfacts clust_init \
    --verbose \
    --num-strains 15 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.approx-clust.world.nc

In [None]:
import sfacts as sf

approx_nmf = sf.World.load('sim.filt.ss-0.approx-nmf.world.nc')
approx_clust = sf.World.load('sim.filt.ss-0.approx-clust.world.nc')

sf.plot.plot_community(
    approx_nmf,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)
sf.plot.plot_community(
    approx_clust,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
)

In [None]:
sf.plot.plot_genotype(
    approx_nmf,
    col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)
sf.plot.plot_genotype(
    approx_clust,
    col_linkage_func=lambda w: w.metagenotype.linkage("position"),
)

By eye, NMF seems to do a better job here. We can use these genotypes as an intialization point for a more refined analysis.

Now the fun part: fitting the StrainFacts model to these data.

Let's take a look at the details of the default StrainFacts model.

We'll leave all of the hyperparameters set to their default values for this model.
In addition, we explicitly fit 15 strains (5 more than the simulation actually had),
and we set a random seed for reproducibility.

In [None]:
!sfacts fit \
    --verbose \
    --init-from sim.filt.ss-0.approx-nmf.world.nc --init-vars genotype \
    --num-strains 15 \
    --random-seed 0 \
    sim.filt.ss-0.mgen.nc sim.filt.ss-0.fit2.world.nc

In [None]:
fit = sf.World.load('sim.filt.ss-0.fit.world.nc')

sf.plot.plot_community(
    fit,
    col_linkage_func=lambda w: w.metagenotype.linkage("sample"),
    row_linkage_func=lambda w: w.genotype.linkage("strain")
)
sf.plot.plot_genotype(
    fit,
    row_linkage_func=lambda w: w.genotype.linkage("strain")
)

Sometimes we may want to re-estimate genotype
based on this initial estimate of strain relative abundances.
This can be useful if we have many more SNP positions than computational resources.
Alternatively, and in this case, we can use the refitting procedure to
get a feel for how precisely specified our genotype estimates are.
We do this by explicitly setting the regularization parameter, $\gamma^*$ (`gamma_hyper`),
to 1.01, which removes the bias towards discrete genotypes.
The result is that our genotype estimates will be "fuzzy",
incorporating more uncertainty.

All other hyperparameters are set to default values.

Here, we're going to refit the original 1000 simulated positions.
Remember that we already subsampled this file into two
non-overlapping blocks of 500 positions and used the first for the initial fitting.
Now, we'll fix the strain relative abundances ("community") and use gradient
descent to estimate genotypes.
Since conditioning on the community makes every position
in the genotype seperable, we can easily parallelize this refitting procedure
by running blocks in parallel processes.
Here we'll also further divide our computation of each block serially using --chunk-size.

In [None]:
!sfacts fit_geno \
    --verbose \
    --hyperparameters gamma_hyper=1.01 \
    --chunk-size=250 \
    --random-seed=0 \
    sim.filt.ss-0.fit2.world.nc sim.filt.ss-0.mgen.nc sim.filt.ss-0.fit3.geno.nc

In [None]:
!sfacts fit_geno \
    --verbose \
    --hyperparameters gamma_hyper=1.01 \
    --chunk-size=250 \
    --random-seed=0 \
    sim.filt.ss-0.fit2.world.nc sim.filt.ss-1.mgen.nc sim.filt.ss-1.fit3.geno.nc

In [None]:
!sfacts concat_geno \
    --metagenotype sim.filt.mgen.nc \
    --community sim.filt.ss-0.fit2.world.nc \
    --outpath sim.filt.fit3.world.nc \
    sim.filt.ss-0.fit3.geno.nc sim.filt.ss-1.fit3.geno.nc

`concatenate_genotype_chunks` then recombines one or more genotype blocks refit in this step with the observed
metagenotype data and original community inference to build a new world file.

When we visualize these refit genotypes, we see that they look similar, but slightly "fuzzier"
than the original fit.

In the next example, we'll compare these fits to the simulated ground-truth
in order to evaluate our performance.