# Tutorial: applying strainFlye to the "SheepGut" dataset

This tutorial will walk you through the various commands strainFlye offers.

## 0. Following this tutorial

Depending on your dataset and goals, you could either walk through the entirety of this document or skip around a bit. You can "jump to" many steps of the pipeline without trouble; for example, if you already have a BCF file of single-nucleotide mutations, you may prefer to skip all of the naïve mutation calling / FDR estimation steps and jump to the phasing analyses.

Note that many of the steps of this pipeline will take a while (in my experience: for the full SheepGut dataset on our cluster, "a while" usually means around a day). Smaller datasets and/or faster machines should of course decrease this duration. Step 5 of this tutorial ("**Optional: Filter the FASTA file in order to focus on certain contigs**") discusses how you can subset your dataset, after the `align` step, to certain contigs of interest; this will also speed things up.

([Obligatory xkcd link.](https://xkcd.com/1343/))

## 1. The two main strainFlye inputs

The strainFlye pipeline takes as input two main types of data:

1. A __set of reads__ (in FASTA / FASTQ format).

2. A __set of contigs__ (in FASTA format) assembled from these reads.

This assumes that you're starting the pipeline at the beginning (with `align`). Later pipeline steps will require other inputs that can be produced by intermediate pipeline steps.

### 1.1. Details about these types of inputs

**Regarding reads:** We designed strainFlye in the context of PacBio Circular Consensus Sequencing (CCS) "HiFi" reads ([Wenger & Peluso _et al._, 2019](https://www.nature.com/articles/s41587-019-0217-9)). However, in theory it should still work with other reasonably long and accurate reads.

**Regarding contigs:** We don't impose any restriction on the assembler you use to construct these. We have tested strainFlye on [metaFlye](https://github.com/fenderglass/Flye) ([Kolmogorov _et al._, 2020](https://www.nature.com/articles/s41592-020-00971-x)) and [hifiasm-meta](https://github.com/xfengnefx/hifiasm-meta) ([Feng _et al._, 2022](https://www.nature.com/articles/s41592-022-01478-3)) output, but it should in theory work with the outputs of other HiFi assemblers.

### 1.2. The SheepGut dataset

We'll be applying strainFlye to the SheepGut dataset that is shown in our paper. (This dataset has also been described in ([Kolmogorov _et al._, 2020](https://www.nature.com/articles/s41592-020-00971-x) and [Bickhart & Kolmogorov _et al._, 2022](https://www.nature.com/articles/s41587-021-01130-z).) Please see the strainFlye paper's "Data access" section for details about acquiring reads and contigs for the SheepGut dataset.

Note that the "contigs" we use for the SheepGut datatset really correspond to edge sequences in the `assembly_graph.gfa` file produced by metaFlye. These edge sequences may be slightly different from the file of contigs / scaffolds in the `assembly.fasta` file produced by metaFlye: see [Flye's manual](https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#output) for more information. (You could use either type of sequence with strainFlye, although I personally recommend using edges: it's useful to have context about where exactly in the assembly graph a sequence is, and things like gaps in scaffolds [represented by `N`s] will cause strainFlye to complain.)

## 2. Introduction

Let's take care of a few things before the tutorial starts.

### 2.1. Installing strainFlye

Before following along with this tutorial, we assume that you have already installed strainFlye (and have activated the corresponding conda environment). Please see [strainFlye's README](https://github.com/fedarko/strainFlye) for installation instructions.

### 2.2. What commands are available through strainFlye?

In [30]:
!strainFlye

Usage: strainFlye [OPTIONS] COMMAND [ARGS]...

  Pipeline for the analysis of rare mutations in metagenomes.

  Please consult https://github.com/fedarko/strainFlye if you have any
  questions, comments, etc. about strainFlye. Thank you for using this tool!

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  align   Align reads to contigs, and filter the resulting alignment.
  call    [+] Call mutations in contigs naïvely & compute diversity indices.
  fdr     [+] Estimate and fix FDRs for contigs' naïve mutation calls.
  spot    [+] Identify putative mutational hotspots or coldspots in contigs.
  smooth  [+] Create and assemble smoothed and virtual reads.
  link    [+] Create link graphs showing co-occurring alleles.
  dynam   [+] Compute simple information about growth dynamics.
  utils   [+] Various utility commands provided with strainFlye.


### 2.3. Importing and configuring some utilities

You shouldn't need to do much (if any) programming to use strainFlye's commands; that said, we will be using Python to help with a few small tasks throughout this tutorial, and you will probably need to do some programming to plot the results we produce (this philosophy was inspired partially by [Schloss 2020](https://journals.asm.org/doi/full/10.1128/AEM.02343-19)). We'll import some useful Python packages here to reduce clutter in this notebook.

(If you prefer, you could of course use another language instead of Python.)

In [15]:
import time
import skbio
import pandas as pd

## 3. Convert the assembly graph GFA file to a FASTA file of contigs

**You can skip this step if:** you already have a FASTA file describing contigs in your assembly graph.

Our assembly graph (the GFA file) contains the sequences of the contigs that we will use in many downstream analyses, but we'll need to have a FASTA file that just describes these contigs' sequences (independent of the assembly graph topology).

There are some [bash one-liners](https://www.biostars.org/p/169516/#169530) you can use to convert a GFA 1 file to a FASTA file, but strainFlye also provides a utility command (`strainFlye utils gfa-to-fasta`) to do this for you. We'll use this here. (Our solution may be a bit slower than a bash one-liner, but it performs some useful sanity checking on the GFA file.)

In [46]:
!strainFlye utils gfa-to-fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-fasta /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta

--------
strainFlye utils gfa-to-fasta @ 0.00 sec: Starting...
Input GFA file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa
Output FASTA file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
--------
strainFlye utils gfa-to-fasta @ 14.97 sec: Done.
Output FASTA file contains 78,793 sequences.


## 4. Align reads to contigs; filter the resulting alignment

**You can skip this step if:** you already have a BAM file representing an alignment of reads to contigs, and this BAM file does not contain secondary alignments / partially-mapped reads / overlapping supplementary alignments (these all may cause problems in downstream analyses).

We'll need to align reads back to these contigs. The resulting alignment, and/or the mutations that we call from it, will be used in pretty much all downstream steps—so it's important to make sure that it is of good quality!

The `strainFlye align` command uses minimap2 to perform alignment, and then does some extra filtering on the resulting alignment.

Note that this command, in particular, may take a while to run. Sequence alignment is computationally expensive! On our cluster, `strainFlye align` ran on the full SheepGut dataset in 62,941.21 seconds (aka about 17.5 hours).

In [2]:
!strainFlye align

Usage: strainFlye align [OPTIONS] READS...

  Align reads to contigs, and filter the resulting alignment.

  Files of reads should be in the FASTA or FASTQ formats; GZIP'd files are
  allowed.

  This command involves multiple steps, including:

    1) Align reads to contigs (using minimap2) to generate a SAM file
    2) Convert this SAM file to a sorted and indexed BAM file
    3) Filter overlapping supplementary alignments (OSAs) from this BAM file
    4) Filter partially-mapped reads from this BAM file

  Note that we only sort the alignment file once, although we do re-index it
  after the two filtering steps. This decision is motivated by
  https://www.biostars.org/p/131333/#131335.

Options:
  -c, --contigs PATH              FASTA file of contigs to which reads will be
                                  aligned.  [required]
  -g, --graph PATH                GFA 1-formatted file describing an assembly
                                  graph of the contigs. Thi

In [4]:
!strainFlye align \
    # We can use the FASTA file we just generated above.
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/alignment \
    # Reads file(s) are specified here, after all of the other parameters:
    /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz \
    /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/*.fasta.gz

This generates a BAM file (`final.bam`) and BAM index file (`final.bam.bai`) in the specified output directory.

We can use this BAM file for many analyses downstream—the first of these will be mutation calling.

## 5. Optional: Filter the FASTA file in order to focus on certain contigs

We just aligned our dataset's reads against *all* contigs in the assembly graph. This is standard practice (see, e.g., [this tutorial](https://astrobiomike.github.io/genomics/metagen_anvio#mapping-our-reads-to-the-assembly-they-built)); aligning reads against all contigs probably yields a more accurate alignment than just aligning reads against a subset of these contigs (although proving if this is "best practice" or not is a challenging question, and one that I will sidestep right now).

However, now that we have this alignment, we don't necessarily need to perform mutation calling, phasing, etc. on all contigs (although, if you want to, we could!). To speed up the rest of this tutorial, **we will focus solely on the "long" contigs in this dataset**: here, we will define a contig as "long" if its length is at least 1 Mbp (aka 1,000,000 bp). In theory, these long contigs represent putative metagenome-assembled genomes (MAGs).

Of course, if you prefer, you could apply more sophisticated criteria to pick which contigs to focus on—maybe you'd like to also focus on contigs with high coverages, or maybe on contigs with good [CheckM](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387/) completeness or contamination values. Or maybe you'd like to keep considering all contigs in the full dataset! Your decision should depend on your goals, and your dataset.

In any case, how do we "focus on" certain contigs? **We can filter our FASTA file to a subset of contigs present in the full dataset**, and use this filtered FASTA file for all downstream analyses. As an example of this, we will use Python (in particular, the [scikit-bio](http://scikit-bio.org/) library) to filter our FASTA file to all long contigs.

In [15]:
# Produce a filtered FASTA file containing only contigs >= 1 Mbp long
# (This uses scikit-bio; see http://scikit-bio.org/ for more details.)

input_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta"
output_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta"
len_threshold = 1000000

t0 = time.time()
num_long_contigs = 0
with open(output_contigs_fp, "w") as of:
    for contig in skbio.io.read(input_contigs_fp, format="fasta", constructor=skbio.DNA):
        if len(contig) >= len_threshold:
            skbio.io.write(contig, format="fasta", into=of)
            num_long_contigs += 1
t1 = time.time()

print(f"Found {num_long_contigs:,} contigs with lengths \u2265 {len_threshold:,} bp. Took {t1 - t0:,.2f} sec.")

Found 468 contigs with lengths ≥ 1,000,000 bp. Took 12.30 sec.


The remainder of this tutorial will focus on these 468 long contigs.

In any case, now we can get on to some more interesting stuff!

## 6. Perform naïve mutation calling, then estimate and fix mutation calls' FDRs

**You can skip this step if:** You already have a BCF file describing single-nucleotide, non-multi-allelic mutations in your contigs.

The analyses downstream of this step (hotspot/coldspot identification, phasing) take as input a set of identified single-nucleotide mutations (or, if you prefer to use different terminology, "called variants", "called SNVs", ...) in which we have some confidence. How does strainFlye identify these mutations?

There are a few steps (as our paper describes). First, we will **naïvely call mutations** using a simple threshold-based method (referred to as "NaiveFreq" in the paper). We can then **estimate the false-discovery rates (FDRs) of the mutations called for each contig** using the target-decoy approach. If desired, we can then adjust the called mutations to **fix the estimated FDRs of these mutation calls** below a specified threshold.

### 6.1. $p$-mutations and $r$-mutations?

So, our first step will be performing this simple threshold-based calling. What do we mean by "threshold" here?

strainFlye supports calling two basic types of mutations: $p$-mutations and $r$-mutations. The docs explain the difference between these two types best:

In [3]:
!strainFlye call

Usage: strainFlye call [OPTIONS] COMMAND [ARGS]...

  [+] Call mutations in contigs naïvely & compute diversity indices.

  Consider a position "pos" in a contig. Using the alignment, we can count how
  many reads have a (mis)match operation to "pos" with one of the four
  nucleotides (A, C, G, T; we ignore degenerate nucleotides in reads). We
  represent these four nucleotides' counts at pos as follows:

      N1 = # reads of the most-common aligned nucleotide at pos,
      N2 = # reads of the second-most-common aligned nucleotide at pos,
      N3 = # reads of the third-most-common aligned nucleotide at pos,
      N4 = # reads of the fourth-most-common aligned nucleotide at pos.

  (We break ties arbitrarily.)

  strainFlye supports two types of naïve mutation calling based on these
  counts: p-mutations and r-mutations. These are described below.

  p-mutations (naïve percentage-based mutation calling)
  -----------------------------------------------------

  T

### 6.2. Understanding these (sub)commands

First off, note that `strainFlye call` doesn't do anything besides show help info if you run it by itself. This is because, unlike `strainFlye align`, `strainFlye call` has two subcommands: `p-mutation` and `r-mutation`. Which of these you use will depend on how you want to naïvely call mutations. You can invoke one of these subcommands by writing out the full chain of commands: for example, `strainFlye call p-mutation`.

#### 6.2.1. Input and output

Probably the most important parameter at this step is the *minimum threshold*. Both of these subcommands, `strainFlye call p-mutation` and `strainFlye call r-mutation`, take as input a minimum version of their corresponding threshold (either `--min-p` or `--min-r`).

These commands each output:

1. A __BCF (binary [variant call format](https://samtools.github.io/hts-specs/VCFv4.3.pdf)) file__ describing all mutations called naïvely across the contigs, based on the minimum $p$ or $r$ threshold set (`--min-p` or `--min-r`).

2. A __TSV ([tab separated values](https://en.wikipedia.org/wiki/Tab-separated_values)) file__ describing the contigs' computed diversity indices, for various values of $p$ or $r$ (configurable using the `--div-index-p-list` or `--div-index-r-list` parameters).
  - Long story short, diversity indices indicate how many of a contig's "sufficiently-covered" positions have called mutations: in general, higher diversity indices imply higher mutation rates.
  - If enough positions in a contig are not "sufficiently-covered," then we will not compute the diversity index for this contig. Please see the Supplemental Material "Diversity index details" in the strainFlye paper for details.

#### 6.2.2. Interpreting the output

The default minimum value of $p$ (or $r$) used in these commands is fairly low. As you might expect, using such a low threshold for calling a position as a mutation may yield many false positives: we will almost certainly identify many real mutations, but also many "false" mutations that happen to occur as the result of sequencing errors, alignment errors, etc. Viewed another way, the __[false discovery rate (FDR)](https://en.wikipedia.org/wiki/False_discovery_rate)__ (defined as the ratio of false positives to total true + false positives) of the mutation calls generated at this step will probably be high—although this depends on a variety of factors, including which contig(s) we are focusing on mutations in, what our goals are in the first place, etc.

##### FDR estimation and fixing

After we run this command, we can use strainFlye's FDR estimation and fixing functionality to attempt to address this problem. This process involves adjusting the "minimum" value of $p$ (or $r$) used for each contig to reduce the FDR as needed.

Depending on your dataset and your goals, you may or may not want to do this: as of writing, many of the analyses in the strainFlye paper don't perform FDR fixing on their input mutations, and instead just make use of "un-fixed" naïvely called $p$-mutations. (That being said, this is mostly a historical artifact of us implementing the FDR fixing code close to the end of the project.)

In general, if you are going to perform FDR estimation / fixing, we recommend only doing this for $p$-mutations (and not $r$-mutations); this is discussed in the strainFlye paper, in the Supplemental Material section named "Identifying mutations based solely on read counts."

### 6.3. Naïvely call $p$-mutations ($p = 0.15\%$) and compute diversity indices for various values of $p$

Now that we know what we're doing, we're ready to call mutations and compute diversity indices! We'll do $p$-mutation calling at a minimum $p$ of $0.15\%$, which matches what we used for Figure 2 in the paper. The default diversity index values of $p$ (ranging from $0.5\%$ to $50\%$) should be good for us.

Like alignment, this command will also take a while—we need to check each position in the alignment for each of the input contigs. If you'd like, you can use the `--verbose` flag to display some extra information while this command is running, to make the wait more tolerable (and assure you that it isn't frozen somewhere).

In [4]:
!strainFlye call p-mutation

Usage: strainFlye call p-mutation [OPTIONS]

  Call p-mutations and compute diversity indices.

  The primary parameter for this command is the lower bound of p, defined by
  --min-p. The BCF output will include "mutations" for all positions that pass
  this (likely very low) threshold; this BCF can be filtered using the
  utilities contained in the "strainFlye fdr" module.

Options:
  -c, --contigs PATH              FASTA file of contigs in which to naïvely
                                  call mutations. All contigs in this FASTA
                                  file should also be contained in the BAM
                                  file; it's ok if the BAM file contains
                                  contigs not in this FASTA file (we'll ignore
                                  them).  [required]
  -b, --bam PATH                  Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  --min-

In [None]:
!strainFlye call p-mutation \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --min-p 15 \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/call-p15

### 6.4. Estimating FDRs using the target-decoy approach

We now have both our initial mutation calls (which, as we've discussed, probably have a high FDR) and information about our contigs' diversity indices. We will use the __target-decoy approach__ to attempt to estimate and thus control the FDR of our mutation calls. This is done by the `strainFlye fdr estimate` and `strainFlye fdr fix` commands.

As discussed in our paper, we can select—out of one of our $C$ contigs—a __decoy contig__ (a.k.a. a decoy genome), and compute a mutation rate for it ($\text{rate}_{\text{decoy}}$). For each of the other $C - 1$ __target contigs__, we can estimate the FDR of identified mutations in this contig as $\dfrac{\text{rate}_{\text{decoy}}}{\text{rate}_{\text{target}}}$.

#### 6.4.1. Manually selecting a decoy contig

If you'd like, we could go through the diversity indices produced by `strainFlye call p-mutation` ourselves, in an attempt to select a reasonable-seeming decoy contig. **[This notebook](https://nbviewer.org/github/fedarko/strainFlye/blob/main/docs/AnalyzingDiversityIndices.ipynb)** demonstrates this sort of process.

#### 6.4.2. Letting `strainFlye fdr estimate` automatically select a decoy contig

The optional notebook discussed above shows that `edge_6104` is probably a good decoy contig, so we could if desired just pass it to `strainFlye fdr estimate` using that command's `-dc` or `--decoy-contig` option. However, to illustrate another option, we'll instead pass our diversity index TSV file to `strainFlye fdr estimate` and let it do the job of selecting a decoy contig. (Spoiler alert: it'll select `edge_6104` anyway.)

The Methods section of our paper provides details about how the automatic decoy contig selection algorithm works. If you'd prefer, you can also go through the exact source code for it, which is reasonably well-documented: see the `autoselect_decoy()` function in **[this file](https://github.com/fedarko/strainFlye/blob/main/strainflye/fdr_utils.py#L101)**.

In [5]:
!strainFlye fdr

Usage: strainFlye fdr [OPTIONS] COMMAND [ARGS]...

  [+] Estimate and fix FDRs for contigs' naïve mutation calls.

Options:
  -h, --help  Show this message and exit.

Commands:
  estimate  Estimate the FDRs of contigs' mutation calls.
  fix       Fix contigs' mutation calls' estimated FDRs to an upper limit.


In [6]:
!strainFlye fdr estimate

Usage: strainFlye fdr estimate [OPTIONS]

  Estimate the FDRs of contigs' mutation calls.

  We do this using the target-decoy approach (TDA). Given a set of C contigs,
  we select a "decoy contig" with relatively few called mutations. We then
  compute a mutation rate for this decoy contig, and use this mutation rate
  (along with the mutation rates of the other C - 1 "target" contigs) to
  estimate the FDRs of all of these target contigs' mutation calls.

  We can produce multiple FDR estimates for a single target contig's calls by
  varying the p or r threshold used (from the --min-p or --min-r threshold
  used to generate the input BCF file, up to the --high-p or --high-r
  threshold given here). Using this information (and information about the
  numbers of mutations called per megabase), we can plot an FDR curve for a
  given target contig's mutation calls.

  This command accepts an input BCF file of p- or r-mutations; however, in
  general we recommend using p

In [7]:
!strainFlye fdr estimate \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
    --diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
    --decoy-contexts Everything
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info

--------
strainFlye fdr estimate @ 0.00 sec: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta
Input BAM file: /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf
Input diversity indices file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv
Input manually-set decoy contig: None
Input decoy contig context-dependent position / mutation type(s): ('Full', 'CP2', 'Tv', 'Nonsyn', 'Nonsense', 'CP2Tv', 'CP2Nonsyn', 'CP2Nonsense', 'TvNonsyn', 'TvNonsense', 'CP2TvNonsense')
Input high p threshold (only used if the BCF describes p-mutations): 500
Input high r threshold (only used if the BCF describes r-mutations): 100
Input min length of a potential decoy contig (only used if diversity indices are specified): 1000000
Input min average coverage of a potential decoy contig (only used if diversity indices are spec

Let's check on the TSV files that got written to the output directory. We should see one file for every decoy context, indicating the FDR estimates for each target contig for this context; and one lone "number of mutations per Mb" file, indicating the number of mutations per megabase for each target contig.

In general, we can plot these as FDR curves by using the FDR estimates as x-axis values and the "number of mutations per Mb" values as y-axis values.

In [7]:
!ls /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/

fdr-CP2Nonsense.tsv    fdr-CP2Tv.tsv	 fdr-TvNonsense.tsv
fdr-CP2Nonsyn.tsv      fdr-Full.tsv	 fdr-TvNonsyn.tsv
fdr-CP2.tsv	       fdr-Nonsense.tsv  fdr-Tv.tsv
fdr-CP2TvNonsense.tsv  fdr-Nonsyn.tsv	 num-mutations-per-mb.tsv


### 6.5. Plotting FDR curves

**[This notebook (in the analysis code repository)](https://nbviewer.org/github/fedarko/sheepgut/blob/main/sf-analyses/sheep/3-PlotFDRCurves.ipynb)** demonstrates how we can plot some or all of these FDR estimates for our target contigs. (I put a decent amount of work into making those plots look fancy for the paper, but you don't nee to make things that complicated unless you want to!)

### 6.6. Fixing mutation calls' FDRs to an upper limit of $1\%$

FDR estimates are nice, but we can take things a step further. strainFlye has the ability to _fix_ the estimated FDR for each target contig's mutation calls to a specified upper limit.

Since we already have FDR curves showing how, as $p$ varies, the estimated FDR for each target contig varies, fixing the estimated FDR to an upper limit $F$ amounts to choosing (for each target contig) a value of $p$ that yields a FDR ≤ $F$. The `strainFlye fdr fix` command takes care of this for us.

In [8]:
!strainFlye fdr fix

Usage: strainFlye fdr fix [OPTIONS]

  Fix contigs' mutation calls' estimated FDRs to an upper limit.

  This takes as input the estimated FDRs from "strainFlye fdr estimate" (if
  you used multiple decoy contexts, then you will need to choose which set of
  FDR estimates to use here) to guide us on how to fix the FDR for each
  contig. Note that mutations that passed the "high" p or r threshold
  specified for "strainFlye fdr estimate", and thus were not used for FDR
  estimation, will all be included in the output BCF file from this command;
  these mutations are considered "indisputable."

  We include indisputable mutations from the decoy contig and from all target
  contigs in our output BCF file. We will only consider including non-
  indisputable mutations from the target contigs: the decision of which non-
  indisputable mutations will be included is based on the lowest p or r
  parameter for a target contig that yields an estimated FDR ≤ the fixed FDR
  given 

In [30]:
!strainFlye fdr fix \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
    --fdr-info /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info.tsv \
    --fdr 1 \
    --output-bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf

--------
strainFlye fdr fix @ 0.00 sec: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf
Input FDR estimate file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info.tsv
Input FDR to fix mutation calls at: 1.0
Verbose?: No
Output BCF file with mutation calls at the fixed FDR: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
--------
strainFlye fdr fix @ 0.00 sec: Loading and checking BCF and TSV files...
strainFlye fdr fix @ 14.93 sec: Looks good so far; decoy contig seems to be edge_6104.
strainFlye fdr fix @ 14.93 sec: Looks like the cutoff for "indisputable" mutations was p = 500.
strainFlye fdr fix @ 14.93 sec: All mutations passing this cutoff will be included in the output BCF file.
--------
strainFlye fdr fix @ 14.93 sec: Based on the FDR information, finding optimal values of p for each contig...
strainFlye fdr fix @ 14.95 sec: Done.
strainFlye fdr fix @ 14.95 sec: For 155 / 467 contigs, there exist values of p (at least, cons

It took us a few steps, but we have now generated a file (`p15-fdr1pct.bcf`) of $p$-mutation calls at a fixed (estimated) FDR of 1%.

Although our methodology has a few limitations (e.g. we don't support calling multi-allelic mutations yet), this BCF file can be used downstream for many types of analyses. In the next sections of the tutorial we'll demonstrate the additional commands supported by strainFlye, most of which make use of these mutation calls in some way.

## 7. Identify hotspots and coldspots

We've called mutations and estimated these calls' FDRs. Now we can get to the fun part: what's going on with these mutations?

Often, we're interested in analyzing mutations' locations in the contigs. Are there any particular "hotspot" regions where there are surprisingly many mutations? Are there any "coldspot" regions where there are, surprisingly, no or few mutations?

We've kept strainFlye's functionality for identifying these types of regions fairly minimal at the moment. Here we'll demonstrate identifying very basic hotspots and coldspots using `strainFlye spot`'s commands.

In [9]:
!strainFlye spot

Usage: strainFlye spot [OPTIONS] COMMAND [ARGS]...

  [+] Identify putative mutational hotspots or coldspots in contigs.

  Many methods exist for identifying these sorts of hotspots or coldspots;
  strainFlye's implementations of these methods are intended mostly as a quick
  proof-of-concept for replicating the results shown in our paper, and are not
  extremely "feature-rich" quite yet.

Options:
  -h, --help  Show this message and exit.

Commands:
  hot-features  Identify hotspot features (for example, genes).
  cold-gaps     Identify long coldspot "gaps" without any mutations.


### 7.1. Identify hotspot features

In [10]:
!strainFlye spot hot-features

Usage: strainFlye spot hot-features [OPTIONS]

  Identify hotspot features (for example, genes).

  By "feature", we refer to a single continuous region within a contig, as
  described in the file given for --features. These regions could describe
  anything: predicted protein-coding genes, introns or exons, intergenic
  regions of interest, etc. For now, we treat each feature independently (e.g.
  we don't lump together exons from the same "Parent" gene; each feature is
  considered separately as a potential "hotspot").

  You can configure whether or not we classify a feature as a hotspot by
  adjusting the --min-num-mutations and --min-perc-mutations parameters; at
  least one of these parameters must be specified. If both parameters are
  specified, then both checks (number of mutations in a feature, and
  percentage of mutations in a feature) will need to pass in order for us to
  label a feature as a hotspot.

Options:
  -b, --bcf PATH                  Indexed 

#### 7.1.1. A note about "features"

Although we should be familiar with the FASTA and BCF input files by this point, the `-f` / `--features` input (a GFF3 file) may be surprising. strainFlye leaves the task of creating this file up to the user.

Predicted genes' coordinates are probably the most obvious type of "feature" for which we could look for hotspots. If you don't have gene predictions for your contigs yet, [Prodigal](https://github.com/hyattpd/Prodigal) is good (and should have been installed along with strainFlye, since it's used internally elsewhere in strainFlye). Here, we'll use it to predict protein-coding genes across all contigs.

It's important to note that Prodigal does not predict eukaryotic genes (i.e. genes that are split up into introns and exons). These genes will thus not be a perfect representation of all protein-coding genes in all contigs in the dataset, since we know that this sample does contain at least some eukaryotic genomes. (However, if you use another tool for predicting eukaryotic genes that produces GFF3 output, then these should also be usable as "features" for this command.)

In [None]:
# See https://github.com/hyattpd/Prodigal/wiki/cheat-sheet for details about these options.
#
# Note that, for the paper, I ran Prodigal in "normal" mode on certain contigs individually
# (https://github.com/fedarko/sheepgut/blob/main/inspect-seqs/prodigal.py), but here we just
# run Prodigal in "anonymous" mode on all contigs at once. The results should be fairly similar,
# although there'll probably be some differences.
!prodigal \
    -i /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    -f gff \
    -c \
    -p meta \
    -o /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous.gff

#### 7.1.2. Running the command to identify hotspot features (hotspot predicted genes)

Now that we have our gene predictions, let's move see if any of them have a lot of mutations. (Based on our findings in the paper, we know that these sorts of hotspots do exist in this dataset.)

`strainFlye spot hot-features` supports two types of basic thresholds for labelling a feature as a hotspot, `--min-num-mutations` and `--min-perc-mutations`. We'll use both here, and label a feature a "hotspot" if it meets both of the following criteria:

1. it contains at least 5 mutations, and
2. at least 2% of its positions have mutations.

In [20]:
!strainFlye spot hot-features \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --features /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous.gff \
    --min-num-mutations 5 \
    --min-perc-mutations 2 \
    --output-hotspots /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv

Using strainFlye version "0.1.0-dev".
--------
spot hot-features @ 0.00s: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Input feature file: /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous.gff
Input minimum number of mutations needed to call a feature a hotspot: 5
Input minimum % of mutations needed to call a feature a hotspot: 2.0
Output file describing hotspot features: /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv
--------
spot hot-features @ 0.00s: Loading and checking the BCF file...
spot hot-features @ 12.20s: Looks good so far.
--------
spot hot-features @ 12.20s: Going through features in the GFF3 file and identifying hotspot features...
spot hot-features @ 100.13s: Identified 57,188 hotspot feature(s) across all 468 contigs in the BCF file.
--------
spot hot-features @ 100.13s: Writing out this information to a TSV file...
spot hot-features @ 100.29s: Done.
--------
spot hot-features @ 100.30s: Done.


The output of this command isn't anything special: it's a TSV file in which each row describes an identified hotspot feature, defining "hotspots" based on the `--min-num-mutations` and `--min-perc-mutations` options we set earlier. Let's load this file using `pandas.read_csv()` and get a brief sense of what it looks like:

In [21]:
hotspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv", sep="\t")
hotspots.head()

Unnamed: 0,Contig,FeatureID,FeatureStart_1IndexedInclusive,FeatureEnd_1IndexedInclusive,NumberMutatedPositions,PercentMutatedPositions
0,edge_8,8_765,865350,866369,32,3.14
1,edge_8,8_781,873414,873626,5,2.35
2,edge_8,8_787,878293,879429,48,4.22
3,edge_8,8_788,879444,881552,101,4.79
4,edge_8,8_789,881718,882956,42,3.39


Depending on your goals, we could focus on—for example—the hotspot genes with the highest mutation rates in a contig. We can compute this for `edge_1671` ("BACT1", as we name it in our paper) by filtering and then sorting the DataFrame:

In [23]:
bact1_hotspots = hotspots[hotspots["Contig"] == "edge_1671"]

# Sort all the "hotspot genes" in BACT1 from high to low mutation rates (% of positions mutated).
# This is similar to the table of highly-mutated genes in the Supplemental Material of our paper,
# although unlike that table this uses FDR-fixed mutations (and it also uses different gene
# predictions, as discussed above).
bact1_hotspots.sort_values(["PercentMutatedPositions"], ascending=False)

Unnamed: 0,Contig,FeatureID,FeatureStart_1IndexedInclusive,FeatureEnd_1IndexedInclusive,NumberMutatedPositions,PercentMutatedPositions
12437,edge_1671,1671_860,1041656,1042084,86,20.05
12473,edge_1671,1671_1217,1460034,1460204,34,19.88
12373,edge_1671,1671_183,206606,207304,131,18.74
12467,edge_1671,1671_1168,1402353,1402718,61,16.67
12456,edge_1671,1671_1102,1326288,1327073,120,15.27
...,...,...,...,...,...,...
12422,edge_1671,1671_751,916664,918493,37,2.02
12375,edge_1671,1671_248,272253,273047,16,2.01
12461,edge_1671,1671_1131,1361175,1361672,10,2.01
12421,edge_1671,1671_750,914486,916576,42,2.01


### 7.2. Identify coldspot gaps

Similarly, strainFlye supports the identification of basic "coldspots"—here, defined as long gaps without any mutations. The main parameter is the minimum length needed to define a gap as a "coldspot." Let's test this out on the SheepGut dataset:

In [11]:
!strainFlye spot cold-gaps

Usage: strainFlye spot cold-gaps [OPTIONS]

  Identify long coldspot "gaps" without any mutations.

  To clarify, we define a "gap" of length L on a contig as a collection of
  continuous positions [N, N + 1, ... N + L - 2, N + L - 1] in which no
  positions are mutations (based on the input BCF file).

  If the --circular flag is specified, then we can loop around the contig from
  right to left; otherwise, the left and right sides of the contig are hard
  boundaries. To give an example of this, consider a 9-nucleotide contig that
  has mutations at positions 4 and 6:

                             Mut.    Mut.
                  1   2   3   4   5   6   7   8   9

  If --circular is specified, then this contig has two gaps: one gap of length
  1 (covering just position 5, between the two mutations), and another gap of
  length 6 (starting at position 7 and looping around to position 3: [7, 8, 9,
  1, 2, 3]).

  If --circular is not specified, then this contig has th

In [18]:
!strainFlye spot cold-gaps \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --output-coldspots /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv

Using strainFlye version "0.1.0-dev".
--------
spot cold-gaps @ 0.00s: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Minimum coldspot gap length: 5,000 bp
Check for circular coldspot gaps?: No
Compute exact longest-gap p-values?: Yes
Output file describing coldspot gaps: /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv
--------
spot cold-gaps @ 0.00s: Loading and checking the BCF file...
spot cold-gaps @ 12.10s: Looks good so far.
--------
spot cold-gaps @ 12.10s: Going through contigs and identifying coldspot gaps...
spot cold-gaps @ 15.71s: Identified 15,258 coldspot gap(s) across all 468 contigs in the BCF file.
--------
spot cold-gaps @ 15.71s: Writing out this information to a TSV file...
spot cold-gaps @ 15.72s: Done.
--------
spot cold-gaps @ 15.72s: Done.


Again, `cold-gaps` will output a simple TSV file describing its identified coldspots:

In [19]:
coldspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv", sep="\t")
coldspots.head()

Unnamed: 0,Contig,Start_1IndexedInclusive,End_1IndexedInclusive,Length,LongestGap_P_Value
0,edge_8,383672,681281,297610,
1,edge_8,681283,715603,34321,
2,edge_8,717621,865477,147857,
3,edge_8,885778,906394,20617,
4,edge_8,913216,920714,7499,


There are many ways we could make use of this information—similarly to the hotspot example, we could try sorting these gaps from longest to shortest for the BACT1 contig. This is shown below. (Of the gaps that we identify, the longest one in each contig is assigned a _p_-value; please see the Supplemental Material of our paper for details on how we compute this, and the assumptions made.)

In [20]:
coldspots[coldspots["Contig"] == "edge_1671"].sort_values(["Length"], ascending=False)

Unnamed: 0,Contig,Start_1IndexedInclusive,End_1IndexedInclusive,Length,LongestGap_P_Value
2387,edge_1671,1216892,1239536,22645,1.168921e-103
2388,edge_1671,1618448,1637614,19167,
2389,edge_1671,1797824,1813581,15758,
2385,edge_1671,1194740,1207381,12642,
2390,edge_1671,2140896,2153394,12499,
2383,edge_1671,979572,990495,10924,
2380,edge_1671,740701,750399,9699,
2386,edge_1671,1207881,1216890,9010,
2384,edge_1671,1183003,1191716,8714,
2382,edge_1671,972772,979570,6799,


## 8. Phasing analyses

### 8.1. Generating "smoothed haplotypes"

Given our called mutations, we can attempt to generate haplotypes that respect these mutations using strainFlye's `smooth` module.

The details and motivation for this are explained in depth in our paper. To briefly summarize, we will convert the reads aligned to each contig into "smoothed reads," which only contain our called mutations with no other variations. We will then (optionally, depending on the `--virtual-reads` parameter) construct "virtual reads" to fill in low-coverage regions. We will then assemble these reads using [LJA](https://github.com/AntonBankevich/LJA/) to construct "smoothed haplotypes."

In [21]:
!strainFlye smooth

Usage: strainFlye smooth [OPTIONS] COMMAND [ARGS]...

  [+] Create and assemble smoothed and virtual reads.

Options:
  -h, --help  Show this message and exit.

Commands:
  create    Create smoothed and virtual reads for each contig.
  assemble  Assemble contigs' smoothed and virtual reads using LJA.


#### 8.1.1. Create smoothed and virtual reads

In [22]:
!strainFlye smooth create

Usage: strainFlye smooth create [OPTIONS]

  Create smoothed and virtual reads for each contig.

Options:
  -c, --contigs PATH              FASTA file of contigs for which we will
                                  create smoothed and virtual reads. All
                                  contigs in this FASTA file should also be
                                  contained in the BAM and BCF files; it's ok
                                  if the BAM or BCF files contain contigs not
                                  in this FASTA file (we'll ignore them).
                                  [required]
  --bam PATH                      Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  --bcf PATH                      Indexed BCF file describing single-
                                  nucleotide mutations in a set of contigs.
                                  [required]
  -di, --diversity-indices PATH  

In [None]:
!strainFlye smooth create \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads

#### 8.1.2. Assemble these reads using LJA

This step assumes that we have already installed [LJA](https://github.com/AntonBankevich/LJA/), in particular the `simple_ec` branch of it. Please see [LJA's manual](https://github.com/AntonBankevich/LJA/blob/main/docs/lja_manual.md) for installation instructions.

I have installed LJA into a specific location on our cluster. So that strainFlye can easily run LJA for each contig's reads files, we pass the location of LJA's binary executable to strainFlye below using the `--lja-bin` parameter.

In [23]:
!strainFlye smooth assemble

Usage: strainFlye smooth assemble [OPTIONS]

  Assemble contigs' smoothed and virtual reads using LJA.

  Please note that this command relies on the "simple_ec" branch of LJA being
  installed on your system. See strainFlye's README (and/or LJA's manual) for
  details on installing LJA.

Options:
  -r, --reads-dir DIRECTORY   Directory produced by "strainFlye smooth create"
                              containing smoothed (and optionally virtual)
                              reads. We will use LJA to assemble each
                              *.fasta.gz file in this directory (representing
                              reads from different contigs) independently.
                              [required]
  -p, --lja-params TEXT       Additional parameters to pass to LJA, besides
                              the --reads and --output-dir parameters. To
                              explain our defaults: the --simpleec flag is
                              currently 

In [None]:
!strainFlye smooth assemble \
    --reads-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads \
    --lja-bin /home/mfedarko/software/LJA-branch/bin/lja \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies

Once you've created these assemblies, you can then analyze them like you would any other assembly of a MAG. You may want to visualize the assembly graphs, for example—[Bandage](https://github.com/rrwick/Bandage), [AGB](https://github.com/almiheenko/AGB/), and [MetagenomeScope](https://github.com/marbl/MetagenomeScope) are a few tools that can do this.

For information about LJA-specific details of the outputs (e.g. what files mean what), please see [LJA's manual](https://github.com/AntonBankevich/LJA/blob/main/docs/lja_manual.md#output-of-la-jolla-assembler).

### 8.2. Constructing link graphs

As a related analysis, we can construct _link graphs_ showing "how much" of a contig we can potentially phase. These graphs are described in detail in the Supplemental Material of our paper.

Long story short, creating link graphs in strainFlye has two steps (like performing smoothed-read assembly above). First, we'll compute nucleotide (co-)occurrence data for each contig; then, we'll convert this data into link graphs.

In [24]:
!strainFlye link

Usage: strainFlye link [OPTIONS] COMMAND [ARGS]...

  [+] Create link graphs showing co-occurring alleles.

  The "nt" command should be run first; this generates information about the
  counts and co-occurrences of nucleotides at mutated positions in a contig.

  The "graph" command takes as input the information produced by "nt" and
  creates link graphs (one link graph per contig) from it. There are many
  parameters that impact the graph creation, so we have split this into two
  steps in order to make it easy to rerun the "graph" step with different
  parameter settings if needed (the "nt" command will likely take longer to
  run).

Options:
  -h, --help  Show this message and exit.

Commands:
  nt     Compute (co-)occurrence information for mutations' nucleotides.
  graph  Convert (co-)occurrence information into a link graph structure.


#### 8.2.1. Compute nucleotide (co-)occurrence information

##### A brief note about filtering
Note that, in the paper's Supplemental Material, we limited our focus to mutated positions with coverages of at least 1,000x. We do not do this here. If you'd to only consider mutations occurring at positions with at least a certain coverage, you'll need to filter the BCF file yourself. (You can probably use [bcftools](https://samtools.github.io/bcftools/howtos/filtering.html) to do this; if you are using a BCF file produced by strainFlye, then you can filter based on the `MDP` info tag, which describes the number of matching + mismatching reads at a mutated position.)

In [25]:
!strainFlye link nt

Usage: strainFlye link nt [OPTIONS]

  Compute (co-)occurrence information for mutations' nucleotides.

Options:
  -c, --contigs PATH          FASTA file of contigs for which we will create
                              link graphs. All contigs in this FASTA file
                              should also be contained in the BAM and BCF
                              files; it's ok if the BAM or BCF files contain
                              contigs not in this FASTA file (we'll ignore
                              them).  [required]
  --bam PATH                  Sorted and indexed BAM file representing an
                              alignment of reads to contigs.  [required]
  --bcf PATH                  Indexed BCF file describing single-nucleotide
                              mutations in a set of contigs.  [required]
  -o, --output-dir DIRECTORY  Directory to which information about nucleotide
                              frequencies at the mutated positions in e

In [None]:
!strainFlye link nt \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt

Using strainFlye version "0.1.0-dev".
--------
link nt @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta
Input BAM file: /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Verbose?: No
Output directory: /Poppy/mfedarko/sftests/tutorial-output/link/nt
--------
link nt @ 0.00s: Loading and checking FASTA, BAM, and BCF files...
link nt @ 4.56s: The FASTA file describes 468 contig(s).
link nt @ 4.64s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)).
link nt @ 15.89s: All FASTA contig(s) are included in the BCF file (the header of this BCF file describes 468 contig(s)).
link nt @ 15.90s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files.
link nt @ 15.90s: So far, these files seem good.
--------
link nt @ 15.90s: Going through contigs

#### 8.2.2. Convert this information to create link graphs

This next command directly takes as input the information we just computed. You can experiment with various parameters of graph construction here, if you'd like; the defaults match what we have set in the paper. (The help text for this command, shown below, explains these parameters in detail.)

In [None]:
!strainFlye link graph

In [None]:
!strainFlye link graph \
    --nt-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/link/graph \

## 9. Create codon/amino acid mutation matrices

_(This part of the pipeline is not finished yet, sorry. See [the README](https://github.com/fedarko/strainFlye) for information on the ad hoc code from the other repository you can use to create/visualize these matrices.)_

In [None]:
!strainFlye matrix

## 10. Growth dynamics analyses

One of the nice things of having complete or nearly-complete MAGs is that we can take a look at both their GC skews and coverages.

GC skews can sometimes be indicative of where the origin and terminus of replication are in a genome (this is discussed extensively in [Chapter 1 of the _Bioinformatics Algorithms_ textbook](https://www.bioinformaticsalgorithms.org/bioinformatics-chapter-1)). Similarly, we know from ([Korem, Zeevi, Suez _et al._, 2015](https://www.science.org/doi/full/10.1126/science.aac4812)) that the coverage throughout a MAG can also be indicative of the origin of replication.

In our paper, we demonstrate the easy applicability of HiFi reads to this sort of analysis by showing plots of coverage vs. skew for three MAGs in the SheepGut dataset, demonstrating that—as we might expect—these values are anti-correlated. (Why are they _anti-correlated_, specifically? The skew minimum indicates the origin of replication, and coverage for a genome undergoing replication should be at its maximum near the origin of replication.)

The `strainFlye dynam covskew` command can help us ~~replicate~~ reproduce this sort of analysis. Rather than plotting coverage and skew for each position within a MAG, we usually want to create "bins" of positions within a MAG and then plot coverage and skew for these bins: strainFlye can give us this sort of information for a contig in an easy-to-read TSV file that we can use as input for plotting coverage and skew.

In [31]:
!strainFlye dynam covskew

Usage: strainFlye dynam covskew [OPTIONS]

  Compare coverage and GC skew within contigs.

  Bins
  ----
  We split each contig into bin(s) of a fixed number of positions:
  starting at the leftmost end of the contig, we combine the next
  --bin-length positions into a single bin, continuing until we run out of
  positions. The rightmost bin will contain fewer positions than --bin-length
  if the contig length is not divisible by --bin-length. (If the contig's
  length is ≤ --bin-length, we'll just create one bin for this contig.)

  How we compute bin coverages
  ----------------------------
  For each bin, we compute the median coverage of all positions in this bin.
  We then compute M, the median of these medians. We then compute
  "normalized" coverages by dividing each bin's median coverage by M. We then
  clamp these normalized coverages using --norm-coverage-epsilon.

  How we compute GC skews
  -----------------------
  For each bin, we compute the skew (G

In [29]:
!strainFlye dynam covskew \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/covskew \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
dynam covskew @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta
Input BAM file: /Poppy/mfedarko/sheepgut/main-workflow/output/fully-filtered-and-sorted-aln.bam
Bin length: 10,000 bp
Normalized coverage epsilon: 0.3
Output directory: /Poppy/mfedarko/sftests/tutorial-output/covskew
--------
dynam covskew @ 0.00s: Loading and checking FASTA and BAM files...
dynam covskew @ 4.73s: The FASTA file describes 468 contig(s).
dynam covskew @ 4.73s: All of these are included in the BAM file (which has 78,793 reference(s)), with the same lengths.
--------
dynam covskew @ 4.73s: Going through contigs and computing coverage/skew information...
dynam covskew @ 4.73s: On contig edge_8 (1,710,962 bp) (1 / 468 contigs = 0.21%).
dynam covskew @ 4.73s: Creating 171 bins of length 10,000 bp and 1 smaller bin of length 962 bp for contig edge_8...
dynam covskew @ 25.45s: On contig edge_10 (1,556,9

dynam covskew @ 1,020.13s: On contig edge_1343 (1,150,856 bp) (37 / 468 contigs = 7.91%).
dynam covskew @ 1,020.13s: Creating 115 bins of length 10,000 bp and 1 smaller bin of length 856 bp for contig edge_1343...
dynam covskew @ 1,029.61s: On contig edge_1345 (1,148,683 bp) (38 / 468 contigs = 8.12%).
dynam covskew @ 1,029.61s: Creating 114 bins of length 10,000 bp and 1 smaller bin of length 8,683 bp for contig edge_1345...
dynam covskew @ 1,039.21s: On contig edge_1349 (1,008,193 bp) (39 / 468 contigs = 8.33%).
dynam covskew @ 1,039.21s: Creating 100 bins of length 10,000 bp and 1 smaller bin of length 8,193 bp for contig edge_1349...
dynam covskew @ 1,048.51s: On contig edge_1350 (2,038,177 bp) (40 / 468 contigs = 8.55%).
dynam covskew @ 1,048.51s: Creating 203 bins of length 10,000 bp and 1 smaller bin of length 8,177 bp for contig edge_1350...
dynam covskew @ 1,089.87s: On contig edge_1371 (1,634,973 bp) (41 / 468 contigs = 8.76%).
dynam covskew @ 1,089.87s: Creating 163 bins of 

dynam covskew @ 2,537.11s: On contig edge_1736 (1,047,335 bp) (75 / 468 contigs = 16.03%).
dynam covskew @ 2,537.11s: Creating 104 bins of length 10,000 bp and 1 smaller bin of length 7,335 bp for contig edge_1736...
dynam covskew @ 2,545.87s: On contig edge_1738 (2,962,047 bp) (76 / 468 contigs = 16.24%).
dynam covskew @ 2,545.87s: Creating 296 bins of length 10,000 bp and 1 smaller bin of length 2,047 bp for contig edge_1738...
dynam covskew @ 2,586.47s: On contig edge_1749 (2,138,318 bp) (77 / 468 contigs = 16.45%).
dynam covskew @ 2,586.47s: Creating 213 bins of length 10,000 bp and 1 smaller bin of length 8,318 bp for contig edge_1749...
dynam covskew @ 2,605.98s: On contig edge_1770 (1,048,294 bp) (78 / 468 contigs = 16.67%).
dynam covskew @ 2,605.98s: Creating 104 bins of length 10,000 bp and 1 smaller bin of length 8,294 bp for contig edge_1770...
dynam covskew @ 2,615.29s: On contig edge_1778 (1,708,996 bp) (79 / 468 contigs = 16.88%).
dynam covskew @ 2,615.29s: Creating 170 b

dynam covskew @ 3,411.11s: On contig edge_2195 (1,213,600 bp) (113 / 468 contigs = 24.15%).
dynam covskew @ 3,411.11s: Creating 121 bins of length 10,000 bp and 1 smaller bin of length 3,600 bp for contig edge_2195...
dynam covskew @ 3,422.92s: On contig edge_2209 (1,741,203 bp) (114 / 468 contigs = 24.36%).
dynam covskew @ 3,422.92s: Creating 174 bins of length 10,000 bp and 1 smaller bin of length 1,203 bp for contig edge_2209...
dynam covskew @ 3,440.86s: On contig edge_2214 (1,382,691 bp) (115 / 468 contigs = 24.57%).
dynam covskew @ 3,440.86s: Creating 138 bins of length 10,000 bp and 1 smaller bin of length 2,691 bp for contig edge_2214...
dynam covskew @ 3,453.07s: On contig edge_2218 (3,016,620 bp) (116 / 468 contigs = 24.79%).
dynam covskew @ 3,453.07s: Creating 301 bins of length 10,000 bp and 1 smaller bin of length 6,620 bp for contig edge_2218...
dynam covskew @ 3,488.47s: On contig edge_2289 (2,279,686 bp) (117 / 468 contigs = 25.00%).
dynam covskew @ 3,488.47s: Creating 

dynam covskew @ 5,067.52s: On contig edge_2722 (1,222,305 bp) (151 / 468 contigs = 32.26%).
dynam covskew @ 5,067.52s: Creating 122 bins of length 10,000 bp and 1 smaller bin of length 2,305 bp for contig edge_2722...
dynam covskew @ 5,080.22s: On contig edge_2731 (1,143,015 bp) (152 / 468 contigs = 32.48%).
dynam covskew @ 5,080.22s: Creating 114 bins of length 10,000 bp and 1 smaller bin of length 3,015 bp for contig edge_2731...
dynam covskew @ 5,090.09s: On contig edge_2742 (2,400,363 bp) (153 / 468 contigs = 32.69%).
dynam covskew @ 5,090.09s: Creating 240 bins of length 10,000 bp and 1 smaller bin of length 363 bp for contig edge_2742...
dynam covskew @ 5,146.55s: On contig edge_2747 (3,550,098 bp) (154 / 468 contigs = 32.91%).
dynam covskew @ 5,146.55s: Creating 355 bins of length 10,000 bp and 1 smaller bin of length 98 bp for contig edge_2747...
dynam covskew @ 5,177.16s: On contig edge_2759 (2,653,209 bp) (155 / 468 contigs = 33.12%).
dynam covskew @ 5,177.16s: Creating 265 b

dynam covskew @ 5,907.67s: On contig edge_3258 (1,977,317 bp) (189 / 468 contigs = 40.38%).
dynam covskew @ 5,907.67s: Creating 197 bins of length 10,000 bp and 1 smaller bin of length 7,317 bp for contig edge_3258...
dynam covskew @ 5,926.24s: On contig edge_3278 (1,687,629 bp) (190 / 468 contigs = 40.60%).
dynam covskew @ 5,926.24s: Creating 168 bins of length 10,000 bp and 1 smaller bin of length 7,629 bp for contig edge_3278...
dynam covskew @ 5,942.03s: On contig edge_3285 (1,702,011 bp) (191 / 468 contigs = 40.81%).
dynam covskew @ 5,942.03s: Creating 170 bins of length 10,000 bp and 1 smaller bin of length 2,011 bp for contig edge_3285...
dynam covskew @ 5,958.14s: On contig edge_3306 (1,884,912 bp) (192 / 468 contigs = 41.03%).
dynam covskew @ 5,958.14s: Creating 188 bins of length 10,000 bp and 1 smaller bin of length 4,912 bp for contig edge_3306...
dynam covskew @ 5,977.52s: On contig edge_3312 (3,305,680 bp) (193 / 468 contigs = 41.24%).
dynam covskew @ 5,977.52s: Creating 

dynam covskew @ 6,754.99s: On contig edge_4234 (1,505,663 bp) (227 / 468 contigs = 48.50%).
dynam covskew @ 6,754.99s: Creating 150 bins of length 10,000 bp and 1 smaller bin of length 5,663 bp for contig edge_4234...
dynam covskew @ 6,775.25s: On contig edge_4321 (2,185,649 bp) (228 / 468 contigs = 48.72%).
dynam covskew @ 6,775.25s: Creating 218 bins of length 10,000 bp and 1 smaller bin of length 5,649 bp for contig edge_4321...
dynam covskew @ 6,793.89s: On contig edge_4371 (1,389,684 bp) (229 / 468 contigs = 48.93%).
dynam covskew @ 6,793.89s: Creating 138 bins of length 10,000 bp and 1 smaller bin of length 9,684 bp for contig edge_4371...
dynam covskew @ 6,806.87s: On contig edge_4380 (2,156,131 bp) (230 / 468 contigs = 49.15%).
dynam covskew @ 6,806.87s: Creating 215 bins of length 10,000 bp and 1 smaller bin of length 6,131 bp for contig edge_4380...
dynam covskew @ 6,826.66s: On contig edge_4382 (1,060,200 bp) (231 / 468 contigs = 49.36%).
dynam covskew @ 6,826.66s: Creating 

dynam covskew @ 7,541.22s: On contig edge_5295 (2,982,162 bp) (265 / 468 contigs = 56.62%).
dynam covskew @ 7,541.22s: Creating 298 bins of length 10,000 bp and 1 smaller bin of length 2,162 bp for contig edge_5295...
dynam covskew @ 7,570.89s: On contig edge_5308 (1,008,425 bp) (266 / 468 contigs = 56.84%).
dynam covskew @ 7,570.89s: Creating 100 bins of length 10,000 bp and 1 smaller bin of length 8,425 bp for contig edge_5308...
dynam covskew @ 7,580.92s: On contig edge_5343 (2,281,253 bp) (267 / 468 contigs = 57.05%).
dynam covskew @ 7,580.92s: Creating 228 bins of length 10,000 bp and 1 smaller bin of length 1,253 bp for contig edge_5343...
dynam covskew @ 7,610.52s: On contig edge_5367 (1,021,210 bp) (268 / 468 contigs = 57.26%).
dynam covskew @ 7,610.52s: Creating 102 bins of length 10,000 bp and 1 smaller bin of length 1,210 bp for contig edge_5367...
dynam covskew @ 7,620.36s: On contig edge_5376 (2,918,945 bp) (269 / 468 contigs = 57.48%).
dynam covskew @ 7,620.36s: Creating 

dynam covskew @ 9,304.15s: On contig edge_9371 (4,228,047 bp) (303 / 468 contigs = 64.74%).
dynam covskew @ 9,304.15s: Creating 422 bins of length 10,000 bp and 1 smaller bin of length 8,047 bp for contig edge_9371...
dynam covskew @ 9,344.87s: On contig edge_9374 (1,185,231 bp) (304 / 468 contigs = 64.96%).
dynam covskew @ 9,344.87s: Creating 118 bins of length 10,000 bp and 1 smaller bin of length 5,231 bp for contig edge_9374...
dynam covskew @ 9,358.67s: On contig edge_9387 (2,626,914 bp) (305 / 468 contigs = 65.17%).
dynam covskew @ 9,358.67s: Creating 262 bins of length 10,000 bp and 1 smaller bin of length 6,914 bp for contig edge_9387...
dynam covskew @ 9,385.34s: On contig edge_10090 (1,769,386 bp) (306 / 468 contigs = 65.38%).
dynam covskew @ 9,385.34s: Creating 176 bins of length 10,000 bp and 1 smaller bin of length 9,386 bp for contig edge_10090...
dynam covskew @ 9,408.13s: On contig edge_10177 (1,515,080 bp) (307 / 468 contigs = 65.60%).
dynam covskew @ 9,408.13s: Creati

dynam covskew @ 10,240.37s: On contig edge_15667 (1,119,524 bp) (341 / 468 contigs = 72.86%).
dynam covskew @ 10,240.37s: Creating 111 bins of length 10,000 bp and 1 smaller bin of length 9,524 bp for contig edge_15667...
dynam covskew @ 10,254.47s: On contig edge_15668 (2,895,208 bp) (342 / 468 contigs = 73.08%).
dynam covskew @ 10,254.47s: Creating 289 bins of length 10,000 bp and 1 smaller bin of length 5,208 bp for contig edge_15668...
dynam covskew @ 10,283.56s: On contig edge_15908 (1,260,399 bp) (343 / 468 contigs = 73.29%).
dynam covskew @ 10,283.56s: Creating 126 bins of length 10,000 bp and 1 smaller bin of length 399 bp for contig edge_15908...
dynam covskew @ 10,300.78s: On contig edge_15931 (1,072,997 bp) (344 / 468 contigs = 73.50%).
dynam covskew @ 10,300.78s: Creating 107 bins of length 10,000 bp and 1 smaller bin of length 2,997 bp for contig edge_15931...
dynam covskew @ 10,313.89s: On contig edge_16157 (1,522,273 bp) (345 / 468 contigs = 73.72%).
dynam covskew @ 10,3

dynam covskew @ 11,268.32s: On contig edge_25370 (2,057,969 bp) (378 / 468 contigs = 80.77%).
dynam covskew @ 11,268.32s: Creating 205 bins of length 10,000 bp and 1 smaller bin of length 7,969 bp for contig edge_25370...
dynam covskew @ 11,287.98s: On contig edge_25431 (2,021,231 bp) (379 / 468 contigs = 80.98%).
dynam covskew @ 11,287.98s: Creating 202 bins of length 10,000 bp and 1 smaller bin of length 1,231 bp for contig edge_25431...
dynam covskew @ 11,306.41s: On contig edge_25588 (1,065,404 bp) (380 / 468 contigs = 81.20%).
dynam covskew @ 11,306.41s: Creating 106 bins of length 10,000 bp and 1 smaller bin of length 5,404 bp for contig edge_25588...
dynam covskew @ 11,319.38s: On contig edge_25602 (1,358,794 bp) (381 / 468 contigs = 81.41%).
dynam covskew @ 11,319.38s: Creating 135 bins of length 10,000 bp and 1 smaller bin of length 8,794 bp for contig edge_25602...
dynam covskew @ 11,333.83s: On contig edge_25994 (1,708,089 bp) (382 / 468 contigs = 81.62%).
dynam covskew @ 11

dynam covskew @ 11,854.66s: On contig edge_35678 (1,111,762 bp) (415 / 468 contigs = 88.68%).
dynam covskew @ 11,854.66s: Creating 111 bins of length 10,000 bp and 1 smaller bin of length 1,762 bp for contig edge_35678...
dynam covskew @ 11,866.35s: On contig edge_35728 (1,244,682 bp) (416 / 468 contigs = 88.89%).
dynam covskew @ 11,866.35s: Creating 124 bins of length 10,000 bp and 1 smaller bin of length 4,682 bp for contig edge_35728...
dynam covskew @ 11,879.09s: On contig edge_35733 (1,234,543 bp) (417 / 468 contigs = 89.10%).
dynam covskew @ 11,879.09s: Creating 123 bins of length 10,000 bp and 1 smaller bin of length 4,543 bp for contig edge_35733...
dynam covskew @ 11,892.32s: On contig edge_35924 (1,526,048 bp) (418 / 468 contigs = 89.32%).
dynam covskew @ 11,892.32s: Creating 152 bins of length 10,000 bp and 1 smaller bin of length 6,048 bp for contig edge_35924...
dynam covskew @ 11,913.16s: On contig edge_35958 (2,276,057 bp) (419 / 468 contigs = 89.53%).
dynam covskew @ 11

dynam covskew @ 12,508.93s: On contig edge_59408 (1,117,271 bp) (452 / 468 contigs = 96.58%).
dynam covskew @ 12,508.93s: Creating 111 bins of length 10,000 bp and 1 smaller bin of length 7,271 bp for contig edge_59408...
dynam covskew @ 12,520.47s: On contig edge_59951 (1,199,680 bp) (453 / 468 contigs = 96.79%).
dynam covskew @ 12,520.47s: Creating 119 bins of length 10,000 bp and 1 smaller bin of length 9,680 bp for contig edge_59951...
dynam covskew @ 12,532.76s: On contig edge_60444 (1,209,109 bp) (454 / 468 contigs = 97.01%).
dynam covskew @ 12,532.76s: Creating 120 bins of length 10,000 bp and 1 smaller bin of length 9,109 bp for contig edge_60444...
dynam covskew @ 12,544.70s: On contig edge_61525 (1,179,977 bp) (455 / 468 contigs = 97.22%).
dynam covskew @ 12,544.70s: Creating 117 bins of length 10,000 bp and 1 smaller bin of length 9,977 bp for contig edge_61525...
dynam covskew @ 12,556.47s: On contig edge_61852 (1,216,223 bp) (456 / 468 contigs = 97.44%).
dynam covskew @ 12