# Tutorial: applying strainFlye to a subset of the SheepGut dataset

This tutorial walk you through some of the analyses that strainFlye can perform.

Here we will be using the same SheepGut dataset that is used in our paper, but feel free to follow along with another dataset.

The pipeline takes as input two primary types of data:

1. A __set of reads__ (in FASTA / FASTQ format) generated using PacBio HiFi sequencing.

2. An __assembly graph__ (in GFA 1 format) produced by running metaFlye (or another assembler) on these reads.

Please see the paper's "Data access" section for details about acquiring both of these types of data for the SheepGut dataset.

## The commands available through strainFlye

In [3]:
!strainFlye

Usage: strainFlye [OPTIONS] COMMAND [ARGS]...

  Pipeline for the analysis of rare mutations in metagenomes.

  Please consult https://github.com/fedarko/strainFlye if you have any
  questions, comments, etc. about strainFlye. Thank you for using this tool!

Options:
  -h, --help  Show this message and exit.

Commands:
  align  Aligns reads to contigs, then filters this alignment.
  call   [+] Naïve mutation calling and diversity index computation.
  fdr    FDR estimation and fixing for contigs' mutation calls.
  utils  [+] Various utility commands provided with strainFlye.


## 0. Convert the assembly graph to a FASTA file of contigs

**You can skip this step if:** you already have a FASTA file exactly matching the sequences of the segments in your GFA file.

Our assembly graph (the GFA file) contains the sequences of the contigs that we will use in many downstream analyses, but it will be helpful to have a FASTA file that just describes these contigs' sequences (independent of the assembly graph topology).

There are some [bash one-liners](https://www.biostars.org/p/169516/#169530) you can use to convert a GFA file to a FASTA file, but strainFlye also provides a utility command (`strainFlye utils gfa-to-fasta`) to do this for you. We'll use this here. (Our solution may be a bit slower than a bash one-liner, but it performs some useful sanity checking on the GFA file.)

In [10]:
!strainFlye utils gfa-to-fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-fasta /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta

--------
strainFlye utils gfa-to-fasta @ 0.00 sec: Starting...
Input GFA file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa
Output FASTA file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
--------
strainFlye utils gfa-to-fasta @ 11.77 sec: Done.
Output FASTA file contains 78,793 sequences.


## 1. Performing alignment

**You can skip this step if:** you already have a BAM file representing an alignment of reads to contigs, and this BAM file does not contain secondary alignments / partially-mapped reads / overlapping supplementary alignments (these all may cause problems in downstream analyses).

We'll need to align reads back to these contigs. The resulting alignment will be used in pretty much all downstream steps, so it's important to make sure that it is of good quality.

The `strainFlye align` command produces a BAM file corresponding to such an alignment. 

Note that this command, in particular, may take a while to run. Sequence alignment is computationally expensive! On our cluster, `strainFlye align` ran on the full SheepGut dataset in 62,941.21 seconds (aka about 17.5 hours).

In [4]:
!strainFlye align

Usage: strainFlye align [OPTIONS] READS...

  Aligns reads to contigs, then filters this alignment.

  Files of reads should be in the FASTA or FASTQ formats; GZIP'd files are
  allowed.

  This command involves multiple steps, including:

    1) Align reads to contigs (using minimap2) to generate a SAM file
    2) Convert this SAM file to a sorted and indexed BAM file
    3) Filter overlapping supplementary alignments within this BAM file
    4) Filter partially-mapped reads within this BAM file

  Note that we only sort the alignment file once, although we do re-index it
  after the two filtering steps. This decision is motivated by
  https://www.biostars.org/p/131333/#131335.

Options:
  -c, --contigs PATH          FASTA file of contigs to which reads will be
                              aligned.  [required]
  -g, --graph PATH            GFA 1-formatted file describing an assembly
                              graph of the contigs. This is used in the
       

In [4]:
!strainFlye align \
    # We can use the FASTA file we just generated above.
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/alignment \
    # Reads file(s) are specified here, after all of the other parameters:
    /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz \
    /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/*.fasta.qz

This generates a BAM file (`final.bam`) and BAM index file (`final.bam.bai`) in the specified output directory.

We can use this BAM file for many analyses downstream—the first of these will be mutation calling.

## 2. Perform naïve mutation calling; estimate and fix FDRs

**You can skip this step if:** you already have a VCF file indicating single-nucleotide variant calls in these contigs.

The analyses downstream of this step take as input a set of identified single-nucleotide mutations (or, if you prefer to use different terminology, "called variants", "called SNVs", ...) in which we have some confidence. Here we call mutations and estimate their false-discovery rates using the target-decoy approach, as described in our paper.

### 2.1. $p$-mutations and $r$-mutations?

strainFlye supports calling two basic types of mutations: $p$-mutations and $r$-mutations. The docs explain the difference between these two types best:

In [5]:
!strainFlye call

Usage: strainFlye call [OPTIONS] COMMAND [ARGS]...

  [+] Naïve mutation calling and diversity index computation.

  Consider a position "pos" in a contig. A given read with a (mis)match
  operation at "pos" must have one of four nucleotides (A, C, G, T) aligned to
  pos. We represent these nucleotides' counts at pos as follows:

      N1 = # reads of the most-common aligned nucleotide at pos,
      N2 = # reads of the second-most-common aligned nucleotide at pos,
      N3 = # reads of the third-most-common aligned nucleotide at pos,
      N4 = # reads of the fourth-most-common aligned nucleotide at pos.

  (We break ties arbitrarily.)

  strainFlye supports two types of naïve mutation calling based on these
  counts: p-mutations and r-mutations. These are described below.

  p-mutations (naïve percentage-based mutation calling)
  -----------------------------------------------------

  This takes as input some percentage p in the range (0%, 50%]. Define
  freq(po

### 2.2. More details about $p$-mutation calling...

Let's view the documentation of the $p$-mutation-calling command, in particular, in more detail:

In [6]:
!strainFlye call p-mutation

Usage: strainFlye call p-mutation [OPTIONS]

  Calls p-mutations and computes diversity indices.

  The primary parameter for this command is the lower bound of p, defined by
  --min-p. The VCF output will include "mutations" for all positions that pass
  this (likely very low) threshold, but this VCF should be adjusted using the
  utilities contained in the "strainFlye fdr" module.

Options:
  -c, --contigs PATH              FASTA file of contigs in which to naïvely
                                  call mutations.  [required]
  -b, --bam PATH                  BAM file representing an alignment of reads
                                  to contigs.  [required]
  --min-p INTEGER RANGE           Minimum value of p for which to call
                                  p-mutations. This is scaled up by 100 (i.e.
                                  the default of 50 corresponds to 50 / 100 =
                                  0.5%) in order to bypass floating-point
           

#### Understanding the output of these commands

Both of these commands take as input a *minimum* version of their threshold (either `--min-p` or `--min-r`). We only need to specify a minimum threshold because, if some position _pos_ is classified as a $p$-mutation for $p = p_0$, this implies that _pos_ is also a $p$-mutation for all $p \geq p_0$. This also holds for $r$-mutations.

These commands each output:

1. A __VCF ([variant call format](https://samtools.github.io/hts-specs/VCFv4.3.pdf)) file__ describing all mutations called naïvely across the contigs, based on the minimum $p$ or $r$ threshold set.


2. A __TSV ([tab separated values](https://en.wikipedia.org/wiki/Tab-separated_values)) file__ describing the contigs' computed diversity indices, for various values of $p$ or $r$ (controllable using the `--div-index-p-list` or `--div-index-r-list` parameters). 

The default minimum value of $p$ (or $r$) used in these commands is fairly low. As you might expect, using such a low threshold for calling a position as a mutation will yield many false positives: we will definitely identify some real mutations, but also many "false" mutations that occur as the result of sequencing errors, alignment errors, etc. Viewed another way, the __false discovery rate (FDR)__ (defined as the ratio of false positives to total true + false positives) of our identified mutations will probably be unacceptably high.

After we run this command, we'll use strainFlye's FDR estimation and fixing functionality to address this problem. The second output of these commands (the TSV file containing diversity index information) will help us with that.

### 2.3. Running $p$-mutation calling for $p \geq 0.15\%$

This matches what we used for Figure 2 in the paper.

In [None]:
!strainflye call p-mutation \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --min-p 15 \
    --output-vcf /Poppy/mfedarko/sftests/tutorial-output/p0.15.vcf \
    --output-diversity-indices /Poppy/mfedarko/sftests/tutorial-output/p-div-indices.tsv

### 2.4. Estimating and fixing FDRs using the target-decoy approach

We now have both our initial mutation calls (which, as we've discussed, probably have a high FDR) and information about out contigs' diversity indices. We will use the __target-decoy approach__ to attempt to estimate and thus control the FDR of our mutation calls.

As discussed in our paper, we will select from one of our $C$ contigs a __decoy contig__ (a.k.a. a decoy genome), and compute a mutation rate for it ($\text{rate}_{\text{decoy}}$). For each of the other $C - 1$ __target contigs__, we can estimate the FDR of identified mutations in this contig as $\dfrac{\text{rate}_{\text{decoy}}}{\text{rate}_{\text{target}}}$.

The first hurdle we'll need to surmount is selecting a decoy contig. What would make a good decoy contig? We can define some rules:

- A low number of "real" mutations
- High coverage
- Long length

The first two of these criteria match up pretty well with the diversity indices we computed earlier, which—unlike the naïve mutations described in the VCF file—take coverage information into account.

We can provide these diversity indices to strainFlye's `fdr` command, and it will use them to select a suitable-seeming decoy contig from our dataset. This command will then estimate and fix mutation calls' FDRs (to a default maximum FDR of $1\%$).

In [7]:
!strainflye fdr

Usage: strainflye fdr [OPTIONS]

  FDR estimation and fixing for contigs' mutation calls.

  Does this using the target-decoy approach (TDA). Given a set of C contigs,
  we select a "decoy contig" with relatively few called mutations. We then
  compute a mutation rate for this decoy contig, and use this mutation rate
  (along with the mutation rates of the other C - 1 contigs) to estimate the
  FDRs of all of the other contigs.

  By varying p or r, we can plot a FDR curve for each of the C - 1 non-decoy
  (target) contigs; and, given a fixed FDR, we can choose the "optimal" p or r
  parameter for each contig that results in a FDR ≤ this FDR.

Options:
  -v, --vcf PATH                  VCF file describing naïvely called p- or
                                  r-mutations.  [required]
  -di, --diversity-indices PATH   TSV file describing the diversity indices of
                                  a set of contigs. Used to automatically
                                 

In [None]:
!strainflye fdr \
    --vcf /Poppy/mfedarko/sftests/tutorial-output/p0.15.vcf \
    --diversity-indices /Poppy/mfedarko/sftests/tutorial-output/p-div-indices.tsv \
    --output-fdr-info /Poppy/mfedarko/sftests/tutorial-output/p-fdr-info.tsv \
    --output-vcf /Poppy/mfedarko/sftests/tutorial-output/p-fdr-1-pct-fixed.vcf

(**From Marcus:** this is as far as the tutorial goes now. Given FDR estimates, we can identify the optimal threshold ($p$ or $r$) value for each contig, then use that to refine the naïve mutation calls into a more reliable VCF file. From there, replicating all of the downstream analyses (hotspots, phasing, ...) should be straightforward.)