# Tutorial: applying strainFlye to a subset of the SheepGut dataset

This tutorial walk you through some of the analyses that strainFlye can perform.

Here we will be using the same SheepGut dataset that is used in our paper, but feel free to follow along with another dataset.

The pipeline takes as input two primary types of data:

1. A __set of reads__ (in FASTA / FASTQ format) generated using PacBio HiFi sequencing.

2. An __assembly graph__ (in GFA 1 format) produced by running metaFlye (or another assembler) on these reads.

Please see the paper's "Data access" section for more details about acquiring both of these types of data for the SheepGut dataset.

## The various commands available through strainFlye

In [13]:
!strainFlye

Usage: strainFlye [OPTIONS] COMMAND [ARGS]...

  Pipeline for the analysis of rare mutations in metagenomes.

Options:
  -h, --help  Show this message and exit.

Commands:
  align       Aligns reads to contigs, then filters this alignment.
  call        Methods for naïve mutation calling.
  diversity   Computes the diversity index for MAGs.
  spots       Identifies hot- and/or cold-spots in MAGs.
  covskew     Visualizes MAG coverage and GC skew.
  matrix      Computes mutation matrices of a MAG.
  link-graph  Constructs the link graph structure for a MAG.
  smooth      Generates smoothed haplotypes.
  utils       Various utility commands provided with strainFlye.


## 0. Convert the assembly graph to a FASTA file of contigs

**You can skip this step if:** you already have a FASTA file exactly matching the sequences of the segments in your GFA file.

Our assembly graph (the GFA file) contains the sequences of the contigs that we will use in many downstream analyses, but it will be helpful to have a FASTA file that just describes these contigs' sequences (independent of the assembly graph topology).

There are some [bash one-liners](https://www.biostars.org/p/169516/#169530) you can use to convert a GFA file to a FASTA file, but strainFlye also provides a utility command (`strainFlye utils gfa-to-fasta`) to do this for you. We'll use this here. (Our solution may be a bit slower than a bash one-liner, but it performs some useful sanity checking on the GFA file.)

In [10]:
!strainFlye utils gfa-to-fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-fasta /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta

--------
strainFlye utils gfa-to-fasta @ 0.00 sec: Starting...
Input GFA file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa
Output FASTA file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
--------
strainFlye utils gfa-to-fasta @ 11.77 sec: Done.
Output FASTA file contains 78,793 sequences.


## 1. Performing alignment

**You can skip this step if:** you already have a BAM file representing an alignment of reads to contigs, and this BAM file does not contain secondary alignments / partially-mapped reads / overlapping supplementary alignments (these all may cause problems in downstream analyses).

We'll need to align reads back to these contigs. The resulting alignment will be used in pretty much all downstream steps, so it's important to make sure that it is of good quality.

The `strainFlye align` command produces a BAM file corresponding to such an alignment. 

Note that this command, in particular, may take a while to run. Sequence alignment is computationally expensive! On our cluster, `strainFlye align` ran on the full SheepGut dataset in 62,941.21 seconds (aka about 17.5 hours).

In [11]:
!strainFlye align \
    # We can use the FASTA file we just generated above.
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/alignment \
    # Reads file(s) are specified here, after all of the other parameters:
    /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz \
    /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/*.fasta.qz

Usage: strainFlye align [OPTIONS] READS...

  Aligns reads to contigs, then filters this alignment.

  Files of reads should be in the FASTA or FASTQ formats; GZIP'd files are
  allowed.

  This command involves multiple steps, including:

    1) Align reads to contigs (using minimap2) to generate a SAM file
    2) Convert this SAM file to a sorted and indexed BAM file
    3) Filter overlapping supplementary alignments within this BAM file
    4) Filter partially-mapped reads within this BAM file

  Note that we only sort the alignment file once, although we do re-index it
  after the two filtering steps. This decision is motivated by
  https://www.biostars.org/p/131333/#131335.

Options:
  -c, --contigs PATH          FASTA file of contigs to which reads will be
                              aligned.  [required]
  -g, --graph PATH            GFA 1-formatted file describing an assembly
                              graph of the contigs. This is used in the
       

This generates a BAM file (`final.bam`) and BAM index file (`final.bam.bai`) in the specified output directory.

We can use this BAM file for many analyses downstream—the first of these will be mutation calling.

## 2. Perform naïve mutation calling

**You can skip this step if:** you already have a VCF file indicating single-nucleotide variant calls in these contigs.

The analyses downstream of this step take as input a set of identified single-nucleotide mutations (or, if you prefer to use different terminology, "called variants", "called SNVs", ...) in which we have some confidence. Here we call mutations and estimate their FDR using the target-decoy approach, as described in our paper.

strainFlye supports calling two basic types of mutations: $p$-mutations and $r$-mutations. The docs explain the difference between these two types best:

In [17]:
!strainFlye call

Usage: strainFlye call [OPTIONS] COMMAND [ARGS]...

  Methods for naïve mutation calling.

  Consider a position "pos" in a contig. A given read with a (mis)match
  operation at "pos" must have one of four nucleotides (A, C, G, T) aligned to
  pos. We represent these nucleotides' counts at pos as follows:

      N1 = # reads of the most-common aligned nucleotide at pos,
      N2 = # reads of the second-most-common aligned nucleotide at pos,
      N3 = # reads of the third-most-common aligned nucleotide at pos,
      N4 = # reads of the fourth-most-common aligned nucleotide at pos.

  (We break ties arbitrarily.)

  strainFlye supports two types of naïve mutation calling based on these
  counts: p-mutations and r-mutations. These are described below.

  p-mutations (naïve percentage-based mutation calling)
  -----------------------------------------------------

  This takes as input some percentage p in the range (0%, 50%]. Define
  freq(pos) = N2 / (N1 + N2 + N3 

Notably, we call mutations for multiple values of $p$ (or $r$) at once. We do this in order to simplify the creation of FDR curves based on varying $p$ (or $r$).

### $p$-mutation calling for $p \in [0.15\%, 2\%]$

This matches what we used for Figure 2 in the paper.

We note that the `--min-p`, `--max-p`, and `--delta-p` values are all scaled up by 100. So `--min-p 15` means "set the minimum value of $p$ to 15 / 100 = 0.15%."

In [16]:
!strainflye call p-mutation \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --min-p 15 \
    --max-p 200 \
    --delta-p 1 \
    --output-vcf /Poppy/mfedarko/sftests/tutorial-output/sheepgut_pmuts_0.15to2.vcf

Usage: strainflye call p-mutation [OPTIONS]

  Performs naïve percentage-based mutation (p-mutation) calling.

  We consider multiple values of the p parameter (defined by the min, max, and
  delta p options), and our output VCF file represents each of the resulting
  values of p tested as a separate FILTER.

  Note that --min-p, --max-p, and --delta-p must all be integers -- these will
  be converted to floating-point percentages later, but for now keeping them
  as integers makes the computation of ranges of values of p simpler.

Options:
  -c, --contigs PATH           FASTA file of contigs in which to naïvely call
                               mutations.  [required]
  -b, --bam PATH               BAM file representing an alignment of reads to
                               contigs.  [required]
  --min-p INTEGER RANGE        Minimum value of p for which to call
                               p-mutations. This is scaled up by 100 (i.e. the
                         

### $r$-mutation calling for $r \in [5, 100]$

While we're at it, we'll compare these with the p-mutations we got earlier.

In [None]:
!strainflye call r-mutation \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --min-r 1 \
    --max-p 100 \
    --delta-r 1 \
    --output-vcf /Poppy/mfedarko/sftests/tutorial-output/sheepgut_rmuts_1to100.vcf