2024 ICPPB Sourmash LIN demo
===

# Analyzing Metagenome Composition with sourmash using the LIN taxonomic framework

Tessa Pierce Ward

July 2024

requires sourmash v4.8+

This tutorial uses the `sourmash taxonomy` module, which was introduced via [blog post](https://bluegenes.github.io/sourmash-tax/)
and was recently shown to perfom well for taxonomic profiling of long (and short) reads in [Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://link.springer.com/article/10.1186/s12859-022-05103-0), Portik et al., 2022.


In this tutorial, we'll use sourmash gather to analyze metagenomes using the [LIN taxonomic framework](https://dl.acm.org/doi/pdf/10.1145/3535508.3545546).
Specifically, we will analyze plant metagenomes for the presence of _Ralstonia solanacearum_.
The goal is to see if we can correctly assign the sequence in each file to the correct phylogenetic group, distinguishing between pathogenic and non-pathogenic strains.

**Simulated Samples: Ralstonia + tomato host:**
- `Sample0` - no Ralstonia
- `Sample-II` - Ralstonia solanacearum, PhylIIB
- `SampleIV` - Ralstonia solanacearum, Phyl-IV

**Infected Field Sample (nanopore):**
- `barcode16`

## Setup

In [1]:
## Download Ralstonia 32-genome database and corresponding taxonomy files

# database
!curl -JLO https://osf.io/download/wxtk3/
!mv ralstonia.sc1000.zip ralstonia.zip

# taxonomy csv
!curl -JLO https://osf.io/download/sj2z7/
!mv ralstonia32.lin-taxonomy.csv ralstonia.lin-taxonomy.csv

# lingroup csv
!curl -JLO https://osf.io/download/nqms2/
!mv LINgroups.csv ralstonia.lingroups.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0    801      0 --:--:-- --:--:-- --:--:--   803
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 4274k  100 4274k    0     0  2222k      0  0:00:01  0:00:01 --:--:-- 2222k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   1759      0 --:--:-- --:--:-- --:--:--  1761
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4109  100  4109    0     0   4038      0  0:00:01  0:00:01 --:--:--  4038
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   2419      0 --:--:

In [2]:
### Next, download pre-made sourmash signatures made from the input metagenomes

# download Sample-0 signature
!curl -JLO https://osf.io/download/dvyt9/

# download Sample-II signature
!curl -JLO https://osf.io/download/agwdu/

# download Sample-IV signature
!curl -JLO https://osf.io/download/rngjq/

# move downloaded signatures to ./inputs
!mv Sample*.zip ./inputs

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   2553      0 --:--:-- --:--:-- --:--:--  2564
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  821k  100  821k    0     0   923k      0 --:--:-- --:--:-- --:--:--  923k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   1448      0 --:--:-- --:--:-- --:--:--  1450
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  839k  100  839k    0     0   924k      0 --:--:-- --:--:-- --:--:--  924k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0    757      0 --:--:

### Look at the signatures

Let's start with the `Sample-II` sample

By running `sourmash sig fileinfo`, we can see information on the signatures available within the zip file.

Here, you can see I've generated the metagenome signature with `scaled=100` and built three ksizes, `k=21`, k=31` and `k=51`

In [3]:
!sourmash sig fileinfo ./inputs/Sample-II.sc1000.zip

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[K** loading from './inputs/Sample-II.sc1000.zip'
path filetype: ZipFileLinearIndex
location: /Users/ntward/dib-lab/2024-icppb/inputs/Sample-II.sc1000.zip
is database? yes
has manifest? yes
num signatures: 3
[K** examining manifest...
total hashes: 105825
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      33335 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      35516 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      36974 total hashes


### Look at the database

Here, you can see I've generated the database with `scaled=1000` and built three ksizes, `k=21`, `k=31` and `k=51`

In [4]:
!sourmash sig fileinfo ralstonia.zip

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[K** loading from 'ralstonia.zip'
path filetype: ZipFileLinearIndex
location: /Users/ntward/dib-lab/2024-icppb/ralstonia.zip
is database? yes
has manifest? yes
num signatures: 96
[K** examining manifest...
total hashes: 524340
summary of sketches:
   32 sketches with DNA, k=21, scaled=1000            174967 total hashes
   32 sketches with DNA, k=51, scaled=1000            174975 total hashes
   32 sketches with DNA, k=31, scaled=1000            174398 total hashes


> There's a lot of things to digest in this output but the two main ones are:
> * there are 32 genomes represented in this database, each of which are sketched at k=21,k=31,k=51
> * this database represents ~524 *million* k-mers (multiply number of hashes by the scaled number)


## Build a taxonomic profile of Sample II

First, let's run `sourmash gather` to find the closest reference genome(s) in the database.
If you want to read more about what sourmash is doing, please see [Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), Irber et al., 2022.

In [5]:
!sourmash gather inputs/Sample-II.sc1000.zip \
                 ralstonia.zip -k 31 \
                --output Sample-II.k31.gather.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: Sample-II... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 31 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
1.3 Mbp        3.9%   26.6%       1.2    GCF_001373295.1 Ralstonia solanacear...
[Kfound less than 50.0 kbp in common. => exiting

found 1 matches total;
the recovered matches hit 3.9% of the abundance-weighted query.
the recovered matches hit 3.6% of the query k-mers (unweighted).



> The first step of gather ("prefetch") found all potential matches with at least 50kb matching sequence (31 of 32 total database genomes). Then, the greedy algorithm narrowed this to a single best match, ` GCF_001373295.1` which shared an estimated 1.3 Mbp with the metagenome (~3.9% of the total query dataset). We can visualize this by looking at a venn diagram of the shared k-mers between metagenome sample and the top match. The yellow intersection represend <4% of the metagenome and ~26.6% of the Ralstonia RS2 reference genome. This small match percentage is expected, though, since the dataset is a simulated plant metagenome with an in silico Ralstonia spike-in, and we are just searching for `Ralstonia` here.
![sampleii.venn](https://hackmd.io/_uploads/SyssO_QPC.png)

### Add taxonomic information and summarize up lingroups

`sourmash gather` finds the smallest set of reference genomes that contains all the known information (k-mers) in the metagenome. In most cases, `gather` will find many metagenome matches. Here, we're only looking for `Ralstonia` matches and we only have a single gather result. Regardless, let's use `sourmash tax metagenome` to add taxonomic information and see if we've correctly assigned the pathogenic sequence.

#### First, let's look at the relevant taxonomy files.

These commands will show the first few lines of each file. If you prefer, you can look at a more human-friendly view by opening the files in a spreadsheet program.

- **taxonomy_csv:** `ralstonia32.lin-taxonomy.csv`
  - the essential columns are `lin` (`14;1;0;...`) and `accession` (`GCF_00`...)
- **lingroups information:** `ralstonia.lingroups.csv`
  - both columns are essential (`name`, `lin`)

Look at the taxonomy file:

In [6]:
!head -n 5 ralstonia.lin-taxonomy.csv

lin,species,strain,filename,accession
14;1;0;0;0;0;0;0;0;0;6;0;1;0;1;0;0;0;0;0,Ralstonia solanacearum,OE1_1,GCF_001879565.1_ASM187956v1_genomic.fna,GCF_001879565.1
14;1;0;0;0;0;0;0;0;0;6;0;1;0;0;0;0;0;0;0,Ralstonia solanacearum,PSS1308,GCF_001870805.1_ASM187080v1_genomic.fna,GCF_001870805.1
14;1;0;0;0;0;0;0;0;0;2;1;0;0;0;0;0;0;0;0,Ralstonia solanacearum,FJAT_1458,GCF_001887535.1_ASM188753v1_genomic.fna,GCF_001887535.1
14;1;0;0;0;0;0;0;0;0;2;0;0;4;4;0;0;0;0;0,Ralstonia solanacearum,Pe_13,GCF_012062595.1_ASM1206259v1_genomic.fna,GCF_012062595.1


> The key columns are:
> - `accession`, containing identifiers matching the database sketches
> - `lin`, containing the LIN taxonomic information.

Now, let's look at the lingroups file:

In [7]:
!head -n5 ralstonia.lingroups.csv

name,lin
A_Total_reads;B_PhylI,14;1;0;0;0;0;0;0;0;0
A_Total_reads;B_PhylI;C_seq14,14;1;0;0;0;0;0;0;0;0;3
A_Total_reads;B_PhylI;C_seq15,14;1;0;0;0;0;0;0;0;0;2
A_Total_reads;B_PhylI;C_seq34,14;1;0;0;0;0;0;0;0;0;6


> Here, we have two columns:
> - `name` - the name for each lingroup. 
> - `lin` - the LIN prefix corresponding to each group.

### Now, run `sourmash tax metagenome` to integrate taxonomic information into `gather` results

Using the `gather` output we generated above, we can integrate taxonomic information and summarize up "ranks" (lin positions). We can produce several different types of outputs, including a `lingroup` report.

`lingroup` format summarizes the taxonomic information at each `lingroup`, and produces a report with 4 columns: 
- `name` (from lingroups file)
- `lin` (from lingroups file)
- `percent_containment` - total % of the file matched to this lingroup
- `num_bp_contained` - estimated number of bp matched to this lingroup

> Since sourmash assigns all k-mers to individual genomes, no reads/base pairs are "assigned" to higher taxonomic ranks or lingroups (as with Kraken-style LCA). Here, "percent_containment" and "num_bp_contained" is calculated by summarizing the assignments made to all genomes in a lingroup. This is akin to the "contained" information in Kraken-style reports.

Run `tax metagenome`:

In [8]:
!sourmash tax metagenome -g Sample-II.k31.gather.csv \
                        -t ralstonia.lin-taxonomy.csv \
                        --lins --lingroup ralstonia.lingroups.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[KTrying to read LIN taxonomy assignments.
[Kloaded 1 gather results from 'Sample-II.k31.gather.csv'.
[Kloaded results for 1 queries from 1 gather CSVs
[KRead 20 lingroup rows and found 20 distinct lingroup prefixes.
name	lin	percent_containment	num_bp_contained
A_Total_reads;B_PhylII	14;1;0;0;0;3;0	3.94	1464000
A_Total_reads;B_PhylII;C_IIB	14;1;0;0;0;3;0;0	3.94	1464000
A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2	14;1;0;0;0;3;0;0;0;0;1;0;0;0;0	3.94	1464000
A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1	14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0	3.94	1464000


> Here, the most specific lingroup we assign to is `A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1`, which means this is in **phylotype IIB, sequevar 1**. This is the USDA select agent!

## Interlude: simulated samples can be too easy
These samples were generated via read simulation from ralstonia genomes, and it turns out this one was created from the RS2 genome we are matching here. Let's try excluding this specific genome to see if we still find the same results without an exact database match. This should be more realistic.

We have a gather command-line option for just this situation, `--exclude-db-pattern`. Let's run it and predict taxonomy.

In [9]:
!sourmash gather inputs/Sample-II.sc1000.zip ralstonia.zip \
                -k 31 -o Sample-II.k31.gather.noRS2.csv \
                --exclude-db-pattern "RS2"

!sourmash tax metagenome -g Sample-II.k31.gather.noRS2.csv -t ralstonia.lin-taxonomy.csv \
                        --lins --lingroup ralstonia.lingroups.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: Sample-II... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 30 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
1.2 Mbp        3.9%   24.0%       1.2    GCF_002251655.1 Ralstonia solanacear...
[Kfound less than 50.0 kbp in common. => exiting

found 1 matches total;
the recovered matches hit 3.9% of the abundance-weighted query.
the recovered matches hit 3.5% of the query k-mers (unweighted).

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[KTrying to read LIN taxonomy ass

> Without the exact genome the reads were generated from, we get a slightly smaller sequence overlap (1.2Mbp instead of 1.3Mbp). However, when we add the LINgroup information, we still find the right group, `A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1`, or Phylotype IIB, sequevar 1 (pathogenic).

# Repeat for Remaining Samples

### Sample-0

In [10]:
# gather
!sourmash gather inputs/Sample-0.sc1000.zip \
                ralstonia.zip -k 31 \
                -o Sample-0.dna.k31.gather.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: Sample-0... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 0 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.
[Kfound less than 50.0 kbp in common. => exiting
[K
No matches found for --threshold-bp at 50.0 kbp.



#### gather found no sequence matches! Let's try lowering the detection threshold:

In [11]:
!sourmash gather inputs/Sample-0.sc100.zip \
                 ralstonia.zip -k 31 \
                 --threshold-bp 3000 \
                 -o Sample-0.k31.gather.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: Sample-0... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 0 signatures with overlap >= 3.0 kbp.
[KDoing gather to generate minimum metagenome cover.
[Kfound less than 3.0 kbp in common. => exiting
[K
No matches found for --threshold-bp at 3.0 kbp.



> Which means we *still* didn't find anything! It turns out that Sample-0 is a control sample that does not have any *Ralstonia*.


### Sample-IV

In [12]:
 !sourmash gather inputs/Sample-IV.sc1000.zip \
                  ralstonia.zip -k 31 \
                  -o Sample-IV.k31.gather.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: Sample-IV... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 31 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
1.2 Mbp        3.7%   22.1%       1.2    GCF_003515185.1 Ralstonia solanacear...
[Kfound less than 50.0 kbp in common. => exiting

found 1 matches total;
the recovered matches hit 3.7% of the abundance-weighted query.
the recovered matches hit 3.4% of the query k-mers (unweighted).



> We can look directly at the k-mer overlap between the SampleIV and this Ralstonia genome:
![sampleiv.venn](https://hackmd.io/_uploads/rk1FLdQDC.png)

In [13]:
!sourmash tax metagenome -g Sample-IV.k31.gather.csv -t ralstonia.lin-taxonomy.csv \
                        --lins --lingroup ralstonia.lingroups.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[KTrying to read LIN taxonomy assignments.
[Kloaded 1 gather results from 'Sample-IV.k31.gather.csv'.
[Kloaded results for 1 queries from 1 gather CSVs
[KRead 20 lingroup rows and found 20 distinct lingroup prefixes.
name	lin	percent_containment	num_bp_contained
A_Total_reads;B_PhylIV	14;1;0;0;0;2;0;0;0	3.72	1404000
A_Total_reads;B_PhylIV;C_seq10	14;1;0;0;0;2;0;0;0;0;0;0	3.72	1404000


> We find that this genome is in **Phylotype IV, seq10** group (non-pathogenic).

# infected field sample ("barcode 16"; nanopore):

Let's run the infected field sample for a more realistic example.

In [14]:
# Download the sample
!curl -JLO https://osf.io/download/s2q83/
!mv bc16.scaled1000.zip ./inputs

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   3050      0 --:--:-- --:--:-- --:--:--  3040
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 21.9M  100 21.9M    0     0  7543k      0  0:00:02  0:00:02 --:--:-- 10.3M


In [15]:
!sourmash gather inputs/bc16.scaled1000.zip ralstonia.zip \
                -k 31 --output barcode16.k31.gather.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: bc16... (k=31, DNA)
[K--ading from 'ralstonia.zip'...
[Kloaded 96 total signatures from 1 locations.
[Kafter selecting signatures compatible with search, 32 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 32 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
5.1 Mbp       17.8%   97.2%      57.5    GCF_002251605.2 Ralstonia solanacear...
2.7 Mbp        0.3%    4.0%      21.2    GCF_000825845.1 Ralstonia solanacear...
2.2 Mbp        0.0%    2.1%       5.0    GCF_001644815.1 Ralstonia solanacear...
0.8 Mbp        0.0%    1.6%       2.0    GCF_012062595.1 Ralstonia solanacear...
4.9 Mbp        0.2%    1.2%      44.8    GCF_002251695.1 Ralstonia solanacear...
2.1 Mbp        0.0%    1.0%   

> The initial search (prefetch) found that all 32 genomes had shared sequence with our query. The minimum metagenome cover shows 6 genomes with non-overlapping matches. Nearly 18% of the abundance-weighted query matched to the GCF_002251605.2 Ralstonia solanacearum UW700 genome.

### Interlude: Can we match host k-mers?
Just for fun, let's try adding a random tomato genome to the database, to see if we can match the host k-mers:

In [16]:
# download signature from a tomato genome 
!curl -JLO https://osf.io/download/28pjz/

!sourmash gather inputs/bc16.scaled1000.zip ralstonia.zip \
                GCF_000188115.5_SL3.1.zip -k 31 \
                -o barcode16.k31.gather-w-host.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   377  100   377    0     0   2091      0 --:--:-- --:--:-- --:--:--  2094
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (23) Failed writing header
[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[Kselecting specified query k=31
[Kloaded query: bc16... (k=31, DNA)
[K--ading from 'GCF_000188115.5_SL3.1.zip'...
[Kloaded 99 total signatures from 2 locations.
[Kafter selecting signatures compatible with search, 33 remain.

[KStarting prefetch sweep across databases.
[KPrefetch found 33 signatures with overlap >= 50.0 kbp.
[KDoing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- -----

>We can plot the k-mer overlap between this sample and the two top matches: Ralstonia UW700 and the Heinz 1706 tomato genome. Here, we see that while the Ralstonia k-mers are a small portion of the overall file (0.6%; small circle on the left), nearly the entire reference genome is present in the barcode16 sample (97.2%). This tomato genome, in constrast, shares a little less than half its content with the sample. It was randomly chosen and may not reflect the cultivar where bc16 was sampled from :).
![bc16.venn](https://hackmd.io/_uploads/H1UMrdXDR.png)
> plotted with the sourmash_plugin_venn library https://github.com/sourmash-bio/sourmash_plugin_venn

#### run tax metagenome
We could run tax metagenome with these results. However, since we don't have LIN or LINgroup information for the tomato genome, the results will only include the Ralstonia matches anyway. Since the tomato genome did not share any k-mers with the Ralstonia genomes, it will not impact Ralstonia taxonomic assignment.

In [17]:
!sourmash tax metagenome -g barcode16.k31.gather.csv \
                        -t ralstonia.lin-taxonomy.csv \
                        --lins --lingroup ralstonia.lingroups.csv

[K
== This is sourmash version 4.8.10. ==
[K== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

[KTrying to read LIN taxonomy assignments.
[Kloaded 1 gather results from 'barcode16.k31.gather.csv'.
[Kloaded results for 1 queries from 1 gather CSVs
[KRead 20 lingroup rows and found 20 distinct lingroup prefixes.
name	lin	percent_containment	num_bp_contained
A_Total_reads;B_PhylI	14;1;0;0;0;0;0;0;0;0	0.01	182000
A_Total_reads;B_PhylI;C_seq15	14;1;0;0;0;0;0;0;0;0;2	0.01	182000
A_Total_reads;B_PhylII	14;1;0;0;0;3;0	18.33	300671000
A_Total_reads;B_PhylII;C_IIA	14;1;0;0;0;3;0;1	0.28	4629000
A_Total_reads;B_PhylII;C_IIC	14;1;0;0;0;3;0;2	18.01	295389000
A_Total_reads;B_PhylII;C_IIB	14;1;0;0;0;3;0;0	0.04	653000
A_Total_reads;B_PhylII;C_IIB;D_seq4	14;1;0;0;0;3;0;0;1;0;0;0;0;0	0.04	579000


> 18% of the sample matched to A_Total_reads;B_PhylII;C_IIC, so we find that this samples is also in **Phylotype IIC**, which is not the pathogenic lineage.

## Summary and concluding thoughts

The LIN taxonomic framework may be useful distinguishing groups below the species level, and we can use LINs and LINgroups with `sourmash tax`. For low level matches, the gather greedy
approach can struggle. In cases where there is an identical % match between two reference genomes, the reported match is selected at random. We are working on ways to better warn users about places where this behavior occurs and welcome
feedback and suggestions on our [issue tracker](https://github.com/sourmash-bio/sourmash/issues/new).

We typically recommend running at `scaled=1000` (our default), as this works for most microbial use cases. However, for smaller samples and databases or for distinguishing between highly related genomes, you may want to run at higher resolution (lower scaled), e.g. scaled=100 or lower. Note, higher resolution signatures are larger and take longer to build and search. See more information on scaled and thresholds [here](https://sourmash.readthedocs.io/en/latest/faq.html#what-scaled-values-should-i-use-with-sourmash).

Sourmash taxonomy can also be used with NCBI, GTDB, and ICTV taxonomies. For a walkthough using GTDB and sample-specific assemblies with an environmental metagenome, see [here](https://sourmash.readthedocs.io/en/latest/tutorial-lemonade.html).

NOTE: We're in the process of upgrading sourmash commands for multithreading and faster processing. If you get comfortable with these commands and want to process more samples, please check out the **[branchwater plugin](https://github.com/sourmash-bio/sourmash_plugin_branchwater/tree/main/doc) e.g. `fastmultigather` command for faster execution and to run multiple samples at once.**