# metagenome analysis with `sourmash gather`

* simple algorithm to implement: report best containment match to query, subtract from query, iterate
* this gives a compositional breakdown of a metagenome
* this is a greedy combinatorial search for collections of k-mers
* (demonstration, and reimplementation)
* benchmarking and discussion is part of @luizirber's thesis, but I have been authorized to make the following statement:

> we followed the CAMI recommendations for benchmarking sourmash with other taxonomic profiling tools, and we have better recall and precision using a fraction of the computational resources, and can scale to two orders of magnitude more reference datasets than other tools can support

* explore some results [here](https://luizirber.github.io/2020-cami/cami_ii_mg/opal_output_all/results.html)
* does not fail due to saturation of LCA taxonomy (c.f. [Nasko et al., 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206640/)
* gives decent strain level resolution (again, b/c of combinatorics))


In [1]:
ls data

[34mMAGs[m[m/                           gtdb-release89-k31.lca.json.gz
Makefile                        gtdb-release89-k31.sbt.zip
README.md                       iHMP-PSM7J4EF.sig
[34mTara-MS[m[m/                        podar-lineage.csv
akker-reads.abundtrim.gz        shew-reads.abundtrim.gz
[34mbak[m[m/                            twofoo.fq.gz
[34mgenomes[m[m/


In [2]:
!sourmash sig describe data/iHMP-PSM7J4EF.sig

[K
== This is sourmash version 3.2.4.dev5+g6484e78f. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kloaded 3 signatures total.ata/iHMP-PSM7J4EF.sig...
---
signature filename: data/iHMP-PSM7J4EF.sig
signature: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
source file: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
md5: 9a540c534967433dc55b89a7d21e0369
k=21 molecule=DNA num=0 scaled=2000 seed=42 track_abundance=1
size: 24371
signature license: CC0

---
signature filename: data/iHMP-PSM7J4EF.sig
signature: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
source file: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
md5: 75c3a04d70b7220a3aae46cb343e6361
k=31 molecule=DNA num=0 scaled=2000 seed=42 track_abundance=1
size: 23058
signature license: CC0

---
signature filename: data/iHMP-PSM7J4EF.sig
signature: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
source file: outputs/abundtrim/PSM7J4EF.abundtrim.fq.gz
md5: 282f85f9323c946898cc01bc791ed9ea
k=51 molecule=DNA num=0 scaled=2000 s

In [3]:
!sourmash gather data/iHMP-PSM7J4EF.sig data/gtdb-release89-k31.sbt.zip

[K
== This is sourmash version 3.2.4.dev5+g6484e78f. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kselecting default query k=31.
[Kloaded query: outputs/abundtrim/PSM7J4EF.abu... (k=31, DNA)
[Kloaded 1 databases.                                                            


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.1 Mbp       20.1%   72.0%       6.5    GCF_000156075 s__Bacteroides_B dorei
2.2 Mbp        4.7%   44.3%       2.8    GCF_000025985 s__Bacteroides fragilis
2.1 Mbp        3.5%   42.1%       2.2    GCF_000690815 s__Escherichia coli
1.9 Mbp        3.7%   29.3%       2.6    GCF_001314995 s__Bacteroides ovatus
1.5 Mbp        2.6%   31.8%       2.4    GCF_000154205 s__Bacteroides uniformis
1.2 Mbp        2.4%   43.0%       2.7    GCF_900112995 s__Lachnospira rogosae
1.0 Mbp        2.0%   48.2%       2.6    GCA_000980495 s__Parasutterella sp000...
0.9 Mbp        2.0%   27.2%       3.0    GCF_000020605 s__Agathoba

In [9]:
!sourmash gather -h

[K
== This is sourmash version 3.2.4.dev5+g6484e78f. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

usage:  gather [-h] [-q] [-d] [--traverse-directory] [-o FILE]
               [--save-matches FILE] [--threshold-bp REAL]
               [--output-unassigned FILE] [--scaled FLOAT]
               [--ignore-abundance] [-k K] [--protein] [--no-protein]
               [--dayhoff] [--no-dayhoff] [--hp] [--no-hp] [--dna] [--no-dna]
               query databases [databases ...]

positional arguments:
  query                 query signature
  databases             signatures/SBTs to search

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           suppress non-error output
  -d, --debug
  --traverse-directory  search all signatures underneath directories
  -o FILE, --output FILE
                        output CSV containing matches to this file
  --save-matches FILE   save the matched signatures from 

In [10]:
!sourmash gather data/iHMP-PSM7J4EF.sig data/gtdb-release89-k31.sbt.zip \
    --scaled=10000 --save-matches=iHMP-matches.sig

[K
== This is sourmash version 3.2.4.dev5+g6484e78f. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kselecting default query k=31.
[Kloaded query: outputs/abundtrim/PSM7J4EF.abu... (k=31, DNA)
[Kdownsampling query from scaled=2000 to 10000
[Kloaded 1 databases.                                                            


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
4.0 Mbp       20.2%   71.2%       6.5    GCF_000156075 s__Bacteroides_B dorei
2.3 Mbp        3.7%   44.7%       2.2    GCF_000690815 s__Escherichia coli
2.1 Mbp        4.5%   42.0%       2.8    GCF_000025985 s__Bacteroides fragilis
2.0 Mbp        4.1%   31.6%       2.6    GCF_001314995 s__Bacteroides ovatus
1.4 Mbp        2.5%   31.1%       2.3    GCF_000154205 s__Bacteroides uniformis
1.4 Mbp        3.0%   45.9%       2.9    GCF_900112995 s__Lachnospira rogosae
1.1 Mbp        2.5%   27.2%       3.2    GCF_000020605 s__Agathobacter rectale
1.0 Mbp        2.5%  

In [12]:
import sourmash

In [18]:
list_of_sigs = sourmash.load_signatures('iHMP-matches.sig')

query_sig = sourmash.load_one_signature('data/iHMP-PSM7J4EF.sig', ksize=31)

def best_match(q, los):
    best_score = 0
    best_sig = None
    for subj in los:
        score = subj.contained_by(q, downsample=True)
        if score > best_score:
            best_score = score
            best_sig = subj
            
    return best_score, best_sig

this_mh = query_sig.minhash.copy_and_clear()
this_mh.add_many(query_sig.minhash.get_mins())
this_sig = sourmash.SourmashSignature(this_mh)

while 1:
    score, match = best_match(this_sig, list_of_sigs)
    
    if not match:
        break
    print(score, match.name())
   
    # remove the best match from the query signature, and then iterate
    query_mh = this_sig.minhash
    query_mh.remove_many(match.minhash.get_mins())
    this_sig = sourmash.SourmashSignature(query_mh)
    




0.3729956268221574 GCF_000156075 s__Bacteroides_B dorei
