# sourmash: working with private collections of signatures

## download a bunch of genomes

In [None]:
!mkdir -p big_genomes
!curl -L https://osf.io/8uxj9/?action=download | (cd big_genomes && tar xzf -)

## compute signatures for each file

In [None]:
!cd big_genomes/ && sourmash compute -k 31 --scaled=1000 --name-from-first *.fa

## Compare them all

In [None]:
!sourmash compare big_genomes/*.sig -o compare_all.mat
!sourmash plot compare_all.mat

In [None]:
from IPython.display import Image
Image(filename='compare_all.mat.matrix.png') 

## make a fast(er) search database for all of them

In [None]:
!sourmash index -k 31 all-genomes big_genomes/*.sig

You can now use this to search, and gather.

In [None]:
!sourmash search shew_os185.fa.sig all-genomes --threshold=0.001

In [None]:
# (make fake metagenome again, just in case)
!cat genomes/*.fa > fake-metagenome.fa
!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa

In [None]:
!sourmash gather fake-metagenome.fa.sig all-genomes

# build a database with taxonomic information --

for this, we need to provide a metadata file that contains accession => tax information.

In [None]:
import pandas
df = pandas.read_csv('podar-lineage.csv')
df

In [None]:
!sourmash lca index podar-lineage.csv taxdb big_genomes/*.sig  -C 3 --split-identifiers

This database 'taxdb.lca.json' can be used for search and gather as above:

In [None]:
!sourmash gather fake-metagenome.fa.sig taxdb.lca.json

...but can also be used for taxonomic summarization:

In [None]:
!sourmash lca summarize --query fake-metagenome.fa.sig --db taxdb.lca.json