kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

mr-eyes · 2021-10-08T19:00:45Z

Extending sourmash-bio/sourmash#1750, I am copying a conversation between @ctb and @drtamermansour to be detailed later into tasks.

Slack Conversation

Tamer Mansour 8:38 AM
@titus @taylorreiter We can solve this problem VERY efficiently by using kProcessor/kSpider.

Titus Brown:speech_balloon: 8:39 AM
it’s not hard to do in sourmash, either, although I suspect kProcessor/kSpider has specialized lookup tables that make it even faster.
8:39
the issue is integrating it into the CLI.

Tamer Mansour 8:41 AM
kProcessor/kSpider can perform pairwise calculation of shared kmers for 20k genes in a couple GB space and few minutes
8:42
We can prepare the output in a format that sourmash can read
8:44
All what we need is to implement a simple parser for sourmash signature files
8:45
In the CLI, i think sourmash can call kProcessor/kSpider under the hood

Titus Brown:speech_balloon: 8:46 AM
so, two issues here -
Rob is talking about (potentially) very large metagenomes, and the sourmash downsampling approach is probably quite important for his application, since metagenomes can be so much larger than transcriptomes. e.g. “a couple GB space” rapidly becomes 100s of GB.
As Taylor experienced, sourmash isn’t doing a good job with sparse matrices, either, so if you have 20k x 20k queries you run out of memory regardless :)
In this case, kProcessor/kSpider would be completely replacing sourmash, not producing output for it to read.
I think what we want is something similar to what you suggested - a parser for kProcessor/kSpider to read sourmash sig files. (edited)
8:46
Integrating kProcessor into sourmash is not simple or easy, especially since we moved sourmash over to rust.

Tamer Mansour 8:50 AM
Sourmash does not need to integrate kProcessor. Just use it as a third party tool. Sourmash has to do the downsampling, prep the signature files, call kProcessor/kSpider module, and finally present the output through the sourmash visualization scripts (edited)

Titus Brown:speech_balloon: 8:50 AM
call kProcessor/kSpider module
8:51
if we did it that way, sourmash would now include kProcessor/kSpider as a dependency 🙂

Tamer Mansour 8:51 AM
yes
8:51
it is a python package
8:52
This is what kProcessor is made for 🙂

Titus Brown:speech_balloon: 8:54 AM
I don’t think that’s a good idea; sourmash is pretty strict about versioning and dependencies.
It should be pretty straightforward to have kProcessor read sourmash sig files (it’s just k-mers and hashes!), do the comparison, and output a numpy matrix that can be read by sourmash plot. In this case I’m pretty sure Rob (and Taylor) don’t want to use the viz tools, anyway, which won’t scale to that number of samples.
8:55
If you put together a demo of the 20kx transcript query somewhere, we can suggest it, even without the downsampling.
8:56
If we had an extensions framework in sourmash, could do it that way, too.
8:56
But I don’t think we want sourmash to have kProcessor as a required dependency.

Tamer Mansour 8:57 AM
Sure we can prep a demo. But our current indexing can not make 20k whole datasets.
8:57
Mostafa is working on the new indexing algorithm to do so in the soon future

Titus Brown:speech_balloon: 8:58 AM
So this seems like a good use case for future development, but it’s not something we should suggest to Rob right now ;)

Tamer Mansour 8:59 AM
We can implement downsampling in kProcessor. That should be even easier than developing a parser for sourmash signatures
9:00
I can make something to try next week

Titus Brown:speech_balloon: 9:00 AM
yes, it’s easy, but now you have distinct code bases and you don’t get the advantage of all the sourmash signature manipulation utilities. If it’s simple to parse sourmash signatures (which it should be - they’re “just” JSON, plus we have a sourmash API to load it) then might as well add that too.
9:01
If nothing else, it’s a good way to test your downsampling code in kProcessor by running the same operations in kProcessor directly vs loading with sourmash.
9:02
(the downsampling code is now pretty simple in sourmash, but it took a while to get there, and we have a LOT of tests for it. so it’s pretty robustly tested. No reason to discard that.)

Tamer Mansour 9:03 AM
This is also a good solution

Titus Brown:speech_balloon: 9:03 AM
also check out the sourmash sketch documentation. Soooo maaaaany oppppppptions to implement. Ugh.

Tamer Mansour 9:05 AM
I was thinking of a very simple approach. Incorporating one kmer every 1000 while reading the input sequences. That is it 😄
👍
1

9:05
But using sourmash is a better idea

Titus Brown:speech_balloon: 9:06 AM
I see value in both, TBH. No reason to force people to jump through hoops either way, and as you say, the code is simple for scaled downsampling.
👍
1

Titus Brown:speech_balloon: 9:28 AM
One other thought - the seq-to-hashes stuff that @mr-eyes implemented for sourmash could be used directly by kProcessor to build scaled=1 dataframes in DNA space (which you already have working) as well as translated and protein queries. Again, ultimately you probably want this in kProcessor directly, but it’s a pretty simple call to sourmash to get the functionality working right now.

ctb · 2021-10-09T12:09:44Z

note that sourmash is lower case 😁

ctb · 2021-10-11T15:29:37Z

Some sourmash code - haven't tested it, but it should mostly work :)

Load signatures from ...anything - a .sig file, a .zip file, a directory:

>>> loaded_sigs = sourmash.load_file_as_signatures(fpath)

You might want to select out only the scaled signatures, since num signatures are a different beast and can't really be used the way we would like:

>>> loaded_sigs = loaded_sigs.select(scaled=True)

Retrieve sketches:

>>> for ss in loaded_sigs:
...   mh = ss.minhash

Retrieve ksize and moltype and scaled/num from the sketches:

>>> ksize = mh.ksize
>>> moltype = mh.moltype # 'DNA', 'protein', 'dayhoff', 'hp'
>>> scaled = mh.scaled # if 0, this is a 'num' sketch

Get actual hashes:

>>> for hashval in mh.hashes:
...    print(hashval)

Retrieve abundances:

>>> for hashval, abund in mh.hashes.items()
...    print(hashval, abund)

🎉

drtamermansour · 2021-10-12T00:40:42Z

@ctb
The same sourmash signature might have kmers of different sizes, is not it?

ctb · 2021-10-12T02:03:46Z

it's complicated but tl;dr the code above will work, because each signature will be made to have exactly one sketch.

(the signature creation and save code does things slightly differently; see sourmash-bio/sourmash#1647 and sourmash-bio/sourmash#616 esp), but the load code splits it out so it's one signature for one sketch, and each sketch has one ksize.)

you might also be getting confused because a single .sig file can contain many different signatures, as well as signatures with multiple sketches.

so, as I said, confusing. but the code above will work.

mr-eyes changed the title ~~kProcessor/kSpider/Sourmash thoughts for downsampling/containment matrix~~ kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix Oct 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

mr-eyes commented Oct 8, 2021 •

edited

Loading

ctb commented Oct 9, 2021

ctb commented Oct 11, 2021 •

edited

Loading

drtamermansour commented Oct 12, 2021

ctb commented Oct 12, 2021

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

Comments

mr-eyes commented Oct 8, 2021 • edited Loading

ctb commented Oct 9, 2021

ctb commented Oct 11, 2021 • edited Loading

drtamermansour commented Oct 12, 2021

ctb commented Oct 12, 2021

mr-eyes commented Oct 8, 2021 •

edited

Loading

ctb commented Oct 11, 2021 •

edited

Loading