Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90

Open
mr-eyes opened this issue Oct 8, 2021 · 4 comments
Open

Comments

@mr-eyes
Copy link
Member

mr-eyes commented Oct 8, 2021

Extending sourmash-bio/sourmash#1750, I am copying a conversation between @ctb and @drtamermansour to be detailed later into tasks.

Slack Conversation

Tamer Mansour 8:38 AM
@titus @taylorreiter We can solve this problem VERY efficiently by using kProcessor/kSpider.

Titus Brown:speech_balloon: 8:39 AM
it’s not hard to do in sourmash, either, although I suspect kProcessor/kSpider has specialized lookup tables that make it even faster.
8:39
the issue is integrating it into the CLI.

Tamer Mansour 8:41 AM
kProcessor/kSpider can perform pairwise calculation of shared kmers for 20k genes in a couple GB space and few minutes
8:42
We can prepare the output in a format that sourmash can read
8:44
All what we need is to implement a simple parser for sourmash signature files
8:45
In the CLI, i think sourmash can call kProcessor/kSpider under the hood

Titus Brown:speech_balloon: 8:46 AM
so, two issues here -
Rob is talking about (potentially) very large metagenomes, and the sourmash downsampling approach is probably quite important for his application, since metagenomes can be so much larger than transcriptomes. e.g. “a couple GB space” rapidly becomes 100s of GB.
As Taylor experienced, sourmash isn’t doing a good job with sparse matrices, either, so if you have 20k x 20k queries you run out of memory regardless :)
In this case, kProcessor/kSpider would be completely replacing sourmash, not producing output for it to read.
I think what we want is something similar to what you suggested - a parser for kProcessor/kSpider to read sourmash sig files. (edited)
8:46
Integrating kProcessor into sourmash is not simple or easy, especially since we moved sourmash over to rust.

Tamer Mansour 8:50 AM
Sourmash does not need to integrate kProcessor. Just use it as a third party tool. Sourmash has to do the downsampling, prep the signature files, call kProcessor/kSpider module, and finally present the output through the sourmash visualization scripts (edited)

Titus Brown:speech_balloon: 8:50 AM
call kProcessor/kSpider module
8:51
if we did it that way, sourmash would now include kProcessor/kSpider as a dependency 🙂

Tamer Mansour 8:51 AM
yes
8:51
it is a python package
8:52
This is what kProcessor is made for 🙂

Titus Brown:speech_balloon: 8:54 AM
I don’t think that’s a good idea; sourmash is pretty strict about versioning and dependencies.
It should be pretty straightforward to have kProcessor read sourmash sig files (it’s just k-mers and hashes!), do the comparison, and output a numpy matrix that can be read by sourmash plot. In this case I’m pretty sure Rob (and Taylor) don’t want to use the viz tools, anyway, which won’t scale to that number of samples.
8:55
If you put together a demo of the 20kx transcript query somewhere, we can suggest it, even without the downsampling.
8:56
If we had an extensions framework in sourmash, could do it that way, too.
8:56
But I don’t think we want sourmash to have kProcessor as a required dependency.

Tamer Mansour 8:57 AM
Sure we can prep a demo. But our current indexing can not make 20k whole datasets.
8:57
Mostafa is working on the new indexing algorithm to do so in the soon future

Titus Brown:speech_balloon: 8:58 AM
So this seems like a good use case for future development, but it’s not something we should suggest to Rob right now ;)

Tamer Mansour 8:59 AM
We can implement downsampling in kProcessor. That should be even easier than developing a parser for sourmash signatures
9:00
I can make something to try next week

Titus Brown:speech_balloon: 9:00 AM
yes, it’s easy, but now you have distinct code bases and you don’t get the advantage of all the sourmash signature manipulation utilities. If it’s simple to parse sourmash signatures (which it should be - they’re “just” JSON, plus we have a sourmash API to load it) then might as well add that too.
9:01
If nothing else, it’s a good way to test your downsampling code in kProcessor by running the same operations in kProcessor directly vs loading with sourmash.
9:02
(the downsampling code is now pretty simple in sourmash, but it took a while to get there, and we have a LOT of tests for it. so it’s pretty robustly tested. No reason to discard that.)

Tamer Mansour 9:03 AM
This is also a good solution

Titus Brown:speech_balloon: 9:03 AM
also check out the sourmash sketch documentation. Soooo maaaaany oppppppptions to implement. Ugh.

Tamer Mansour 9:05 AM
I was thinking of a very simple approach. Incorporating one kmer every 1000 while reading the input sequences. That is it 😄
👍
1

9:05
But using sourmash is a better idea

Titus Brown:speech_balloon: 9:06 AM
I see value in both, TBH. No reason to force people to jump through hoops either way, and as you say, the code is simple for scaled downsampling.
👍
1

Titus Brown:speech_balloon: 9:28 AM
One other thought - the seq-to-hashes stuff that @mr-eyes implemented for sourmash could be used directly by kProcessor to build scaled=1 dataframes in DNA space (which you already have working) as well as translated and protein queries. Again, ultimately you probably want this in kProcessor directly, but it’s a pretty simple call to sourmash to get the functionality working right now.

@ctb
Copy link
Member

ctb commented Oct 9, 2021

note that sourmash is lower case 😁

@mr-eyes mr-eyes changed the title kProcessor/kSpider/Sourmash thoughts for downsampling/containment matrix kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix Oct 9, 2021
@ctb
Copy link
Member

ctb commented Oct 11, 2021

Some sourmash code - haven't tested it, but it should mostly work :)

Load signatures from ...anything - a .sig file, a .zip file, a directory:

>>> loaded_sigs = sourmash.load_file_as_signatures(fpath)

You might want to select out only the scaled signatures, since num signatures are a different beast and can't really be used the way we would like:

>>> loaded_sigs = loaded_sigs.select(scaled=True)

Retrieve sketches:

>>> for ss in loaded_sigs:
...   mh = ss.minhash

Retrieve ksize and moltype and scaled/num from the sketches:

>>> ksize = mh.ksize
>>> moltype = mh.moltype # 'DNA', 'protein', 'dayhoff', 'hp'
>>> scaled = mh.scaled # if 0, this is a 'num' sketch

Get actual hashes:

>>> for hashval in mh.hashes:
...    print(hashval)

Retrieve abundances:

>>> for hashval, abund in mh.hashes.items()
...    print(hashval, abund)

🎉

@drtamermansour
Copy link
Member

@ctb
The same sourmash signature might have kmers of different sizes, is not it?

@ctb
Copy link
Member

ctb commented Oct 12, 2021

it's complicated but tl;dr the code above will work, because each signature will be made to have exactly one sketch.

(the signature creation and save code does things slightly differently; see sourmash-bio/sourmash#1647 and sourmash-bio/sourmash#616 esp), but the load code splits it out so it's one signature for one sketch, and each sketch has one ksize.)

you might also be getting confused because a single .sig file can contain many different signatures, as well as signatures with multiple sketches.

so, as I said, confusing. but the code above will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants