-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to go from the hash value to the k-mer sequence? #483
Comments
On Tue, May 29, 2018 at 10:36:02PM -0700, Luiz Irber wrote:
No (yet). See #211
There is a recent PR with a similar structure (hashes to `ID` in reads) in #477, but it's not going to be merged. #482 implements the changes needed in the C++/Cython layers.
But since it's something people are asking more and more, probably it's a good idea to implement =]
I've needed this occasionally myself, but not frequently enough to make
it part of 'sourmash compute' which is what #477 does. Curious what
the use case is here? A good reason to do this would help immensely.
|
well let's see
|
right! (1) we have a script for, somewhere... @taylorreiter do you remember where this is? (The project really needs to figure out a little utility script solution... see #201) For both (1) and (2), I think it's important to realize that the power of MinHash comes from the fact that these k-mers don't mean anything. So looking at them directly can be useful for developing intuition but is not so useful for anything else. In particular, they are definitely not seeds and shouldn't be used as such. For example, there are two reasons you might get a k-mer match: one is that there are lots of exact k-mer matches due to a long stretch of high similarity! the other is that there is a lot of low similarity and statistically you're going to find a few k-mers with high similarity. In either case, there is no guarantee of locality (or non-locality) in the k-mers that are chosen by MinHash, so they could be right next to each other and ignoring patches of k-mers elsewhere, OR they could be "seeding" patches of high similarity that are spread throughout. There are other, better ways to look systematically at all shared k-mers (this is enabled by khmer and other projects) and quickly do alignments (mashmap and nucmer before that). That software is very fast once you've identified which samples need to be aligned, and it's what we've used. tl;dr? use sourmash to figure out which samples to compare using a more detailed approach, but please don't pay any attention at all to which specific k-mers are chosen for comparison purposes - they are intentionally meaningless. |
What about for gene expression analyses? I'm hashing RNA expression signatures and it would be useful to map those kmers back to genes, even probabilistically like kalliso/salmon's "transcript equivalency" count values. |
On Mon, Jul 16, 2018 at 09:31:48AM -0700, Olga Botvinnik wrote:
What about for gene expression analyses? I'm hashing RNA expression signatures and it would be useful to map those kmers back to genes, even probabilistically like kalliso/salmon's "transcript equivalency" count values.
right - this has come up before ;). since sourmash does not do any kind
of batch correction and many transcripts go unsampled, I think the right
answer is to use kallisto or salmon for this...
I continue to be resistant to the notion that we treat the particular k-mers
chosen at random as meaningful or representative in any specific way, and I
think it's actively misleading to do so. but my resistance is being worn
down because so many people want to do it and it's not my responsibility to
police other people :).
from a practical perspective, then, I would like to have verbatim k-mers
be optional in signatures, in order to keep the file size down when we
are building large databases.
|
As @luizirber mentioned, he had implemented a way to save which k-mer sequence is underlying a hash in #477. We tried this on a metagenome and it "worked", but it wasn't incredibly helpful. Because the signatures are scaled down to 1/2000 (or 1/10000 etc), knowing these sequences wasn't terribly helpful because it gave such a limited picture. |
Not entirely on topic, but: here is a script that outputs reads or sequences containing any hashes from a signature. |
Related: #678 Here's a potential use case: In single-cell ATAC seq data shows open chromatin regions, and often people are interested in "what TF motif(s) are available in this cell type?" The processing is rather long and laborious, involving alignment and then calling peaks and then using the peak calls to find similarities. I think the alignment could be skipped and the sequences could be used directly to find cell-cell similarities and cluster them. Given the output k-mers, one could find most abundant k-mers and match them to e.g. JASPAR or other databases to find related TFs. ... something to think about! |
hi olga, re the use case - do you see subsampling as being problematic here? my current approach with @taylorreiter is this: we have started using spacegraphcats [ref] to index large data sets and retrieve the neighborhoods of specific hashes. For an identified hash-of-interest, we can get a guarantee on retrieval (has to do with recovering everything within the cDBG within some radius). but, this is not what you're proposing above, because we are ok with the hashes of interest in our case being a subsample of all possible hashes of interest. in your case it seems like you'd really want to guarantee that you had all interesting k-mers. you might be interested in looking at kevlar [ref] for ideas. in sum, there's lots of uses for k-mers and De Bruijn graphs and cDBGs, but sourmash is more about subsampling than k-mers and graphs. Maybe we need to revisit these functions in libraries like khmer instead? |
True, the subsampling may be problematic as to find the TF binding sites, we'd want all the relevant reads. One option is to use the hashes to find cell-cell similarities, build a KNN graph and do leiden clustering, then extract hashes present in only certain clusters, and extract the reads containing these hashes. Re finding reads, I think I'm doing something wrong because when I use
|
hi @olgabot I can look into this but could you give some guidance as to a quick test case that is problematic? The notebook looks like it might take a long time to run. (Please do just tell me if it's not and I'll go ahead and do it :) |
It shouldn't take too long! The fastqs are single cells so they're pretty
small.
…On Fri, Aug 23, 2019, 17:39 C. Titus Brown ***@***.***> wrote:
hi @olgabot <https://github.com/olgabot> I can look into this but could
you give some guidance as to a quick test case that is problematic? The
notebook looks like it might take a long time to run. (Please do just tell
me if it's not and I'll go ahead and do it :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#483>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGE24CVCTKNUGHHNZ3WCGTQGB7NFANCNFSM4FCKILMA>
.
|
See #724 for scripts that convert between k-mers and hashes. Still WIP. |
I.e. any switch that tracks which k-mer sequence is present in the signature?
Thank you.
The text was updated successfully, but these errors were encountered: