Is there a way to go from the hash value to the k-mer sequence? #483

phiweger · 2018-05-30T04:04:03Z

I.e. any switch that tracks which k-mer sequence is present in the signature?

Thank you.

luizirber · 2018-05-30T05:36:02Z

No (yet). See #211

There is a recent PR with a similar structure (hashes to ID in reads) in #477, but it's not going to be merged. #482 implements the changes needed in the C++/Cython layers.

But since it's something people are asking more and more, probably it's a good idea to implement =]

ctb · 2018-05-30T16:38:26Z

On Tue, May 29, 2018 at 10:36:02PM -0700, Luiz Irber wrote: No (yet). See #211 There is a recent PR with a similar structure (hashes to `ID` in reads) in #477, but it's not going to be merged. #482 implements the changes needed in the C++/Cython layers. But since it's something people are asking more and more, probably it's a good idea to implement =]

I've needed this occasionally myself, but not frequently enough to make it part of 'sourmash compute' which is what #477 does. Curious what the use case is here? A good reason to do this would help immensely.

phiweger · 2018-05-31T04:32:55Z

well let's see

it would be nice to align the kmers to the target genome to get a feeling for MinHash "kmer sampling" -- this is probably just instructional
I am looking into pangenomes at the moment, trying to quickly match up genomes on shared hashes, and use more costly aligners around those seeds -- the idea being that you not only want the similarity of two genomes (or 1000) but what actually it is (proteins, ncRNA ...) that is similar. However, you'd need to know where the minhashed kmer is on the genome to do this.

ctb · 2018-05-31T13:07:49Z

right! (1) we have a script for, somewhere... @taylorreiter do you remember where this is? (The project really needs to figure out a little utility script solution... see #201)

For both (1) and (2), I think it's important to realize that the power of MinHash comes from the fact that these k-mers don't mean anything. So looking at them directly can be useful for developing intuition but is not so useful for anything else. In particular, they are definitely not seeds and shouldn't be used as such.

For example, there are two reasons you might get a k-mer match: one is that there are lots of exact k-mer matches due to a long stretch of high similarity! the other is that there is a lot of low similarity and statistically you're going to find a few k-mers with high similarity. In either case, there is no guarantee of locality (or non-locality) in the k-mers that are chosen by MinHash, so they could be right next to each other and ignoring patches of k-mers elsewhere, OR they could be "seeding" patches of high similarity that are spread throughout.

There are other, better ways to look systematically at all shared k-mers (this is enabled by khmer and other projects) and quickly do alignments (mashmap and nucmer before that). That software is very fast once you've identified which samples need to be aligned, and it's what we've used.

tl;dr? use sourmash to figure out which samples to compare using a more detailed approach, but please don't pay any attention at all to which specific k-mers are chosen for comparison purposes - they are intentionally meaningless.

olgabot · 2018-07-16T16:31:48Z

What about for gene expression analyses? I'm hashing RNA expression signatures and it would be useful to map those kmers back to genes, even probabilistically like kalliso/salmon's "transcript equivalency" count values.

ctb · 2018-07-17T11:25:27Z

On Mon, Jul 16, 2018 at 09:31:48AM -0700, Olga Botvinnik wrote: What about for gene expression analyses? I'm hashing RNA expression signatures and it would be useful to map those kmers back to genes, even probabilistically like kalliso/salmon's "transcript equivalency" count values.

right - this has come up before ;). since sourmash does not do any kind of batch correction and many transcripts go unsampled, I think the right answer is to use kallisto or salmon for this... I continue to be resistant to the notion that we treat the particular k-mers chosen at random as meaningful or representative in any specific way, and I think it's actively misleading to do so. but my resistance is being worn down because so many people want to do it and it's not my responsibility to police other people :). from a practical perspective, then, I would like to have verbatim k-mers be optional in signatures, in order to keep the file size down when we are building large databases.

taylorreiter · 2018-07-17T16:19:47Z

As @luizirber mentioned, he had implemented a way to save which k-mer sequence is underlying a hash in #477. We tried this on a metagenome and it "worked", but it wasn't incredibly helpful. Because the signatures are scaled down to 1/2000 (or 1/10000 etc), knowing these sequences wasn't terribly helpful because it gave such a limited picture.

I wasn't able to find the script mentioned by @ctb here.

ctb · 2018-07-18T13:48:49Z

Not entirely on topic, but: here is a script that outputs reads or sequences containing any hashes from a signature.

output-sequences-with-scaled-hashes.txt

olgabot · 2019-07-31T21:10:39Z

Related: #678

Here's a potential use case:

In single-cell ATAC seq data shows open chromatin regions, and often people are interested in "what TF motif(s) are available in this cell type?" The processing is rather long and laborious, involving alignment and then calling peaks and then using the peak calls to find similarities. I think the alignment could be skipped and the sequences could be used directly to find cell-cell similarities and cluster them.

Given the output k-mers, one could find most abundant k-mers and match them to e.g. JASPAR or other databases to find related TFs.

... something to think about!

ctb · 2019-08-02T11:34:34Z

hi olga, re the use case - do you see subsampling as being problematic here?

my current approach with @taylorreiter is this: we have started using spacegraphcats [ref] to index large data sets and retrieve the neighborhoods of specific hashes. For an identified hash-of-interest, we can get a guarantee on retrieval (has to do with recovering everything within the cDBG within some radius).

but, this is not what you're proposing above, because we are ok with the hashes of interest in our case being a subsample of all possible hashes of interest. in your case it seems like you'd really want to guarantee that you had all interesting k-mers.

you might be interested in looking at kevlar [ref] for ideas.

in sum, there's lots of uses for k-mers and De Bruijn graphs and cDBGs, but sourmash is more about subsampling than k-mers and graphs. Maybe we need to revisit these functions in libraries like khmer instead?

olgabot · 2019-08-21T19:19:36Z

hi olga, re the use case - do you see subsampling as being problematic here?

True, the subsampling may be problematic as to find the TF binding sites, we'd want all the relevant reads. One option is to use the hashes to find cell-cell similarities, build a KNN graph and do leiden clustering, then extract hashes present in only certain clusters, and extract the reads containing these hashes.

Re finding reads, I think I'm doing something wrong because when I use hash_murmur directly on extracted k-mers to hash, they don't match what minhash.add_sequence outputs! I'm trying to understand why ignore_abundance=True and ignore_abundance=False produce vastly different k-nearest neighbor (KNN) graphs, and whether there's difference in coverage of the same k-mers from sample to sample.

`ignore_abundance=True`

This is very close to what we expect for these celltypes, which are very different.

`ignore_abundance=False`

However, if we include abundance, then we get mixing of the celltypes!

the code

Here is the notebook, and here is the most relevant excerpt, which is based on output-sequences-with-scaled-hashes.txt above.

%%time
query_hashes = set(hashes_of_interest.index)
minhash = sig1.minhash.copy_and_clear()


# now, iterate over the input sequences and output those that add
# hashes!
n = 0
m = 0

NOTIFY_EVERY_BP=1e6

all_kmers_to_hashes_of_interest = {}

watermark = NOTIFY_EVERY_BP

with open("bladder_test_cells_different_hashes.fasta", 'w') as f:

    for fastq in fastqs:
        sample_id = os.path.basename(fastq).split("_R")[0]
        for record in screed.open(fastq):
            n += len(record.sequence)
            while n >= watermark:
                sys.stderr.write('... {} {}\r'.format(watermark, fastq))
                watermark += NOTIFY_EVERY_BP

            minhash.add_sequence(record.sequence, force=True)
            hashes = set(minhash.get_hashes())
            match = hashes.intersection(query_hashes)
            if match:
                n_kmers = len(record.sequence) - ksize + 1
                kmers = [record.sequence[i:(i+ksize)] for i in range(n_kmers)]
                hashed_kmers = [hash_murmur(kmer) for kmer in kmers]
                kmer_to_hash = dict(zip(kmers, hashed_kmers))

                kmer_to_hashes_of_interest = {kmer: h for kmer, h in kmer_to_hash.items() if h in query_hashes}
                if len(kmer_to_hashes_of_interest) != len(match):
                    print("Something is weird! Not all the hashes were found...")
                    print(match)
                    break
                all_kmers_to_hashes_of_interest.update(kmer_to_hashes_of_interest)
            if len(all_kmers_to_hashes_of_interest) == len(query_hashes):
                print("Found all hashes! Exiting.")
                break

Is there anything obviously wrong going on here? The kmer size and molecule type are the same, and the default seed of 42 was used.

ctb · 2019-08-24T00:39:45Z

hi @olgabot I can look into this but could you give some guidance as to a quick test case that is problematic? The notebook looks like it might take a long time to run. (Please do just tell me if it's not and I'll go ahead and do it :)

olgabot · 2019-08-26T16:20:45Z

It shouldn't take too long! The fastqs are single cells so they're pretty small.

…

On Fri, Aug 23, 2019, 17:39 C. Titus Brown ***@***.***> wrote: hi @olgabot <https://github.com/olgabot> I can look into this but could you give some guidance as to a quick test case that is problematic? The notebook looks like it might take a long time to run. (Please do just tell me if it's not and I'll go ahead and do it :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#483>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGE24CVCTKNUGHHNZ3WCGTQGB7NFANCNFSM4FCKILMA> .

ctb · 2019-09-01T22:53:03Z

See #724 for scripts that convert between k-mers and hashes. Still WIP.

ctb mentioned this issue Sep 1, 2019

[WIP] Add utility scripts for converting hash values to k-mers. #724

Closed

5 tasks

ctb mentioned this issue Sep 10, 2019

Better k-mer / hashval extraction functionality. #731

Closed

olgabot mentioned this issue Apr 9, 2020

Input hashes + protein sequences czbiohub-sf/nf-predictorthologs#25

Merged

8 tasks

luizirber mentioned this issue Oct 22, 2020

Expose a function to produce a list of all hashes from a sequence #1214

Closed

ctb mentioned this issue Jan 25, 2021

explain p_match and p_query in sourmash documentation #1289

Closed

ctb mentioned this issue Aug 2, 2021

[MRG] add MinHash.kmers_and_hashes(...) and sourmash sig kmers #1695

Merged

6 tasks

ctb closed this as completed in #1695 Aug 7, 2021

ctb mentioned this issue Jun 2, 2022

Return k-mers in addition to sketch/hash values #2073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to go from the hash value to the k-mer sequence? #483

Is there a way to go from the hash value to the k-mer sequence? #483

phiweger commented May 30, 2018

luizirber commented May 30, 2018

ctb commented May 30, 2018 via email

phiweger commented May 31, 2018

ctb commented May 31, 2018

olgabot commented Jul 16, 2018

ctb commented Jul 17, 2018 via email

taylorreiter commented Jul 17, 2018

ctb commented Jul 18, 2018

olgabot commented Jul 31, 2019

ctb commented Aug 2, 2019

olgabot commented Aug 21, 2019

ctb commented Aug 24, 2019

olgabot commented Aug 26, 2019 via email

ctb commented Sep 1, 2019

Is there a way to go from the hash value to the k-mer sequence? #483

Is there a way to go from the hash value to the k-mer sequence? #483

Comments

phiweger commented May 30, 2018

luizirber commented May 30, 2018

ctb commented May 30, 2018 via email

phiweger commented May 31, 2018

ctb commented May 31, 2018

olgabot commented Jul 16, 2018

ctb commented Jul 17, 2018 via email

taylorreiter commented Jul 17, 2018

ctb commented Jul 18, 2018

olgabot commented Jul 31, 2019

ctb commented Aug 2, 2019

olgabot commented Aug 21, 2019

ignore_abundance=True

ignore_abundance=False

the code

ctb commented Aug 24, 2019

olgabot commented Aug 26, 2019 via email

ctb commented Sep 1, 2019

`ignore_abundance=True`

`ignore_abundance=False`