with --protein, translate whole sequence and then emit k-mers #664

olgabot · 2019-04-07T20:13:06Z

This may be a solution for #659

Instead of emitting DNA k-mers and then translating them, which for all but the 0 frame can run into issues of creating an X codon at the end, what about translating the entire read/sequence in 6 frames, and then emitting k-mers? This would have fewer X amino acids output and potentially be more meaningful for translated nucleotide sequence.

The text was updated successfully, but these errors were encountered:

olgabot · 2019-04-07T20:44:56Z

I.e. instead of making DNA kmers and then translating them:

Translate the whole sequence and get k-mers from that:

ctb · 2021-01-10T14:55:58Z

I implemented this for spacegraphcats, on a trial basis - see spacegraphcats/spacegraphcats#379, script query-unitigs-prot.py.

ctb · 2021-01-10T14:56:31Z

(and it was quite simple and I liked it :)

bluegenes · 2021-06-09T16:54:32Z

@luizirber and I talked through the relevant rust code -- it seems we're already translating the entire sequence first and then emitting k-mers.

Here we find the forward 3-frames:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L96-L105

translate to amino acid
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L106

then kmerize and hash.
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L108-L111

Then, the revcomp 3 frames are done in the same fashion:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L113-L119

Luiz pointed out that we could improve efficiency by checking for each k-mer (e.g. in a hash table) before hashing (= modify lines 108/116), but otherwise the code is already doing what is suggested here!

ctb · 2021-06-25T20:54:33Z

closing, since it seems like this is already happening!

luizirber added enhancement idea labels May 1, 2019

ctb mentioned this issue May 25, 2020

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed

ctb added the 5.0 issues to address for a 5.0 release label Jan 10, 2021

ctb mentioned this issue May 15, 2021

summary: further improvements to protein handling in sourmash #1525

Open

ctb closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

with --protein, translate whole sequence and then emit k-mers #664

with --protein, translate whole sequence and then emit k-mers #664

olgabot commented Apr 7, 2019

olgabot commented Apr 7, 2019

ctb commented Jan 10, 2021

ctb commented Jan 10, 2021

bluegenes commented Jun 9, 2021

ctb commented Jun 25, 2021

with --protein, translate whole sequence and then emit k-mers #664

with --protein, translate whole sequence and then emit k-mers #664

Comments

olgabot commented Apr 7, 2019

olgabot commented Apr 7, 2019

ctb commented Jan 10, 2021

ctb commented Jan 10, 2021

bluegenes commented Jun 9, 2021

ctb commented Jun 25, 2021