Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with --protein, translate whole sequence and then emit k-mers #664

Closed
olgabot opened this issue Apr 7, 2019 · 5 comments
Closed

with --protein, translate whole sequence and then emit k-mers #664

olgabot opened this issue Apr 7, 2019 · 5 comments
Labels
5.0 issues to address for a 5.0 release enhancement idea

Comments

@olgabot
Copy link
Collaborator

olgabot commented Apr 7, 2019

This may be a solution for #659

Instead of emitting DNA k-mers and then translating them, which for all but the 0 frame can run into issues of creating an X codon at the end, what about translating the entire read/sequence in 6 frames, and then emitting k-mers? This would have fewer X amino acids output and potentially be more meaningful for translated nucleotide sequence.

@olgabot
Copy link
Collaborator Author

olgabot commented Apr 7, 2019

I.e. instead of making DNA kmers and then translating them:

slice_dna_then_make_protein@2x

Translate the whole sequence and get k-mers from that:

translate_protein_then_kmers@2x

@ctb
Copy link
Contributor

ctb commented Jan 10, 2021

I implemented this for spacegraphcats, on a trial basis - see spacegraphcats/spacegraphcats#379, script query-unitigs-prot.py.

@ctb ctb added the 5.0 issues to address for a 5.0 release label Jan 10, 2021
@ctb
Copy link
Contributor

ctb commented Jan 10, 2021

(and it was quite simple and I liked it :)

@bluegenes
Copy link
Contributor

@luizirber and I talked through the relevant rust code -- it seems we're already translating the entire sequence first and then emitting k-mers.

Here we find the forward 3-frames:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L96-L105

translate to amino acid
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L106

then kmerize and hash.
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L108-L111

Then, the revcomp 3 frames are done in the same fashion:
https://github.com/dib-lab/sourmash/blob/eb2b210d40b1441a1467ec507e920339e0bfe437/src/core/src/signature.rs#L113-L119

Luiz pointed out that we could improve efficiency by checking for each k-mer (e.g. in a hash table) before hashing (= modify lines 108/116), but otherwise the code is already doing what is suggested here!

@ctb
Copy link
Contributor

ctb commented Jun 25, 2021

closing, since it seems like this is already happening!

@ctb ctb closed this as completed Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release enhancement idea
Projects
None yet
Development

No branches or pull requests

4 participants