-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
with --protein, translate whole sequence and then emit k-mers #664
Comments
I implemented this for spacegraphcats, on a trial basis - see spacegraphcats/spacegraphcats#379, script |
(and it was quite simple and I liked it :) |
@luizirber and I talked through the relevant rust code -- it seems we're already translating the entire sequence first and then emitting k-mers. Here we find the forward 3-frames: translate to amino acid then kmerize and hash. Then, the revcomp 3 frames are done in the same fashion: Luiz pointed out that we could improve efficiency by checking for each k-mer (e.g. in a hash table) before hashing (= modify lines 108/116), but otherwise the code is already doing what is suggested here! |
closing, since it seems like this is already happening! |
This may be a solution for #659
Instead of emitting DNA k-mers and then translating them, which for all but the 0 frame can run into issues of creating an
X
codon at the end, what about translating the entire read/sequence in 6 frames, and then emitting k-mers? This would have fewerX
amino acids output and potentially be more meaningful for translated nucleotide sequence.The text was updated successfully, but these errors were encountered: