You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, when I ask for the 10-mers, I don't get for instance AAGACTTTAT (the last four nucleotides of the first record followed by the first six of the second).
Is it intentional?
The text was updated successfully, but these errors were encountered:
YPares
changed the title
While does count_kmers not return k-mers that are split between two records?
Why does count_kmers not return k-mers that are split between two records?
Feb 4, 2016
Our k-mer counters will only count k-mers contained entirely in a single read or contig. We do this because two reads come from different DNA fragments, and thus the k-mers gained by merging two reads together (e.g., the AAGACTTTAT k-mer from your example) are not guaranteed to actually exist in the DNA that we sequenced. Even if the reads were paired and thus from the same fragment, we would not want to merge the reads when producing k-mers because there is typically an unknown (but estimable) insert between the two reads. For this insert, the fragment sequence is unknown.
Let me know if this was unclear and I can put together a more concrete example.
Thanks for your answer @fnothaft! A follow-up question: is it the case only for count_kmers? I mean is that kind of treatment (processing data spreaded over two reads) done elsewhere in ADAM or is it a general behaviour in the framework?
(Sorry if I'm not clear, I'm only an beginner in genomics analysis. I just have a little experience with Spark, and was under the impression that this kind of treatment was made impossible by the nature of Spark)
I ran count_kmers (like in the first example in the README.md), and the .sam file in input contains two consecutive records like:
However, when I ask for the 10-mers, I don't get for instance
AAGACTTTAT
(the last four nucleotides of the first record followed by the first six of the second).Is it intentional?
The text was updated successfully, but these errors were encountered: