Why does count_kmers not return k-mers that are split between two records? #930

YPares · 2016-02-04T17:46:59Z

I ran count_kmers (like in the first example in the README.md), and the .sam file in input contains two consecutive records like:

simread:1:26472783:false    16  1   26472784    60  75M *   0   0   GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA *   NM:i:0  AS:i:75 XS:i:0
simread:1:240997787:true    0   1   240997788   60  75M *   0   0   CTTTATTTTTATTTTTAAGGTTTTTTTTGTTTGTTTGTTTTGAGATGGAGTCTCGCTCCACCGCCCAGACTGGAG *   NM:i:0  AS:i:75 XS:i:39

However, when I ask for the 10-mers, I don't get for instance AAGACTTTAT (the last four nucleotides of the first record followed by the first six of the second).

Is it intentional?

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-02-04T17:57:05Z

Hi @YPares !

Our k-mer counters will only count k-mers contained entirely in a single read or contig. We do this because two reads come from different DNA fragments, and thus the k-mers gained by merging two reads together (e.g., the AAGACTTTAT k-mer from your example) are not guaranteed to actually exist in the DNA that we sequenced. Even if the reads were paired and thus from the same fragment, we would not want to merge the reads when producing k-mers because there is typically an unknown (but estimable) insert between the two reads. For this insert, the fragment sequence is unknown.

Let me know if this was unclear and I can put together a more concrete example.

YPares · 2016-02-05T08:39:29Z

Thanks for your answer @fnothaft! A follow-up question: is it the case only for count_kmers? I mean is that kind of treatment (processing data spreaded over two reads) done elsewhere in ADAM or is it a general behaviour in the framework?
(Sorry if I'm not clear, I'm only an beginner in genomics analysis. I just have a little experience with Spark, and was under the impression that this kind of treatment was made impossible by the nature of Spark)

fnothaft · 2016-07-06T15:54:17Z

Closing as won't fix.

YPares changed the title ~~While does count_kmers not return k-mers that are split between two records?~~ Why does count_kmers not return k-mers that are split between two records? Feb 4, 2016

fnothaft added the wontfix label Jul 6, 2016

fnothaft closed this as completed Jul 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does count_kmers not return k-mers that are split between two records? #930

Why does count_kmers not return k-mers that are split between two records? #930

YPares commented Feb 4, 2016

fnothaft commented Feb 4, 2016

YPares commented Feb 5, 2016

fnothaft commented Jul 6, 2016

Why does count_kmers not return k-mers that are split between two records? #930

Why does count_kmers not return k-mers that are split between two records? #930

Comments

YPares commented Feb 4, 2016

fnothaft commented Feb 4, 2016

YPares commented Feb 5, 2016

fnothaft commented Jul 6, 2016