New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does count_kmers not return k-mers that are split between two records? #930

Closed
YPares opened this Issue Feb 4, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@YPares

YPares commented Feb 4, 2016

I ran count_kmers (like in the first example in the README.md), and the .sam file in input contains two consecutive records like:

simread:1:26472783:false    16  1   26472784    60  75M *   0   0   GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA *   NM:i:0  AS:i:75 XS:i:0
simread:1:240997787:true    0   1   240997788   60  75M *   0   0   CTTTATTTTTATTTTTAAGGTTTTTTTTGTTTGTTTGTTTTGAGATGGAGTCTCGCTCCACCGCCCAGACTGGAG *   NM:i:0  AS:i:75 XS:i:39

However, when I ask for the 10-mers, I don't get for instance AAGACTTTAT (the last four nucleotides of the first record followed by the first six of the second).

Is it intentional?

@YPares YPares changed the title from While does count_kmers not return k-mers that are split between two records? to Why does count_kmers not return k-mers that are split between two records? Feb 4, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 4, 2016

Member

Hi @YPares !

Our k-mer counters will only count k-mers contained entirely in a single read or contig. We do this because two reads come from different DNA fragments, and thus the k-mers gained by merging two reads together (e.g., the AAGACTTTAT k-mer from your example) are not guaranteed to actually exist in the DNA that we sequenced. Even if the reads were paired and thus from the same fragment, we would not want to merge the reads when producing k-mers because there is typically an unknown (but estimable) insert between the two reads. For this insert, the fragment sequence is unknown.

Let me know if this was unclear and I can put together a more concrete example.

Member

fnothaft commented Feb 4, 2016

Hi @YPares !

Our k-mer counters will only count k-mers contained entirely in a single read or contig. We do this because two reads come from different DNA fragments, and thus the k-mers gained by merging two reads together (e.g., the AAGACTTTAT k-mer from your example) are not guaranteed to actually exist in the DNA that we sequenced. Even if the reads were paired and thus from the same fragment, we would not want to merge the reads when producing k-mers because there is typically an unknown (but estimable) insert between the two reads. For this insert, the fragment sequence is unknown.

Let me know if this was unclear and I can put together a more concrete example.

@YPares

This comment has been minimized.

Show comment
Hide comment
@YPares

YPares Feb 5, 2016

Thanks for your answer @fnothaft! A follow-up question: is it the case only for count_kmers? I mean is that kind of treatment (processing data spreaded over two reads) done elsewhere in ADAM or is it a general behaviour in the framework?
(Sorry if I'm not clear, I'm only an beginner in genomics analysis. I just have a little experience with Spark, and was under the impression that this kind of treatment was made impossible by the nature of Spark)

YPares commented Feb 5, 2016

Thanks for your answer @fnothaft! A follow-up question: is it the case only for count_kmers? I mean is that kind of treatment (processing data spreaded over two reads) done elsewhere in ADAM or is it a general behaviour in the framework?
(Sorry if I'm not clear, I'm only an beginner in genomics analysis. I just have a little experience with Spark, and was under the impression that this kind of treatment was made impossible by the nature of Spark)

@fnothaft fnothaft added the wontfix label Jul 6, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

Closing as won't fix.

Member

fnothaft commented Jul 6, 2016

Closing as won't fix.

@fnothaft fnothaft closed this Jul 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment