Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does count_kmers not return k-mers that are split between two records? #930

Closed
YPares opened this issue Feb 4, 2016 · 3 comments
Closed
Labels

Comments

@YPares
Copy link

YPares commented Feb 4, 2016

I ran count_kmers (like in the first example in the README.md), and the .sam file in input contains two consecutive records like:

simread:1:26472783:false    16  1   26472784    60  75M *   0   0   GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA *   NM:i:0  AS:i:75 XS:i:0
simread:1:240997787:true    0   1   240997788   60  75M *   0   0   CTTTATTTTTATTTTTAAGGTTTTTTTTGTTTGTTTGTTTTGAGATGGAGTCTCGCTCCACCGCCCAGACTGGAG *   NM:i:0  AS:i:75 XS:i:39

However, when I ask for the 10-mers, I don't get for instance AAGACTTTAT (the last four nucleotides of the first record followed by the first six of the second).

Is it intentional?

@YPares YPares changed the title While does count_kmers not return k-mers that are split between two records? Why does count_kmers not return k-mers that are split between two records? Feb 4, 2016
@fnothaft
Copy link
Member

fnothaft commented Feb 4, 2016

Hi @YPares !

Our k-mer counters will only count k-mers contained entirely in a single read or contig. We do this because two reads come from different DNA fragments, and thus the k-mers gained by merging two reads together (e.g., the AAGACTTTAT k-mer from your example) are not guaranteed to actually exist in the DNA that we sequenced. Even if the reads were paired and thus from the same fragment, we would not want to merge the reads when producing k-mers because there is typically an unknown (but estimable) insert between the two reads. For this insert, the fragment sequence is unknown.

Let me know if this was unclear and I can put together a more concrete example.

@YPares
Copy link
Author

YPares commented Feb 5, 2016

Thanks for your answer @fnothaft! A follow-up question: is it the case only for count_kmers? I mean is that kind of treatment (processing data spreaded over two reads) done elsewhere in ADAM or is it a general behaviour in the framework?
(Sorry if I'm not clear, I'm only an beginner in genomics analysis. I just have a little experience with Spark, and was under the impression that this kind of treatment was made impossible by the nature of Spark)

@fnothaft
Copy link
Member

fnothaft commented Jul 6, 2016

Closing as won't fix.

@fnothaft fnothaft closed this as completed Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants