Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the confirmation for number of kmers NOT always pop-up? #37

Closed
ashishdamania opened this issue Jun 20, 2016 · 3 comments
Closed

Comments

@ashishdamania
Copy link

Below is the attached screenshot. Also, it seems that it the number of kmers shown is not equal to theoretical max: kmers=Total Length +1 -K (Kmer length).
Can you please check it? I can upload my test file if required.

Thanks for making VizBin.

screen shot 2016-06-19 at 5 37 14 pm

@ashishdamania ashishdamania changed the title Why does the program does not always pop up confirmation for number of kmers? Why does the confirmation for number of kmers NOT always pop-up? Jun 20, 2016
@claczny
Copy link
Owner

claczny commented Jun 20, 2016

Hi,

this particular dialogue is meant to inform the user about the amount of non-default-DNA-alphabet letters in the provided sequences.
Specifically, how many kmers were affected by such letters, e.g., N in ACGNT.
Accordingly, the dialogue will only appear if there is at least one non-default-DNA-alphabet letter in your retained sequences, i.e., sequences equal to or longer than the specified minimum length (default of 1,000nt) .
More generally speaking, each such letter will affect k k-mers in the worst case.

It is not unexpected to have some kmers being ignored and in your particular case, the frequency is really low, so everything should be fine.

Regarding

the number of kmers shown is not equal to theoretical max: kmers=Total Length +1 -K (Kmer length)

this formula is correct. If I understand your question correctly, "Total Length" should then represent the cumulative length of the retained sequences in your provided FASTA file.
Does it not match that number in your case?

Hope that helps and thank you for your interest in VizBin.

Best,

Cedric

@ashishdamania
Copy link
Author

Hi Cedric,
Thanks for the detailed and prompt response. Now it makes sense why I do not get that pop up information.

  1. Is it possible to add the information about kmers in the log or pop-up regardless if the sequences contain N or other unexpected characters?
  2. Also, I see that the kmer length that is reported in the pop-up is not consistent with the formula above.
    For example, I tried EssentialGenes.fa from the data directory and added two N in the sequence so that I could get a pop-up and
    I see that there 571706 kmers with K=5 but the sequence length is 574002 which should give us 574002+1-5=573998.

I calculated the length of the EssentialGenes.fa as follows:

grep -v ">"  EssentialGenes.fa > EssentialGenes_reformatted.fa

bioawk -c fastx '{print $name,length($seq)}' < EssentialGenes_reformatted.fa 

Does it discount kmers based on some criteria? Sorry, I am not getting this total correctly. Again, thanks for the response and for making VizBin.

Ashish

screen shot 2016-06-20 at 9 18 58 am

@claczny
Copy link
Owner

claczny commented Jun 20, 2016

Hi Ashish,

  1. Is it possible to add the information about kmers in the log or pop-up regardless if the sequences contain N or other unexpected characters?

frankly speaking, this feature is meant as a small reminder that something might need consideration within the data. It is not meant to serve as a proper/fullscale validity check, which, in any case, should occur prior to binning the data.
If there is no pop-up, the better ;)

  1. Also, I see that the kmer length that is reported in the pop-up is not consistent with the formula above.
    For example, I tried EssentialGenes.fa from the data directory and added two N in the sequence so that I could get a pop-up and
    I see that there 571706 kmers with K=5 but the sequence length is 574002 which should give us 574002+1-5=573998.

This calculation would be correct if we had a single sequence of that length.
However, we have 574 separate sequences in EssentialGenes.fa. Hence, the formula should be

(1000+1-5)*574

which equals to 571704, since every sequence is 1,000 bp long. If one was to add 2 N's, the number would be 571706, i.e., as displayed by VizBin. So everything is in order there.

Consequently, I consider this issue closed. Feel free to continue posting questions/comments if they are related to this issue. Otherwise, please open a new ticket describing the situation at hand.

Let me know if there is anything else I can be of help with.

Best,

Cedric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants