-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug in A-Phased Repeat Finder #2
Comments
In that same function, it also looks like, if the list of A-tracts ends with minATracts or more that are close enough to each other, the corresponding A-phased repeat is not added to the arep[] array. |
Looking at getAtracts() I think it will have a similar problem -- if the final A-tract reaches the end of the sequence, it doesn't get collected into the pAPRs[] array. |
I'm basing that on the assumption that the dna[] array doesn't incorporate some kind of sentinel. For example, if dna[total_bases-1] is not 'a' or 't' then getAtracts() doesn't have the problem I think it does. |
Another bug: at line 90, this test I don't know if that miscounting results in any problems or not. |
Edited to add: Rereading what I wrote below, it might be perceived as criticism. This was not my intent. I wanted to convey what my purpose was in looking through the course code. I didn't come into this code to look for bugs. What I was/am looking for is a clear definition of what is considered an A-tract. Because I've been asked by someone to annotate the A-tracts in the output of this program. I.e. in an A-phased repeat, they want to know what subintervals are part of the A-tracts, and what subintervals are not. The first definition I was pointed to was the one in the caption of Table 1 in the 2010 paper: "Non-B DB: a database of predicted non-B DNA-forming motifs in mammalian genomes." I couldn't come up with an interpretation of that definition that matched the motifs I saw reported. Then I was pointed to this repository, and I noticed the 2013 paper in the README: "Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools." That gives the definition "A-phased motifs are defined as three or more tracts of four to nine adenines or adenines followed by thymines, with centers separated by 11–12 nucleotides." But that still disagreed with the motifs I saw reported. I note that while the 2013 paper says an A-tract needs a minimum of 4 As (or, I think one of AAAT, AATT, ATTT is also indicated by that description), the default value in the code for minAPRlen is 3, not 4. I thought I would be able to figure it out from the source code. There is a definition at the top of findAPR.c that looks promising, but it isn't clear to me that that is what getAtracts actually implements. I'm not saying it doesn't. I'm having difficulty understanding the effect of "go through each nucleotide" loop, what events it is actually counting. And as evidenced in this thread, I've come across a lot of bugs just while trying to understand how this one function works. The description at the top of findAPR.c adds the important detail that the center of the A-tract is the center of its longest run of consecutive As (at least that's what I think it is saying). I didn't see that as part of the description in either of two papers. |
The CAPTCHA issue is now fixed in the web version of non-BMST. |
So I am really befuddled as to what definition of an A-tract is implemented. To try to understand this, I've added code to findAPR so that whenever it adds an A-phased repeat to its list, it writes out the component A-tracts to a file. Below I list several of the A-tracts thus reported, and their positions in hg19 chr22. For some (many) of these, I don't see how they fit the A-tract definition at the top of findAPR.c. In particular, that definition seems to count a run of As (or, I guess, a run of As plus a run of Ts right after the As) and a run of Ts, then an A-tract is supposed to have at least four more in the first run as in the second. Granting that where that defintion says "4", it means whatever value minAT is (which is 3), how can it report anything that isn't at least 4 long?
|
Found another bug. The user options minATractSep and maxATractSep aren't used anywhere. Instead, the default values are essentially hardwired into this line: |
Breaking this out into it's own issue, (see #4). |
In findAPR(), line 197 has this loop:
for (i = 0; i < nProcessedATs - (minATracts + 1); i++)
Shouldn't that be
for (i = 0; i < nProcessedATs - (minATracts - 1); i++)
For example, if nProcessedATs=10 and minATracts=3, the loop termination becomes i<6. If I understand the code correctly, that means the last A-tract triplet it will consider is 5,6,7, and it doesn't consider 6,7,8 or 7,8,9.
The text was updated successfully, but these errors were encountered: