Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in A-Phased Repeat Finder #2

Closed
rsharris opened this issue May 4, 2020 · 9 comments
Closed

Possible bug in A-Phased Repeat Finder #2

rsharris opened this issue May 4, 2020 · 9 comments
Assignees

Comments

@rsharris
Copy link

rsharris commented May 4, 2020

In findAPR(), line 197 has this loop:
for (i = 0; i < nProcessedATs - (minATracts + 1); i++)
Shouldn't that be
for (i = 0; i < nProcessedATs - (minATracts - 1); i++)

For example, if nProcessedATs=10 and minATracts=3, the loop termination becomes i<6. If I understand the code correctly, that means the last A-tract triplet it will consider is 5,6,7, and it doesn't consider 6,7,8 or 7,8,9.

@rsharris
Copy link
Author

rsharris commented May 4, 2020

In that same function, it also looks like, if the list of A-tracts ends with minATracts or more that are close enough to each other, the corresponding A-phased repeat is not added to the arep[] array.

@rsharris
Copy link
Author

rsharris commented May 4, 2020

Looking at getAtracts() I think it will have a similar problem -- if the final A-tract reaches the end of the sequence, it doesn't get collected into the pAPRs[] array.

@rsharris
Copy link
Author

rsharris commented May 4, 2020

I'm basing that on the assumption that the dna[] array doesn't incorporate some kind of sentinel.

For example, if dna[total_bases-1] is not 'a' or 't' then getAtracts() doesn't have the problem I think it does.

@rsharris
Copy link
Author

rsharris commented May 4, 2020

Another bug: at line 90, this test
if (dna[n - 1] == 't')
could cause an segfault if n is 0 (i.e. if the sequence begins with a run of 4 or more As). Moreover, if n is not zero and dna[n-1] is not a T, the code assumes it is an A. But this is never an A the first time through the loop.

I don't know if that miscounting results in any problems or not.

@rsharris
Copy link
Author

rsharris commented May 4, 2020

Edited to add: Rereading what I wrote below, it might be perceived as criticism. This was not my intent. I wanted to convey what my purpose was in looking through the course code.

I didn't come into this code to look for bugs. What I was/am looking for is a clear definition of what is considered an A-tract. Because I've been asked by someone to annotate the A-tracts in the output of this program. I.e. in an A-phased repeat, they want to know what subintervals are part of the A-tracts, and what subintervals are not.

The first definition I was pointed to was the one in the caption of Table 1 in the 2010 paper: "Non-B DB: a database of predicted non-B DNA-forming motifs in mammalian genomes." I couldn't come up with an interpretation of that definition that matched the motifs I saw reported.

Then I was pointed to this repository, and I noticed the 2013 paper in the README: "Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools." That gives the definition "A-phased motifs are defined as three or more tracts of four to nine adenines or adenines followed by thymines, with centers separated by 11–12 nucleotides." But that still disagreed with the motifs I saw reported.

I note that while the 2013 paper says an A-tract needs a minimum of 4 As (or, I think one of AAAT, AATT, ATTT is also indicated by that description), the default value in the code for minAPRlen is 3, not 4.

I thought I would be able to figure it out from the source code. There is a definition at the top of findAPR.c that looks promising, but it isn't clear to me that that is what getAtracts actually implements. I'm not saying it doesn't. I'm having difficulty understanding the effect of "go through each nucleotide" loop, what events it is actually counting. And as evidenced in this thread, I've come across a lot of bugs just while trying to understand how this one function works.

The description at the top of findAPR.c adds the important detail that the center of the A-tract is the center of its longest run of consecutive As (at least that's what I think it is saying). I didn't see that as part of the description in either of two papers.

@Brian-Luke
Copy link

The CAPTCHA issue is now fixed in the web version of non-BMST.

@rsharris
Copy link
Author

rsharris commented May 8, 2020

So I am really befuddled as to what definition of an A-tract is implemented. To try to understand this, I've added code to findAPR so that whenever it adds an A-phased repeat to its list, it writes out the component A-tracts to a file.

Below I list several of the A-tracts thus reported, and their positions in hg19 chr22.

For some (many) of these, I don't see how they fit the A-tract definition at the top of findAPR.c. In particular, that definition seems to count a run of As (or, I guess, a run of As plus a run of Ts right after the As) and a run of Ts, then an A-tract is supposed to have at least four more in the first run as in the second. Granting that where that defintion says "4", it means whatever value minAT is (which is 3), how can it report anything that isn't at least 4 long?

att       chr22:16060836-16060838
ttt       chr22:16060846-16060848
aaa       chr22:16060857-16060859
aaa       chr22:16061285-16061287
aaat      chr22:16061295-16061298
ataaatt   chr22:16061303-16061309
atttt     chr22:16068563-16068567
ataaaa    chr22:16068572-16068577
aaa       chr22:16068585-16068587
aaaaaatt  chr22:16069995-16070002
aaaat     chr22:16070007-16070011
aaa       chr22:16070019-16070021
aaat      chr22:16073085-16073088
aaaa      chr22:16073095-16073098
aaaatt    chr22:16073104-16073109
atttttta  chr22:16081290-16081297
attt      chr22:16081302-16081305
aaatatata chr22:16081313-16081321
ttt       chr22:16100276-16100278
atttt     chr22:16100286-16100290
tatttatt  chr22:16100292-16100299
aaa       chr22:16100568-16100570
aaaaa     chr22:16100577-16100581
aat       chr22:16100589-16100591
aaaa      chr22:16101118-16101121
att       chr22:16101129-16101131
ttt       chr22:16101140-16101142
aaat      chr22:16103192-16103195
ttt       chr22:16103203-16103205
ttaaaaaa  chr22:16103210-16103217
ttt       chr22:16110233-16110235
ttt       chr22:16110243-16110245
atttatttt chr22:16110248-16110256
taaaat    chr22:16114610-16114615
aaaa      chr22:16114622-16114625
attt      chr22:16114632-16114635
att       chr22:16120045-16120047
tttt      chr22:16120055-16120058
aat       chr22:16120066-16120068
aaa       chr22:16125821-16125823
aata      chr22:16125832-16125835
ttt       chr22:16125842-16125844
aataa     chr22:16130130-16130134
aaa       chr22:16130140-16130142
aata      chr22:16130151-16130154
aaat      chr22:16134881-16134884
ataaattat chr22:16134888-16134896

@rsharris
Copy link
Author

Found another bug. The user options minATractSep and maxATractSep aren't used anywhere. Instead, the default values are essentially hardwired into this line:
if ((distToNext <= 11.1) && (distToNext >= 9.9))

@johnsonra
Copy link
Collaborator

Found another bug. The user options minATractSep and maxATractSep aren't used anywhere. Instead, the default values are essentially hardwired into this line:
if ((distToNext <= 11.1) && (distToNext >= 9.9))

Breaking this out into it's own issue, (see #4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants