-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need some advice for annotating *.gd file with unassigned junctions #255
Comments
There's a lot going on here. Interpreting junctions can be tricky. Here are my best guesses: The three pairs highlighted in purple are insertions of mobile elements. breseq can't tell the difference between ends of the same element located at different sites in the genome (they have the exact same sequence as far as the reads spanning the new junction can tell), so it ends up picking one of them somewhat arbitrarily. If you looked at those regions, you would see that they are identical, or at least identical on their margins on the scale of your read length. The orange highlighting of those rows tells you that side of the junction is matched by multiple sites in the genome. You can help breseq give you more precise output in these cases if you can can annotate all locations of the IS element as The duplication related to the MOB you highlighted in cyan and the coverage plot likely involves this chain of events:
The 1st and 2nd entries look like they are integration of your plasmid into the chromosome near the origin of your circular genome. There also seems to be some kind of duplication in the chromosome associated with the integration. The way to read the 1st entry is that a read spanning the junction starts at low coords and goes up to position 200 in the plasmid, then it jumps to 367287 in the genome and starts going up in coordinates. The 2nd entry is for starting at 245 in the plasmid and going up connected to 368508 and then going down in the chromosome. So bases 201-244 in the plasmid were deleted. The rest of the plasmid was inserted between a target site duplication of bases 367287-368508 in the genome in a forward orientation (base 245 going up then wrapping around and going up to 201). The duplication may have occurred during whatever recombination event put the plasmid into the genome. I would predict that there is excess coverage of bases 367287-368508 if you plotted coverage in that region. |
|
Replying as I've also had a reference genome with un-annotated IS regions
and experienced the same issue with junctions.
"Is this error due to the positions in one entry (gene[...]123456..234567) not being consecutive"
Sort of, not 'consecutive' exactly, but for a gene on the forward strand
(i.e.strand 1), start must be smaller than end. For a gene on the reverse
strand (i.e. strand -1), end must be smaller than start. You've given
arbitrary numbers here, but I suppose you are aware that these are the gene
coordinates (in the genome) in genbank format? It seems these coordinates
are not being parsed correctly for your gene, as your error message says
that start = 1 and end = 0 (which fails the start<end check).
Danna
…On Fri, Jan 15, 2021 at 11:59 AM AMVeenstraSkirl ***@***.***> wrote:
Hi Jeffrey,
I have been busy annotating my genome with the "mobile_element" entries,
and some other things that were missing. I have tried to run it just now
and got this error:
[image: 210115_error_invalidStartEnd_coordinates]
<https://user-images.githubusercontent.com/35833695/104723391-7c92b180-572f-11eb-8bab-69d0580405cb.png>
Something with the start and end positions is not right. Most likely due
to some of the positions I entered.
The error just specifies the line of code in the reference_sequence.h
header file.
I searched for a similar issue on here and found one of your files that
defines the error message and some of the variables (start_1 and end_1)
that are involved in this. However, I could not decifer the initialization
of the "end_1" variable in the piece of code you have on here to get an
indea on where to look at in my file.
Is this error due to the positions in one entry (gene[...]123456..234567)
not being consecutive, or because of the order of the entries, or other
conflicts between entries? Or all together?
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#255 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBVJ3JGSFNRNL35DVGWRMTS2AU33ANCNFSM4VNG7NNA>
.
|
Hi Danna! Yes, I was indeed aware that the arbitrary numbers are standing for the start and end of a gene and are actually not consecutive. Sorry for the weird wording. I meant what you said:
Ok, so I will have to search the entries in my file again for position combiantions that are "non-complying" to this criterium. |
Hi, For anyone who encountered a similar problem, my final remarks on this: The check on all the positions in the new entries, revealed that a typo in one of the entries created a position number greater than the total number of my genome. Check with something like HOWEVER, after fixing this mistake the file would still yield the original error. Turns out, when creating the new version of my file, the actual DNA sequence at the bottom of the file was not copied together with the rest of the entries. Where it should say something like this: Fixed the issue with a copy-paste-job. Thanks for your help! |
Thanks for adding the explanations. I've tried to add some more feedback about exactly which feature has these coordinate problems to the code, but I'm not seeing this exact error checking code being triggered in my text cases. If you care to share the reference file that was giving the "invalid start-end" error with me by email, I can see about trying to make that error message more informative. |
Hi Jeffrey,
Thanks a lot for your work and support on Breseq! I have been trying to find my question in your papers already, spefically the one together with Daniel from 2014 (Deatherage&Barrick, 2014; abbr. D&B,2014), that contains detailed instructions on how to incorporate unassigned junctions into the output.gd file. The instructions and different steps for your example seem clear to me, so I want to apply them to my own project.
In my case, there was no new junction reported in the predicted mutations, however there are 9 entries in the unassigned junction table. See here:
In general the values of the scores, skew and frequency seem legit enough for me to take them as serious calls and find out what has happened there (except for entry 5). The first two are not the issue. It's the other six entries that I am wracking my brains about.
When looking closely at the positions, there are three pairs of entries connected to three sites on the genome. I have marked them with dark purple on the left hand side.
Repeat sequences are involved in every entry suggesting some mobile element action. However, the two repeats in every pair are different from each other - in every pair. So, I think I cannot link it to a single jump of one mobile element for any of the three sites, as has been described for the example in points 3.2.10 and 3.2.11 in D&B, 2014.
I know a little bit what is going on for the entry marked in cyan. There seems to be a huge duplication of appr. 200,000 bp:
The duplication in the cit+ E. coli in the tutorial of the breseq documentation contains only one single entry. So, I am puzzling over the meaning of the second entry
My first idea was to check, whether the reads are mixed, which is the case for the white rows of each entry, while the reads of the orange rows span over both sides. Only one example for one white row:
I am not sure what to do with this information. Does this translate into mixed reads as shown in figure 5e of D&B, 2014? This would indicate a tandem duplication, which I can only find for this one particular site. The coverage plots follow average read depth for the other two sites.
The perfect way to find out, seems to me as proposed in D&B, 2014 by mutating the reference, and have a go with breseq at it again. But I am not sure how to write the lines, as given by the entries. In the paper, you could combine two entries from the unassigned table into one line in the *.gd-file. I am not sure on how to do this, for my examples, with two different repeat ends.
Do you see a way on how to solve this and could you give me a hint/advice on how to approach this further?
Thanks in advance!
Annie
The text was updated successfully, but these errors were encountered: