-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make simpler CSV/TSV output table from gdtools COMPARE #274
Comments
I think this could be accomplished downstream in R with a dplyr::select() call without having to change the current code? For backwards compatibility, recommend perhaps the new output flag be -TSV-simple and the original stay as-is? One thing that would be nice to simplify is the annotation of gene_name and gene_inactivated for large deletions. The former gives it as a range (eg mutS–rpoS) whereas the latter gives all the intervening genes. Would be handy to have all the relevant genes in one column perhaps? For batch processing and summary purposes in eg R. |
Sure, I'll keep the original format names/options as-is. Thanks for the input. I am working on adding the simplified version as Regarding the (Definitely need to document these... in which case, it would also be easy to add an option for having |
Hi, is it possible that the genes are actually not classified as inactivated for some MOB mutations? From your comment above I understand that all IS elements should be classified as inactivated, but this is not the case for the Breseq output in my data. I copy an example of the different column values of the CSV table for one of these cases below. del_end 2 |
Yes, you're correct, that's the way things operate currently. It looks like only DEL mutations will add things to the inactivated list and others will get added as overlapping. There's a note in the code questioning whether this is the right thing to do for other mutation types. It does seem reasonable to me that a MOB landing in a gene could also be counted as inactivating (according to this rough way of classifying things). |
Thanks! This does not seem to explain what I am seeing, though.. could you maybe explain what is the difference between my other example and this other MOB which actually does have genes classified as inactivated? (NaN categories not shown) duplication_size 9 |
I'm not sure why that is. Looking more closely at the code, a MOB should be inactivating if it is entirely within the gene, and overlapping if the duplicated region overlaps the end but is not completely contained within a gene. That logic doesn't seem to be working for the first example you posted. If you want to share the input GD file that you converted to CSV/TSV plus the reference sequence, I can try to follow what is happening. |
Thanks. Sure, I have 2 reference files because my reference genome ($ref_file1 = keio_parent_sequence.gb) has an inserted cassette ($ref_file2 = KanR_NeoR_FRTs.gbk). The line of code I used is
I additionally attach another example where I also do not understand why a +C insertion is not classified as inactivating the gene: |
Hi @gabypetrungaro, I figured out what is happening. The code is working as designed, but the design has a hidden option that only marks overlapping mutations as inactivating if they hit within the first 80% of the length of the gene. This seems to be the case for both of these. It is confusing, and right now it is hard-coded, so you can't change it as a command-line option. I'll add this as an option that can be set at the command line and also explain that it's set at 80% by default, so this will be flexible and not mysterious in the next version! If you can compile breseq on your own, you can change this line to = 1.0 right now to get the expected behavior:
|
Hi @jeffreybarrick, |
Yes, that's correct. Here's a write-up of the rules that will be part of the
I just committed the code. |
The current table has an overwhelming number of columns in an arbitrary order. Recommend making a simple version (
-f TSV
) that resembles the HTML table in order and content and a full version (-f TSV-ALL
?). Also, add CSV outputs or switch to this entirely.The text was updated successfully, but these errors were encountered: