Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make simpler CSV/TSV output table from gdtools COMPARE #274

Closed
jeffreybarrick opened this issue Jul 22, 2021 · 10 comments
Closed

Make simpler CSV/TSV output table from gdtools COMPARE #274

jeffreybarrick opened this issue Jul 22, 2021 · 10 comments

Comments

@jeffreybarrick
Copy link
Contributor

The current table has an overwhelming number of columns in an arbitrary order. Recommend making a simple version (-f TSV) that resembles the HTML table in order and content and a full version (-f TSV-ALL?). Also, add CSV outputs or switch to this entirely.

@dannagifford
Copy link

I think this could be accomplished downstream in R with a dplyr::select() call without having to change the current code?

For backwards compatibility, recommend perhaps the new output flag be -TSV-simple and the original stay as-is?

One thing that would be nice to simplify is the annotation of gene_name and gene_inactivated for large deletions. The former gives it as a range (eg mutS–rpoS) whereas the latter gives all the intervening genes. Would be handy to have all the relevant genes in one column perhaps? For batch processing and summary purposes in eg R.

@jeffreybarrick
Copy link
Contributor Author

Sure, I'll keep the original format names/options as-is. Thanks for the input.

I am working on adding the simplified version as -f TABLE. This will also add code for constructing (unicode) text versions of the HTML columns that describe the mutation. I will add these to the TSV output format as additional columns.

Regarding the gene_name and genes_inactivated columns... this clearly needs to be documented better, but gene_name is meant to be a human-readable summary type field. The genes_inactivated and genes_overlapping columns should (together) include all of the names of the individual genes that are affected by the mutation. They get classified as inactivating if they clearly destroy gene function (nonsense mutations, out-of-frame indels, big deletions, IS elements) and as overlapping otherwise. Maybe you'd like for there to be another column that has all of these genes together... gene_list?

(Definitely need to document these... in which case, it would also be easy to add an option for having gdtools ANNOTATE/COMPARE output only a subset of user-selected annotation columns.)

@gabypetrungaro
Copy link

Hi, is it possible that the genes are actually not classified as inactivated for some MOB mutations? From your comment above I understand that all IS elements should be classified as inactivated, but this is not the case for the Breseq output in my data. I copy an example of the different column values of the CSV table for one of these cases below.

del_end 2
duplication_size 2
frequency 1
gene_name nfsA
gene_position coding (674-675/723 nt)
gene_product nitroreductase A, NADPH-dependent, FMN-dependent
gene_strand >
genes_inactivated NaN
genes_overlapping nfsA
genes_promoter NaN
ins_start NaN
locus_tag BW25113_0851
locus_tags_inactivated NaN
locus_tags_overlapping BW25113_0851
locus_tags_promoter NaN
mutation_category mobile_element_insertion
position 887313
position_end 887314
position_start 887313
ref_seq GC
repeat_name IS30
repeat_size 1221
seq_id CP009273
snp_type NaN
strand -1
time -1
title 8962_830_E3_seqA
type MOB

@jeffreybarrick
Copy link
Contributor Author

Yes, you're correct, that's the way things operate currently.

It looks like only DEL mutations will add things to the inactivated list and others will get added as overlapping.

There's a note in the code questioning whether this is the right thing to do for other mutation types. It does seem reasonable to me that a MOB landing in a gene could also be counted as inactivating (according to this rough way of classifying things).

@gabypetrungaro
Copy link

Thanks! This does not seem to explain what I am seeing, though.. could you maybe explain what is the difference between my other example and this other MOB which actually does have genes classified as inactivated? (NaN categories not shown)

duplication_size 9
frequency 1
gene_name ybjC
gene_position coding (172-180/288 nt)
gene_product DUF1418 family protein
gene_strand >
genes_inactivated ybjC
locus_tag BW25113_0850
locus_tags_inactivated BW25113_0850
mutation_category mobile_element_insertion
position 886540
position_end 886548
position_start 886540
ref_seq CCCGCTGCG
repeat_name IS1
repeat_size 768
seq_id CP009273
strand -1
time -1
title 8961_828_H1_fis
type MOB

@jeffreybarrick
Copy link
Contributor Author

I'm not sure why that is.

Looking more closely at the code, a MOB should be inactivating if it is entirely within the gene, and overlapping if the duplicated region overlaps the end but is not completely contained within a gene.

That logic doesn't seem to be working for the first example you posted.

If you want to share the input GD file that you converted to CSV/TSV plus the reference sequence, I can try to follow what is happening.

@gabypetrungaro
Copy link

Thanks. Sure, I have 2 reference files because my reference genome ($ref_file1 = keio_parent_sequence.gb) has an inserted cassette ($ref_file2 = KanR_NeoR_FRTs.gbk).

The line of code I used is
gdtools ANNOTATE -o output_table.csv -f CSV -r $ref_file1 -r $ref_file2 $files

$files is a longer list of .gd files but the 2 examples I just posted are the ones attached.

I additionally attach another example where I also do not understand why a +C insertion is not classified as inactivating the gene:
frequency 1
gene_name nfsB
gene_position coding (563/654 nt)
gene_product dihydropteridine reductase, NAD(P)H-dependent,...
gene_strand <
genes_overlapping nfsB
insert_position 1
locus_tag BW25113_0578
locus_tags_overlapping BW25113_0578
mutation_category small_indel
new_seq C
position 600318
position_end 600318
position_start 600318
ref_seq A
seq_id CP009273
time -1
title 8962_830_F9_tolC
type INS

refs_and_example_gd_files.zip

@jeffreybarrick
Copy link
Contributor Author

Hi @gabypetrungaro,

I figured out what is happening. The code is working as designed, but the design has a hidden option that only marks overlapping mutations as inactivating if they hit within the first 80% of the length of the gene. This seems to be the case for both of these. It is confusing, and right now it is hard-coded, so you can't change it as a command-line option.

I'll add this as an option that can be set at the command line and also explain that it's set at 80% by default, so this will be flexible and not mysterious in the next version!

If you can compile breseq on your own, you can change this line to = 1.0 right now to get the expected behavior:

 const double cReferenceSequences::k_inactivate_overlap_fraction = 0.8;

@gabypetrungaro
Copy link

Hi @jeffreybarrick,
Thanks much for your answer! this clarifies the issue and helps me understand better how Breseq works :) I guess that this is the case not only for MOBs but for other type of mutations then, right? If so, this would also explain the INS mutation in my last example.
Thanks again and kind regards.

@jeffreybarrick
Copy link
Contributor Author

Yes, that's correct. Here's a write-up of the rules that will be part of the gdtools COMPARE/ANNOTATE command-line help in the next version:

--inactivating-overlap-fraction  <arg>                                                                                                                            

Mutations within this fraction of the length of a gene from its beginning are 
assigned to the 'genes_inactivating' versus 'the genes_overlapping' list if they 
fulfill other criteria. (DEFAULT=0.8)

--inactivating-size <arg>        

INS, DEL, and SUB mutations in genes that are longer than this length cutoff are 
always assigned to the 'genes_inactivating' list, even if they are in-frame or 
noncoding genes. (DEFAULT=15)

--promoter-distance <arg>        

Mutations upstream and within this distance of the beginning of a gene have it 
added to their 'genes_promoter' list. (DEFAULT=150)

MUTATION EFFECTS CLASSIFICATION

Each mutation has a 'genes_overlapping' list assigned based
on the genes it overlaps. If the mutation affects a position
within --inactivating-overlap-fraction of the length of an
overlapping gene from its start, the gene is moved to the
'genes_inactivated' list if it is also a MOB, SNP causing a
nonsense mutation, or an INS, DEL, or SUB that results in a
size change that is <= the --inactivating-size and results
in a frameshift in a protein-coding gene or an INS, DEL, SUB
with a size change > the --innactivating-size for any type
of gene even if it is in-frame in a protein-coding gene. If
there are no 'genes_overlapping' or 'genes_inactivated', a
mutation has a 'genes_promoter' list assigned to all genes
that are <= the --promoter_cutoff bp upstream of a gene.
There can be multiple qualifying genes assigned to each
lists, but each gene will only be in one of the lists.

I just committed the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants