The gff_cleaner
utility performs various manipulations on a GFF file to "clean" it.
General usage syntax:
gff_cleaner [OPTIONS] <file>.gff
Options:
--version
show program's version number and exit
-h, --help
show the help message and exit
-o OUTPUT_GFF
Name of output GFF file (default is <file>_clean.gff
)
--prepend=PREPEND_STR
String to prepend to seqname in first column
--clean
Perform all the 'cleaning' manipulations on the input data (equivalent to specifying all of --clean-score
, --clean-replace-attributes
, --clean-exclude-attributes
and --clean-group-sgds
)
--clean-score
Replace Anc_*
and blanks in the score
field with zeroes
--clean-replace-attributes
Replace ID
, Gene
, Parent
and Name
attributes with the value of the SGD
attribute, if present
--clean-exclude-attributes
Remove the kaks
, kaks2
and ncbi
attributes (to remove arbitrary attributes, see the --remove-attribute=...
option)
--clean-group-sgds
Group features with the same SGD
by adding unique numbers to the ID
attributes; ID
s will have the form CDS:<SGD>:<n>
(where n
is a unique number for a given SGD)
--report-duplicates
Report duplicate SGD
names and write list to <file>_duplicates.gff
with line numbers, chromosome, start coordinate and strand.
--resolve-duplicates=MAPPING_FILE
Resolve duplicate SGD
s by matching against 'best' genes in the supplied mapping file; other non-matching genes are discarded and written to <file>_discarded.gff
.
--discard-unresolved
Discard any unresolved duplicates, which are written to <file>_unresolved.gff
.
--insert-missing=GENE_FILE
Insert genes from gene file with SGD
names that don't appear in the input GFF. If GENE_FILE
is blank ('='s must still be present) then the mapping file supplied with the --resolve-duplicates
option will be used instead.
--add-exon-ids
For exon features without an ID
attribute, construct and insert an ID of the form exon:<Parent>:<n>
(where n
is a unique number).
--add-missing-ids
For features without an ID
attribute, construct and insert a generated ID of the form <feature>:<Parent>:<n>
(where n
is a unique number).
--no-percent-encoding
Convert encoded attributes to the correct characters in the output GFF.
Warning
This may result in a non-cannonical GFF that can't be read correctly by this or other programs.
--remove-attribute=RM_ATTR
Remove attribute RM_ATTR
from the list of attributes for all records in the GFF file (can be specified multiple times)
--strict-attributes
Remove attributes that don't conform to the KEY=VALUE
format
--debug
Print debugging information
<file>_clean.gff
: 'cleaned' version of input<file>_duplicates.txt
: list of duplicatedSGD
names and the lines they appear on in the input file, along with chromosome, start coordinate and strand<file>_discarded.gff
: genes rejected by--resolve-duplicates
<file>_unresolved.gff
: unresolved duplicates rejected by--discard-unresolved
The following steps outline the procedure for using the program, with each step being run on the output from the previous one:
Clean the chromosome names in the file by adding a prefix (`--prepend` option)
Creates a copy of the input file with the chromosome names updated with a specified prefix.
E.g.
--prepend=chr
will addchr
to the start of each chromosome name in the file, which is useful if the chromosome is denoted by a number and needs the prefix for consistency with a mapping file.Clean the GFF score and attribute data (`--clean` options)
The "clean" options perform the following operations:
--clean-score
: the data in the score column is cleaned up by replacingAnc_*
and blanks with '0's.
The attribute field of the GFF can contain various semicolon-separated key-value pairs:
--clean-replace-attributes
: if one of these is a non-blankSGD
then theGene
,Parent
andName
values are updated to be the same as theSGD
name.--clean-exclude-attributes
: attributes calledkaks
,kaks2
andncbi
are removed (n.b. to remove arbitrary attributes, use the more general--remove-attribute=...
option).
If multiple features share the same SGD name then
--clean-replace-attributes
can result in them also sharing the same ID; to deal with this:--clean-group-sgds
: update the ID attribute to group neighbouring lines that have the sameSGD
(seesgd_grouping
below).A single --clean can be specified which performs all these operations automatically.
Detect duplicate SGDs (`--report-duplicates` option)
Report duplicate SGD names found in the input file.
This option writes a list of the duplicates to a 'duplicates' file.
It also reports the number of 'trivial' duplicates, i.e. lines having the same
SGD
because they are part of the same gene.Resolve duplicate SGDs using a mapping file (`--resolve-duplicates` option)
Attempt to resolve duplicates by referring to a list of "best" genes given in a mapping file. For each duplicated name the resolution procedure is:
- Find mapping gene(s) with the same name
- For each mapping gene, keep duplicates which match chromosome, strand and which overlap with the start and end of the gene (see
overlap_criteria
below). ForSGD
groups the mapping gene must overlap the whole group for it to match; mapping genes and duplicates which don't have matches are removed from the process. - At the end of the matching procedure the duplication is resolved if there is one
SGD
(orSGD
group) matched to one mapping gene. Otherwise the duplication remains unresolved.
When duplicates are resolved, the non-matching duplicates are discarded; otherwise by default all unresolved duplicates are kept. However if the
--discard-unresolved
option is also specified then all unresolved duplicates are removed before output; the--insert-missing
option can then be used to add them back in.Note that the
--discard-unresolved
option cannot get rid of 'trivial' duplicates (i.e. lines having the same SGD because they are part of the same gene).Add missing genes (`--insert-missing` option)
Adds genes from a list of "best" genes given in a mapping file which have names not found in the input GFF.
As part of setting the ID
attribute of GFF lines, the "clean" option also attempts to group neighbouring lines which have the same SGD
name.
The ID
attribute is updated to the form:
ID=CDS:<sgd_name>:<i>
where <sgd_name>
is a gene or transcript name (e.g. YEL0W
) and <i>
is an integer index which starts from 1. Groupings are indicated by subsequent lines having the same <sgd_name>
but monotonically increasing indices, for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:4;SGD=YEL0W
When determining a grouping the program looks ahead from each line for subsequent lines (up to five) which have the same SGD value. So groupings can also accommodate "breaks", for example:
chr1 Test CDS 34525 35262 0 - 0 ID=CDS:YEL0W:1;SGD=YEL0W
chr1 Test CDS 35823 37004 0 - 0 ID=CDS:YEL0W:2;SGD=YEL0W
chr1 Test CDS 38050 38120 0 - 0 ID=CDS:YEL0X:1;SGD=YEL0X
chr1 Test CDS 39195 39569 0 - 0 ID=CDS:YEL0W:3;SGD=YEL0W
The mapping file is a tab-delimited text file with lines of the form:
name chr start end strand
<name>
is used to match against the SGD
names in the input GFF file.
Aside from matching chromosome and strand, one of the criteria for a mapping gene to match a duplicate from the GFF file is that the two must overlap.
An overlap is counted as the duplicate from the GFF having start/end positions such that it lies inside the start/end positions of the mapping gene extended by 1kb i.e. between start - 1000
and end + 1000
.