DESCRIPTION bin/fix_genic_features.py
fix_genic_features.py
makes in total three passes through transcript
annotation to arrive at the final genome wide annotation.
In the first pass, for every distinct gene_name
attribute in GTF file we
store all of the corresponding features from the GTF input file (i.e. all
exons, and also all start and stop codons in case of coding genes).
In the second pass, we define the six types of genomic features. Features can be overlapping at first.
- upstream genic regions
- 3'UTRs
- CDS
- introns
- 5'UTRs
- downstream genic regions
In the third and final pass, we go through the list of features and resolve ambiguities of overlapping features using following principles:
-
Any exon outside of outermost start codon is labelled
utr5p
, outside of the outermost stop codon is labelledutr3p
and middle exons are labelled simplyexon
. -
If exon overlaps with an intron, we label such region
exon
. -
Upstream and downstream regions of any gene are initially defined in the second pass as 1 kb up- and downstream. During the third pass regions that receive final
upstream
anddownstream
annotation can be shorter than 1kb: if we detect that the upstream region of geneA overlaps withutr5p
,utr3p
,exon
orintron
of geneB, then only the non-overlapping fraction of the upstream regions of geneA is going to be labelledupstream
in the final annotation. In the extreme cases theupstream
anddownstream
recieve length of 0 nt. For example, this can happen when geneA is located within a long intron of geneB and the intial 1 kb upstream and downstream regions of geneA happen to be entirely contained within the long intron of geneB.
Exons are defined in the first pass from the information in the input GTF file.
Regions between consecutive exons are all labelled as intron
in the second
pass. Some exons will be re-labelled into utr5p
and utr3p
in the second
pass. Note regions of introns (or sometimes entire introns) lose intron
label
in the third pass if an overlap with exonic features is detected.
UTR features of coding genes are defined in the second pass after we store
coordinates of all exons, start and stop codons. In the second pass we select a
single start and a single stop codon position per gene. Then, these two
selected positions are used as stable reference points for assigning UTR
annotation to exon
features of all transcripts of that gene. When gene
encodes multiple transcript isoforms that differ in positions of start and stop
codons, we choose the positions of the outermost codons as the reference
points. Exons, or parts of exons, that are outside of the reference points
receive utr5p
and utr3p
labels, while middle exons remain with exon
label. In case of non-coding genes, i.e. genes without start_codon
and
stop_codon
entries in the input GTF file, all exons remain with the exon
label.
Upstream and downstream regions are defined as 1kb regions outside of outermost exons in the second pass of the program. They will be contracted in the third pass if overlaps with genic features is detected.
Input data is passed to fix_genic_features.py
via three arguments. Command
line flags for passing these arguments to the program, and description of input
format is given below.
To correctly annotate regions outside of genes (i.e. regions that recieve
intergenic
label) it is required to know chromosome lengths. This file
consists of lines with two tab-delimited fields. First field is chromsome
name, the second field is chromsome length.
Example of input genome file for Drosophila melanogaster dm3 genome assembly:
chr2L 23011544
chr2LHet 368872
chr2R 21146708
chr2RHet 3288761
chr3L 24543557
chr3LHet 2555491
chr3R 27905053
chr3RHet 2517507
chr4 1351857
chrM 19517
chrU 10049037
chrUextra 29004656
chrX 22422827
chrXHet 204112
chrYHet 347038
Note: given that we have the genome sequence file in FASTA format, the table with chromosome names and lengths can be generated with get_chr_lengths.pl
Input transcript annotation is required to be in GTF format GTF description. GTF RefSeq transcript annotation can be obtained via UCSC Table Browser.
In the default GTF file only exon
, stop_codon
and start_codon
features are
important. Other features (such as CDS
) can be present in the GTF input file,
but they are ignored by fix_genic_features.py
. Things like intron are
inferred based exon
position.
For it's function of per-gene annotation of genomes, fix_genic_features.py
requires presence of the gene_name
attribute in the input GTF file, since
annotation is done on a per-gene basis. Unfortunately the default RefSeq GTF
table as obtained from UCSC does not include gene_name
attributes, to these
attributes have to be added. Also, since we are interested in discriminating
between types of non-coding RNA (snoRNA, miRNA, lincRNA and etc.) we mannually
curate annotation of non-coding RNAs in the default GTF files. Description
of these procedures is in
curating_RefSeq_GTF
Fortunately, however, the relationship between RefSeq transcript_id
and
gene_name
field is preserved when UCSC Table browser is instructed to output
all fields
. Therefore it easy to supplement the GTF table with gene_name
attributes via Perl snippet
add_gene_names_to_gtf.pl.
In later stages of the analysis, performed by
summarize_features.py
we distribute reads between overlapping genes. Overlaps of two protein coding
genes are treated differently from overlaps of protein-coding and non-coding
genes (this is because some highly abundant non-coding RNAs are artifacts of
cloning see documentation for
summarize_features).
Therefore we manually curate GTF files from UCSC Table browser and add
gene_type
attribute to each feature (e.g. protein_coding
,
AR_miRBase_hairpin
, and etc.). The curation of the UCSC RefSeq GTF files is
described
here.
Example input GTF file with added gene_name
and gene_type
attributes:
...
chr1 mm10_refGene exon 3421702 3421901 0.000000 - . gene_id "NM_001011874"; transcript_id "NM_001011874"; gene_name "Xkr4" ; gene_type "protein_coding";
chr1 00RM00000000 exon 3428386 3428449 . + . gene_id "NR_RM:SRP"; transcript_id "NR_RM:SRP"; gene_name "NR_RM:7SLRNA_dup1"; gene_type "AR_SRP";
chr1 00RM00000000 exon 3529769 3529806 . + . gene_id "NR_RM:tRNA"; transcript_id "NR_RM:tRNA"; gene_name "NR_RM:tRNA-Gly-GGA_dup1"; gene_type "AR_tRNA";
chr1 mm10_refGene CDS 3670552 3671348 0.000000 - 0 gene_id "NM_001011874"; transcript_id "NM_001011874"; gene_name "Xkr4" ; gene_type "protein_coding";
...
The gene_type
attribute is concatenated to gene_name
attribute via #
character in the output file.
Since fix_genic_features.py
uses gene_name
attribute as a key in a
dictionary to store exon information, it is a problem if several genes share
the same name. Luckily, the UCSC RefSeq GTF file contains transcript_id
attribute that is strictly unique per-chromosome: when copies of the same
sequence present as different genes on one chromosome in the GTF file they are
distinguished by _dupN
suffix. Therefore, we reasoned that the easiest way
to make sure that gene_name
attributes are unique, fix_genic_features.py
simply moves _dupN
suffix from transcript_id
onto gene_name
whenever the
suffix is present.
When we manually curate the GTF file and add some non-coding transcripts to it
(description of curating procedure is
here),
we make sure that gene_name
is unique to start with by adding _dupN
suffx
to the gene_name
straight away.
During the first pass of fix_genic_features.py
, every genic feature on any
non-coding gene belonging to selected gene_type
categories is expanded by 25
nt from both ends. Currently this procedure is performed for AR_miRBase_hairpin
,
AR_snRNA
, AR_snoRNA
.
This extension of features of non-coding genes is done since during subsequent
counting stage (counting is performed by
rpkm_genic_features.bxt5.py)
we really do not want to count any of the reads overlapping with non-coding
genes as if they belong to a coding gene. Therefore we artificially expand
coordinates of features of non-coding genes by 25 nt. Note, that even though
both exon
and intron
may originally expanded, only exon
features remain
expanded in the final output, since exon
trumps intron
during the third
pass of fix_genic_features.py
when ambiguities of overlapping features are
resolved (see PROCEDURE
section above for more details).