uORF Annotator is the tool for annotation of the functional impact of genetic variants in upstream open reading frames (uORFs) in the human genome, which were manually annotated based on publicly available Ribo-seq and other data types in 3641 OMIM genes.
Install all dependencies from requirements.yml
as new conda environment.
conda env create -f requirements.yml
- VCF file of variants for further annotation
- BED file with available uORFs (
sorted.v4.bed
in this repository) - GTF file with genomic features annotation
- [optional] TSV file with gene-level gnomAD constraint statistics
It is highly recommended to run the tool with a GTF file containing uORF-matching transcript isoforms for each gene (combined_uorf.v4.gtf
in this repository). If a GTF file lacks matching transcript IDs, main CDS effect will not be properly annotated.
python uORF_annotator.py \
-i <input_variants.vcf> \
-b <input_uORFs.bed> \
-g <input_annotation.gtf> \
-f <human_genome.fasta> \
-gc <gnomad_constraint> \
-out <output file prefix>
-utr
Two TSV outputs are generated - one for ATG-started uORFs and one - for non-ATG-started ones. Each row represents annotation of a single variant in particular uORF (per uORF annotation). Fields in the file have the following content:
- #CHROM - contig name
- POS - position
- REF - reference allele
- ALT - alternative allele
- orf_start - uORF start position
- orf_end - uORF end position
- strand - strand dirrection defined as + (forward) or - (reverse).
- gene_name - symnol gene ID from GTF annotation
- transcript - transcript ID from uORF file
- codon_type - ATG or non-ATG start codon in uORF
- overlapping_type - type of uORF (overlapping/non-overlapping) from input
.bed
file - dist_from_orf_to_snp - distance from start of uORF to variant position
- consequence - type of effect
- main_cds_effect - effect of uORF variant on the main CDS (annotated for
frameshift
andstop lost
variants only) - in_known_ORF - a binary flag indicating if a variant affects main ORF of the gene
- pLI score (if gnomAD constraint annotation is provided)
- LOEUF score (if gnomAD constraint annotation is provided)
- INFO - old INFO field from inputed
.vcf
file
The generated VCF output contains all variants affecting uORF sequences. Each variant is annotated with the following INFO fields: uORFs
, uORFs_ATG
, uORFs_eff
. The description of fields is given below:
uORFs
- a full consequence annotation for each variant-uORF combination. Format: 'ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF|utid|overlapping_type|dominance_type|codon_type'uORFs_ATG
- a flag indicating if a variant falls within at least one ATG-starting uORF.uORFs_eff
- a short notation of how a change in uORF structure resulting from a variant affects the main coding part (СDS) of a gene. ext - N-terminal extension, overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS. If the variant falls into more than one uORF, the effects on them are listed through &
uORF Annotator generates two BED files with uORFs affected by variants that alter the uORF length (one file contains ATG uORFs and the other contains non-ATG-started uORFs). Both BED files contain two entries for each affected uORF:
- initial uORF, its
name
field format: uORF_unique_number-gene_name|uORF_type|start_codon_type(ATG/non-AT), filled with black color; - resulting uORF after introduction of a variant, its
name
field format: uORF_unique_number-gene_name|variant|variant_type|main_CDS_effect, filled with different colors depending on the effect. Color legend:
- Grey features - cases when the variant does not change the overlap between uORF and main CDS.
- Orange features - cases when (a) uORF-truncating variant eliminates the existing overlap between uORF and main CDS; or (b) variant leads to the production of a chimeric protein product of the gene, possessing an extension at the N-terminus resulting from uORF translation
- Red features - cases where variant leads to the appearance of a new overlapping segment between uORF and main gene CDS, with the two sequences translated in different frames.
This repository contains two additional files:
Annotated_uORFs_and_alt.CDS.starts_v4.8.bed
- Manually annotated alternative open reading frames (including non-overlapping and overlapping uORFs, CDS_extensions and CDS_truncations) found in 3641 human genes from the OMIM database . BED-file, field "name" contains information about 'gene_name|isoform|type_of_ORF|start_codon'.High-confidence_uORFs_v2.bed
- List of high confidence uORFs found in 3641 human genes from the OMIM database. In this list, we included only uORFs predicted in at least two out of four different studies (the present study, Ji et al. (PMID: 26687005), McGillivray et al. (PMID: 29562350), Scholz et al. (PMID: 31513641)). BED-file, field "name" contains information about 'gene_name|isoform|type_of_uORF|type_of_start_codon(ATG/non-ATG)'.