This repository contains utilities for the generation of gene models and annotations used in the RNA-seq and eQTL pipelines.
Gene-level expression and eQTLs from the GTEx project are calculated based on a collapsed gene model (i.e., combining all isoforms of a gene into a single transcript), according to the following rules:
- Transcripts annotated as “retained_intron” or “read_through” are excluded. Additionally, transcripts that overlap with annotated read-through transcripts may be blacklisted (blacklists for GENCODE v19, 24 & 25 are provided in this repository; no transcripts were blacklisted for v26).
- The union of all exon intervals of each gene is calculated.
- Overlapping intervals between genes are excluded from all genes.
The purpose of step 3 is primarily to exclude overlapping regions from genes annotated on both strands, which can't be unambiguously quantified from unstranded RNA-seq (GTEx samples were sequenced using an unstranded protocol). For stranded protocols, this step can be skipped by adding the --collapse_only flag.
Command:
python3 collapse_annotation.py gencode.v26.GRCh38.annotation.gtf gencode.v26.GRCh38.genes.gtfwhere gencode.v26.GRCh38.annotation.gtf is the GTF from GENCODE.
Further documentation is available on the GTEx Portal.