RNA-seq pipeline for the GTEx Consortium
This repository contains all components of the RNA-seq pipeline used by the GTEx Consortium, including alignment, expression quantification, and quality control.
The GTEx RNA-seq pipeline is provided as a Docker image, available at https://hub.docker.com/r/broadinstitute/gtex_rnaseq/
To download the image, run:
docker pull broadinstitute/gtex_rnaseq:V10The following tools are included in the Docker image:
- SamToFastq: BAM to FASTQ conversion
- FastQC: sequencing quality control
- STAR: spliced alignment of RNA-seq reads
- Picard MarkDuplicates: mark duplicate reads
- RSEM: transcript expression quantification
- bamsync: utility for transferring QC flags and re-generating read group IDs when realigning BAMs
- RNA-SeQC: RNA-seq quality control (metrics and gene-level expression quantification)
Versions used across GTEx releases*:
| V7 | V8 | V10 | |
|---|---|---|---|
| STAR | v2.4.2a | v2.5.3a | v2.7.10a |
| RSEM | v.1.2.22 | v1.3.0 | v1.3.3 |
| RNA-SeQC | v1.1.8 | v1.1.9 | v2.4.2 |
| Genome | GRCh37 | GRCh38 | GRCh38 |
| GENCODE | v19 | v26 | v39 |
*V9 did not include any RNA-seq updates
Reference indexes for STAR and RSEM are needed to run the pipeline. All reference files are available at gs://gtex-resources.
GTEx releases from V8 onward are based on the GRCh38/hg38 reference genome. Please see TOPMed_RNAseq_pipeline.md for details and links for this reference. Releases up to V7 were based on the GRCh37/hg19 reference genome (download).
For hg19-based analyses, the GENCODE annotation should be patched to use Ensembl chromosome names:
zcat gencode.v19.annotation.gtf.gz | \
sed 's/chrM/chrMT/;s/chr//' > gencode.v19.annotation.patched_contigs.gtf
The collapsed version for RNA-SeQC was generated with:
python collapse_annotation.py --transcript_blacklist gencode19_unannotated_readthrough_blacklist.txt \
gencode.v19.annotation.patched_contigs.gtf gencode.v19.annotation.patched_contigs.collapsed.gtf
The STAR index should be built to match the sequencing read length, specified by the sjdbOverhang parameter. GTEx samples were sequenced using a 2x76 bp paired-end sequencing protocol, and the matching sjdbOverhang is 75.
# build the STAR index:
mkdir $path_to_references/star_index_oh75
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "STAR \
--runMode genomeGenerate \
--genomeDir /data/star_index_oh75 \
--genomeFastaFiles /data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
--sjdbGTFfile /data/gencode.v39.GRCh38.annotation.gtf \
--sjdbOverhang 75 \
--runThreadN 4"
# build the RSEM index:
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "rsem-prepare-reference \
/data/Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta \
/data/rsem_reference/rsem_reference \
--gtf /data/gencode.v39.GRCh38.annotation.gtf \
--num-threads 4"Individual components of the pipeline can be run using the commands below. It is assumed that the $path_to_data directory contains the input data and reference indexes.
# BAM to FASTQ conversion
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq \
/bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"
# STAR alignment
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "/src/run_STAR.py \
/data/star_index_oh75 \
/data/${sample_id}_1.fastq.gz \
/data/${sample_id}_2.fastq.gz \
${sample_id} \
--threads 4 \
--output_dir /tmp/star_out && mv /tmp/star_out /data/star_out"
# sync BAMs (optional; copy QC flags and read group IDs)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "/src/run_bamsync.sh \
/data/$input_bam \
/data/star_out/${sample_id}.Aligned.sortedByCoord.out.bam \
/data/star_out/${sample_id}"
# mark duplicates (Picard)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "/src/run_MarkDuplicates.py \
/data/star_out/${sample_id}.Aligned.sortedByCoord.out.patched.bam \
${sample_id}.Aligned.sortedByCoord.out.patched.md \
--output_dir /data"
# RNA-SeQC
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "/src/run_rnaseqc.py \
${sample_id}.Aligned.sortedByCoord.out.patched.md.bam \
${genes_gtf} \
${genome_fasta} \
${sample_id} \
--output_dir /data"
# RSEM transcript quantification
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "/src/run_RSEM.py \
/data/rsem_reference \
/data/star_out/${sample_id}.Aligned.toTranscriptome.out.bam \
/data/${sample_id} \
--threads 4"Sample-level outputs in GCT format can be concatenated using combine_GCTs.py:
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V10 \
/bin/bash -c "python3 /src/combine_GCTs.py \
${rnaseqc_tpm_gcts} ${sample_set_id}.rnaseqc_tpm"