Skip to content
Branch: master
Find file History
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
bamsync Adding README for bamsync Oct 10, 2018
src Post-processing script for STAR junctions Aug 30, 2019
Dockerfile Fixed numpy dependency for pyBigWig Oct 28, 2018
README.md Updated GENCODE links Nov 12, 2018
bamsync.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
markduplicates.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
rnaseq_pipeline_bam.wdl Updated pipeline WDLs to import tasks Jul 15, 2018
rnaseq_pipeline_fastq.wdl Updated pipeline WDLs to import tasks Jul 15, 2018
rnaseqc_aggregate.wdl Combined aggregation task for RNA-SeQC May 1, 2018
rnaseqc_counts.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
rnaseqc_metrics.wdl New WLDs for RNA-SeQC (count quantification & QC) Feb 15, 2017
rsem.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
rsem_aggregate.wdl Updated pipeline WDLs to import tasks Jul 15, 2018
rsem_reference.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
samtofastq.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
samtools_view.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
star.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017
star_index.wdl Changed to pull images from gcr.io; harmonized workflows and individu… Dec 3, 2017

README.md

RNA-seq pipeline for the GTEx Consortium

This repository contains all components of the RNA-seq pipeline used by the GTEx Consortium, including alignment, expression quantification, and quality control.

Docker image

The GTEx RNA-seq pipeline is provided as a Docker image, available at https://hub.docker.com/r/broadinstitute/gtex_rnaseq/

To download the image, run:

docker pull broadinstitute/gtex_rnaseq:V8

Image contents and pipeline components

The following tools are included in the Docker image (V8 pipeline):

  • SamToFastq: BAM to FASTQ conversion
  • STAR: spliced alignment of RNA sequence reads (v2.5.3a)
  • Picard MarkDuplicates: mark duplicate reads
  • RSEM transcript expression quantification (v1.3.0)
  • bamsync: utility for transferring QC flags from the input BAM and for re-generating read group IDs
  • RNA-SeQC: QC metrics and gene-level expression quantification (v1.1.9)

Version in V7 pipeline:

  • STAR v2.4.2a
  • RSEM v1.2.22
  • RNA-SeQC v1.1.8

Setup steps

Reference genome and annotation

Reference indexes for STAR and RSEM are needed to run the pipeline.

GTEx releases up to V7 were based on the GRCh37/hg19 genome reference (download). Releases from V8 onward are based on GRCh38/hg38. Please see TOPMed_RNAseq_pipeline.md for details and links for this reference.

Releases V6/V6p and V7 were based on the GENCODE v19 annotation; release V8 will be based on GENCODE v26.

For hg19-based analyses, the GENCODE annotation should be patched to use Ensembl chromosome names:

zcat gencode.v19.annotation.gtf.gz | \
    sed 's/chrM/chrMT/;s/chr//' > gencode.v19.annotation.patched_contigs.gtf

Building the indexes

The STAR index should be built to match the sequencing read length, specified by the sjdbOverhang parameter. GTEx samples were sequenced using a 2x76 bp paired-end sequencing protocol, and the matching sjdbOverhang is 75.

# build the STAR index:
mkdir $path_to_references/star_index_oh75
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "STAR \
        --runMode genomeGenerate \
        --genomeDir /data/star_index_oh75 \
        --genomeFastaFiles /data/Homo_sapiens_assembly19.fasta \
        --sjdbGTFfile /data/gencode.v19.annotation.patched_contigs.gtf \
        --sjdbOverhang 75 \
        --runThreadN 4"

# build the RSEM index:
docker run --rm -v $path_to_references:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "rsem-prepare-reference \
        /data/Homo_sapiens_assembly19.fasta \
        /data/rsem_reference/rsem_reference \
        --gtf /data/gencode.v19.annotation.patched_contigs.gtf \
        --num-threads 4"

Running the pipeline

Individual components of the pipeline can be run using the commands below. It is assumed that the $path_to_data directory contains the input data and reference indexes.

# BAM to FASTQ conversion
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq \
    /bin/bash -c "/src/run_SamToFastq.py /data/$input_bam -p ${sample_id} -o /data"

# STAR alignment
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_STAR.py \
        /data/star_index_oh75 \
        /data/${sample_id}_1.fastq.gz \
        /data/${sample_id}_2.fastq.gz \
        ${sample_id} \
        --threads 4 \
        --output_dir /tmp/star_out && mv /tmp/star_out /data/star_out"

# sync BAMs (optional; copy QC flags and read group IDs)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_bamsync.sh \
        /data/$input_bam \
        /data/star_out/${sample_id}.Aligned.sortedByCoord.out.bam \
        /data/star_out/${sample_id}"

# mark duplicates (Picard)
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_MarkDuplicates.py \
        /data/star_out/${sample_id}.Aligned.sortedByCoord.out.patched.bam \
        ${sample_id}.Aligned.sortedByCoord.out.patched.md \
        --output_dir /data"

# RNA-SeQC
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_rnaseqc.py \
    ${sample_id}.Aligned.sortedByCoord.out.patched.md.bam \
    ${genes_gtf} \
    ${genome_fasta} \
    ${sample_id} \
    --output_dir /data"

# RSEM transcript quantification
docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "/src/run_RSEM.py \
        /data/rsem_reference \
        /data/star_out/${sample_id}.Aligned.toTranscriptome.out.bam \
        /data/${sample_id} \
        --threads 4"

Aggregating outputs

Sample-level outputs in GCT format can be concatenated using combine_GCTs.py:

docker run --rm -v $path_to_data:/data -t broadinstitute/gtex_rnaseq:V8 \
    /bin/bash -c "python3 /src/combine_GCTs.py \
        ${rnaseqc_rpkm_gcts} ${sample_set_id}.rnaseqc_rpkm"
You can’t perform that action at this time.