Gathering input files

There are dozens of input files needed for these pipelines. These include things like reference genomes, aligner indices, gene annotations, and known variants. This guide aims to be a comprehensive walkthrough of where to find those files and/or how to create them. Code snippets are included for both mouse and human, using GRCh38/GRCm38 and Ensembl version 95. (We welcome contributions of additional species)

If you just want to grab a pre-built blob containing all of these files, visit the links below (work in progress)

To create them yourself, follow the links below, each labelled with key pipelines for which they're needed

- Alignment
- Germline variant calling
- Somatic variant calling (exome/WGS)
- RNAseq (bulk)
- Bisulfite

Before running any snippets, make a directory where you'd like this cache to be stored, and set the BASEDIR variable to point to it:

BASEDIR=/path/to/annotation_data_grch38_ens95

Reference genome

fasta

human

Note that we use a slightly modified version of the GRCh38 reference which N-masks a single contig. That contig contained a duplication of the U2AF1 gene locus, resulting in multi-mapping and loss of variant calls in that important cancer gene. The coordinate system is preserved and alignments should be otherwise unaffected.

mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCh38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz

mouse

mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCm38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz

fasta index

human or mouse

From docker image `mgibio/samtools-cwl:1.0.0`

/opt/samtools/bin/samtools faidx $BASEDIR/reference_genome/all_sequences.fa

Sequence dictionary

human or mouse

From docker image `mgibio/picard-cwl:2.18.1`

java -jar /opt/picard/picard.jar CreateSequenceDictionary R=$BASEDIR/reference_genome/all_sequences.fa O=$BASEDIR/reference_genome/all_sequences.dict

Contig lengths

human or mouse

 perl -ne 'print "$1\t$2\n" if $_ =~ /SN:([^\s]+).+LN:(\d+)/' $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/reference_genome/all_sequences.genome

Ensembl/VEP annotation files

VEP cache

human

Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):

mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES homo_sapiens --ASSEMBLY GRCh38

If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>

mouse

Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):

mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES mus_musculus --ASSEMBLY GRCm38

If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>

Genes/Transcripts GTF

human

mkdir -p $BASEDIR/rna_seq_annotation
cd $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
gunzip Homo_sapiens.GRCh38.95.gtf.gz

The reference uses "chr" prefixes (chr1, chr2, chr3), while the Ensembl gtf does not (1, 2, 3), so we need to convert the gtf:

cd $BASEDIR/rna_seq_annotation/
wget https://raw.githubusercontent.com/chrisamiller/convertEnsemblGTF/master/convertEnsemblGTF.pl
mv $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig
perl convertEnsemblGTF.pl $BASEDIR/reference_genome/all_sequences.dict $BASEDIR/vep_cache/homo_sapiens/95_GRCh38/chr_synonyms.txt 
$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf && rm -f $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig

mouse

mkdir -p $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/mus_musculus/Mus_musculus.GRCm38.95.gtf.gz
gunzip $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz

Transcriptome cDNA reference

human

Note that we're not going to convert these coordinates from [1,2,3] to [chr1,chr2,chr3], since Kallisto doesn't include coordinates in it's output. If pseudobams from kallisto are desired, that conversion would need to be done.

wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz http://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

mouse

wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

Aligner indices

bwa mem index

human or mouse

From docker image: `mgibio/alignment_helper-cwl:1.0.0`

mkdir -p $BASEDIR/aligner_indices/bwamem_0.7.15 
cd $BASEDIR/aligner_indices/bwamem_0.7.15
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
/usr/local/bin/bwa index all_sequences.fa

STAR-fusion index

human

In the `trinityctat/starfusion` docker container:

mkdir -p $BASEDIR/aligner_indices/star_fusion 
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf --pfam_db current --dfam_db human --output_dir $BASEDIR/aligner_indices/star_fusion"

mouse

In the `trinityctat/starfusion` docker container:

mkdir -p $BASEDIR/aligner_indices/star_fusion 
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf --pfam_db current --dfam_db mouse --output_dir $BASEDIR/aligner_indices/star_fusion"

HISAT Index (for older pipeline versions only)

human or mouse

From docker image: mgibio/rnaseq

mkdir -p $BASEDIR/aligner_indices/hisat_2.1.0
cd $BASEDIR/aligner_indices/hisat_2.1.0
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
echo "This is a placeholder file to pass to CWL to refer to all the index files and be a parameter to HISAT2." >GRCm38
/opt/hisat2/hisat2-2.1.0/hisat2-build -p 8 all_sequences.fa GRCm38"

biscuit index

human or mouse

From docker image: `mgibio/biscuit:0.3.8`

mkdir -p $BASEDIR/aligner_indices/biscuit_0.3.8
cd $BASEDIR/aligner_indices/biscuit_0.3.8
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
/usr/bin/biscuit index all_sequences.fa

Known Variants (these are all human-only)

known snps and indels

human

dbSNP/indels from the Mills/1000G/etc datasets for use with GATK/Mutect from the `google/cloud-sdk` image:

mkdir $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi $BASEDIR/known_variants

DoCM cancer variants

human

Variants from the [Database of Canonical Mutations](http://docm.info) used for hotspot checking in somatic pipelines

wget -O $BASEDIR/known_variants/docm.vcf https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/docm.GRCh38.vcf

Gnomad population frequencies

human

This command downloads the V3 of the genome [Gnomad Database](https://gnomad.broadinstitute.org) which has the allele frequencies from 70k WGS samples aligned to GRCh38. This file is ~236gb.

wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz
wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz.tbi https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz.tbi

lite version

Can create a lite version containing the population allele frequencies from the full vcf using using bcftools. Tested using bcftools version 1.9.

bcftools annotate -x ^INFO/AF,INFO/AF_afr,INFO/AF_amr,INFO/AF_asj,INFO/AF_eas,INFO/AF_fin,INFO/AF_nfe,INFO/AF_oth,INFO/AF_sas --threads $THREADS --output-type z -o $BASEDIR/known_variants/gnomad-lite.genomes.r3.0.sites.vcf.gz $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz

Common SNPs for somalier concordance

human

wget -O $BASEDIR/known_variants/somalier_GRCh38.vcf.gz https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz

Other Transcriptome files

Kallisto transcriptome index

human

From docker image `mgibio/rnaseq`

kallisto index -i $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz

mouse

From docker image `mgibio/rnaseq`

kallisto index -i $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz

Kallisto gene translation table

human

From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`

cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'

mouse

From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`

cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'

Refflat genes for QC

human

In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`

gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.refFlat.txt

mouse

In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`

gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.refFlat.txt

Ribosomal genes for QC

human

cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals

mouse

cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals

Other

Illumina adapters

human or mouse

mkdir $BASEDIR/miscellaneous
cd $BASEDIR/miscellaneous
wget https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/illumina_multiplex.fa

QC annotation files

human

cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/hg38_QC_assets.zip
unzip hg38_QC_assets.zip

mouse

cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/mm10_QC_assets.zip
unzip mm10_QC_assets.zip

Want to contribute to this Wiki?

Fork it and send a pull request.

Return to Wiki Home
Return to analysis-workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gathering input files

Reference genome

fasta

fasta index

Sequence dictionary

Contig lengths

Ensembl/VEP annotation files

VEP cache

Genes/Transcripts GTF

Transcriptome cDNA reference

Aligner indices

bwa mem index

STAR-fusion index

HISAT Index (for older pipeline versions only)

biscuit index

Known Variants (these are all human-only)

known snps and indels

DoCM cancer variants

Gnomad population frequencies

Common SNPs for somalier concordance

Other Transcriptome files

Kallisto transcriptome index

Kallisto gene translation table

Refflat genes for QC

Ribosomal genes for QC

Other

Illumina adapters

QC annotation files

Want to contribute to this Wiki?

Clone this wiki locally