Skip to content

Gathering input files

Chris Miller edited this page Jan 28, 2022 · 39 revisions

There are dozens of input files needed for these pipelines. These include things like reference genomes, aligner indices, gene annotations, and known variants. This guide aims to be a comprehensive walkthrough of where to find those files and/or how to create them. Code snippets are included for both mouse and human, using GRCh38/GRCm38 and Ensembl version 95. (We welcome contributions of additional species)

If you just want to grab a pre-built blob containing all of these files, visit the links below (work in progress)

To create them yourself, follow the links below, each labelled with key pipelines for which they're needed

  • Alignment - Alignment
  • Germline - Germline variant calling
  • Somatic - Somatic variant calling (exome/WGS)
  • RNAseq - RNAseq (bulk)
  • Bisulfite - Bisulfite

Before running any snippets, make a directory where you'd like this cache to be stored, and set the BASEDIR variable to point to it:

BASEDIR=/path/to/annotation_data_grch38_ens95

Reference genome

fasta AlignmentGermlineSomaticBisulfiteRNAseq

human

Note that we use a slightly modified version of the GRCh38 reference which N-masks a single contig. That contig contained a duplication of the U2AF1 gene locus, resulting in multi-mapping and loss of variant calls in that important cancer gene. The coordinate system is preserved and alignments should be otherwise unaffected.

mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCh38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz
mouse

mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCm38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz

fasta index AlignmentGermlineSomaticBisulfiteRNAseq

human or mouse

From docker image `mgibio/samtools-cwl:1.0.0`

/opt/samtools/bin/samtools faidx $BASEDIR/reference_genome/all_sequences.fa

Sequence dictionary AlignmentGermlineSomaticBisulfiteRNAseq

human or mouse

From docker image `mgibio/picard-cwl:2.18.1`

java -jar /opt/picard/picard.jar CreateSequenceDictionary R=$BASEDIR/reference_genome/all_sequences.fa O=$BASEDIR/reference_genome/all_sequences.dict

Contig lengths Bisulfite

human or mouse

 perl -ne 'print "$1\t$2\n" if $_ =~ /SN:([^\s]+).+LN:(\d+)/' $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/reference_genome/all_sequences.genome

Ensembl/VEP annotation files

VEP cache GermlineSomaticRNAseq

human

Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):

mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES homo_sapiens --ASSEMBLY GRCh38

If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>

mouse

Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):

mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES mus_musculus --ASSEMBLY GRCm38

If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>

Genes/Transcripts GTF RNAseq

human

mkdir -p $BASEDIR/rna_seq_annotation
cd $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
gunzip Homo_sapiens.GRCh38.95.gtf.gz

The reference uses "chr" prefixes (chr1, chr2, chr3), while the Ensembl gtf does not (1, 2, 3), so we need to convert the gtf:

cd $BASEDIR/rna_seq_annotation/
wget https://raw.githubusercontent.com/chrisamiller/convertEnsemblGTF/master/convertEnsemblGTF.pl
mv $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig
perl convertEnsemblGTF.pl $BASEDIR/reference_genome/all_sequences.dict $BASEDIR/vep_cache/homo_sapiens/95_GRCh38/chr_synonyms.txt 
$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf && rm -f $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig
mouse

mkdir -p $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/mus_musculus/Mus_musculus.GRCm38.95.gtf.gz
gunzip $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz

Transcriptome cDNA reference RNAseq

human

Note that we're not going to convert these coordinates from [1,2,3] to [chr1,chr2,chr3], since Kallisto doesn't include coordinates in it's output. If pseudobams from kallisto are desired, that conversion would need to be done.

wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz http://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
mouse

wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

Aligner indices

bwa mem index AlignmentGermlineSomatic

human or mouse

From docker image: `mgibio/alignment_helper-cwl:1.0.0`

mkdir -p $BASEDIR/aligner_indices/bwamem_0.7.15 
cd $BASEDIR/aligner_indices/bwamem_0.7.15
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
/usr/local/bin/bwa index all_sequences.fa

STAR-fusion index RNAseq

human

In the `trinityctat/starfusion` docker container:

mkdir -p $BASEDIR/aligner_indices/star_fusion 
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf --pfam_db current --dfam_db human --output_dir $BASEDIR/aligner_indices/star_fusion"
mouse

In the `trinityctat/starfusion` docker container:

mkdir -p $BASEDIR/aligner_indices/star_fusion 
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf --pfam_db current --dfam_db mouse --output_dir $BASEDIR/aligner_indices/star_fusion"

HISAT Index (for older pipeline versions only) RNAseq

human or mouse

From docker image: mgibio/rnaseq

mkdir -p $BASEDIR/aligner_indices/hisat_2.1.0
cd $BASEDIR/aligner_indices/hisat_2.1.0
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
echo "This is a placeholder file to pass to CWL to refer to all the index files and be a parameter to HISAT2." >GRCm38
/opt/hisat2/hisat2-2.1.0/hisat2-build -p 8 all_sequences.fa GRCm38"

biscuit index Bisulfite

human or mouse

From docker image: `mgibio/biscuit:0.3.8`

mkdir -p $BASEDIR/aligner_indices/biscuit_0.3.8
cd $BASEDIR/aligner_indices/biscuit_0.3.8
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do 
    ln -s $BASEDIR/reference_genome/$i .
done
/usr/bin/biscuit index all_sequences.fa

Known Variants (these are all human-only)

known snps and indels AlignmentGermlineSomatic

human

dbSNP/indels from the Mills/1000G/etc datasets for use with GATK/Mutect from the `google/cloud-sdk` image:

mkdir $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi $BASEDIR/known_variants

DoCM cancer variants Somatic

human

Variants from the [Database of Canonical Mutations](http://docm.info) used for hotspot checking in somatic pipelines

wget -O $BASEDIR/known_variants/docm.vcf https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/docm.GRCh38.vcf

Gnomad population frequencies GermlineSomatic

human

This command downloads the V3 of the genome [Gnomad Database](https://gnomad.broadinstitute.org) which has the allele frequencies from 70k WGS samples aligned to GRCh38. This file is ~236gb.

wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz
wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz.tbi https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz.tbi

lite version Can create a lite version containing the population allele frequencies from the full vcf using using bcftools. Tested using bcftools version 1.9.
bcftools annotate -x ^INFO/AF,INFO/AF_afr,INFO/AF_amr,INFO/AF_asj,INFO/AF_eas,INFO/AF_fin,INFO/AF_nfe,INFO/AF_oth,INFO/AF_sas --threads $THREADS --output-type z -o $BASEDIR/known_variants/gnomad-lite.genomes.r3.0.sites.vcf.gz $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz 

Common SNPs for somalier concordance Somatic

human

wget -O $BASEDIR/known_variants/somalier_GRCh38.vcf.gz https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz

Other Transcriptome files

Kallisto transcriptome index RNAseq

human

From docker image `mgibio/rnaseq`

kallisto index -i $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz
mouse

From docker image `mgibio/rnaseq`

kallisto index -i $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz

Kallisto gene translation table RNAseq

human

From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`

cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'
mouse

From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`

cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'

Refflat genes for QC RNAseq

human

In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`

gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.refFlat.txt
mouse

In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`

gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.refFlat.txt

Ribosomal genes for QC RNAseq

human

cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals
mouse

cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals

Other

Illumina adapters Bisulfite

human or mouse

mkdir $BASEDIR/miscellaneous
cd $BASEDIR/miscellaneous
wget https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/illumina_multiplex.fa

QC annotation files Bisulfite

human

cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/hg38_QC_assets.zip
unzip hg38_QC_assets.zip
mouse

cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/mm10_QC_assets.zip
unzip mm10_QC_assets.zip
Clone this wiki locally