Skip to content
Dries Decap edited this page May 26, 2015 · 26 revisions

#Getting started

Recipes

For the DNA-seq pipeline recipe go here.

For the RNA-seq pipeline recipe go here.

##Preparing reference for Halvade To run Halvade a reference is needed for both GATK and BWA and a SNP database is required for the base quality score recalibration step. In Halvade, the hg19 reference is used, the files can be found in the GATK resource bundle. GATK requires the fasta file containing all sepparate chromosomes and corresponding fasta index and dictionary file, the files needed are:

  • ucsc.hg19.dict.gz
  • ucsc.hg19.fasta.fai.gz
  • ucsc.hg19.fasta.gz

All files downloaded from the GATK resource bundle have a corresponding md5 hash to check for file errors. These files are compressed with gzip and should be decompressed with gunzip <filename>. After the fasta file is decompressed, a reference for BWA can be made using the following instruction:

bwa index ucsc.hg19.fasta

This will create 5 files which are all needed for BWA:

  • ucsc.hg19.fasta.amb
  • ucsc.hg19.fasta.ann
  • ucsc.hg19.fasta.bwt
  • ucsc.hg19.fasta.pac
  • ucsc.hg19.fasta.sa

For base quality score recalibration a SNP database is required, which is included in the GATK resource bundle:

  • dbsnp_138.hg19.vcf.gz
  • dbsnp_138.hg19.vcf.idx.gz

This file can remain compressed as Halvade will automatically decompress this file if needed. One additional file that is required by Halvade is a file containing all binaries that will be used in Halvade, this file is given with each release of Halvade:

  • bin.tar.gz

These files all need to be on either HDFS or S3 so Halvade can access them when needed. Halvade requires that all reference files for BWA start with the same prefix as the fasta reference for GATK, which is why the files need to be in the same folder. How to put this data on HDFS or S3 can be found here: HDFS and Amazon S3.

on HDFS

To put files on HDFS the Hadoop (version 2.0 or newer) hdfs command can be used as follows:

hdfs fs -put /path/to/local/filename /path/on/hdfs/
hdfs fs -put /path/to/local/filename /path/on/hdfs/custom_filename

If you want to make a new folder that will contain the data, this command can be used:

hdfs fs -mkdir /path/to/new/folder/

on Amazon S3

To put files on Amazon S3 a bucket has to be created first, instructions can be found on Amazon. Once a bucket has been made, files can be uploading using the Amazon console (instructions from Amazon). An alternative way to upload files to S3 is using s3cmd. This can be downloaded here and instructions on how to use s3cmd can be found here.

Preprocessing

To preprocess the input data (paired-end FASTQ files) a tool is provided with Halvade. This tool, HalvadeUploader.jar, will preprocess the input data and upload it onto HDFS or S3 depending on the output directory. For more information on how to run the preprocessing tool, [go here] (Halvade-Preprocessing).

Main script

To run the program a script has been added, this reads configuration from two files:

To set an option, remove the # before the line and add an argument (between "..." if the option is a string) if necessary. After all options are set, run runHalvade.py and wait until completion.

###Local cluster To configure a cluster the options in halvade.conf need to be set, for a local cluster this is:

  • nodes: sets he number of nodes in the cluster
  • vcores: sets the number of threads that can run on each node
  • B: sets the absolute path to the bin.tar.gz file on HDFS
  • D: sets the absolute path to the SNP database file on HDFS
  • R: sets the absolute path of the fasta file (without extension) of the reference on HDFS, all other reference files should be in the same folder with the same prefix

Make sure that all options for Amazon EMR are disabled by putting the line in comment (add # before the line)

Once this is set for your cluster you only need to change halvade_run.conf for jobs you want to run. Two options that are mandatory are the input I, which gives the path to the input directory, and output O, which gives the path to the output directory. With this all options are set and Halvade can be run.

###Amazon EMR To run on Amazon EMR the Amazon EMR command line interface (instructions from Amazon) needs to be installed. To configure a cluster the options in halvade.conf need to be set, for Amazon EMR this is:

  • nodes: sets he number of nodes in the cluster
  • vcores: sets the number of threads that can run on each node
  • B: sets the absolute path to the bin.tar.gz file on S3
  • D: sets the absolute path to SNP database file on S3
  • R: sets the absolute path of the fasta file of the reference on S3, all other reference files should be in the same folder with the same prefix
  • emr_jar: sets the absolute path of _HalvadeWithLibs.jar on S3
  • emr_script: sets the absolute path of halvade_bootstrap.sh on S3
  • emr_type: sets the Amazon EMR instance type (e.g. "c3.8xlarge")
  • emr_ami_v: sets the AMI number for Amazon EMR, should be set to "3.1.0" or newer
  • tmp: this should be set to "/mnt/halvade/", this directory will be created in the bootstrap script. To change the tmp directory make sure it exists on the nodes

For locations on S3 a uri of this form should be used: s3://bucketname/directory/to/file

Once this is set for your cluster you only need to change halvade_run.conf for jobs you want to run. Two options that are mandatory are the input I, which gives the path to the input directory, and output O, which gives the path to the output directory. With this all options are set and Halvade can be run by executing ./runHalvade.py.