Skip to content

Recipe: DNA seq with Halvade on a local Hadoop cluster

Dries Decap edited this page Mar 12, 2018 · 13 revisions

DNA-seq variant calling with Halvade on a local Hadoop cluster

Step 1: Downloading the required data

Dataset

The dataset used in this example is the NA12878 dataset. This consists of two gzipped FASTQ files with paired-end reads that are available for download here:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

Halvade

Every Halvade release is available at github. Download and extract this archive, for version v1.2.0 do:

wget https://github.com/ddcap/halvade/releases/download/v1.2.0/Halvade_v1.2.0.tar.gz
tar xvf Halvade_v1.2.0.tar.gz

Reference

The reference consists of a FASTA file containing the human genome, an index and a dictionary file, available for download here:

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.fai.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.dict.gz
gunzip ucsc.hg19*.gz

For the BQSR step, GATK needs a dbSNP file which is available for download here:

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.idx.gz
gunzip dbsnp_138.hg19*.gz

While you are downloading the dbsnp database, the bwa index can be created, for this untar the bin.tar.gz file to get the BWA binary and use this to index the (unzipped) FASTA file you just downloaded:

tar -xvf bin.tar.gz
./bin/bwa index ucsc.hg19.fasta

This will create 5 files which will be uploaded to HDFS to be used in Halvade in the last step.

GATK

Halvade uses GATK to run, so the binary needs to be present in the bin.tar.gz file. We have already extracted this bin.tar.gz file which resulted in a bin/ directory. Add the GenomeAnalysisTK.jar file to this folder and with this folder make a new bin.tar.gz file:

cp GenomeAnalysisTK.jar bin/
rm bin.tar.gz
tar -cvzf bin.tar.gz bin/*

Step 2: Configuring Hadoop for Halvade

To use Halvade, Hadoop needs to be installed, in this tutorial the Cloudera distribution of Hadoop, called CDH 5 is used. You can find a detailed description on how to install CDH 5 on your cluster here. Make sure each node of your cluster has Java 1.7 installed and uses this as the default Java instance, in Ubuntu use these commands:

sudo apt-get install openjdk-7-jre
sudo update-alternatives --config java

After CDH 5 is installed, this needs to be configured so that Halvade can use all available resources. In Halvade each tasks processes a portion of the input data, however, the execution time can vary to a certain degree. For this, the task timeout needs to be set high enough, in mapred-site.xml change this property to 30 minutes:

<property>
  <name>mapreduce.task.timeout</name>
  <value>1800000</value>
</property>

Next, CDH 5 needs to know how many cores and how much memory is available on the nodes, this is set in yarn-site.xml. This is very important for the number of tasks that will be started on the cluster. In this example nodes with 128 GBytes of memory and a dual socket cpu setup with in total 24 cores is used. Because some tools benefit from the hyperthreading capabilities of a CPU, the vcores is set to 48:

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>131072</value>
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>48</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>131072</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>512</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>48</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
</property>

After this the configuration needs to be pushed to all nodes:

scp *-site.xml myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster/

And the MapReduce service needs to be restarted:

On the Resource Manager:

sudo service hadoop-yarn-resourcemanager restart

On each NodeManager:

sudo service hadoop-yarn-nodemanager restart

On the JobHistory server:

sudo service hadoop-mapreduce-historyserver restart

Intel’s Hadoop Adapter for Lustre

When using Lustre as the filesystem instead of HDFS, using Intel's adapter for Lustre will increase the performance of Halvade. To enable the Adapter for Lustre you need to change some configurations in your Hadoop installation. In core-site.xml you need to point to the location of Lustre and set the Lustre FileSystem class, if Lustre is mounted on /mnt/lustre/ add these to the file:

<property>
    <name>fs.defaultFS</name>
    <value>lustre:///</value>
</property>
<property>
    <name>fs.lustre.impl</name>
    <value>org.apache.hadoop.fs.LustreFileSystem</value>
</property>
<property>
    <name>fs.AbstractFileSystem.lustre.impl</name>
    <value>org.apache.hadoop.fs.LustreFileSystem$LustreFs</value>
</property>
<property>
    <name>fs.root.dir</name>
    <value>/mnt/lustre/hadoop</value>
</property>

Additionally, you need to set the Shuffle class in mapred-site.xml:

<property>
    <name>mapreduce.job.map.output.collector.class</name>
    <value>org.apache.hadoop.mapred.SharedFsPlugins$MapOutputBuffer</value>
</property>
<property>
    <name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name>
    <value>org.apache.hadoop.mapred.SharedFsPlugins$Shuffle</value>
</property>

After adding these settings to the configuration, the files need to be pushed to all nodes again and all services restarted, see above. Additionally the jar containing Intel's Adapter for Lustre should be available on all nodes and added to the classpath of Hadoop. To do this you can find the directories that are currently in your Hadoop classpath and add the jar to one of these on every node, to find the directories run:

hadoop classpath

Step 3: Running Halvade

Prepare the data

First, a directory needs to be made where all the reference, the dbSNP and input files will be stored (change the username accordingly):

hdfs dfs -mkdir -p /user/<username>/halvade/ref/dbsnp/
hdfs dfs -mkdir /user/<username>/halvade/in/

Next the reference, dbSNP and bin.tar.gz files need to be copied to HDFS:

hdfs dfs -put ucsc.hg19.* /user/<username>/halvade/ref/
hdfs dfs -put dbsnp_138.hg19.vcf* /user/<username>/halvade/ref/dbsnp/
hdfs dfs -put bin.tar.gz /user/<username>/halvade/

The input data needs to be preprocessed and stored onto HDFS, this is done with HalvadeUploaderWithLibs.jar. For better performance it is advised to increase the Java heap memory for the hadoop command, run these commands:

export HADOOP_HEAPSIZE=32768
hadoop jar HalvadeUploaderWithLibs.jar -1 ERR194147_1.fastq.gz -2 ERR194147_2.fastq.gz -O /user/<username>/halvade/in/ -t 8

If a different distributed filesystem like Lustre or GPFS is used instead of HDFS, and you want Halvade to raed the files directly from there, you can add some files so that Halvade can locate the reference files.

For the BWA and GATK reference you need to make these 2 files, based on the fasta filename, e.g. ucsc.hg19.fasta:

touch ucsc.hg19.bwa_ref
touch ucsc.hg19.gatk_ref

As for the dbSNP a similar file can be made in the directory of the database, e.g. /lustre//halvade/ref/dbsnp/:

touch /lustre/<username>/halvade/ref/dbsnp/.dbsnp

The same is valid if you distributed the reference to each node before executing the job, make sure these files are present on every node so Halvade can find the local files.

Configure Halvade

Now some configurations need to be set to run Halvade properly, in example.config the basic options are provided, copy this file and adjust accordingly, assuming a 16 node cluster with 128GB memory and 24 cores with simultaneous multi threading:

N=16
M=128
C=24
B="/user/<username>/halvade/bin.tar.gz"
D="/user/<username>/halvade/ref/dbsnp/dbsnp_138.hg19.vcf"
R="/user/<username>/halvade/ref/ucsc.hg19.fasta"
I="/user/<username>/halvade/in/"
O="/user/<username>/halvade/out/"
smt

If your reference is distributed on each node or accessible by each node in a different distributed file system like Lustre or GPFS, then add this line to the config file:

refdir="/lustre/<username>/halvade/ref/"

Where /lustre//halvade/ref/ contains the BWA reference, the fasta reference and the dbSNP file.

Run Halvade

Now everything has been set and running Halvade can be done with this command:

python runHalvade.py example.config

When Halvade is finished, a vcf file will be created called /user/<username>/halvade/out/merge/HalvadeCombined.vcf which contains all called variants.