Where is BwaMemAligner.java? #3186

hliang · 2017-06-28T08:07:24Z

Got a NullPointerException while trying to run BwaAndMarkDuplicatesPipelineSpark.

Running:
    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 --conf spark.driver.userClassPathFirst=true --conf spark.io.compression.codec=lzf --conf spark.driver.maxResultSize=0 --conf spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.kryoserializer.buffer.max=512m --conf spark.yarn.executor.memoryOverhead=600 --conf spark.cores.max=20 --executor-cores 4 --executor-memory 10g --conf spark.driver.memory=10g /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar BwaAndMarkDuplicatesPipelineSpark -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam -O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img --disableSequenceDictionaryValidation --sparkMaster spark://ln16:7077
...
...
Caused by: java.lang.NullPointerException
        at org.broadinstitute.hellbender.utils.bwa.BwaMemAligner.<init>(BwaMemAligner.java:25)
        at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine$ReadAligner.apply(BwaSparkEngine.java:93)
        at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$align$ed6f731$1(BwaSparkEngine.java:56)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

And I notice gatk/src/main/java/org/broadinstitute/hellbender/utils/bwa/BwaMemAligner.java doesn't exist, and there is no class file gatk/build/classes/main/org/broadinstitute/hellbender/utils/bwa/BwaMemAligner.class either. Is that causing this error?

The text was updated successfully, but these errors were encountered:

magicDGS · 2017-06-28T09:10:08Z

This should be in the gatk-bwamem-jni dependency. Maybe the artifact is not correctly packaged...but it is in the gradle.build...

hliang · 2017-06-28T09:41:57Z

You are right. It's built and packed in gatk/build/libs/gatk-spark.jar.
Any clue why the program failed.

magicDGS · 2017-06-28T09:51:01Z

I'm not familiar with this code, so I cannot help fixing the problem. After exploring at the code a bit, it looks like the image file is null. Because there is a single instance of the index file, it looks that at some point it was closed and thus the global instance is null. Maybe this will be helpful for debug/fix this problem...

SHuang-Broad · 2017-06-28T15:37:04Z

@hliang , I broke it into several lines:

    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 
    --conf spark.driver.userClassPathFirst=true 
    --conf spark.io.compression.codec=lzf 
    --conf spark.driver.maxResultSize=0 
    --conf spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  
    --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  
    --conf spark.kryoserializer.buffer.max=512m 
    --conf spark.yarn.executor.memoryOverhead=600 
    --conf spark.cores.max=20 --executor-cores 4 --executor-memory 10g 
    --conf spark.driver.memory=10g 
    /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar 
    BwaAndMarkDuplicatesPipelineSpark 
    -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam 
    -O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam 
    -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
    --bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img 
    --disableSequenceDictionaryValidation 
    --sparkMaster spark://ln16:7077

And the line

--bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img

is the problem. The image file has to be living a regular file system in all the worker nodes, NOT hdfs.

SHuang-Broad · 2017-06-28T15:38:10Z

So the null pointer is caused by bwa mem complaining that it cannot load the index from the HDFS.

magicDGS · 2017-06-29T09:10:58Z

Could it be possible to read the file from a java.nio.Path in the gatk-bwamem-jni, @SHuang-Broad? It looks that it's a constraint of the native code, but it will be nice to be able to have just one index image in HDFS accessible for all the nodes...

hliang · 2017-06-30T10:10:14Z

Thank you @SHuang-Broad. The error was gone after I copied bwaindeximage file to lustre file system, which can be accessed by all worker nodes.
The new problem is: the program started but didn't give any informative message/progress (see log below). It was stopped (Ctl-C) after 16 hours. The sequence data is regular human exome, which could be mapped in 1-2 hours in our traditional pipeline.

../gatk/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
-I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam 
-O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /TEST/hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=720 
--executor-cores 20 
--executor-memory 50g 
--conf spark.driver.memory=50g
Using GATK jar /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar
Running:
    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 --conf spark.driver.userClassPathFirst=true --conf spark.io.compression.codec=lzf --conf spark.driver.maxResultSize=0 --conf spark.executor.extraJavaOption
s=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.di$able=true  --conf spark.kryoserializer.buffer.max=512m --conf spark.yarn.executor.memoryOverhead=600 --conf spark.cores.max=720 --executor-cores 20 --executor-memory 50g --conf spark.driver.memory=50g /home/myname/gatk4/gatk$build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar BwaAndMarkDuplicatesPipelineSpark -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam -O hdfs://ln16/user/myname/gatk4test/B$aAndMarkDuplicatesPipelineSpark_out.bam -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --bwamemIndexImage /hadoop/myname/GRCh37.fa.img --disableSequenceDictionaryValidation --sparkMaster spark://$n16:7077
16:55:20.195 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar!/com/intel/gkl/native/libgkl_compression.so
[June 29, 2017 4:55:20 PM CST] BwaAndMarkDuplicatesPipelineSpark  --bwamemIndexImage /hadoop/myname/GRCh37.fa.img --output hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam --reference hdfs$//ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --input hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam --disableSequenceDictionaryValidation true --sparkMaster spark://ln16:7077  --duplicates_scoring_strategy SUM_OF_BASE_QUALITIES --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --help fal$e --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
[June 29, 2017 4:55:20 PM CST] Executing as myname@ln14 on Linux 3.10.0-514.16.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Version: 4.alpha.2-1125-g27b5190-SNAPSHOT
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 1
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Deflater: IntelDeflater
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Inflater: IntelInflater
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Initializing engine
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Done initializing engine
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
^C
####################### Ctrl-C after 16 hours ##############

mwalker174 · 2017-06-30T16:44:41Z

This likely has to do with your spark configuration. Check on the Spark job's progress through the web interface, which should be something like http://<driver_address>:4040 (see https://spark.apache.org/docs/latest/monitoring.html).

If your BAM is very small, you can also try increasing the number of partitions by reducing --bamPartitionSize.

SHuang-Broad · 2017-06-30T20:09:15Z

@magicDGS . I am afraid this is not easy. I didn't write the binding (@tedsharpe did), but I would asseme the limitation comes from bwa mem itself, not the binding, as the binding is a thin wrapper that delegates the loading of the index files (or the image that combines all 5 index files in this case) to bwa.

The SV team here have a script (scripts/sv/default_init.sh) that when the Spark cluster is created and initialized, the image file is distributed to all walker nodes. Spark clusters other than Google's Dataproc would probably allow you to provide scripts as initialization actions as well. On the other hand, there seem to be a --files argument that you can append to your cmd line arguments which yarn will parse and distribute the provided local file to all nodes, though in this case it will be very inefficient considering the image file's size.

SHuang-Broad · 2017-06-30T20:11:43Z

@hliang , the suggestion by @mwalker174 might be your solution.
Note that those log4j errors are known and is on our radar to be fixed(it won't prevent real work being done in my experience, just annoying).

magicDGS · 2017-07-03T08:33:14Z

Thanks for the answer @SHuang-Broad. It would be nice if the bwa-mem C library have the option to pass streams instead of files for the index, allowing passing in-memory and file-based (in whatever file system abstraction) indexes. I will try to look at the code and see if I can submit a patch, but I need to refresh my C++ for that...

hliang · 2017-07-04T01:19:33Z

Thank you @mwalker174 . The input bamfile is about 7 GB. If no --bamPartitionSize is specified, the job would stuck at the first step collect at ReadsSparkSource.java:220, until we killed it. So I tried --bamPartitionSize 4000000, and it went through, but the Spark web interface showed errors in sortByKey steps:
.
And the program failed eventually:

18:24:57.885 INFO  BwaAndMarkDuplicatesPipelineSpark - Shutting down engine
[July 3, 2017 6:24:57 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark done. Elapsed time: 269.29 minutes.
Runtime.totalMemory()=4172283904
org.apache.spark.SparkException: Job aborted due to stage failure: Task 607 in stage 3.0 failed 4 times, most recent failure: Lost task 607.13 in stage 3.0 (TID 14832, 12.9.68.0, executor 24): ExecutorLostFailure (executor 24 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 169939 ms
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
        at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:152)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
        at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
        at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortReads(SparkUtils.java:153)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:228)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:153)
        at org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark.runTool(BwaAndMarkDuplicatesPipelineSpark.java:62)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I have to look more into BwaAndMarkDuplicatesPipelineSpark. The good news is at least we get BwaSpark working now: BwaSpark with --bamPartitionSize=4000000 or 64000000, the program finishes in less than 20 minutes without error. (It used to stalled if no --bamPartitionSize is specified).

mwalker174 · 2017-07-05T19:06:09Z

@hliang I see that many of the tasks are failing and it looks like one of the executors crashed. To find the cause, you can check the error logs of these tasks through the web UI.

I suspect increasing executor memory will fix the problem. Heartbeat timeouts usually occur when an executor JVM runs out of memory or requests more memory than the node will allow.

hliang · 2017-07-10T01:40:46Z

Thank you @mwalker174 for the suggestions. I ended up writing for loops to test which configurations work. Driver memory: 2-50g; executor memory: 2-50g; executor cores: 1-20; bamPartitionSize: 1-64m. Some combinations failed in minutes, some failed in hours, and some finished without errors. Bellow are three of which work for a ~33X WGS data:

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 4000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 20 
--executor-memory 10g 
--conf spark.driver.memory=50g

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 4000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 5 
--executor-memory 50g 
--conf spark.driver.memory=50g

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 64000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 2 
--executor-memory 50g 
--conf spark.driver.memory=50g

Hope someone will find this helpful.

hliang · 2017-07-18T06:22:44Z

Just found the log4j errors could be fixed by editing gatk-launch and set spark.driver.userClassPathFirst to false (was true by default), or add --conf spark.driver.userClassPathFirst=false to gatk-launch command line.
Not sure if that would have any unexpected effect though.

droazen assigned lbergelson Aug 1, 2017

droazen added this to the Engine-4.0 milestone Aug 1, 2017

droazen removed this from the Engine-4.0 milestone Oct 17, 2017

chapmanb mentioned this issue Feb 13, 2018

Spark timeout errors with BaseRecalibrator and GATK 4 bcbio/bcbio-nextgen#2252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is BwaMemAligner.java? #3186

Where is BwaMemAligner.java? #3186

hliang commented Jun 28, 2017

magicDGS commented Jun 28, 2017

hliang commented Jun 28, 2017

magicDGS commented Jun 28, 2017

SHuang-Broad commented Jun 28, 2017

SHuang-Broad commented Jun 28, 2017

magicDGS commented Jun 29, 2017

hliang commented Jun 30, 2017

mwalker174 commented Jun 30, 2017

SHuang-Broad commented Jun 30, 2017

SHuang-Broad commented Jun 30, 2017

magicDGS commented Jul 3, 2017

hliang commented Jul 4, 2017

mwalker174 commented Jul 5, 2017

hliang commented Jul 10, 2017

hliang commented Jul 18, 2017 •

edited

Loading

Where is BwaMemAligner.java? #3186

Where is BwaMemAligner.java? #3186

Comments

hliang commented Jun 28, 2017

magicDGS commented Jun 28, 2017

hliang commented Jun 28, 2017

magicDGS commented Jun 28, 2017

SHuang-Broad commented Jun 28, 2017

SHuang-Broad commented Jun 28, 2017

magicDGS commented Jun 29, 2017

hliang commented Jun 30, 2017

mwalker174 commented Jun 30, 2017

SHuang-Broad commented Jun 30, 2017

SHuang-Broad commented Jun 30, 2017

magicDGS commented Jul 3, 2017

hliang commented Jul 4, 2017

mwalker174 commented Jul 5, 2017

hliang commented Jul 10, 2017

hliang commented Jul 18, 2017 • edited Loading

hliang commented Jul 18, 2017 •

edited

Loading