Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is BwaMemAligner.java? #3186

Open
hliang opened this issue Jun 28, 2017 · 15 comments
Open

Where is BwaMemAligner.java? #3186

hliang opened this issue Jun 28, 2017 · 15 comments
Assignees

Comments

@hliang
Copy link

hliang commented Jun 28, 2017

Got a NullPointerException while trying to run BwaAndMarkDuplicatesPipelineSpark.

Running:
    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 --conf spark.driver.userClassPathFirst=true --conf spark.io.compression.codec=lzf --conf spark.driver.maxResultSize=0 --conf spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.kryoserializer.buffer.max=512m --conf spark.yarn.executor.memoryOverhead=600 --conf spark.cores.max=20 --executor-cores 4 --executor-memory 10g --conf spark.driver.memory=10g /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar BwaAndMarkDuplicatesPipelineSpark -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam -O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img --disableSequenceDictionaryValidation --sparkMaster spark://ln16:7077
...
...
Caused by: java.lang.NullPointerException
        at org.broadinstitute.hellbender.utils.bwa.BwaMemAligner.<init>(BwaMemAligner.java:25)
        at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine$ReadAligner.apply(BwaSparkEngine.java:93)
        at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$align$ed6f731$1(BwaSparkEngine.java:56)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

And I notice gatk/src/main/java/org/broadinstitute/hellbender/utils/bwa/BwaMemAligner.java doesn't exist, and there is no class file gatk/build/classes/main/org/broadinstitute/hellbender/utils/bwa/BwaMemAligner.class either. Is that causing this error?

@magicDGS
Copy link
Contributor

This should be in the gatk-bwamem-jni dependency. Maybe the artifact is not correctly packaged...but it is in the gradle.build...

@hliang
Copy link
Author

hliang commented Jun 28, 2017

You are right. It's built and packed in gatk/build/libs/gatk-spark.jar.
Any clue why the program failed.

@magicDGS
Copy link
Contributor

I'm not familiar with this code, so I cannot help fixing the problem. After exploring at the code a bit, it looks like the image file is null. Because there is a single instance of the index file, it looks that at some point it was closed and thus the global instance is null. Maybe this will be helpful for debug/fix this problem...

@SHuang-Broad
Copy link
Contributor

@hliang , I broke it into several lines:

    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 
    --conf spark.driver.userClassPathFirst=true 
    --conf spark.io.compression.codec=lzf 
    --conf spark.driver.maxResultSize=0 
    --conf spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  
    --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  
    --conf spark.kryoserializer.buffer.max=512m 
    --conf spark.yarn.executor.memoryOverhead=600 
    --conf spark.cores.max=20 --executor-cores 4 --executor-memory 10g 
    --conf spark.driver.memory=10g 
    /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar 
    BwaAndMarkDuplicatesPipelineSpark 
    -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam 
    -O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam 
    -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
    --bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img 
    --disableSequenceDictionaryValidation 
    --sparkMaster spark://ln16:7077

And the line

--bwamemIndexImage hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa.img

is the problem. The image file has to be living a regular file system in all the worker nodes, NOT hdfs.

@SHuang-Broad
Copy link
Contributor

So the null pointer is caused by bwa mem complaining that it cannot load the index from the HDFS.

@magicDGS
Copy link
Contributor

Could it be possible to read the file from a java.nio.Path in the gatk-bwamem-jni, @SHuang-Broad? It looks that it's a constraint of the native code, but it will be nice to be able to have just one index image in HDFS accessible for all the nodes...

@hliang
Copy link
Author

hliang commented Jun 30, 2017

Thank you @SHuang-Broad. The error was gone after I copied bwaindeximage file to lustre file system, which can be accessed by all worker nodes.
The new problem is: the program started but didn't give any informative message/progress (see log below). It was stopped (Ctl-C) after 16 hours. The sequence data is regular human exome, which could be mapped in 1-2 hours in our traditional pipeline.

../gatk/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
-I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam 
-O hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /TEST/hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=720 
--executor-cores 20 
--executor-memory 50g 
--conf spark.driver.memory=50g
Using GATK jar /home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar
Running:
    /opt/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://ln16:7077 --conf spark.driver.userClassPathFirst=true --conf spark.io.compression.codec=lzf --conf spark.driver.maxResultSize=0 --conf spark.executor.extraJavaOption
s=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true  --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.di$able=true  --conf spark.kryoserializer.buffer.max=512m --conf spark.yarn.executor.memoryOverhead=600 --conf spark.cores.max=720 --executor-cores 20 --executor-memory 50g --conf spark.driver.memory=50g /home/myname/gatk4/gatk$build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar BwaAndMarkDuplicatesPipelineSpark -I hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam -O hdfs://ln16/user/myname/gatk4test/B$aAndMarkDuplicatesPipelineSpark_out.bam -R hdfs://ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --bwamemIndexImage /hadoop/myname/GRCh37.fa.img --disableSequenceDictionaryValidation --sparkMaster spark://$n16:7077
16:55:20.195 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/myname/gatk4/gatk/build/libs/gatk-package-4.alpha.2-1125-g27b5190-SNAPSHOT-spark.jar!/com/intel/gkl/native/libgkl_compression.so
[June 29, 2017 4:55:20 PM CST] BwaAndMarkDuplicatesPipelineSpark  --bwamemIndexImage /hadoop/myname/GRCh37.fa.img --output hdfs://ln16/user/myname/gatk4test/BwaAndMarkDuplicatesPipelineSpark_out.bam --reference hdfs$//ln16/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit --input hdfs://ln16/user/myname/NA12878/wes/NA12878-NGv3-LAB1360-A.unaligned.bam --disableSequenceDictionaryValidation true --sparkMaster spark://ln16:7077  --duplicates_scoring_strategy SUM_OF_BASE_QUALITIES --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --help fal$e --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
[June 29, 2017 4:55:20 PM CST] Executing as myname@ln14 on Linux 3.10.0-514.16.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_112-b15; Version: 4.alpha.2-1125-g27b5190-SNAPSHOT
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.COMPRESSION_LEVEL : 1
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:55:20.229 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Deflater: IntelDeflater
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Inflater: IntelInflater
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Initializing engine
16:55:20.230 INFO  BwaAndMarkDuplicatesPipelineSpark - Done initializing engine
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
log4j:ERROR A "org.apache.log4j.ConsoleAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.ConsoleAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "console".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
log4j:ERROR A "org.apache.log4j.varia.NullAppender" object is not assignable to a "org.apache.log4j.Appender" variable.
log4j:ERROR The class "org.apache.log4j.Appender" was loaded by
log4j:ERROR [sun.misc.Launcher$AppClassLoader@53d8d10a] whereas object of type
log4j:ERROR "org.apache.log4j.varia.NullAppender" was loaded by [org.apache.spark.util.ChildFirstURLClassLoader@18a70f16].
log4j:ERROR Could not instantiate appender named "NullAppender".
^C
####################### Ctrl-C after 16 hours ##############

@mwalker174
Copy link
Contributor

This likely has to do with your spark configuration. Check on the Spark job's progress through the web interface, which should be something like http://<driver_address>:4040 (see https://spark.apache.org/docs/latest/monitoring.html).

If your BAM is very small, you can also try increasing the number of partitions by reducing --bamPartitionSize.

@SHuang-Broad
Copy link
Contributor

@magicDGS . I am afraid this is not easy. I didn't write the binding (@tedsharpe did), but I would asseme the limitation comes from bwa mem itself, not the binding, as the binding is a thin wrapper that delegates the loading of the index files (or the image that combines all 5 index files in this case) to bwa.

The SV team here have a script (scripts/sv/default_init.sh) that when the Spark cluster is created and initialized, the image file is distributed to all walker nodes. Spark clusters other than Google's Dataproc would probably allow you to provide scripts as initialization actions as well. On the other hand, there seem to be a --files argument that you can append to your cmd line arguments which yarn will parse and distribute the provided local file to all nodes, though in this case it will be very inefficient considering the image file's size.

@SHuang-Broad
Copy link
Contributor

@hliang , the suggestion by @mwalker174 might be your solution.
Note that those log4j errors are known and is on our radar to be fixed(it won't prevent real work being done in my experience, just annoying).

@magicDGS
Copy link
Contributor

magicDGS commented Jul 3, 2017

Thanks for the answer @SHuang-Broad. It would be nice if the bwa-mem C library have the option to pass streams instead of files for the index, allowing passing in-memory and file-based (in whatever file system abstraction) indexes. I will try to look at the code and see if I can submit a patch, but I need to refresh my C++ for that...

@hliang
Copy link
Author

hliang commented Jul 4, 2017

Thank you @mwalker174 . The input bamfile is about 7 GB. If no --bamPartitionSize is specified, the job would stuck at the first step collect at ReadsSparkSource.java:220, until we killed it. So I tried --bamPartitionSize 4000000, and it went through, but the Spark web interface showed errors in sortByKey steps:
sparkjob.
And the program failed eventually:

18:24:57.885 INFO  BwaAndMarkDuplicatesPipelineSpark - Shutting down engine
[July 3, 2017 6:24:57 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark done. Elapsed time: 269.29 minutes.
Runtime.totalMemory()=4172283904
org.apache.spark.SparkException: Job aborted due to stage failure: Task 607 in stage 3.0 failed 4 times, most recent failure: Lost task 607.13 in stage 3.0 (TID 14832, 12.9.68.0, executor 24): ExecutorLostFailure (executor 24 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 169939 ms
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
        at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:152)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
        at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
        at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortReads(SparkUtils.java:153)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:228)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:153)
        at org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark.runTool(BwaAndMarkDuplicatesPipelineSpark.java:62)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I have to look more into BwaAndMarkDuplicatesPipelineSpark. The good news is at least we get BwaSpark working now: BwaSpark with --bamPartitionSize=4000000 or 64000000, the program finishes in less than 20 minutes without error. (It used to stalled if no --bamPartitionSize is specified).

@mwalker174
Copy link
Contributor

@hliang I see that many of the tasks are failing and it looks like one of the executors crashed. To find the cause, you can check the error logs of these tasks through the web UI.

I suspect increasing executor memory will fix the problem. Heartbeat timeouts usually occur when an executor JVM runs out of memory or requests more memory than the node will allow.

@hliang
Copy link
Author

hliang commented Jul 10, 2017

Thank you @mwalker174 for the suggestions. I ended up writing for loops to test which configurations work. Driver memory: 2-50g; executor memory: 2-50g; executor cores: 1-20; bamPartitionSize: 1-64m. Some combinations failed in minutes, some failed in hours, and some finished without errors. Bellow are three of which work for a ~33X WGS data:

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 4000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 20 
--executor-memory 10g 
--conf spark.driver.memory=50g

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 4000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 5 
--executor-memory 50g 
--conf spark.driver.memory=50g

../gatk-4.beta.1/gatk-launch BwaAndMarkDuplicatesPipelineSpark 
--bamPartitionSize 64000000 
-I hdfs://bigdata/user/myname/gatk4test/wgs.sub4.unaligned.bam 
-O hdfs://bigdata/user/myname/gatk4test/wgs.sub4.BwaAndMarkDuplicatesPipelineSpark_out.bam 
-R hdfs://bigdata/user/myname/genomes/Hsapiens/GRCh37/seq/GRCh37.2bit 
--bwamemIndexImage /hadoop/myname/GRCh37.fa.img 
--disableSequenceDictionaryValidation 
-- --sparkRunner SPARK 
--sparkMaster spark://ln16:7077 
--conf spark.cores.max=600 
--executor-cores 2 
--executor-memory 50g 
--conf spark.driver.memory=50g

Hope someone will find this helpful.

@hliang
Copy link
Author

hliang commented Jul 18, 2017

Just found the log4j errors could be fixed by editing gatk-launch and set spark.driver.userClassPathFirst to false (was true by default), or add --conf spark.driver.userClassPathFirst=false to gatk-launch command line.
Not sure if that would have any unexpected effect though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants