Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

huangk3 · 2016-09-16T20:05:42Z

Below are the contents of my reference folder. The index is there, but I don't know why the tool can't recognize it. Please help, thanks!
kh3@rgcaahauva08091 ~/Resources/genome_b37 $> ls -l genome.*
-rw-rw---- 1 kh3 kh3 784809415 Sep 16 10:16 genome.2bit
-rw-rw---- 1 kh3 kh3 3168829906 Feb 4 2014 genome.fa
-rw-r----- 1 kh3 kh3 106669 Sep 16 11:32 genome.fa.amb
-rw-r----- 1 kh3 kh3 3276 Sep 16 11:32 genome.fa.ann
-rw-r----- 1 kh3 kh3 3137454592 Sep 16 11:31 genome.fa.bwt
-rw-rw---- 1 kh3 kh3 2984 Feb 4 2014 genome.fa.fai
-rw-rw---- 1 kh3 kh3 2984 Sep 16 13:18 genome.fai
-rw-r----- 1 kh3 kh3 784363628 Sep 16 11:32 genome.fa.pac
-rw-r----- 1 kh3 kh3 1568727304 Sep 16 11:44 genome.fa.sa

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaAndMarkDuplicatesPipelineSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/ge
nome.2bit --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
15:47:28.760 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
15:47:28.809 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 16, 2016 3:47:28 PM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark --threads 16 --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark
.aligned.bam --reference /home/kh3/Resources/genome_b37/genome.2bit --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --fixedChunkSiz
e 100000 --duplicates_scoring_strategy SUM_OF_BASE_QUALITIES --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedO
utput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[September 16, 2016 3:47:28 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.
2-45-ga30af5a-SNAPSHOT
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.BUFFER_SIZE : 131072
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.COMPRESSION_LEVEL : 1
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CREATE_INDEX : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CREATE_MD5 : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CUSTOM_READER_FACTORY :
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.REFERENCE_FASTA : null
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Deflater IntelDeflater
15:47:28.836 INFO BwaAndMarkDuplicatesPipelineSpark - Initializing engine
15:47:28.836 INFO BwaAndMarkDuplicatesPipelineSpark - Done initializing engine
15:47:29.287 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
15:47:34.944 ERROR Executor:95 - Exception in task 5.0 in stage 0.0 (TID 5)
org.broadinstitute.hellbender.exceptions.GATKException: Cannot run BWA-MEM
at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$null$1(BwaSparkEngine.java:113)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:159)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:159)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: bwa_idx_load failed
at com.github.lindenb.jbwa.jni.BwaIndex._open(Native Method)
at com.github.lindenb.jbwa.jni.BwaIndex.(BwaIndex.java:216)
at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$null$1(BwaSparkEngine.java:109)
... 32 more

lbergelson · 2016-09-16T20:46:34Z

Hi @huangk3. This is a known bug. The spark pipeline uses a 2bit reference but bwa needs a fasta reference. There's a branch in pr #1981 that adds a bwa_reference parameter as a work around. The test are failing but to the best of my knowledge it's the tests that are wrong and not the code. I need to update the tests in that branch and get it merged in, but if you want to play around it with it it's a good place to start. (use with caution for now though).

lbergelson · 2016-09-16T20:51:33Z

Actually, on further consideration, I think I've given you the wrong advice. Try just directly using a fasta instead of a 2bit in the --reference parameter. I think the two bit is only necessary for the complete reads pipeline.

huangk3 · 2016-09-16T21:06:48Z

Hi @lbergelson Thanks for the quick response! I just used the fasta as reference, and it still doesn't work. The log only says "null".

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/genome.fa --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
16:55:32.261 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
16:55:32.310 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 16, 2016 4:55:32 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam --threads 16 --reference /home/kh3/Resources/genome_b37/genome.fa --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --fixedChunkSize 100000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[September 16, 2016 4:55:32 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.2-45-ga30af5a-SNAPSHOT
16:55:32.335 INFO BwaSpark - Defaults.BUFFER_SIZE : 131072
16:55:32.335 INFO BwaSpark - Defaults.COMPRESSION_LEVEL : 1
16:55:32.335 INFO BwaSpark - Defaults.CREATE_INDEX : false
16:55:32.335 INFO BwaSpark - Defaults.CREATE_MD5 : false
16:55:32.335 INFO BwaSpark - Defaults.CUSTOM_READER_FACTORY :
16:55:32.335 INFO BwaSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
16:55:32.335 INFO BwaSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
16:55:32.335 INFO BwaSpark - Defaults.REFERENCE_FASTA : null
16:55:32.335 INFO BwaSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:55:32.336 INFO BwaSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
16:55:32.336 INFO BwaSpark - Deflater IntelDeflater
16:55:32.336 INFO BwaSpark - Initializing engine
16:55:32.336 INFO BwaSpark - Done initializing engine
16:55:32.757 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16:55:34.473 INFO BwaSpark - Shutting down engine
[September 16, 2016 4:55:34 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=499646464

null

lbergelson · 2016-09-16T21:46:09Z

Hmn, that might be due to bad default filters. Could you try with --disableAllReadFilters?

huangk3 · 2016-09-17T16:13:18Z

Hi @lbergelson I added --disableAllReadFilters, the log still says "null".

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/genome.fa --disableAllReadFilters --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
12:08:44.856 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
12:08:44.905 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 17, 2016 12:08:44 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam --threads 16 --reference /home/kh3/Resources/genome_b37/genome.fa --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --disableAllReadFilters true --fixedChunkSize 100000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false
[September 17, 2016 12:08:44 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.2-45-ga30af5a-SNAPSHOT
12:08:44.930 INFO BwaSpark - Defaults.BUFFER_SIZE : 131072
12:08:44.930 INFO BwaSpark - Defaults.COMPRESSION_LEVEL : 1
12:08:44.930 INFO BwaSpark - Defaults.CREATE_INDEX : false
12:08:44.930 INFO BwaSpark - Defaults.CREATE_MD5 : false
12:08:44.930 INFO BwaSpark - Defaults.CUSTOM_READER_FACTORY :
12:08:44.930 INFO BwaSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
12:08:44.930 INFO BwaSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
12:08:44.930 INFO BwaSpark - Defaults.REFERENCE_FASTA : null
12:08:44.930 INFO BwaSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
12:08:44.930 INFO BwaSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:08:44.930 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:08:44.931 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:08:44.931 INFO BwaSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
12:08:44.931 INFO BwaSpark - Deflater IntelDeflater
12:08:44.931 INFO BwaSpark - Initializing engine
12:08:44.931 INFO BwaSpark - Done initializing engine
12:08:45.439 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12:08:47.488 INFO BwaSpark - Shutting down engine
[September 17, 2016 12:08:47 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=499646464

null

lbergelson · 2016-09-19T18:04:21Z

@huangk3 Hmn. I'm not sure what's going on then. I'd have to take some time to look into it. Unfortunately, this is my last week before I head out for several weeks off and I don't know if I'll be able to get to it before I leave.

BWA support is still very experimental and it seems like we have quite a few existing issues with it. We're going to be putting a lot of effort in to the spark tools next quarter, but until then it may not get the attention it needs.

sooheelee · 2016-09-27T16:09:31Z

@huangk3 When you replaced the 2-bit reference with the fasta reference, did you also have matched index files (amb, ann, bwt, pac and sa ) within the same directory as the fasta reference? I believe these index files, and also the .fai index file, all need to be regenerated for the specific reference file.

huangk3 · 2016-10-05T17:13:47Z

HI @sooheelee I did try using BWA to index the 2-bit reference, but it doesn't work as well. @lbergelson Do you think the reference loading issue can be fixed soon? like this month?

sooheelee · 2016-10-05T18:22:52Z

It's not likely that the reference loading issue will be fixed this month @huangk3. I'm answering in @lbergelson stead as he's out for a few weeks. This is on the team's radar so there is some chance that it could be but we are unable to say for sure.

droazen · 2017-03-22T20:59:42Z

@tedsharpe Think this one is fixed with your new BWA bindings?

tedsharpe · 2017-03-22T21:29:08Z

Yes, but...

It looks to me as if the index files, which appear to be in the master node's Linux file system in this failing example, are probably not available to the worker nodes. You'd have to copy each of the 5 index files to each of the workers, putting them in the same location on each.

The same problem would occur with the new version: The single-image index file will still need to be available to all workers. You could distribute this file with:
--conf spark.yarn.dist.files=<location of the image file>
which will copy it from your local machine to all workers each time you run the program. This isn't optimal, because it's pretty large.

So, instead, you could copy it to a fixed path, identical on each worker, once up front, and then run your alignment jobs to your heart's content. The new version is a little simpler, because there's just one index file, but otherwise suffers from the same issue: bwa mem only knows how to deal with ordinary file system files -- not HDFS, not GCS -- and so the file must be copied to each worker machine in the cluster.

sooheelee · 2017-07-12T19:49:36Z

Did you solve this issue @huangk3?

lbergelson · 2017-09-29T20:39:07Z

Closing this since there hasn't been any response in a long time. Feel free to re-open if there are updates.

lbergelson added bug community-request labels Sep 16, 2016

lbergelson added the BWA label Oct 24, 2016

droazen assigned tedsharpe Mar 22, 2017

droazen added this to the Engine-4.0 milestone Aug 1, 2017

droazen assigned lbergelson Aug 1, 2017

lbergelson closed this as completed Sep 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

huangk3 commented Sep 16, 2016

lbergelson commented Sep 16, 2016

lbergelson commented Sep 16, 2016

huangk3 commented Sep 16, 2016 •

edited

Loading

lbergelson commented Sep 16, 2016

huangk3 commented Sep 17, 2016

lbergelson commented Sep 19, 2016

sooheelee commented Sep 27, 2016

huangk3 commented Oct 5, 2016

sooheelee commented Oct 5, 2016

droazen commented Mar 22, 2017

tedsharpe commented Mar 22, 2017

sooheelee commented Jul 12, 2017

lbergelson commented Sep 29, 2017

Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

Comments

huangk3 commented Sep 16, 2016

lbergelson commented Sep 16, 2016

lbergelson commented Sep 16, 2016

huangk3 commented Sep 16, 2016 • edited Loading

lbergelson commented Sep 16, 2016

huangk3 commented Sep 17, 2016

lbergelson commented Sep 19, 2016

sooheelee commented Sep 27, 2016

huangk3 commented Oct 5, 2016

sooheelee commented Oct 5, 2016

droazen commented Mar 22, 2017

tedsharpe commented Mar 22, 2017

sooheelee commented Jul 12, 2017

lbergelson commented Sep 29, 2017

huangk3 commented Sep 16, 2016 •

edited

Loading