Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load bwa index error when running BwaSpark under version alpha.2-45-ga30af5a #2171

Closed
huangk3 opened this issue Sep 16, 2016 · 13 comments

Comments

@huangk3
Copy link

huangk3 commented Sep 16, 2016

Below are the contents of my reference folder. The index is there, but I don't know why the tool can't recognize it. Please help, thanks!
kh3@rgcaahauva08091 ~/Resources/genome_b37 $> ls -l genome.*
-rw-rw---- 1 kh3 kh3 784809415 Sep 16 10:16 genome.2bit
-rw-rw---- 1 kh3 kh3 3168829906 Feb 4 2014 genome.fa
-rw-r----- 1 kh3 kh3 106669 Sep 16 11:32 genome.fa.amb
-rw-r----- 1 kh3 kh3 3276 Sep 16 11:32 genome.fa.ann
-rw-r----- 1 kh3 kh3 3137454592 Sep 16 11:31 genome.fa.bwt
-rw-rw---- 1 kh3 kh3 2984 Feb 4 2014 genome.fa.fai
-rw-rw---- 1 kh3 kh3 2984 Sep 16 13:18 genome.fai
-rw-r----- 1 kh3 kh3 784363628 Sep 16 11:32 genome.fa.pac
-rw-r----- 1 kh3 kh3 1568727304 Sep 16 11:44 genome.fa.sa

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaAndMarkDuplicatesPipelineSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/ge
nome.2bit --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
15:47:28.760 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
15:47:28.809 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 16, 2016 3:47:28 PM EDT] org.broadinstitute.hellbender.tools.spark.pipelines.BwaAndMarkDuplicatesPipelineSpark --threads 16 --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark
.aligned.bam --reference /home/kh3/Resources/genome_b37/genome.2bit --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --fixedChunkSiz
e 100000 --duplicates_scoring_strategy SUM_OF_BASE_QUALITIES --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedO
utput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[September 16, 2016 3:47:28 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.
2-45-ga30af5a-SNAPSHOT
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.BUFFER_SIZE : 131072
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.COMPRESSION_LEVEL : 1
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CREATE_INDEX : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CREATE_MD5 : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.CUSTOM_READER_FACTORY :
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.REFERENCE_FASTA : null
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
15:47:28.835 INFO BwaAndMarkDuplicatesPipelineSpark - Deflater IntelDeflater
15:47:28.836 INFO BwaAndMarkDuplicatesPipelineSpark - Initializing engine
15:47:28.836 INFO BwaAndMarkDuplicatesPipelineSpark - Done initializing engine
15:47:29.287 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
[E::bwa_idx_load_from_disk] fail to locate the index files
15:47:34.944 ERROR Executor:95 - Exception in task 5.0 in stage 0.0 (TID 5)
org.broadinstitute.hellbender.exceptions.GATKException: Cannot run BWA-MEM
at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$null$1(BwaSparkEngine.java:113)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:159)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:159)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: bwa_idx_load failed
at com.github.lindenb.jbwa.jni.BwaIndex._open(Native Method)
at com.github.lindenb.jbwa.jni.BwaIndex.(BwaIndex.java:216)
at org.broadinstitute.hellbender.tools.spark.bwa.BwaSparkEngine.lambda$null$1(BwaSparkEngine.java:109)
... 32 more

@lbergelson
Copy link
Member

Hi @huangk3. This is a known bug. The spark pipeline uses a 2bit reference but bwa needs a fasta reference. There's a branch in pr #1981 that adds a bwa_reference parameter as a work around. The test are failing but to the best of my knowledge it's the tests that are wrong and not the code. I need to update the tests in that branch and get it merged in, but if you want to play around it with it it's a good place to start. (use with caution for now though).

@lbergelson
Copy link
Member

Actually, on further consideration, I think I've given you the wrong advice. Try just directly using a fasta instead of a 2bit in the --reference parameter. I think the two bit is only necessary for the complete reads pipeline.

@huangk3
Copy link
Author

huangk3 commented Sep 16, 2016

Hi @lbergelson Thanks for the quick response! I just used the fasta as reference, and it still doesn't work. The log only says "null".

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/genome.fa --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
16:55:32.261 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
16:55:32.310 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 16, 2016 4:55:32 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam --threads 16 --reference /home/kh3/Resources/genome_b37/genome.fa --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --fixedChunkSize 100000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[September 16, 2016 4:55:32 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.2-45-ga30af5a-SNAPSHOT
16:55:32.335 INFO BwaSpark - Defaults.BUFFER_SIZE : 131072
16:55:32.335 INFO BwaSpark - Defaults.COMPRESSION_LEVEL : 1
16:55:32.335 INFO BwaSpark - Defaults.CREATE_INDEX : false
16:55:32.335 INFO BwaSpark - Defaults.CREATE_MD5 : false
16:55:32.335 INFO BwaSpark - Defaults.CUSTOM_READER_FACTORY :
16:55:32.335 INFO BwaSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
16:55:32.335 INFO BwaSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
16:55:32.335 INFO BwaSpark - Defaults.REFERENCE_FASTA : null
16:55:32.335 INFO BwaSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:55:32.336 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:55:32.336 INFO BwaSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
16:55:32.336 INFO BwaSpark - Deflater IntelDeflater
16:55:32.336 INFO BwaSpark - Initializing engine
16:55:32.336 INFO BwaSpark - Done initializing engine
16:55:32.757 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16:55:34.473 INFO BwaSpark - Shutting down engine
[September 16, 2016 4:55:34 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=499646464


null


@lbergelson
Copy link
Member

Hmn, that might be due to bad default filters. Could you try with --disableAllReadFilters?

@huangk3
Copy link
Author

huangk3 commented Sep 17, 2016

Hi @lbergelson I added --disableAllReadFilters, the log still says "null".

Using GATK wrapper script /home/kh3/Softwares/gatk/build/install/gatk/bin/gatk
Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -R /home/kh3/Resources/genome_b37/genome.fa --disableAllReadFilters --disableSequenceDictionaryValidation true -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam
12:08:44.856 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
12:08:44.905 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 17, 2016 12:08:44 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.aligned.bam --threads 16 --reference /home/kh3/Resources/genome_b37/genome.fa --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --disableSequenceDictionaryValidation true --disableAllReadFilters true --fixedChunkSize 100000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false
[September 17, 2016 12:08:44 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.2-45-ga30af5a-SNAPSHOT
12:08:44.930 INFO BwaSpark - Defaults.BUFFER_SIZE : 131072
12:08:44.930 INFO BwaSpark - Defaults.COMPRESSION_LEVEL : 1
12:08:44.930 INFO BwaSpark - Defaults.CREATE_INDEX : false
12:08:44.930 INFO BwaSpark - Defaults.CREATE_MD5 : false
12:08:44.930 INFO BwaSpark - Defaults.CUSTOM_READER_FACTORY :
12:08:44.930 INFO BwaSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
12:08:44.930 INFO BwaSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
12:08:44.930 INFO BwaSpark - Defaults.REFERENCE_FASTA : null
12:08:44.930 INFO BwaSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
12:08:44.930 INFO BwaSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:08:44.930 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:08:44.931 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:08:44.931 INFO BwaSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
12:08:44.931 INFO BwaSpark - Deflater IntelDeflater
12:08:44.931 INFO BwaSpark - Initializing engine
12:08:44.931 INFO BwaSpark - Done initializing engine
12:08:45.439 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12:08:47.488 INFO BwaSpark - Shutting down engine
[September 17, 2016 12:08:47 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=499646464


null


@lbergelson
Copy link
Member

@huangk3 Hmn. I'm not sure what's going on then. I'd have to take some time to look into it. Unfortunately, this is my last week before I head out for several weeks off and I don't know if I'll be able to get to it before I leave.

BWA support is still very experimental and it seems like we have quite a few existing issues with it. We're going to be putting a lot of effort in to the spark tools next quarter, but until then it may not get the attention it needs.

@sooheelee
Copy link
Contributor

@huangk3 When you replaced the 2-bit reference with the fasta reference, did you also have matched index files (amb, ann, bwt, pac and sa ) within the same directory as the fasta reference? I believe these index files, and also the .fai index file, all need to be regenerated for the specific reference file.

@huangk3
Copy link
Author

huangk3 commented Oct 5, 2016

HI @sooheelee I did try using BWA to index the 2-bit reference, but it doesn't work as well. @lbergelson Do you think the reference loading issue can be fixed soon? like this month?

@sooheelee
Copy link
Contributor

It's not likely that the reference loading issue will be fixed this month @huangk3. I'm answering in @lbergelson stead as he's out for a few weeks. This is on the team's radar so there is some chance that it could be but we are unable to say for sure.

@droazen
Copy link
Collaborator

droazen commented Mar 22, 2017

@tedsharpe Think this one is fixed with your new BWA bindings?

@tedsharpe
Copy link
Contributor

Yes, but...

It looks to me as if the index files, which appear to be in the master node's Linux file system in this failing example, are probably not available to the worker nodes. You'd have to copy each of the 5 index files to each of the workers, putting them in the same location on each.

The same problem would occur with the new version: The single-image index file will still need to be available to all workers. You could distribute this file with:
--conf spark.yarn.dist.files=<location of the image file>
which will copy it from your local machine to all workers each time you run the program. This isn't optimal, because it's pretty large.

So, instead, you could copy it to a fixed path, identical on each worker, once up front, and then run your alignment jobs to your heart's content. The new version is a little simpler, because there's just one index file, but otherwise suffers from the same issue: bwa mem only knows how to deal with ordinary file system files -- not HDFS, not GCS -- and so the file must be copied to each worker machine in the cluster.

@sooheelee
Copy link
Contributor

Did you solve this issue @huangk3?

@droazen droazen added this to the Engine-4.0 milestone Aug 1, 2017
@lbergelson
Copy link
Member

Closing this since there hasn't been any response in a long time. Feel free to re-open if there are updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants