ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

108anup · 2020-07-29T20:58:38Z

Bug Report

Affected tool(s) or class(es)

ReadsPipelineSpark (HaplotypeCallerSpark) when running over a spark cluster

Affected version(s)

Latest public release version [GATK v4.1.8.1]

Description

Tools used:

latest docker image from broadinstitute/gatk
latest hadoop (3.3.0)
spark 2.3.1 without hadoop which is able to use the custom hadoop setup.

Steps to reproduce

Script run:

#!/bin/bash

export HADOOP_CONF_DIR=/etc/hadoop
export HADOOP_HOME=/mnt/hadoop-latest
export JAVA_HOME=/mnt/jre1.8.0_192
export SPARK_HOME=/mnt/spark-2.3.1-bin-without-hadoop
export HADOOP_USER_NAME=hadoop

# export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)

TEST_DIR="hdfs://cromwellhadooptest:8020/user/hadoop/gatk/small"
COMMON_DIR="hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common"
INPUT_DIR="$TEST_DIR/input"
OUTPUT_DIR="$TEST_DIR/output"

input_bam="$INPUT_DIR/small_CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam"
output_vcf_basename="$OUTPUT_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21"

ref_fasta="$COMMON_DIR/human_g1k_v37.20.21.fasta"
known_sites="$COMMON_DIR/dbsnp_138.b37.20.21.vcf"

gatk ReadsPipelineSpark \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -O ${output_vcf_basename}.vcf \
    --known-sites ${known_sites} \
    -pairHMM AVX_LOGLESS_CACHING \
    --spark-verbosity DEBUG \
    -- --spark-runner SPARK --spark-master yarn-cluster \
    # --conf 'spark.submit.deployMode=cluster'

Expected behavior

ReadsPipelineSpark should be able to resolve the hdfs file path: hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

Actual behavior

The tool tries to access: file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta even when the input is: hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

Verified that the file is accesible through hdfs:

(gatk) root@2e738717b9c1:/gatk/mnt# $HADOOP_HOME/bin/hdfs dfs -ls hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
-rw-r--r--   3 hadoop supergroup  113008112 2020-07-29 15:54 hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

When I specify input as: hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, (i.e. without the port) I get the same error.

Stack trace for this:

 ***********************************************************************
A USER ERROR has occurred: The specified fasta file (file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta) does not exist.
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta) does not exist.
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkFastaPath(CachingIndexedFastaSequenceFile.java:173)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:143)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:125)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:110)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.processAssemblyRegions(HaplotypeCallerSpark.java:148)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.callVariantsWithHaplotypeCallerAndWriteOutput(HaplotypeCallerSpark.java:277)
at org.broadinstitute.hellbender.tools.spark.pipelines.ReadsPipelineSpark.runTool(ReadsPipelineSpark.java:224)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

When I specify input as: hdfs:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, the tool tries to access hdfs://cromwellhadooptest:-1/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta.

Stack trace for this:

 java.lang.IllegalArgumentException: Wrong FS: hdfs://cromwellhadooptest:-1/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, expected: hdfs://cromwellhadooptest
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:776)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:247)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1725)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1722)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1737)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1729)
at hdfs.jsr203.HadoopFileSystem.checkAccess(HadoopFileSystem.java:937)
at hdfs.jsr203.HadoopFileSystemProvider.checkAccess(HadoopFileSystemProvider.java:75)
at java.nio.file.Files.exists(Files.java:2385)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.canCreateIndexedFastaReader(ReferenceSequenceFileFactory.java:165)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:137)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:122)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:111)
at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceHadoopSparkSource.getReferenceSequenceDictionary(ReferenceHadoopSparkSource.java:41)
at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceMultiSparkSource.getReferenceSequenceDictionary(ReferenceMultiSparkSource.java:93)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReference(GATKSparkTool.java:604)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:553)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:544)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

The above makes sense, that's why I added the hostname and port for the namenode.

It seems like after verifying that the file hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta exists, some code transforms this path into file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

The text was updated successfully, but these errors were encountered:

cmnbroad · 2020-08-03T14:56:09Z

At a minimum, this code in GATKSparkTool throws away the file system (uri scheme) info. There are probably still other code paths that assume the reference is local.

cmnbroad · 2020-08-04T14:41:42Z

Unfortunately, fixing this will require adding some new tests, but the tests are all currently disabled due to #5680.

droazen · 2020-08-10T18:13:51Z

Theory from @cmnbroad is below:

I think this is happening because were trying to serialize the class loader sun.misc.Launcher$AppClassLoader), which appears to be reached through the graph by way of via https://github.com/damiencarol/jsr203-hadoop/blob/master/src/main/java/hdfs/jsr203/HadoopFileSystem.java#L82. We probably need to short circuit that with a custom serializer for one of these:

Serialization trace:
classes (sun.misc.Launcher$AppClassLoader)
classLoader (org.apache.hadoop.conf.Configuration)
conf (org.apache.hadoop.hdfs.DistributedFileSystem)
fs (hdfs.jsr203.HadoopFileSystem)
hdfs (hdfs.jsr203.HadoopPath)
path (htsjdk.samtools.seekablestream.SeekablePathStream)
seekableStream (htsjdk.tribble.TribbleIndexedFeatureReader)
featureReader (org.broadinstitute.hellbender.engine.FeatureDataSource)
featureSources (org.broadinstitute.hellbender.engine.FeatureManager)

See, for instance, dbpedia/distributed-extraction-framework#9.

cmnbroad · 2020-08-10T18:29:12Z

Just for clarity, the last suggestion above is for the fix to #5680. The fix for this issue is separate, but fixing #5680 will enable us to re-enable the tests, which is a prerequisite to fixing this issue.

108anup changed the title ~~ReadsPipelineSpark unable to resolve hdfs path.~~ ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. Jul 29, 2020

droazen added Spark bug labels Aug 10, 2020

kzukowski mentioned this issue Sep 16, 2020

Mutect2 (GATK 4.1.5.0) emitting MNPs despite max-mnp-distance 0 #6473

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

108anup commented Jul 29, 2020 •

edited

cmnbroad commented Aug 3, 2020 •

edited

cmnbroad commented Aug 4, 2020

droazen commented Aug 10, 2020

cmnbroad commented Aug 10, 2020

ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

Comments

108anup commented Jul 29, 2020 • edited

Bug Report

Affected tool(s) or class(es)

Affected version(s)

Description

Tools used:

Steps to reproduce

Expected behavior

Actual behavior

cmnbroad commented Aug 3, 2020 • edited

cmnbroad commented Aug 4, 2020

droazen commented Aug 10, 2020

cmnbroad commented Aug 10, 2020

108anup commented Jul 29, 2020 •

edited

cmnbroad commented Aug 3, 2020 •

edited