Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730

Open
1 task done
108anup opened this issue Jul 29, 2020 · 4 comments
Open
1 task done

Comments

@108anup
Copy link

108anup commented Jul 29, 2020

Bug Report

Affected tool(s) or class(es)

ReadsPipelineSpark (HaplotypeCallerSpark) when running over a spark cluster

Affected version(s)

  • Latest public release version [GATK v4.1.8.1]

Description

Tools used:

latest docker image from broadinstitute/gatk
latest hadoop (3.3.0)
spark 2.3.1 without hadoop which is able to use the custom hadoop setup.

Steps to reproduce

Script run:

#!/bin/bash

export HADOOP_CONF_DIR=/etc/hadoop
export HADOOP_HOME=/mnt/hadoop-latest
export JAVA_HOME=/mnt/jre1.8.0_192
export SPARK_HOME=/mnt/spark-2.3.1-bin-without-hadoop
export HADOOP_USER_NAME=hadoop

# export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)

TEST_DIR="hdfs://cromwellhadooptest:8020/user/hadoop/gatk/small"
COMMON_DIR="hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common"
INPUT_DIR="$TEST_DIR/input"
OUTPUT_DIR="$TEST_DIR/output"

input_bam="$INPUT_DIR/small_CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam"
output_vcf_basename="$OUTPUT_DIR/CEUTrio.HiSeq.WGS.b37.NA12878.20.21"

ref_fasta="$COMMON_DIR/human_g1k_v37.20.21.fasta"
known_sites="$COMMON_DIR/dbsnp_138.b37.20.21.vcf"

gatk ReadsPipelineSpark \
    -R ${ref_fasta} \
    -I ${input_bam} \
    -O ${output_vcf_basename}.vcf \
    --known-sites ${known_sites} \
    -pairHMM AVX_LOGLESS_CACHING \
    --spark-verbosity DEBUG \
    -- --spark-runner SPARK --spark-master yarn-cluster \
    # --conf 'spark.submit.deployMode=cluster'

Expected behavior

ReadsPipelineSpark should be able to resolve the hdfs file path: hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

Actual behavior

The tool tries to access: file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta even when the input is: hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

Verified that the file is accesible through hdfs:

(gatk) root@2e738717b9c1:/gatk/mnt# $HADOOP_HOME/bin/hdfs dfs -ls hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
-rw-r--r--   3 hadoop supergroup  113008112 2020-07-29 15:54 hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

When I specify input as: hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, (i.e. without the port) I get the same error.

Stack trace for this:

 ***********************************************************************
A USER ERROR has occurred: The specified fasta file (file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta) does not exist.
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta) does not exist.
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkFastaPath(CachingIndexedFastaSequenceFile.java:173)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:143)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:125)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:110)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.processAssemblyRegions(HaplotypeCallerSpark.java:148)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.callVariantsWithHaplotypeCallerAndWriteOutput(HaplotypeCallerSpark.java:277)
at org.broadinstitute.hellbender.tools.spark.pipelines.ReadsPipelineSpark.runTool(ReadsPipelineSpark.java:224)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721) 

When I specify input as: hdfs:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, the tool tries to access hdfs://cromwellhadooptest:-1/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta.

Stack trace for this:

 java.lang.IllegalArgumentException: Wrong FS: hdfs://cromwellhadooptest:-1/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta, expected: hdfs://cromwellhadooptest
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:776)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:247)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1725)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1722)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1737)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1729)
at hdfs.jsr203.HadoopFileSystem.checkAccess(HadoopFileSystem.java:937)
at hdfs.jsr203.HadoopFileSystemProvider.checkAccess(HadoopFileSystemProvider.java:75)
at java.nio.file.Files.exists(Files.java:2385)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.canCreateIndexedFastaReader(ReferenceSequenceFileFactory.java:165)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:137)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:122)
at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:111)
at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceHadoopSparkSource.getReferenceSequenceDictionary(ReferenceHadoopSparkSource.java:41)
at org.broadinstitute.hellbender.engine.spark.datasources.ReferenceMultiSparkSource.getReferenceSequenceDictionary(ReferenceMultiSparkSource.java:93)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReference(GATKSparkTool.java:604)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:553)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:544)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721) 

The above makes sense, that's why I added the hostname and port for the namenode.

It seems like after verifying that the file hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta exists, some code transforms this path into file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta

@108anup 108anup changed the title ReadsPipelineSpark unable to resolve hdfs path. ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. Jul 29, 2020
@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 3, 2020

At a minimum, this code in GATKSparkTool throws away the file system (uri scheme) info. There are probably still other code paths that assume the reference is local.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 4, 2020

Unfortunately, fixing this will require adding some new tests, but the tests are all currently disabled due to #5680.

@droazen
Copy link
Collaborator

droazen commented Aug 10, 2020

Theory from @cmnbroad is below:

I think this is happening because were trying to serialize the class loader sun.misc.Launcher$AppClassLoader), which appears to be reached through the graph by way of via https://github.com/damiencarol/jsr203-hadoop/blob/master/src/main/java/hdfs/jsr203/HadoopFileSystem.java#L82. We probably need to short circuit that with a custom serializer for one of these:

Serialization trace:
classes (sun.misc.Launcher$AppClassLoader)
classLoader (org.apache.hadoop.conf.Configuration)
conf (org.apache.hadoop.hdfs.DistributedFileSystem)
fs (hdfs.jsr203.HadoopFileSystem)
hdfs (hdfs.jsr203.HadoopPath)
path (htsjdk.samtools.seekablestream.SeekablePathStream)
seekableStream (htsjdk.tribble.TribbleIndexedFeatureReader)
featureReader (org.broadinstitute.hellbender.engine.FeatureDataSource)
featureSources (org.broadinstitute.hellbender.engine.FeatureManager)

See, for instance, dbpedia/distributed-extraction-framework#9.

@cmnbroad
Copy link
Collaborator

Just for clarity, the last suggestion above is for the fix to #5680. The fix for this issue is separate, but fixing #5680 will enable us to re-enable the tests, which is a prerequisite to fixing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants