-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path. #6730
Comments
108anup
changed the title
ReadsPipelineSpark unable to resolve hdfs path.
ReadsPipelineSpark (HaplotypeCallerSpark) unable to resolve hdfs path.
Jul 29, 2020
At a minimum, this code in GATKSparkTool throws away the file system (uri scheme) info. There are probably still other code paths that assume the reference is local. |
Unfortunately, fixing this will require adding some new tests, but the tests are all currently disabled due to #5680. |
Theory from @cmnbroad is below:
|
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug Report
Affected tool(s) or class(es)
ReadsPipelineSpark (HaplotypeCallerSpark) when running over a spark cluster
Affected version(s)
Description
Tools used:
latest docker image from broadinstitute/gatk
latest hadoop (3.3.0)
spark 2.3.1 without hadoop which is able to use the custom hadoop setup.
Steps to reproduce
Script run:
Expected behavior
ReadsPipelineSpark should be able to resolve the hdfs file path:
hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
Actual behavior
The tool tries to access:
file:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
even when the input is:hdfs://cromwellhadooptest:8020/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
Verified that the file is accesible through hdfs:
When I specify input as:
hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
, (i.e. without the port) I get the same error.Stack trace for this:
When I specify input as:
hdfs:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
, the tool tries to accesshdfs://cromwellhadooptest:-1/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
.Stack trace for this:
The above makes sense, that's why I added the hostname and port for the namenode.
It seems like after verifying that the file
hdfs://cromwellhadooptest/user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
exists, some code transforms this path intofile:///user/hadoop/gatk/common/human_g1k_v37.20.21.fasta
The text was updated successfully, but these errors were encountered: