-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GATKPathSpecifier URI class and update FeatureInput. #5526
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5526 +/- ##
==============================================
- Coverage 87.037% 87.008% -0.03%
- Complexity 31537 31604 +67
==============================================
Files 1930 1934 +4
Lines 145455 145678 +223
Branches 16090 16108 +18
==============================================
+ Hits 126600 126751 +151
- Misses 12996 13043 +47
- Partials 5859 5884 +25
|
@bbimber FYI - when this PR goes in I think it will require some minor changes to VariantQC if you use tagging syntax. |
@cmnbroad thanks and we expect to need to update our tool w/ GATK changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Looks good overall but I have some comments.
@@ -321,7 +192,11 @@ public boolean equals(final Object other) { | |||
} | |||
|
|||
final FeatureInput<?> otherFeature = (FeatureInput<?>)other; | |||
return name.equals(otherFeature.name) && featureFile.equals(otherFeature.featureFile); | |||
if (!Objects.equals(getTag(), otherFeature.getTag())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a mismatch between equals
and hashCode
. super.hashCode
uses getTagAttributes
in the hash code, and equals ignore it. So a pair of FeatureInputs that have the same tag but different tagAttributes will be equal but have different hashCodes which is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch.
|
||
@Override | ||
public Path toPath() { | ||
// special case GCS, in case the filesystem provider wasn't installed properly but is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this use IOUtils.getPath()
? That also handles the case for non-gcs user provided filesystems in spark which this doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan was to (eventually) eliminate all of the call sites that depend on IOUtils.getPath
, and have this be the one true way to path resolution. But that will take several PRs (i.e, after this one I'll probably do one to clean up all the genomicsDB code paths, then another one to make reads/reference inputs and outputs use URI, and then one for the remaining uses). In the meantime I don't want this to call intoIOUtils.getPath
because that creates bogus Paths for things that are not backed by a provider, and there are code paths in gatk that depend on that (I added a comment to the code where we do that as part of this PR that you commented on).
Path p = BucketUtils.getPathOnGcs(getURIString()); | ||
inputStream = Files.newInputStream(p); | ||
} else if (getURI().getScheme().equals(HDFS_SCHEME)) { | ||
org.apache.hadoop.fs.Path file = new org.apache.hadoop.fs.Path(getURIString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this hardcoded support for hadoop Path or do we just want to rely on the admittedly fairly untested hadoop NIO provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I included this because its what BucketUtils.openFile
does (and thus what I needed for my bigger version of this branch where I used the URI class for all reads and reference inputs). But maybe it isn't necessary ? Somehow I thought the jsr203hadoop provider was flaky, but I'm not sure where I got that from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be flakey, we haven't done any sort of stress testing on it and it's not really maintained as far as I know. My guess is that's it's fine for getting a stream though.
|
||
try { | ||
InputStream inputStream; | ||
if (getURI().getScheme().equals(GCS_SCHEME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check for google-ness seems unnecessary, why not call toPath() which already has this special cased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right, for the GCS one.
} | ||
} | ||
|
||
//TODO: should this wrap the stream in a .gz in a BlockCompressedInputStream like IOUtils does? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should... A lot of places expect raw compressed streams and we need a way of accessing that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this comment is pretty old.
|
||
public class PathSpecifierUnitTest { | ||
|
||
final static String FS_SEPARATOR = FileSystems.getDefault().getSeparator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for windows purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
||
//***************************************************************************************** | ||
// Reference that contain characters that require URI-encoding. If the input string is presented | ||
// without no scheme, it will be be automatically encoded by PathSpecifier, otherwise it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo without no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
{"file:///path/to/localFile.bam", "file:///path/to/localFile.bam", true, true}, // empty authority | ||
|
||
//***************************************************************************** | ||
// Valid URIs which are NOT valid NIO paths (no installed file system provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing to me, it seems like a bunch of the ones following this are valid and have paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh. True, but its because only I have multiple branches with this code in different repos, and I'm trying to keep them as in-sync as possible in order to make propagating code review changes easier. In htjsdk-next, there are no providers for these so they're not valid there; when I moved them here I just changed the expected boolean to reflect that to make them pass. I'll have to reconcile them somehow.
{"file:/project/gvcf-pcr/23232_1#1/1.g.vcf.gz"}, // not encoded | ||
{"file:project/gvcf-pcr/23232_1#1/1.g.vcf.gz"}, // scheme-specific part is not hierarchical | ||
|
||
// The hadoop file system provider explicitly throws an NPE if no host is specified and HDFS is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this issue caused all sorts of confusion in the early days of spark before dataproc.
@Test | ||
public void testStdIn() throws IOException { | ||
final PathURI htsURI = new PathSpecifier( | ||
SystemUtils.IS_OS_WINDOWS ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh god, I feel your pain much more after looking through these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular code is actually bogus; some Windows apps use "-" to mean "stdout", but its not standard or built-in to the file system namespace, so this doesn't work in most cases. I've changed this to throw a SkipException
on Windows.
I think I responded to everything. Renamed isNIO to hasFileSystemProvider and keeping it for now. back to @lbergelson. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad One comment, I think there's still an issue with getPath. Let me know if you disagree with my interpretation of it.
Otherwise 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me now. Merge when tests pass.
@lbergelson I added the ClassLoader fallback code in to |
Also, once the tests pass, I'm going to rebase on master and then run them one more time. |
3b8232f
to
873032c
Compare
873032c
to
184e0b4
Compare
URI class, fixes #4343 for feature inputs.
NOTE: this changes the command line argument tagging syntax by moving the tags and attributes from being part of the argument value to being part of the argument name. i.e.
--resource known,known=true,prior=10.0:myFile
becomes:
--resource:known,known=true,prior=10.0 myFile