Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATKPathSpecifier URI class and update FeatureInput. #5526

Merged
merged 1 commit into from
Jan 29, 2019

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Dec 17, 2018

URI class, fixes #4343 for feature inputs.

NOTE: this changes the command line argument tagging syntax by moving the tags and attributes from being part of the argument value to being part of the argument name. i.e.

--resource known,known=true,prior=10.0:myFile

becomes:

--resource:known,known=true,prior=10.0 myFile

@codecov-io
Copy link

codecov-io commented Dec 17, 2018

Codecov Report

Merging #5526 into master will decrease coverage by 0.03%.
The diff coverage is 78.67%.

@@              Coverage Diff               @@
##              master     #5526      +/-   ##
==============================================
- Coverage     87.037%   87.008%   -0.03%     
- Complexity     31537     31604      +67     
==============================================
  Files           1930      1934       +4     
  Lines         145455    145678     +223     
  Branches       16090     16108      +18     
==============================================
+ Hits          126600    126751     +151     
- Misses         12996     13043      +47     
- Partials        5859      5884      +25
Impacted Files Coverage Δ Complexity Δ
...tools/walkers/haplotypecaller/HaplotypeCaller.java 84.211% <ø> (ø) 23 <0> (ø) ⬇️
...llbender/engine/FeatureSupportIntegrationTest.java 100% <ø> (ø) 9 <0> (ø) ⬇️
...e/hellbender/engine/FeatureDataSourceUnitTest.java 91.022% <ø> (ø) 41 <0> (ø) ⬇️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 78.523% <100%> (+0.292%) 40 <2> (+1) ⬆️
...lkers/vqsr/VariantRecalibratorIntegrationTest.java 98.969% <100%> (ø) 18 <0> (ø) ⬇️
...der/tools/walkers/CombineGVCFsIntegrationTest.java 87.448% <100%> (ø) 24 <0> (ø) ⬇️
.../org/broadinstitute/hellbender/engine/PathURI.java 100% <100%> (ø) 1 <1> (?)
...kers/filters/VariantFiltrationIntegrationTest.java 100% <100%> (ø) 26 <0> (ø) ⬇️
...alkers/varianteval/VariantEvalIntegrationTest.java 100% <100%> (ø) 42 <0> (ø) ⬇️
...institute/hellbender/utils/io/IOUtilsUnitTest.java 79.268% <100%> (+0.063%) 47 <0> (+1) ⬆️
... and 37 more

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jan 7, 2019

@bbimber FYI - when this PR goes in I think it will require some minor changes to VariantQC if you use tagging syntax.

@bbimber
Copy link
Contributor

bbimber commented Jan 7, 2019

@cmnbroad thanks and we expect to need to update our tool w/ GATK changes

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Looks good overall but I have some comments.

@@ -321,7 +192,11 @@ public boolean equals(final Object other) {
}

final FeatureInput<?> otherFeature = (FeatureInput<?>)other;
return name.equals(otherFeature.name) && featureFile.equals(otherFeature.featureFile);
if (!Objects.equals(getTag(), otherFeature.getTag())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a mismatch between equals and hashCode. super.hashCode uses getTagAttributes in the hash code, and equals ignore it. So a pair of FeatureInputs that have the same tag but different tagAttributes will be equal but have different hashCodes which is broken.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch.


@Override
public Path toPath() {
// special case GCS, in case the filesystem provider wasn't installed properly but is available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use IOUtils.getPath()? That also handles the case for non-gcs user provided filesystems in spark which this doesn't.

Copy link
Collaborator Author

@cmnbroad cmnbroad Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan was to (eventually) eliminate all of the call sites that depend on IOUtils.getPath, and have this be the one true way to path resolution. But that will take several PRs (i.e, after this one I'll probably do one to clean up all the genomicsDB code paths, then another one to make reads/reference inputs and outputs use URI, and then one for the remaining uses). In the meantime I don't want this to call intoIOUtils.getPath because that creates bogus Paths for things that are not backed by a provider, and there are code paths in gatk that depend on that (I added a comment to the code where we do that as part of this PR that you commented on).

Path p = BucketUtils.getPathOnGcs(getURIString());
inputStream = Files.newInputStream(p);
} else if (getURI().getScheme().equals(HDFS_SCHEME)) {
org.apache.hadoop.fs.Path file = new org.apache.hadoop.fs.Path(getURIString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this hardcoded support for hadoop Path or do we just want to rely on the admittedly fairly untested hadoop NIO provider?

Copy link
Collaborator Author

@cmnbroad cmnbroad Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I included this because its what BucketUtils.openFile does (and thus what I needed for my bigger version of this branch where I used the URI class for all reads and reference inputs). But maybe it isn't necessary ? Somehow I thought the jsr203hadoop provider was flaky, but I'm not sure where I got that from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be flakey, we haven't done any sort of stress testing on it and it's not really maintained as far as I know. My guess is that's it's fine for getting a stream though.


try {
InputStream inputStream;
if (getURI().getScheme().equals(GCS_SCHEME)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check for google-ness seems unnecessary, why not call toPath() which already has this special cased.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right, for the GCS one.

}
}

//TODO: should this wrap the stream in a .gz in a BlockCompressedInputStream like IOUtils does?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should... A lot of places expect raw compressed streams and we need a way of accessing that...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this comment is pretty old.


public class PathSpecifierUnitTest {

final static String FS_SEPARATOR = FileSystems.getDefault().getSeparator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for windows purposes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.


//*****************************************************************************************
// Reference that contain characters that require URI-encoding. If the input string is presented
// without no scheme, it will be be automatically encoded by PathSpecifier, otherwise it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo without no

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{"file:///path/to/localFile.bam", "file:///path/to/localFile.bam", true, true}, // empty authority

//*****************************************************************************
// Valid URIs which are NOT valid NIO paths (no installed file system provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me, it seems like a bunch of the ones following this are valid and have paths.

Copy link
Collaborator Author

@cmnbroad cmnbroad Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. True, but its because only I have multiple branches with this code in different repos, and I'm trying to keep them as in-sync as possible in order to make propagating code review changes easier. In htjsdk-next, there are no providers for these so they're not valid there; when I moved them here I just changed the expected boolean to reflect that to make them pass. I'll have to reconcile them somehow.

{"file:/project/gvcf-pcr/23232_1#1/1.g.vcf.gz"}, // not encoded
{"file:project/gvcf-pcr/23232_1#1/1.g.vcf.gz"}, // scheme-specific part is not hierarchical

// The hadoop file system provider explicitly throws an NPE if no host is specified and HDFS is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this issue caused all sorts of confusion in the early days of spark before dataproc.

@Test
public void testStdIn() throws IOException {
final PathURI htsURI = new PathSpecifier(
SystemUtils.IS_OS_WINDOWS ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh god, I feel your pain much more after looking through these tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular code is actually bogus; some Windows apps use "-" to mean "stdout", but its not standard or built-in to the file system namespace, so this doesn't work in most cases. I've changed this to throw a SkipException on Windows.

@lbergelson lbergelson assigned cmnbroad and unassigned lbergelson Jan 8, 2019
@cmnbroad
Copy link
Collaborator Author

I think I responded to everything. Renamed isNIO to hasFileSystemProvider and keeping it for now. back to @lbergelson.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad One comment, I think there's still an issue with getPath. Let me know if you disagree with my interpretation of it.

Otherwise 👍

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me now. Merge when tests pass.

@cmnbroad
Copy link
Collaborator Author

@lbergelson I added the ClassLoader fallback code in to GATKPathSpecifier. In doing so, I expected some of the PathSpecifier negative unit tests to produce different exceptions, but they didn't. This was because the tests were all using the PathSpecifier base class (because I took them from htsjdk-next which has only the base implementation). So I updated the tests to use GATKPathSpecifier (they all passed), then integrated the fallback code and changed 3 negative tests that went from FileSystemNotFoundException to ProviderNotFoundException. Finally I renamed the test class file to correctly reflect the class being tested. Sorry this was such a pain.

@cmnbroad
Copy link
Collaborator Author

Also, once the tests pass, I'm going to rebase on master and then run them one more time.

@cmnbroad cmnbroad merged commit 0238d2a into master Jan 29, 2019
@lbergelson lbergelson deleted the cn_uri_feature_inputs branch January 29, 2019 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

# in file names converted to %23 resulting in file not found
5 participants