Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow explicit index file paths for inputs to CrosscheckFingerprints #1823

Merged
merged 16 commits into from
Sep 8, 2022

Conversation

rickymagner
Copy link
Contributor

@rickymagner rickymagner commented Aug 16, 2022

Description

This PR addresses an issue regarding data storage solutions where indexed files can be separated from their indices. CrosscheckFingerprints will now allow for an INPUT_INDEX_MAP tsv file with two columns: the first being the list of input files, and the latter being the corresponding index file locations. One can also do the same for the SECOND_INPUT arguments using SECOND_INPUT_INDEX_MAP.

Some tests have been added to test this functionality. This involved copying some test files into a separate dir from index files which were generated in a different directory. The test generates the INPUT_INDEX_MAP file on the fly so as to not rely on any absolute file paths.

Although the tool still runs if an index is not provided, a new warning will be written if providing an index map without mapping each input to an index. If an index file path is typed incorrectly in the map, an error should be thrown by some dependency classes about not being able to import a valid index from the given path.


Checklist

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

@rickymagner
Copy link
Contributor Author

Here is the update to CrosscheckFingerprints as promised. I would appreciate a review from anyone who has time to take a look, especially any number of: @droazen, @yfarjoun, @meganshand, @kachulis, @lbergelson. Thanks!

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rickymagner!

@rickymagner
Copy link
Contributor Author

Added another commit to allow for an INPUT_FORCE_INDEX argument. When toggled on, this checks that all files read for fingerprinting are using indices, either from those provided in INPUT_INDEX_MAP or found by the usual methods. The corresponding test for this PR has been updated to toggle this on (as it's off by default). The SECOND_INPUTs have the same option added.

I'm working on one more test to make sure when this option is on and inputs don't have a valid index, then the right exception should be thrown.

@rickymagner
Copy link
Contributor Author

In the meantime, I refactored the SAM file parsing a bit at this line. Previously this had a hot fix to address this issue.

It would be nice if someone with a bit more knowledge about SamInputResource could comment as to whether my code would fall into the same problem, or if it's avoided by using this different htsjdk code. Maybe one of @yfarjoun / @lbergelson / @cmnbroad if they have a moment? Thanks

@rickymagner
Copy link
Contributor Author

Added the last test as promised above for the INPUT_INDEX_FORCE option, and addressed the comments above. The only thing I'm waiting on is a bit more review + someone potentially addressing this concern.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner and thank you for adding the additional tests!

private SamReader getSamReader(final Path samFile, final Path indexPath, final boolean forceIndex) {
final SamInputResource samResource = SamInputResource.of(samFile);
if (indexPath != null) {
samResource.index(indexPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samResource.index(indexPath);
samResource.index(indexPath, seekableChannelFunction);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just implemented this change and checked this should be equivalent to the old code. I left a comment pointing to the relevant issue in case someone tries to delete the seekableChannelFunction wrapper.

Copy link
Contributor

@kachulis kachulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of small things, other wise looks good. Thanks @rickymagner!

@@ -672,27 +684,19 @@ public Map<FingerprintIdDetails, Fingerprint> fingerprintFiles(final Collection<
final Map<FingerprintIdDetails, Fingerprint> oneFileFingerprints;
log.debug("Processed file: " + p.toUri().toString() + " (" + filesRead.get() + ")");

// Determine whether valid index path given for this file
final Path indexPath = (indexPathMap != null && indexPathMap.containsKey(p)) ? indexPathMap.get(p) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check that p is in indexPathMap. indexPathMap.get(p) will return null if it can't find p in the map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, removed the second boolean expression for the ternary check

src/main/java/picard/fingerprint/FingerprintChecker.java Outdated Show resolved Hide resolved
public Map<String, Fingerprint> loadFingerprints(final Path fingerprintFile, final String specificSample) {
final VCFFileReader reader = new VCFFileReader(fingerprintFile, false);
public Map<String, Fingerprint> loadFingerprints(final Path fingerprintFile, final Path indexPath, final boolean forceIndex, final String specificSample) {
final VCFFileReader reader = getVCFReader(fingerprintFile, indexPath, forceIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

final VCFFileReader reader = new VCFFileReader(fingerprintFile, indexPath, forceIndex)

it looks like that constructor should handle a null indexPath correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually causes a bunch of tests to fail (apparently due to some null pointer exception). Although it seems like the method accepts null pointers, it looks like htsjdk handles that by assuming the index is next to the input file with extension .tbi which many of our test files don't have. I think it might be safer to leave it as is for now. This also allows for the custom PicardException to display when the user forces index file paths but they cannot be found.

final SamInputResource samResource = SamInputResource.of(samFile);
if (indexPath != null) {
// Use of seekableChannelFunction here avoids issue: https://github.com/broadinstitute/picard/issues/1175
samResource.index(indexPath, seekableChannelFunction);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked about this offline, but it looks like when using a SamInputResource, you need to explicitly give it an index if you want one to be used. So you need to find the index if you haven't been given it explicitly. See, as an example, the SamReader open method which takes a Path. Would also be good to add a test to confirm that the index is used when it is available if not explicitly given as input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed and also added a test that should address this.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickymagner Sorry for the delay. I have some comments. The ones about the UI you can ignore if you feel like this is the cleanest way, but I think having a single input that can have a 3rd column with matched indexes like what David added recently in GATK (and we can maybe steal the loading code for) would be clearer.

I also am not sure I understand why you need 2 force index arguments, it seems extremely error prone. I will 100% forget to force the second set of indexes every time.

@lbergelson
Copy link
Member

@rickymagner Thanks for addressing those. Some comments are more aspirational than practical. Looks good to me.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@kachulis kachulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @rickymagner!

@rickymagner rickymagner merged commit fc87aed into master Sep 8, 2022
@rickymagner rickymagner deleted the rm_fingerprint_allow_index_paths branch September 8, 2022 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants