Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

Merged
merged 7 commits into from
Dec 2, 2019

Conversation

samuelklee
Copy link
Contributor

This is to substantially reduce disk costs in high resolution WGS gCNV runs per #5716. As discussed elsewhere, we can enable indexing/gzipping/streaming in the gCNV WDLs themselves, but this should happen after updating to WDL 1.0 (which we need for optional localization).

This PR only partially addresses that issue, since we could make more sweeping changes in the abstract CNV collection classes. However, I did make a small change to TableReader that allows all TSV/CNV collection files to be gzipped.

I fixed format specification in the CollectReadCounts WDL task, which was kind of wonky and incorrect. It's still kind of wonky (due to WDL limitations), but it should be correct. Some exception handling is now done in bash.

I also had to fix some missing newlines at EOFs. One such missing newline in the test counts file caused indexing of the gzipped version of the file to miss the last count upon querying during initial testing. Although probably unnecessary, I changed JSON writing in gCNV to include such newlines.

@@ -224,6 +273,7 @@ task CollectCounts {
output {
String entity_id = base_filename
File counts = counts_filename
File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a better way to handle optional outputs...can reviewer check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null will produce but it may be fine if you tested it and runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to do this one.

Copy link
Contributor Author

@samuelklee samuelklee Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me just remove this output for now. Downstream WGS gCNV WDLs can restore it and force indexing for the time being. We can try to move towards tsv.gz + indexing being the only behavior allowed by this task, if not the Java code paths.

@samuelklee
Copy link
Contributor Author

@mwalker174 for some reason you don't pop up in the list of reviewers (maybe related to the recent GitHub snafu)? In any case, think you could take a look?

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some mostly minor comments. As I think we discussed in a meeting, we can hold off on changing the WDL at this point. Seeing how complicated it is with different formats makes me want to just choose one (probably compressed .tsv since it permits streaming and is therefore offers the most functionality), unless there is a strong case for hdf5 performance-wise.

Adding GCS functionality to the counts collection class seems like it could be organized a little better. I have a suggestion or two about this you can do now, but long-term I think we should aim to have a single code path for reading TSV files, whether local or on GCS. That will require some substantial refactoring to work, but I think we should be using the engine-level Feature functionality instead of the TableReader for Locatable type I/O.

Edit: an afterthought, what about a LocatableTableReader class?

@@ -182,6 +182,7 @@ task CollectCounts {
File ref_fasta
File ref_fasta_fai
File ref_fasta_dict
Boolean? enable_indexing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change this to just Boolean enable_indexing = false and forego the select_first below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we decided on not specifying values for optional parameters in this way as a matter of style across all CNV WDLs, so that default values for parameters that are simply passed through to the command line would be located in the command-line invocation itself. Maybe we can change everything consistently in the WDL 1.0 update PR if it makes sense to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, also see #2858 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I am unsure how the use of optional works in this case, ie what is the difference between Boolean enable_indexing = false and Boolean? enable_indexing = false. Both are allowed and we use the former in the SV pipeline, although I have not tested that it can be overridden in the json.

@@ -182,6 +182,7 @@ task CollectCounts {
File ref_fasta
File ref_fasta_fai
File ref_fasta_dict
Boolean? enable_indexing
String? format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String format = "HDF5"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here.

@@ -224,6 +273,7 @@ task CollectCounts {
output {
String entity_id = base_filename
File counts = counts_filename
File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null will produce but it may be fine if you tested it and runs.

/**
* Determines the integer ploidy state of all contigs for germline samples given counts data. These should be either
* HDF5 or TSV count files generated by {@link CollectReadCounts}.
* HDF5 or TSV count files generated by {@link CollectReadCounts}; TSV files may gzipped, but must then have filenames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mention block compression specifically. Also is the index always expected or just when streaming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gzipped implies block compression but I'm not actually sure which implementation HTSJDK is compatible with. Perhaps I'll specify compression with bgzip, since that is what we will presumably use? Index is only expected when streaming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry, I meant block compression in the way that bgzip implements it. Specifying bgzip would be clearer.

Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL.");
Utils.validate(new SimpleCountCodec().canDecode(path), String.format(
"Read-count file extension must be one of the following: [%s]",
String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to address in this PR, but I think we should move away requiring the .counts.tsv suffix and use just .tsv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no action. I'm a little hesitant since .tsv is such a common file extension, and I would want to make sure that both the code and formats don't ever get to a place where it's easy to fail silently.

Comment on lines +114 to +115
? SimpleCountCollection.readFromGCS(readCountPath)
: SimpleCountCollection.read(new File(readCountPath));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to simplify reading counts files a bit by only exposing the read, and readAndSubset functions, and taking care of the GCS logic inside, ie have public read(String path) and public readAndSubset(String path, final List<SimpleInterval> overlapIntervals), but make the GCS functions private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and agree it would be cleaner, but since we have to do the extra step of merging intervals in the GCS case, I think it's OK to encourage the caller to do this step externally (so we don't repeat this work in the loop where these subset methods are called, inexpensive as it may typically be). Unfortunately, this is encouragement is only given via the method docs and not by any actual validation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this as long as there are tests for both code paths

@@ -23,17 +31,23 @@
* @author Samuel Lee &lt;slee@broadinstitute.org&gt;
*/
public final class SimpleCountCollection extends AbstractSampleLocatableCollection<SimpleCount> {
private static final int DEFAULT_FEATURE_QUERY_LOOKAHEAD_IN_BP = 1_000_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if you did any optimization here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! I think this just mirrors the default value.

@@ -60,13 +74,37 @@ public SimpleCountCollection(final SampleLocatableMetadata metadata,
super(metadata, simpleCounts, SimpleCountCollection.SimpleCountTableColumn.COLUMNS, SIMPLE_COUNT_RECORD_FROM_DATA_LINE_DECODER, SIMPLE_COUNT_RECORD_TO_DATA_LINE_ENCODER);
}

/**
* Read all counts from a file (HDF5 or TSV).
*/
public static SimpleCountCollection read(final File file) {
IOUtils.canReadFile(file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is unnecessary now.

Copy link
Contributor Author

@samuelklee samuelklee Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing about redundant validation here.

Comment on lines +132 to +136
IOUtils.assertFileIsReadable(IOUtils.getPath(path));
Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL.");
Utils.validate(new SimpleCountCodec().canDecode(path), String.format(
"Read-count file extension must be one of the following: [%s]",
String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines aren't necessary since you do them again in readOverlappingSubsetFromGCS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally tend towards redundant parameter validation as long as it is cheap. Violates DRY but is more robust to refactoring, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did notice a non-null check that feels redundant in resolveIntervals, will remove that.

* list of the original intervals desired to be strictly coincident; this merged list can then be used with this method.
* @param overlapIntervals if {@code null} or empty, all counts will be returned; must be sorted and non-overlapping otherwise
*/
public static SimpleCountCollection readOverlappingSubsetFromGCS(final String path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would LOVE for all code paths for reading counts use the same code path with FeatureDataSource, as you have done here. I guess this isn't possible currently with HDF5... but we should be able to use FeatureDataSource for local TSVs as well. Don't need to address for this PR but this should be a goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's consider phasing out HDF5 count files. They really only give a significant speedup for WGS somatic PoN building; it might be possible to achieve this by just optimizing TableReader instead. See #2858 (comment) for some old numbers.

@samuelklee
Copy link
Contributor Author

samuelklee commented Nov 26, 2019

Thanks @mwalker174! I think I responded to or addressed everything.

The code paths for reading TSVs all go through the abstract CNV collection classes. Those require a bit of boilerplate, but were IMO a huge improvement over the horrowshow of utility methods from the old code... Happy to discuss possible further refactoring and improvement (and there are already catch-all issues open), if needed.

If we decide to stream other locatable collections, we can start to extract more of these streaming/subsetting methods to AbstractLocatableCollection, which would give us something like the LocatableTableReader you're envisioning in your edit. We've discussed using @jonn-smith's XSVLocatableTable machinery as well. I think the only downsides are the conventional reliance on extensions/config files for decoding, as well as the need to accommodate CNV headers. Encoding is also not handled. We also still need to represent non-Locatable TSVs, ideally with a minimal number of code paths, although that probably won't present any major refactoring issues. Also recall that we discussed moving from Files -> Paths in previous PRs, so we should instead go from Files -> FeatureDataSources where it makes sense.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @samuelklee. If we can, we should aim to remove HDF5 entirely. XsvLocatableTableCodec looks interesting but I agree we would need to figure out how to do away with a config file.

Merge at will!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants