Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

samuelklee · 2019-11-18T18:14:25Z

This is to substantially reduce disk costs in high resolution WGS gCNV runs per #5716. As discussed elsewhere, we can enable indexing/gzipping/streaming in the gCNV WDLs themselves, but this should happen after updating to WDL 1.0 (which we need for optional localization).

This PR only partially addresses that issue, since we could make more sweeping changes in the abstract CNV collection classes. However, I did make a small change to TableReader that allows all TSV/CNV collection files to be gzipped.

I fixed format specification in the CollectReadCounts WDL task, which was kind of wonky and incorrect. It's still kind of wonky (due to WDL limitations), but it should be correct. Some exception handling is now done in bash.

I also had to fix some missing newlines at EOFs. One such missing newline in the test counts file caused indexing of the gzipped version of the file to miss the last count upon querying during initial testing. Although probably unnecessary, I changed JSON writing in gCNV to include such newlines.

…cordWriter to RecordCollectionWriter for consistency.

…d methods for streaming and subsetting counts.

…L task and disabled indexing in CNV WDLs.

…dingly.

samuelklee · 2019-11-18T19:35:18Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -224,6 +273,7 @@ task CollectCounts {
    output {
        String entity_id = base_filename
        File counts = counts_filename
+        File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null"


Not sure if there is a better way to handle optional outputs...can reviewer check?

WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null will produce but it may be fine if you tested it and runs.

Oops, forgot to do this one.

Hmm, let me just remove this output for now. Downstream WGS gCNV WDLs can restore it and force indexing for the time being. We can try to move towards tsv.gz + indexing being the only behavior allowed by this task, if not the Java code paths.

samuelklee · 2019-11-19T15:52:19Z

@mwalker174 for some reason you don't pop up in the list of reviewers (maybe related to the recent GitHub snafu)? In any case, think you could take a look?

mwalker174

I have some mostly minor comments. As I think we discussed in a meeting, we can hold off on changing the WDL at this point. Seeing how complicated it is with different formats makes me want to just choose one (probably compressed .tsv since it permits streaming and is therefore offers the most functionality), unless there is a strong case for hdf5 performance-wise.

Adding GCS functionality to the counts collection class seems like it could be organized a little better. I have a suggestion or two about this you can do now, but long-term I think we should aim to have a single code path for reading TSV files, whether local or on GCS. That will require some substantial refactoring to work, but I think we should be using the engine-level Feature functionality instead of the TableReader for Locatable type I/O.

Edit: an afterthought, what about a LocatableTableReader class?

mwalker174 · 2019-11-25T16:22:23Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -182,6 +182,7 @@ task CollectCounts {
    File ref_fasta
    File ref_fasta_fai
    File ref_fasta_dict
+    Boolean? enable_indexing


You can change this to just Boolean enable_indexing = false and forego the select_first below

I think we decided on not specifying values for optional parameters in this way as a matter of style across all CNV WDLs, so that default values for parameters that are simply passed through to the command line would be located in the command-line invocation itself. Maybe we can change everything consistently in the WDL 1.0 update PR if it makes sense to do so.

Wait, also see #2858 (comment).

Interesting. I am unsure how the use of optional works in this case, ie what is the difference between Boolean enable_indexing = false and Boolean? enable_indexing = false. Both are allowed and we use the former in the SV pipeline, although I have not tested that it can be overridden in the json.

mwalker174 · 2019-11-25T16:22:49Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -182,6 +182,7 @@ task CollectCounts {
    File ref_fasta
    File ref_fasta_fai
    File ref_fasta_dict
+    Boolean? enable_indexing
    String? format


String format = "HDF5"

Same thing here.

mwalker174 · 2019-11-25T16:27:21Z

scripts/cnv_wdl/cnv_common_tasks.wdl

@@ -224,6 +273,7 @@ task CollectCounts {
    output {
        String entity_id = base_filename
        File counts = counts_filename
+        File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null"


WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null will produce but it may be fine if you tested it and runs.

mwalker174 · 2019-11-25T16:32:31Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/DetermineGermlineContigPloidy.java

 /**
 * Determines the integer ploidy state of all contigs for germline samples given counts data. These should be either
- * HDF5 or TSV count files generated by {@link CollectReadCounts}.
+ * HDF5 or TSV count files generated by {@link CollectReadCounts}; TSV files may gzipped, but must then have filenames


I would mention block compression specifically. Also is the index always expected or just when streaming?

I think gzipped implies block compression but I'm not actually sure which implementation HTSJDK is compatible with. Perhaps I'll specify compression with bgzip, since that is what we will presumably use? Index is only expected when streaming.

Yes sorry, I meant block compression in the way that bgzip implements it. Specifying bgzip would be clearer.

mwalker174 · 2019-11-25T17:23:02Z

...rg/broadinstitute/hellbender/tools/copynumber/formats/collections/SimpleCountCollection.java

+        Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL.");
+        Utils.validate(new SimpleCountCodec().canDecode(path), String.format(
+                "Read-count file extension must be one of the following: [%s]",
+                String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS)));


You don't need to address in this PR, but I think we should move away requiring the .counts.tsv suffix and use just .tsv.

OK, no action. I'm a little hesitant since .tsv is such a common file extension, and I would want to make sure that both the code and formats don't ever get to a place where it's easy to fail silently.

mwalker174 · 2019-11-25T18:13:07Z

.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java

+                ? SimpleCountCollection.readFromGCS(readCountPath)
+                : SimpleCountCollection.read(new File(readCountPath));


I think we should try to simplify reading counts files a bit by only exposing the read, and readAndSubset functions, and taking care of the GCS logic inside, ie have public read(String path) and public readAndSubset(String path, final List<SimpleInterval> overlapIntervals), but make the GCS functions private.

I thought about this and agree it would be cleaner, but since we have to do the extra step of merging intervals in the GCS case, I think it's OK to encourage the caller to do this step externally (so we don't repeat this work in the loop where these subset methods are called, inexpensive as it may typically be). Unfortunately, this is encouragement is only given via the method docs and not by any actual validation.

I'm okay with this as long as there are tests for both code paths

mwalker174 · 2019-11-25T18:23:15Z

...rg/broadinstitute/hellbender/tools/copynumber/formats/collections/SimpleCountCollection.java

@@ -23,17 +31,23 @@
 * @author Samuel Lee &lt;slee@broadinstitute.org&gt;
 */
 public final class SimpleCountCollection extends AbstractSampleLocatableCollection<SimpleCount> {
+    private static final int DEFAULT_FEATURE_QUERY_LOOKAHEAD_IN_BP = 1_000_000;


Just curious if you did any optimization here

Nope! I think this just mirrors the default value.

mwalker174 · 2019-11-25T18:25:11Z

...rg/broadinstitute/hellbender/tools/copynumber/formats/collections/SimpleCountCollection.java

@@ -60,13 +74,37 @@ public SimpleCountCollection(final SampleLocatableMetadata metadata,
        super(metadata, simpleCounts, SimpleCountCollection.SimpleCountTableColumn.COLUMNS, SIMPLE_COUNT_RECORD_FROM_DATA_LINE_DECODER, SIMPLE_COUNT_RECORD_TO_DATA_LINE_ENCODER);
    }

+    /**
+     * Read all counts from a file (HDF5 or TSV).
+     */
    public static SimpleCountCollection read(final File file) {
        IOUtils.canReadFile(file);


This line is unnecessary now.

Same thing about redundant validation here.

mwalker174 · 2019-11-25T18:31:27Z

...rg/broadinstitute/hellbender/tools/copynumber/formats/collections/SimpleCountCollection.java

+        IOUtils.assertFileIsReadable(IOUtils.getPath(path));
+        Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL.");
+        Utils.validate(new SimpleCountCodec().canDecode(path), String.format(
+                "Read-count file extension must be one of the following: [%s]",
+                String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS)));


These lines aren't necessary since you do them again in readOverlappingSubsetFromGCS

I think we generally tend towards redundant parameter validation as long as it is cheap. Violates DRY but is more robust to refactoring, etc.

Actually, I did notice a non-null check that feels redundant in resolveIntervals, will remove that.

mwalker174 · 2019-11-25T19:20:51Z

...rg/broadinstitute/hellbender/tools/copynumber/formats/collections/SimpleCountCollection.java

+     * list of the original intervals desired to be strictly coincident; this merged list can then be used with this method.
+     * @param overlapIntervals    if {@code null} or empty, all counts will be returned; must be sorted and non-overlapping otherwise
+     */
+    public static SimpleCountCollection readOverlappingSubsetFromGCS(final String path,


I would LOVE for all code paths for reading counts use the same code path with FeatureDataSource, as you have done here. I guess this isn't possible currently with HDF5... but we should be able to use FeatureDataSource for local TSVs as well. Don't need to address for this PR but this should be a goal.

Yes, let's consider phasing out HDF5 count files. They really only give a significant speedup for WGS somatic PoN building; it might be possible to achieve this by just optimizing TableReader instead. See #2858 (comment) for some old numbers.

samuelklee · 2019-11-26T16:52:52Z

Thanks @mwalker174! I think I responded to or addressed everything.

The code paths for reading TSVs all go through the abstract CNV collection classes. Those require a bit of boilerplate, but were IMO a huge improvement over the horrowshow of utility methods from the old code... Happy to discuss possible further refactoring and improvement (and there are already catch-all issues open), if needed.

If we decide to stream other locatable collections, we can start to extract more of these streaming/subsetting methods to AbstractLocatableCollection, which would give us something like the LocatableTableReader you're envisioning in your edit. We've discussed using @jonn-smith's XSVLocatableTable machinery as well. I think the only downsides are the conventional reliance on extensions/config files for decoding, as well as the need to accommodate CNV headers. Encoding is also not handled. We also still need to represent non-Locatable TSVs, ideally with a minimal number of code paths, although that probably won't present any major refactoring issues. Also recall that we discussed moving from Files -> Paths in previous PRs, so we should instead go from Files -> FeatureDataSources where it makes sense.

mwalker174

Thanks @samuelklee. If we can, we should aim to remove HDF5 entirely. XsvLocatableTableCodec looks interesting but I agree we would need to figure out how to do away with a config file.

Merge at will!

samuelklee added 3 commits November 18, 2019 11:21

Enabled .gz reading in TableReader for CNV collections and renamed Re…

72439ba

…cordWriter to RecordCollectionWriter for consistency.

Added Tribble codec for SimpleCount counts.tsv/counts.tsv.gz files an…

0d9d9da

…d methods for streaming and subsetting counts.

Extracted count-subsetting and validation code in gCNV CLIs.

680db5c

samuelklee force-pushed the sl_indexed_counts branch from a784117 to b3b437d Compare November 18, 2019 19:20

samuelklee added 2 commits November 18, 2019 14:34

Added indexing and fixed format specification in CollectReadCounts WD…

91ebbe7

…L task and disabled indexing in CNV WDLs.

Added newlines at EOF for CNV files and fixed gCNV JSON writing accor…

b5c74ff

…dingly.

samuelklee force-pushed the sl_indexed_counts branch from b3b437d to b5c74ff Compare November 18, 2019 19:35

samuelklee commented Nov 18, 2019

View reviewed changes

droazen requested a review from mwalker174 November 21, 2019 17:27

droazen assigned mwalker174 Nov 21, 2019

mwalker174 requested changes Nov 25, 2019

View reviewed changes

Addressed PR comments.

1e61c62

Removed counts_idx output from CollectCounts WDL task.

adbc139

mwalker174 approved these changes Dec 2, 2019

View reviewed changes

samuelklee merged commit 4e8a35e into master Dec 2, 2019

samuelklee deleted the sl_indexed_counts branch December 2, 2019 17:38

samuelklee mentioned this pull request Mar 16, 2020

Dev gatk-workflows/gatk4-germline-cnvs#2

Open

mwalker174 mentioned this pull request Sep 23, 2021

Improve capabilities of CNV collections classes. #5716

Closed

samuelklee mentioned this pull request Oct 12, 2021

NIO version of CNV WDLs. #4806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

samuelklee commented Nov 18, 2019

samuelklee Nov 18, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

samuelklee Nov 26, 2019 •

edited

Loading

samuelklee commented Nov 19, 2019

mwalker174 left a comment •

edited

Loading

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

samuelklee Nov 26, 2019

mwalker174 Dec 2, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

mwalker174 Nov 25, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

mwalker174 Dec 2, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

mwalker174 Dec 2, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019 •

edited

Loading

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

samuelklee Nov 26, 2019

mwalker174 Nov 25, 2019

samuelklee Nov 26, 2019

samuelklee commented Nov 26, 2019 •

edited

Loading

mwalker174 left a comment

		? SimpleCountCollection.readFromGCS(readCountPath)
		: SimpleCountCollection.read(new File(readCountPath));

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

Conversation

samuelklee commented Nov 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

samuelklee commented Nov 19, 2019

mwalker174 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelklee commented Nov 26, 2019 • edited Loading

mwalker174 left a comment

Choose a reason for hiding this comment

samuelklee Nov 26, 2019 •

edited

Loading

mwalker174 left a comment •

edited

Loading

samuelklee Nov 26, 2019 •

edited

Loading

samuelklee commented Nov 26, 2019 •

edited

Loading