-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266
Conversation
…cordWriter to RecordCollectionWriter for consistency.
…d methods for streaming and subsetting counts.
a784117
to
b3b437d
Compare
…L task and disabled indexing in CNV WDLs.
b3b437d
to
b5c74ff
Compare
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -224,6 +273,7 @@ task CollectCounts { | |||
output { | |||
String entity_id = base_filename | |||
File counts = counts_filename | |||
File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is a better way to handle optional outputs...can reviewer check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null
will produce but it may be fine if you tested it and runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot to do this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me just remove this output for now. Downstream WGS gCNV WDLs can restore it and force indexing for the time being. We can try to move towards tsv.gz + indexing being the only behavior allowed by this task, if not the Java code paths.
@mwalker174 for some reason you don't pop up in the list of reviewers (maybe related to the recent GitHub snafu)? In any case, think you could take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some mostly minor comments. As I think we discussed in a meeting, we can hold off on changing the WDL at this point. Seeing how complicated it is with different formats makes me want to just choose one (probably compressed .tsv since it permits streaming and is therefore offers the most functionality), unless there is a strong case for hdf5 performance-wise.
Adding GCS functionality to the counts collection class seems like it could be organized a little better. I have a suggestion or two about this you can do now, but long-term I think we should aim to have a single code path for reading TSV files, whether local or on GCS. That will require some substantial refactoring to work, but I think we should be using the engine-level Feature
functionality instead of the TableReader
for Locatable
type I/O.
Edit: an afterthought, what about a LocatableTableReader
class?
@@ -182,6 +182,7 @@ task CollectCounts { | |||
File ref_fasta | |||
File ref_fasta_fai | |||
File ref_fasta_dict | |||
Boolean? enable_indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change this to just Boolean enable_indexing = false
and forego the select_first
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we decided on not specifying values for optional parameters in this way as a matter of style across all CNV WDLs, so that default values for parameters that are simply passed through to the command line would be located in the command-line invocation itself. Maybe we can change everything consistently in the WDL 1.0 update PR if it makes sense to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, also see #2858 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I am unsure how the use of optional works in this case, ie what is the difference between Boolean enable_indexing = false
and Boolean? enable_indexing = false
. Both are allowed and we use the former in the SV pipeline, although I have not tested that it can be overridden in the json.
@@ -182,6 +182,7 @@ task CollectCounts { | |||
File ref_fasta | |||
File ref_fasta_fai | |||
File ref_fasta_dict | |||
Boolean? enable_indexing | |||
String? format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String format = "HDF5"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -224,6 +273,7 @@ task CollectCounts { | |||
output { | |||
String entity_id = base_filename | |||
File counts = counts_filename | |||
File counts_idx = if enable_indexing_ then "${base_filename}.${counts_index_filename_extension}" else "/dev/null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDL doesn't really handle optional outputs from tasks so I would avoid it altogether ideally. Otherwise I would lean towards using a blank file, since that wouldn't cause any downstream problems when no index is expected. I'm not sure what /dev/null
will produce but it may be fine if you tested it and runs.
/** | ||
* Determines the integer ploidy state of all contigs for germline samples given counts data. These should be either | ||
* HDF5 or TSV count files generated by {@link CollectReadCounts}. | ||
* HDF5 or TSV count files generated by {@link CollectReadCounts}; TSV files may gzipped, but must then have filenames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would mention block compression specifically. Also is the index always expected or just when streaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think gzipped implies block compression but I'm not actually sure which implementation HTSJDK is compatible with. Perhaps I'll specify compression with bgzip, since that is what we will presumably use? Index is only expected when streaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sorry, I meant block compression in the way that bgzip implements it. Specifying bgzip would be clearer.
Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL."); | ||
Utils.validate(new SimpleCountCodec().canDecode(path), String.format( | ||
"Read-count file extension must be one of the following: [%s]", | ||
String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to address in this PR, but I think we should move away requiring the .counts.tsv
suffix and use just .tsv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no action. I'm a little hesitant since .tsv
is such a common file extension, and I would want to make sure that both the code and formats don't ever get to a place where it's easy to fail silently.
? SimpleCountCollection.readFromGCS(readCountPath) | ||
: SimpleCountCollection.read(new File(readCountPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to simplify reading counts files a bit by only exposing the read
, and readAndSubset
functions, and taking care of the GCS logic inside, ie have public read(String path)
and public readAndSubset(String path, final List<SimpleInterval> overlapIntervals)
, but make the GCS functions private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this and agree it would be cleaner, but since we have to do the extra step of merging intervals in the GCS case, I think it's OK to encourage the caller to do this step externally (so we don't repeat this work in the loop where these subset methods are called, inexpensive as it may typically be). Unfortunately, this is encouragement is only given via the method docs and not by any actual validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this as long as there are tests for both code paths
@@ -23,17 +31,23 @@ | |||
* @author Samuel Lee <slee@broadinstitute.org> | |||
*/ | |||
public final class SimpleCountCollection extends AbstractSampleLocatableCollection<SimpleCount> { | |||
private static final int DEFAULT_FEATURE_QUERY_LOOKAHEAD_IN_BP = 1_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if you did any optimization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! I think this just mirrors the default value.
@@ -60,13 +74,37 @@ public SimpleCountCollection(final SampleLocatableMetadata metadata, | |||
super(metadata, simpleCounts, SimpleCountCollection.SimpleCountTableColumn.COLUMNS, SIMPLE_COUNT_RECORD_FROM_DATA_LINE_DECODER, SIMPLE_COUNT_RECORD_TO_DATA_LINE_ENCODER); | |||
} | |||
|
|||
/** | |||
* Read all counts from a file (HDF5 or TSV). | |||
*/ | |||
public static SimpleCountCollection read(final File file) { | |||
IOUtils.canReadFile(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is unnecessary now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing about redundant validation here.
IOUtils.assertFileIsReadable(IOUtils.getPath(path)); | ||
Utils.validate(BucketUtils.isCloudStorageUrl(path), "Read-count path must be a Google Cloud Storage URL."); | ||
Utils.validate(new SimpleCountCodec().canDecode(path), String.format( | ||
"Read-count file extension must be one of the following: [%s]", | ||
String.join(",", SimpleCountCodec.SIMPLE_COUNT_CODEC_EXTENSIONS))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines aren't necessary since you do them again in readOverlappingSubsetFromGCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we generally tend towards redundant parameter validation as long as it is cheap. Violates DRY but is more robust to refactoring, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I did notice a non-null check that feels redundant in resolveIntervals
, will remove that.
* list of the original intervals desired to be strictly coincident; this merged list can then be used with this method. | ||
* @param overlapIntervals if {@code null} or empty, all counts will be returned; must be sorted and non-overlapping otherwise | ||
*/ | ||
public static SimpleCountCollection readOverlappingSubsetFromGCS(final String path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would LOVE for all code paths for reading counts use the same code path with FeatureDataSource
, as you have done here. I guess this isn't possible currently with HDF5... but we should be able to use FeatureDataSource
for local TSVs as well. Don't need to address for this PR but this should be a goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's consider phasing out HDF5 count files. They really only give a significant speedup for WGS somatic PoN building; it might be possible to achieve this by just optimizing TableReader instead. See #2858 (comment) for some old numbers.
Thanks @mwalker174! I think I responded to or addressed everything. The code paths for reading TSVs all go through the abstract CNV collection classes. Those require a bit of boilerplate, but were IMO a huge improvement over the horrowshow of utility methods from the old code... Happy to discuss possible further refactoring and improvement (and there are already catch-all issues open), if needed. If we decide to stream other locatable collections, we can start to extract more of these streaming/subsetting methods to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @samuelklee. If we can, we should aim to remove HDF5 entirely. XsvLocatableTableCodec
looks interesting but I agree we would need to figure out how to do away with a config file.
Merge at will!
This is to substantially reduce disk costs in high resolution WGS gCNV runs per #5716. As discussed elsewhere, we can enable indexing/gzipping/streaming in the gCNV WDLs themselves, but this should happen after updating to WDL 1.0 (which we need for optional localization).
This PR only partially addresses that issue, since we could make more sweeping changes in the abstract CNV collection classes. However, I did make a small change to TableReader that allows all TSV/CNV collection files to be gzipped.
I fixed format specification in the CollectReadCounts WDL task, which was kind of wonky and incorrect. It's still kind of wonky (due to WDL limitations), but it should be correct. Some exception handling is now done in bash.
I also had to fix some missing newlines at EOFs. One such missing newline in the test counts file caused indexing of the gzipped version of the file to miss the last count upon querying during initial testing. Although probably unnecessary, I changed JSON writing in gCNV to include such newlines.