New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) #4396
Conversation
8e33396
to
282352d
Compare
@sooheelee, could you please review the documentation of |
@asmirnov239 @sooheelee Let's hold off on reviewing until the other PR (#4335) upon which this is rebased is merged. At that point, @mbabadi can rebase (it might also be helpful to split test files into their own commit, since there are a lot that have been changed) and then we can review. We can go ahead and start evaluations on this branch, though! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with the review @mbabadi! Looks really good, thank you for finishing the postrocessing
@@ -57,6 +58,12 @@ def get_sample_name_from_txt_file(input_path: str) -> str: | |||
return line.strip() | |||
|
|||
|
|||
def write_sample_name_to_txt_file(output_path: str, sample_name: str): | |||
"""Writes sample name to a text file.""" | |||
with open(os.path.join(output_path, io_consts.default_sample_name_txt_filename), 'w') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you don't want to open it in append mode to not override the previously written lines in case in the future the first line will not be the sample name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for writing sample_name.txt
to the root of a sample calls shard. This text file is intended to a sample name one-liner.
# regular expression for matching sample name from header comment line | ||
sample_name_header_regexp = "^@RG.*SM:(.*)[\t]*.*$" | ||
|
||
# prefix for adding sample name as a header comment line | ||
sample_name_header_prefix = "RG\tID:GATKCopyNumber\tSM:" | ||
sample_name_sam_header_prefix = "RG\tID:GATKCopyNumber\tSM:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to derive these tags from a vcf package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
def export_ndarray_tc_with_copy_number_header(sample_posterior_path: str, | ||
ndarray_tc: np.ndarray, | ||
output_file_name: str, | ||
delimiter='\t', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract the delimiter and comment prefix to a constant from this method and methods below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move the values of all delimiter='\t'
and comment='@'
keyword args to io_consts.py
.
default_class_log_posterior_tsv_filename = "log_q_tau_tk.tsv" | ||
default_baseline_copy_number_tsv_filename = "baseline_copy_number_t.tsv" | ||
default_copy_number_segments_tsv_filename = "copy_number_segments.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to eventually export these constants to some json file so that they can be shared between java and python codebases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. However, there's logistic problem: gcnvkernel
is not entirely a java resource in the codebase (only the scripts). So, we have to include a duplicate copy of the JSON as a resource. Any thoughts?
@@ -170,24 +200,49 @@ def __init__(self, | |||
self.denoising_model_approx = denoising_model_approx | |||
self.input_calls_path = input_calls_path | |||
|
|||
@staticmethod | |||
def import_ndarray_tc_with_copy_number_header(sample_posterior_path: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does tc here stand for? Can you document it somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation. Throughout gcnvkernel
code, t
refers to interval index, c
refers to copy-number state index, and j
refers to contig index.
@@ -17,36 +18,13 @@ | |||
* An allele representation of this copy number state (used for VCF creation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this - I forgot to :(
* | ||
* @author Mehrtash Babadi <mehrtash@broadinstitute.org> | ||
*/ | ||
public class IntegerCopyNumberSegmentCollectionUnitTest extends GATKBaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
.boxed() | ||
.collect(Collectors.toMap(IntegerCopyNumberState::new, | ||
cn -> TEST_INVALID_LOG_POSTERIOR_VECTOR[cn]))); | ||
final LocatableCopyNumberPosteriorDistribution locatablePosteriorRecord = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception should already be thrown by now, no need for these 2 lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
*/ | ||
public class GermlineCNVSegmentVariantComposerUnitTest extends GATKBaseTest { | ||
@Test(dataProvider = "variantCompositionSettings") | ||
public void testVariantComposition(final int refAutosomalCopyNumber, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a more universal test that tests that creates an actual segment VCF and checks its correctness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test for the correctness of the generated VariantContext
. Testing an arbitrary VariantContext
is written correctly to VCF is the responsibility of the author of VariantContextWriter
.
String sample_directory = "SAMPLE_${sample_index}" #this is a hardcoded convention in gcnvkernel | ||
String vcf_filename = "${entity_id}.vcf.gz" | ||
String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz" | ||
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not want to provide an option in WDL to not create segments VCF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most users would actually care about the segments VCF file anyway (only more hardcore analysts might care about the intervals VCF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just output both for now.
@samuelklee ooops I didn't see your comment until now.. |
improvement of segment quality calculation methods incorporated small gcnvkernel changes from PR #4396
07a39cc
to
5c4d167
Compare
… (python) CLI script (#4335) * Viterbi segmentation and segment quality calculation in gcnvkernel saving log emission posteriors to disk put __init__ files back in ... Viterbi decoder w/ theano.scan doc updates for Viterbi skeletons for HMM segmentation quality calculation theano-based HMM log constrained probability calculation fixed a notorious bug due to theano fancy indexing ... left and right end point quality calculation for a segment exact quality viterbi API update SAM sequence dictionary parsing interval ordering using SAM sequence dictionary lazy initialization of denoising workspace variables to make it efficient and re-usable for Viterbi segmentation exporting baseline copy number for each sample (for java post-processing) some I/O refactorings loading configs from JSON Viterbi segmentation engine scattered model/calls assembly some refactoring of denoising model Viterbi segmentation and quality calculation finished fixed numerical instability issues with segment quality calculation Viterbi segmentation engine complete Viterbi segmentation python script some gcnvkernel refactoring removed SAM sequence dictionary code fixed the bug causing travis failure single-sample segmentation as opposed to all in one shot PR review: - code improvement of theano forward-backward and viterbi - refactoring of math utils - improvement of segment quality calculation methods - incorporated small gcnvkernel changes from PR #4396 - doc update
5c4d167
to
d050e27
Compare
Codecov Report
@@ Coverage Diff @@
## master #4396 +/- ##
===============================================
- Coverage 79.815% 79.173% -0.642%
- Complexity 16933 17160 +227
===============================================
Files 1058 1053 -5
Lines 61408 61790 +382
Branches 9967 10343 +376
===============================================
- Hits 49013 48921 -92
- Misses 8512 8981 +469
- Partials 3883 3888 +5
|
@samuelklee I'll write unit tests for segment quality calculation in a separate PR (issue #4464). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good for the most part! Added some suggestions for minor refactoring.
String sample_directory = "SAMPLE_${sample_index}" #this is a hardcoded convention in gcnvkernel | ||
String vcf_filename = "${entity_id}.vcf.gz" | ||
String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz" | ||
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just output both for now.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
mkdir extracted-contig-ploidy-calls | ||
tar xzf ${contig_ploidy_calls_tar} -C extracted-contig-ploidy-calls | ||
|
||
allosomal_contigs_array=(${sep=" " allosomal_contigs}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the bash loop here, since we are not appending any indices to the array elements or running tar. You can just use --allosomal-contig ${sep=" --allosomal-contig " allosomal_contigs}
in the command line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -210,7 +210,7 @@ | |||
private RunMode runMode; | |||
|
|||
@Argument( | |||
doc = "Input contig-ploidy calls directory (output of DetermlineGermlineContigPloidy).", | |||
doc = "Input contig-ploidy calls directory (output of DetermineGermlineContigPloidy).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address #4403 while you're here, for both GermlineCNVCaller and DetermineGermlineContigPloidy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice for -I
and -O
is arbitrary when there are multiple mandatory inputs and/or outputs, and this is the case for both tools. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion---this comment applies to a line above that is not in the diff.
We current use --input
to denote the read counts for DetermineGermlineContigPloidy and GermlineCNVCaller (which seems natural, as they are arguably the "primary" input). We also use --output
to denote the output directory, as we do for most CNV tools. However, in these two tools, we neglect to add the standard short names for these arguments. All that needs to be done to close that issue is to add the short names, so we can use -I/-O
as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
fullName = DRY_RUN_LONG_NAME, | ||
optional = true | ||
) | ||
private boolean dryRun = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a dry-run mode? I'd rather not have it if it's only used for debugging.
In general, the WDL task for a tool should expose every parameter. Since we want to only maintain a single version of the WDL, each task should just be a one-to-one wrapper around each tool. Otherwise we may get into scenarios where we or other developers need to go back and expose things that are not exposed in the "official" version of the WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
* according order of the intervals in corresponding chunks. The VCF will contain an ALT allele for every | ||
* non-reference copy number state specified in the posterior files. The reference allele corresponds to copy number 2. | ||
* </p> | ||
* <p>Depending on the arguments, this tool either generates a single "intervals" VCF or additionally, performs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If most users will want the segmented VCF, then let's just always output both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
runToolForSingleSample(callShards, modelShards, sampleIndex, | ||
actualIntervalsOutputVCF, actualSegmentsOutputVCF, | ||
ALLOSOMAL_CONTIGS, AUTOSOMAL_REF_COPY_NUMBER, false); | ||
IntegrationTestSpec.assertEqualTextFiles(actualIntervalsOutputVCF, expectedIntervalsOutputVCF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. I've used FileUtils.contentEquals
in the past. I'd probably stick with that to avoid any interaction with IntegrationTestSpec.assertEqualTextFiles
, but I leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally ended up using IntegrationTestSpec.assertEqualTextFiles
because it takes a comment prefix (helpful for changing VCF info field descriptions without needing to recreate the test resources).
|
||
/** | ||
* Integration test for {@link PostprocessGermlineCNVCalls} | ||
* | ||
* @author Mehrtash Babadi <mehrtash@broadinstitute.org> | ||
* @author Andrey Smirnov <asmirnov@broadinstitute.org> | ||
*/ | ||
public class PostprocessGermlineCNVCallsIntegrationTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can make all test classes final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* | ||
* @author Mehrtash Babadi <mehrtash@broadinstitute.org> | ||
*/ | ||
public class IntegerCopyNumberSegmentCollectionUnitTest extends GATKBaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests! I have been a little lazy about testing these collections classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
|
||
/* assert correctness of quality metrics */ | ||
Assert.assertEquals( | ||
(int)(long)gen.getExtendedAttribute(GermlineCNVSegmentVariantComposer.QS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just cast the first quantity to long? Also, white space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Assert.assertEquals(var.getEnd(), segment.getEnd()); | ||
Assert.assertEquals(var.getAlleles(), GermlineCNVSegmentVariantComposer.ALL_ALLELES); | ||
|
||
final Genotype gen = var.getGenotype(IntegerCopyNumberSegmentCollectionUnitTest.EXPECTED_SAMPLE_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen -> gt or genotype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bdc3c72
to
ad46a1c
Compare
@samuelklee back to you -- the revision is on the longer side, though, no surprises. |
ad46a1c
to
a4028f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one or two minor things, otherwise good to go!
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -256,6 +256,7 @@ task PostprocessGermlineCNVCalls { | |||
|
|||
String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz" | |||
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz" | |||
Boolean allosomal_contigs_specified = defined(allosomal_contigs) && length(select_first([allosomal_contigs, []])) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, does just defined(allosomal_contigs) && length(allosomal_contigs) > 0
work (i.e., are these evaluated left-to-right with short circuiting)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no short circuiting, therefore select_first([allosomal_contigs, []])
:)
@@ -409,46 +339,23 @@ private String getShardSampleName(final int shardIndex) { | |||
} | |||
|
|||
/** | |||
* Returns a list of {@link LocatableCopyNumberPosteriorDistribution} for {@link #sampleIndex} from a | |||
* Returns a list of {@link IntervalCopyNumberGenotypingData} for {@link #sampleIndex} from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't like conflating Posteriors and Data...I'm find with IntervalCopyNumberPosterior if PosteriorDistribution is too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Is it because you store the baseline CN? I'll leave it up to you.
/** | ||
* Extracts column names from a TSV file | ||
*/ | ||
List<String> extractCopyNumberColumnsFromHeader(final File inputFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…s, unit tests, integration tests)
a4028f5
to
861f302
Compare
This PR adds segments VCF writing to
PostprocessGermlineCNVCalls
. Segmentation (Viterbi) and segment quality calculation are performed bygcnvkernel
.This PR introduces the following additional features:
<DUP>
or<DEL>
alleles (in place ofCN_x
alleles), depending on whether the most likely copy-number call is below or above thecontig baseline. The contig baseline state is whatever the user has specified for autosomal contigs, and the contig ploidy state on sex chromosomes (from the output of
DetermineGermlineContigPloidy
).