CNNScoreVariants out of beta #5548

lucidtronix · 2018-12-21T20:41:54Z

Add PEP8 python style with type hints and use model directories instead of separate arguments for config and weights.

codecov-io · 2018-12-21T21:43:26Z

Codecov Report

Merging #5548 into master will increase coverage by 35.717%.
The diff coverage is 96.032%.

@@               Coverage Diff                @@
##              master     #5548        +/-   ##
================================================
+ Coverage     36.838%   72.555%   +35.717%     
- Complexity     17409     26182      +8773     
================================================
  Files           1934      1934                
  Lines         145691    145752        +61     
  Branches       16103     16106         +3     
================================================
+ Hits           53670    105751     +52081     
+ Misses         87181     34866     -52315     
- Partials        4840      5135       +295

Impacted Files	Coverage Δ	Complexity Δ
...der/tools/walkers/vqsr/CNNVariantPipelineTest.java	`100% <100%> (ø)`	`8 <1> (ø)`	⬇️
.../walkers/vqsr/CNNScoreVariantsIntegrationTest.java	`100% <100%> (ø)`	`13 <7> (+3)`	⬆️
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java	`80.444% <89.583%> (+6.736%)`	`45 <15> (+4)`	⬆️
...nder/tools/copynumber/utils/TagGermlineEvents.java	`0% <0%> (-100%)`	`0% <0%> (-3%)`
...r/tools/spark/pathseq/PSBwaArgumentCollection.java	`0% <0%> (-100%)`	`0% <0%> (-1%)`
...ender/tools/readersplitters/ReadGroupSplitter.java	`0% <0%> (-100%)`	`0% <0%> (-3%)`
...ools/funcotator/filtrationRules/ClinVarFilter.java	`0% <0%> (-100%)`	`0% <0%> (-5%)`
...ls/walkers/varianteval/stratifications/Sample.java	`0% <0%> (-100%)`	`0% <0%> (-4%)`
...nes/metrics/QualityYieldMetricsCollectorSpark.java	`0% <0%> (-100%)`	`0% <0%> (-7%)`
...lkers/varianteval/util/SortableJexlVCMatchExp.java	`0% <0%> (-100%)`	`0% <0%> (-2%)`
... and 1381 more

cmnbroad

Checkpointing part one of this review - I still have some more comments to come on the python code and test code but am saving what I have to so far so as to not hold things up. The type hints are a big improvement towards readability though!

cmnbroad · 2019-01-08T21:18:33Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -90,11 +95,9 @@
 *   -inference-batch-size 2 \
 *   -transfer-batch-size 2 \
 *   -tensor-type read-tensor \
- *   -architecture path/to/my_model.json \


Examples above still use inference/batch size of 2. These (size) arguments are @Advanced, so the basic javadoc examples shouldn't refer to them.

Also, the javadoc should have an example using the simplest, all-default-args case, and we should have a corresponding test case for that (I think there is one that is very close).

The javadoc should explain that the tool is intended to be used for single-sample only, and since we only warn on multiple samples, something about what results to expect when using multiple samples.

OK the issue is that the best batch size for 1D is different from the bast batch size for 2D. I'll automatically set them to different defaults when they are not supplied on the command line.

The first example is the simplest possible way to run the tool.

Added a comment about single-sample.

cmnbroad · 2019-01-08T22:34:20Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

-
-    @Argument(fullName = "weights", shortName = "weights", doc = "Keras model HD5 file with neural net weights.", optional = true)
-    private String weights;
+    @Argument(fullName = "model-dir", shortName = "model", doc = "Directory containing Neural Net architecture and configuration json file", optional = true)


We should add something that says "if not supplied the default model is used" or some such.

cmnbroad · 2019-01-09T13:33:04Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java


    @Argument(fullName = "tensor-type", shortName = "tensor-type", doc = "Name of the tensors to generate, reference for 1D reference tensors and read_tensor for 2D tensors.", optional = true)
    private TensorType tensorType = TensorType.reference;

+    @Argument(fullName = "annotation-set", shortName = "annotation-set", doc = "Name of the set of annotations to use", optional = true)
+    private String annotationSet = DEFAULT_ANNOTATION_SET;


As it stands, nobody will know what to do with this argument. The other valid values either need to be documented (perhaps with a ClpEnum), and tests added, or else this arg should be removed, or at least @Hidden. If we do document any of the other sets, we'll need to add a test for them.

Removed it and now the annotation set is controlled only by java.

cmnbroad · 2019-01-09T13:34:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

-    @Argument(fullName = "weights", shortName = "weights", doc = "Keras model HD5 file with neural net weights.", optional = true)
-    private String weights;
+    @Argument(fullName = "model-dir", shortName = "model", doc = "Directory containing Neural Net architecture and configuration json file", optional = true)
+    private String modelDir;

    @Argument(fullName = "tensor-type", shortName = "tensor-type", doc = "Name of the tensors to generate, reference for 1D reference tensors and read_tensor for 2D tensors.", optional = true)
    private TensorType tensorType = TensorType.reference;


Are we keeping 1d as the default ? If we decide to change to 2d, then the examples in the javadoc above will have to change to reflect that

cmnbroad · 2019-01-09T13:36:21Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -209,7 +217,7 @@
            return new String[]{"Inference batch size must be less than or equal to transfer batch size."};
        }

-        if (weights == null && architecture == null){
+        if (modelDir == null){


Shouldn't the tensor-type test below be unconditional (model== null is irrelevant ?).

For now yes, but this way we can support new tensor types without requiring them to have default model

cmnbroad · 2019-01-09T15:06:08Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

                curBatchSize,
                inferenceBatchSize,
+                tensorType,
+                annotationSet,


The python code is based on the name of the (predefined) annotation passed here, but the java code serializes the set of annotations defined by the annotationKeys list from the command line. These two arguments are overlapping/redundant, and need to be consolidated or kept in sync somehow. It would be very easy to construct a command line that causes these to be out of sync.

cmnbroad · 2019-01-09T15:13:26Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/inference.py

-def score_and_write_batch(args, model, file_out, batch_size, python_batch_size, tensor_dir):
-    '''Score a batch of variants with a CNN model. Write tab delimited temp file with scores.
+def score_and_write_batch(model: keras.Model,
+                          file_out: TextIO,


When this code is called by the Java code, the python statement retrieves the values model and file_out from the global namespace, since they're originally stored there when start_session_get_args_and_model is called. Its a bit weird to have to pass them in each time like this.

One option would be to create an instance of an inference wrapper class (see what I did for the FIFO code in tool.py), and have a function for init which creates it and stores it in the global namespace, and score and close functions that delegate to the global instance. I'm not sure how pythonic it is, but all of the code called by java would be in a single module, and there would only be a single variable in the global namespace, with no call redundancies.

BTW, thanks for adding the type hints - they definitely improve readability...

cmnbroad · 2019-01-09T16:04:47Z

scripts/cnn_variant_wdl/jsons/cnn_score_variants_travis.json

-  "CNNScoreVariantsWorkflow.bam_file": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/VQSR/g94982_chr20_1m_10m_bamout.bam",
-  "CNNScoreVariantsWorkflow.bam_file_index": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/VQSR/g94982_chr20_1m_10m_bamout.bai",
+  "CNNScoreVariantsWorkflow.bam_file": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/VQSR/g94982_b37_chr20_1m_895_bamout.bam",
+  "CNNScoreVariantsWorkflow.bam_file_index": "/home/travis/build/broadinstitute/gatk/src/test/resources/large/VQSR/g94982_b37_chr20_1m_895_bamout.bai",


Do any tests for these exist ?

Yes, these are tested in the cnn_variant cromwell job.

cmnbroad · 2019-01-09T16:09:17Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

    @Argument(fullName = "window-size", shortName = "window-size", doc = "Neural Net input window size", minValue = 0, optional = true)
    private int windowSize = 128;

+    @Argument(fullName = "read-limit", shortName = "read-limit", doc = "Maximum number of reads to encode in a tensor, for 2D models only.", minValue = 0, optional = true)
+    private int readLimit = 128;


Is there a better name we can use for this arg/variable that matches the other tools that do this (maybe downsample-reads or something) ?

cmnbroad · 2019-01-09T16:10:00Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -452,9 +465,13 @@ private String getVariantInfoString(final VariantContext variant) {

    private void executePythonCommand() {
        final String pythonCommand = String.format(
-                "vqsr_cnn.score_and_write_batch(args, model, tempFile, %d, %d, '%s')",
+                "vqsr_cnn.score_and_write_batch(model, tempFile, %d, %d, '%s', '%s', %d, %d, '%s')",


See my comments in the python code.

cmnbroad

One additional comment - we need to think about how to version the models.

sooheelee · 2019-01-09T19:07:15Z

@lucidtronix @cmnbroad, I see for v4.0.12.0, CNNScoreVariants falls under the EXPERIMENTAL Tool label. When you say the tool will come out of beta, do you mean there will be a change in this label or something else? I'm writing a document that links to the CNN workflow and need to be clear on the status of the workflow. Thanks.

cmnbroad · 2019-01-09T20:23:37Z

@sooheelee Right now the tool is marked @Experimental. The goal is to get this tool into production status for 4.1 , with no @Experimental or @Beta tags.

sooheelee · 2019-01-09T20:36:38Z

Thanks for clarifying @cmnbroad. Have to say skipping @Beta and going directly from experimental to production is unusual. Congratulations.

lucidtronix · 2019-01-21T20:22:17Z

@cmnbroad back to you

cmnbroad

Ok, we're getting closer. There is still a fair amount of code cleanup that should be done longer term, especially on the Python side - but I commented on what I think is the minimum for now, keeping just to the inference code. Especially where there are hardcoded magic numbers and values that have to be kept in sync between Java and Python - those should be moved to constants on both sides and commented to make that relationship explicit. Back to @lucidtronix.

cmnbroad · 2019-01-23T13:51:59Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -116,30 +118,34 @@
            " If you have an older (pre-1.6) version of TensorFlow installed that does not require AVX you may attempt to re-run the tool with the %s argument to bypass this check.\n" +
            " Note that such configurations are not officially supported.";

+    private static final String DEFAULT_ANNOTATION_SET = "best_practices";


This is unused now, and can be removed.

cmnbroad · 2019-01-23T13:56:00Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+            transferBatchSize = Math.max(transferBatchSize, MAX_BATCH_SIZE_1D);
+            inferenceBatchSize = Math.max(inferenceBatchSize, MAX_BATCH_SIZE_1D);
+        }
+


Should this also require that the transfer size is an integral multiple of the inference size ? It would probably work without that, but would be inefficient, since the python size would wind up doing some smaller batches. If you think thats rigt, the doc for those args should be updated to mention that.

Since this will only impact one batch of python inference I think we wont notice the inefficiency

cmnbroad · 2019-01-23T16:00:47Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -273,6 +293,7 @@ public void onTraversalStart() {
            pythonExecutor.sendSynchronousCommand("import vqsr_cnn" + NL);

            scoreKey = getScoreKeyAndCheckModelAndReadsHarmony();
+            annotationSetString = this.annotationKeys.toString().replace(" ", "").replace("[", "").replace("]", "");


Suggest the more idiomatic:

Suggested change

annotationSetString = this.annotationKeys.toString().replace(" ", "").replace("[", "").replace("]", "");

annotationSetString = annotationKeys.stream().collect(Collectors.joining(","));

Also, the "joining" arg should be in a constant, with a comment saying that the value has to be kept in sync with the corresponding Python constant that is used to parse these lines. The corresponding constant will have to be added to the python code, maybe in defines.py.

We should also do the same thing (add symbolic constants) for the hardcoded values like "\t" used by GATKReadToString, getVariantDataString, etc.

Thats is much better, thanks.

Added constants for the comma, tab, equals and semi-colon. Also simplified getVariantInfoString fxn, which was needlessly sending single valued annotations as if they were lists.

cmnbroad · 2019-01-23T16:04:05Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+                tensorType,
+                annotationSet,
+                windowSize,
+                readLimit,


PR is #5594, once its reviewed we'll add code to call it to this tool.

cmnbroad · 2019-01-23T19:56:22Z

...t/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariantsIntegrationTest.java

    }

    @Test(groups = {"python"})
-    public void testInferenceWithWeightsOnly() throws IOException{
+    public void testInferenceWithWeightOverride() throws IOException {


Does this test differ from the testInference test ?

Lets rename this to testInferenceWithModelOverride.

It didn't, but I removed the model overide from testInference so now it does and renamed it.

cmnbroad · 2019-01-23T21:20:08Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/defines.py

-    'D':[0.333,0,0.333,0.334], 'X':[0.25,0.25,0.25,0.25], 'N':[0.25,0.25,0.25,0.25]
+    'K': [0, 0, 0.5, 0.5], 'M': [0.5, 0.5, 0, 0], 'R': [0.5, 0, 0, 0.5], 'Y': [0, 0.5, 0.5, 0], 'S': [0, 0.5, 0, 0.5],
+    'W': [0.5, 0, 0.5, 0], 'B': [0, 0.333, 0.333, 0.334], 'V': [0.333, 0.333, 0, 0.334], 'H': [0.333, 0.333, 0.334, 0],
+    'D': [0.333, 0, 0.333, 0.334], 'X': [0.25, 0.25, 0.25, 0.25], 'N': [0.25, 0.25, 0.25, 0.25]


It would be helpful to add comments describing what the values in the dictionaries mean, and rename AMBIGUITY_CODES to reflect that use.

I added more comments explaining, but since these are defined by IUPAC as ambiguity codes I want to keep the the dictionary name as that.

cmnbroad · 2019-01-23T21:22:20Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/defines.py

-    'D':[0.333,0,0.333,0.334], 'X':[0.25,0.25,0.25,0.25], 'N':[0.25,0.25,0.25,0.25]
+    'K': [0, 0, 0.5, 0.5], 'M': [0.5, 0.5, 0, 0], 'R': [0.5, 0, 0, 0.5], 'Y': [0, 0.5, 0.5, 0], 'S': [0, 0.5, 0, 0.5],
+    'W': [0.5, 0, 0.5, 0], 'B': [0, 0.333, 0.333, 0.334], 'V': [0.333, 0.333, 0, 0.334], 'H': [0.333, 0.333, 0.334, 0],
+    'D': [0.333, 0, 0.333, 0.334], 'X': [0.25, 0.25, 0.25, 0.25], 'N': [0.25, 0.25, 0.25, 0.25]
 }


 # Annotation sets
 ANNOTATIONS = {


This would be better named ANNOTATIONS_SETS.

cmnbroad · 2019-01-23T21:46:37Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/defines.py


 CODE2CIGAR = 'MIDNSHP=XB'
 CIGAR2CODE = dict([y, x] for x, y in enumerate(CODE2CIGAR))
-CIGAR_CODE = {'M':0, 'I':1, 'D':2, 'N':3, 'S':4}
+CIGAR_CODE = {'M': 0, 'I': 1, 'D': 2, 'N': 3, 'S': 4}
 CIGAR_REGEX = re.compile("(\d+)([MIDNSHP=XB])")

 SKIP_CHAR = '~'
 INDEL_CHAR = '*'
 SEPARATOR_CHAR = '\t'


SEPARATOR_CHAR is, I think, intended to be the FIFO separator char, and would be better named to reflect that.

cmnbroad · 2019-01-23T21:48:04Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/inference.py

@@ -59,40 +72,40 @@ def score_and_write_batch(args, model, file_out, batch_size, python_batch_size,

        variant_data.append(fifo_data[0] + '\t' + fifo_data[1] + '\t' + fifo_data[2] + '\t' + fifo_data[3])


The index values should be replaced with symbolic constants with the name of the field, and comments added saying these need to kept in sync with the code n the Java side.

cmnbroad · 2019-01-23T21:48:39Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/inference.py

        variant_types.append(fifo_data[6].strip())

-        fidx = 7 # 7 Because above we parsed: contig pos ref alt reference_string annotation variant_type
-        if args.tensor_name in defines.TENSOR_MAPS_2D and len(fifo_data) > fidx:
+        fidx = 7  # 7 Because above we parsed: contig pos ref alt reference_string annotation variant_type


add a symbolic constant

lucidtronix · 2019-01-27T19:58:21Z

@cmnbroad Thanks for the review, back to you!

cmnbroad · 2019-01-29T15:22:16Z

Looks like we're mostly there, except for adding a model version, which @lucidtronix is working on, and java downsampling, which probably won't make it.

cmnbroad

Thanks @lucidtronix. We should hold off merging until tests pass on this branch (current failures are unrelated) and master is back up (currently master is failing too). Then if we get a chance we can add Java downsampling if #5594 gets approved.

droazen requested review from cmnbroad and ldgauthier January 3, 2019 19:20

droazen assigned cmnbroad and ldgauthier Jan 3, 2019

cmnbroad reviewed Jan 9, 2019

View reviewed changes

cmnbroad requested changes Jan 23, 2019

View reviewed changes

lucidtronix force-pushed the sf_pep8 branch from 4f6d8f6 to 7740235 Compare January 29, 2019 16:25

cmnbroad approved these changes Jan 29, 2019

View reviewed changes

type hints in inference

1b0b31e

lucidtronix force-pushed the sf_pep8 branch from 7740235 to 1b0b31e Compare January 29, 2019 17:44

lucidtronix merged commit 19ddf36 into master Jan 29, 2019

lucidtronix deleted the sf_pep8 branch January 29, 2019 20:09

This was referenced Jan 30, 2019

CNNScoreVariant CNN_1D=-16.118 for all sites #5101

Closed

CNNScoreVariants changes #4540

Closed

CNNVariantScore needs 2d test with overlapping reads #4536

Closed

	annotationSetString = this.annotationKeys.toString().replace(" ", "").replace("[", "").replace("]", "");
	annotationSetString = annotationKeys.stream().collect(Collectors.joining(","));

		@@ -59,40 +72,40 @@ def score_and_write_batch(args, model, file_out, batch_size, python_batch_size,

		variant_data.append(fifo_data[0] + '\t' + fifo_data[1] + '\t' + fifo_data[2] + '\t' + fifo_data[3])

CNNScoreVariants out of beta #5548

CNNScoreVariants out of beta #5548

Conversation

lucidtronix commented Dec 21, 2018

codecov-io commented Dec 21, 2018 • edited

Codecov Report

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmnbroad left a comment

Choose a reason for hiding this comment

sooheelee commented Jan 9, 2019

cmnbroad commented Jan 9, 2019

sooheelee commented Jan 9, 2019

lucidtronix commented Jan 21, 2019

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented Jan 27, 2019

cmnbroad commented Jan 29, 2019

cmnbroad left a comment

Choose a reason for hiding this comment

codecov-io commented Dec 21, 2018 •

edited