Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

samuelklee · 2021-05-19T17:09:02Z

@fleharty this is a rebased version of the 17cadfa branch used to generate the Pf7 CNV call set. There are two minor changes: a) one to remove spurious negative dCR estimates reported by gCNV, which were negatively affecting genotyping of HRP2/3 deletions, and b) updating sklearn to the version used for clustering, so that we can reproduce everything exactly using just the GATK Docker. The latter change probably isn't absolutely necessary, but it doesn't seem to break anything so I'm going to go ahead with it. We might want to update to an even more recent version later on (especially if we make any breaking/non-refactoring improvements to the malaria genotyping code after the initial PR), but unfortunately this slightly changes the clustering assignment for a few samples.

@mwalker174 @asmirnov239 we discussed the first change some time ago, but just a heads up.

…ut by gCNV and updated sklearn.

samuelklee · 2021-06-09T14:49:09Z

@fleharty want to get this in before release?

fleharty

Looks good to me, though I would prefer to have more tests in the python code.

fleharty · 2021-06-11T03:36:47Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/models/model_denoising_calling.py

@@ -786,8 +786,7 @@ def __init__(self,
        # the expected number of erroneously mapped reads
        mean_mapping_error_correction_s = eps_mapping * read_depth_s * shared_workspace.average_ploidy_s

-        denoised_copy_ratio_st = ((shared_workspace.n_st - mean_mapping_error_correction_s.dimshuffle(0, 'x'))
-                                  / ((1.0 - eps_mapping) * read_depth_s.dimshuffle(0, 'x') * bias_st))
+        denoised_copy_ratio_st = shared_workspace.n_st / (read_depth_s.dimshuffle(0, 'x') * bias_st)


I feel like this change should have broken a test, but it looks like the only tests test the Python environment. Should we add tests (maybe not for this PR)?

samuelklee · 2021-06-11T12:57:51Z

Yup, as you might recall from some discussions with @bhanugandham and @mwalker174, getting automated pipeline-level CNV evaluations up and running was highest on my list before I handed over the role and went on paternity leave. I think these tests would be more useful than unit/integration tests for correctness, but they would almost certainly have to run on CARROT.

That said, the current level of unit/integration test coverage is a bit different from that for the somatic tools, because 1) it's difficult to run gCNV in any useful way on Travis infrastructure, and 2) we hadn't decided on a framework/convention for python unit tests at the time the production code went in (although I think @ldgauthier has added some python unit tests by now).

So we currently only have plumbing WDL/integration tests on very small data for gCNV---and these only test that the tools run, not for correctness. For somatic CNV, we have unit tests for correctness on small simulated data (e.g., for things like segmentation and modeling classes), but integration tests don't cover correctness (and it would be pretty redundant to use the same simulated data for integration, so I'd rather put effort towards pipeline-level tests on real data).

It might be good for you and @mwalker174 to review the current level of testing coverage and understand where things need to be shored up---happy to discuss more.

…ut by gCNV and updated sklearn. (#7261)

* Added a new suite of tools for variant filtering based on site-level annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <mshand@broadinstitute.org> * Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049) * Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027) * Small changes to JointVCFFiltering WDL * making default for use_allele_specific_annotations * addressing comments * first stab * wire through WDL changes * fixed typo * set model_backend input value * add gatk_override to JointVcfFiltering call * typo in indel_annotations * make model_backend optional * tabs and spaces * make all model_backends optional * use gatk 4.3.0 * no point in changing the table names as this is a POC * adding new branch to dockstore * adding in branching logic for classic VQSR vs VQSR-Lite * implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files * passing classic flag to indel run of CreateFilteringFiles * Update GvsCreateFilterSet.wdl cleaning up verbiage * Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261) * cleanup up sloppy comment --------- Co-authored-by: samuelklee <samuelklee@users.noreply.github.com> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>

Removed mapping error rate from estimate of denoised copy ratios outp…

cb2ab69

…ut by gCNV and updated sklearn.

samuelklee force-pushed the sl_gcnv_dcr_error_rate branch from 17cadfa to cb2ab69 Compare May 19, 2021 17:37

samuelklee requested a review from fleharty May 19, 2021 17:42

fleharty approved these changes Jun 11, 2021

View reviewed changes

samuelklee merged commit ce9dccb into master Jun 11, 2021

samuelklee deleted the sl_gcnv_dcr_error_rate branch June 11, 2021 12:58

samuelklee mentioned this pull request Sep 17, 2021

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Closed

koncheto-broad pushed a commit that referenced this pull request Jan 23, 2023

Removed mapping error rate from estimate of denoised copy ratios outp…

bfb1a28

…ut by gCNV and updated sklearn. (#7261)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

samuelklee commented May 19, 2021 •

edited

samuelklee commented Jun 9, 2021

fleharty left a comment

fleharty Jun 11, 2021

samuelklee commented Jun 11, 2021 •

edited

Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. #7261

Conversation

samuelklee commented May 19, 2021 • edited

samuelklee commented Jun 9, 2021

fleharty left a comment

Choose a reason for hiding this comment

fleharty Jun 11, 2021

Choose a reason for hiding this comment

samuelklee commented Jun 11, 2021 • edited

samuelklee commented May 19, 2021 •

edited

samuelklee commented Jun 11, 2021 •

edited