Expose number of samples for emitting denoised copy ratios in gCNV. #5754

samuelklee · 2019-03-05T11:52:29Z

This is set to an unexposed 250 samples. Since the sampling and posterior estimation is done online to prevent the entire samples x intervals x dCR samples matrix from causing OOM, I think we suffer from lack of vectorization.

However, the good news is that we can probably get by with far fewer samples, say ~20) if all we want are estimates of posterior mean and standard deviation---not even sure if the latter will be used by downstream analyses anytime soon.

samuelklee · 2019-03-11T17:09:45Z

Sampling method optimized in #5781.

samuelklee · 2021-09-01T12:07:12Z

Default changed from 250 -> 20 in #5699 and exposed in #7450. Unfortunately, I don't recall if we did any benchmarking to spot check runtime/output changes, but I would expect 20 samples to be sufficient for estimating posterior means and standard deviations to the level needed for most downstream use cases.

fleharty · 2021-09-01T12:15:22Z

@samuelklee ahem... aren't you on vacation?

samuelklee · 2021-09-01T12:30:01Z

A bit of work on vacation here and there usually means I can take more vacations later!

mwalker174 · 2021-09-17T16:05:23Z

@samuelklee Can you briefly explain how the sampling is used in dCR approximations and how they are used downstream? I would expect only 20 samples to give noisy estimates of mean and variance…

samuelklee · 2021-09-17T17:14:31Z

The only place I’ve used them is in the malaria genotyping method, where we calculate very simple annotations for each set of breakpoints—mostly summary statistics of the posterior means, like the minimum in a segment or the segment-level mean. We also calculate a single changepoint statistic using the posterior means. Didn’t see any indicators that posterior sampling noise was causing any issues—the only real issue I had was fixed in #7261.

I think the initial reason we started emitting these is because Talkowski folks wanted to use them for plotting/visualization, in which case I would be even less worried about sampling noise. But maybe you all are using them for something else now? Even then, not sure if anyone is using the posterior variances.

If we ever move towards emitting this for somatic/mosaicism at the single-bin, percent level, then I might be a bit more worried—in which case, good for us for exposing it! But if you’d like to increase the default and it doesn’t seem to add significant runtime, go for it! Might also be good to get some understanding of the sampling noise level for typical data, if only for a few different parameter values.

mwalker174 · 2021-09-17T19:06:06Z

Thanks for that explanation. I'd rather set it to a safer value like 100 as a default - @asmirnov239 do you have any sense of how much this affects run time?

samuelklee · 2021-09-20T15:31:08Z

Looking at this again, plots in #5764 might be useful. Probably OK to bump things up since the rest of inference will take up the bulk of runtime. Although I think one other consideration might be the WDL tests, since they run slow on Travis and already push the time limit. Perhaps set a lower value for those, if necessary? In any case, please double check all of this!

mwalker174 · 2021-09-20T15:51:34Z

Interesting. We fixed the memory spike due to sampling correct? However, I wouldn't want to use more samples if it increases peak memory.

samuelklee · 2021-09-20T15:53:39Z

Yup! The sampling and calculation of means/stdevs is done online (I think @asmirnov239 added this) so it shouldn’t take additional memory to use more samples.

asmirnov239 · 2021-10-15T04:53:59Z

I have benchmarked several runs with various parameters for number of samples for emitting denoised copy ratio. In particular, with num_samples = 20, 200, 500.
The average shard run time for num_samples=20 was 2h34m, for num_samples=200 also 2h34m and for num_samples=500 it was 2h35m, so there is really no significant difference in the runtime.

Here are the concordance plots for different parameters for 3 1KG exome samples:

Since there is no effect on the runtime, and there is poor concordance for high denoised copy ratio values it makes sense to raise the default parameter to 200.

samuelklee · 2021-10-15T10:59:54Z

Thanks for making and documenting those plots, @asmirnov239! Does seem worthwhile to sample more if it makes no difference in the runtime. Just curious, what was the shard size?

Slightly counterintuitive that the high end is more noisy, but I guess it must be due to sampling noise of low bias---I'd expect more competition from reduced noise due to higher counts. Mind sharing the num_samples = 20 and 200 dCR files for at least one sample so I can take a quick look?

Might also be nice to see the posterior standard deviations, but if it's too much work to compile those it's fine. Perhaps we should just concatenate them anyway (I don't recall why we didn't originally add this in #5823, but maybe we had a reason).

In any case, the noise doesn't look crazy in the CR regime that would've been important for the CNV genotyping stuff---phew!

asmirnov239 · 2021-10-18T02:19:44Z

The shard size was what we usually use for exomes: 12500. By the way I put a cutoff on all sites with dCR > 50 for plotting these.

It makes sense to increase the default to 200 so I changed in in the PR #7450.

samuelklee · 2021-10-18T12:52:34Z

Thanks for that info and for sharing the files, @asmirnov239.

I suspect that there are essentially two types of bins: "nice" and "not so nice". The sampling noise in the former is determined by Poisson observation noise, whereas that in the latter is determined by uncertainty in the bias posteriors. This is a bit hard to see in the plots above, and even in this version where I tried to adjust the point size and alpha:

However, plotting a measure of the difference in the dCRs (from 20 and 200 posterior samples) vs. the dCR is more suggestive:

As are the dCR histograms:

I would guess that the nice spike around CR ~ 2 and the fatter base extending up to dCR ~ 100 are distinct populations of bins. So the punchline would be that differences at high dCR are probably just noise within the noise. For "nice" bins at dCR ~ few, the sampling noise looks to be <1%. Not really sure what's going on at very high dCR, but I think it's safe to say that these are "not so nice" bins!

I've seen this pattern in other WES cohorts when plotting the posterior means vs. std devs for the biases; tried to dig up the plots on Slack, but I can't find them at the moment. Perhaps something along those lines might be worth visualizing in your model-criticism notebooks, if you don't already?

Again, hard to say this is indeed the case from the dCRs alone, but if so, it might be worth baking this sort of mixture into future versions of the model or coming up with other strategies to deal with such bins.

asmirnov239 · 2021-10-19T17:46:06Z

@samuelklee that's interesting thanks for making these. I have only observed seen this behavior for 1kG samples so far. I can add the extra plot you mentioned to the plotting notebook. It would be good to characterize these bins and see if it possible to devise an a priori filter.

asmirnov239 · 2021-10-19T17:46:28Z

Closed by #7450

samuelklee added Germline CNV Copy Number tools labels Mar 5, 2019

samuelklee added this to Improvements / Miscellaneous in Germline Copy Number Variants Mar 5, 2019

This was referenced Mar 6, 2019

Investigate and address memory leak in gCNV. #5764

Closed

Improved memory usage in gCNV. #5781

Merged

samuelklee changed the title ~~Expose number of samples for emitting denoised copy ratios in gCNV or optimize sampling method.~~ Expose number of samples for emitting denoised copy ratios in gCNV. Mar 11, 2019

asmirnov239 self-assigned this Jan 8, 2020

asmirnov239 mentioned this issue Sep 1, 2021

Exposed number of samples used for estimating denoised copy ratios in gCNV #7450

Merged

asmirnov239 closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

samuelklee commented Mar 5, 2019 •

edited

samuelklee commented Mar 11, 2019

samuelklee commented Sep 1, 2021

fleharty commented Sep 1, 2021

samuelklee commented Sep 1, 2021

mwalker174 commented Sep 17, 2021 •

edited

samuelklee commented Sep 17, 2021 •

edited

mwalker174 commented Sep 17, 2021

samuelklee commented Sep 20, 2021

mwalker174 commented Sep 20, 2021

samuelklee commented Sep 20, 2021

asmirnov239 commented Oct 15, 2021

samuelklee commented Oct 15, 2021

asmirnov239 commented Oct 18, 2021

samuelklee commented Oct 18, 2021 •

edited

asmirnov239 commented Oct 19, 2021

asmirnov239 commented Oct 19, 2021

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Comments

samuelklee commented Mar 5, 2019 • edited

samuelklee commented Mar 11, 2019

samuelklee commented Sep 1, 2021

fleharty commented Sep 1, 2021

samuelklee commented Sep 1, 2021

mwalker174 commented Sep 17, 2021 • edited

samuelklee commented Sep 17, 2021 • edited

mwalker174 commented Sep 17, 2021

samuelklee commented Sep 20, 2021

mwalker174 commented Sep 20, 2021

samuelklee commented Sep 20, 2021

asmirnov239 commented Oct 15, 2021

samuelklee commented Oct 15, 2021

asmirnov239 commented Oct 18, 2021

samuelklee commented Oct 18, 2021 • edited

asmirnov239 commented Oct 19, 2021

asmirnov239 commented Oct 19, 2021

samuelklee commented Mar 5, 2019 •

edited

mwalker174 commented Sep 17, 2021 •

edited

samuelklee commented Sep 17, 2021 •

edited

samuelklee commented Oct 18, 2021 •

edited