Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose number of samples for emitting denoised copy ratios in gCNV. #5754

Closed
samuelklee opened this issue Mar 5, 2019 · 16 comments
Closed

Comments

@samuelklee
Copy link
Contributor

samuelklee commented Mar 5, 2019

This is set to an unexposed 250 samples. Since the sampling and posterior estimation is done online to prevent the entire samples x intervals x dCR samples matrix from causing OOM, I think we suffer from lack of vectorization.

However, the good news is that we can probably get by with far fewer samples, say ~20) if all we want are estimates of posterior mean and standard deviation---not even sure if the latter will be used by downstream analyses anytime soon.

@samuelklee samuelklee added this to Improvements / Miscellaneous in Germline Copy Number Variants Mar 5, 2019
@samuelklee samuelklee changed the title Expose number of samples for emitting denoised copy ratios in gCNV or optimize sampling method. Expose number of samples for emitting denoised copy ratios in gCNV. Mar 11, 2019
@samuelklee
Copy link
Contributor Author

Sampling method optimized in #5781.

@samuelklee
Copy link
Contributor Author

Default changed from 250 -> 20 in #5699 and exposed in #7450. Unfortunately, I don't recall if we did any benchmarking to spot check runtime/output changes, but I would expect 20 samples to be sufficient for estimating posterior means and standard deviations to the level needed for most downstream use cases.

@fleharty
Copy link
Contributor

fleharty commented Sep 1, 2021

@samuelklee ahem... aren't you on vacation?

@samuelklee
Copy link
Contributor Author

A bit of work on vacation here and there usually means I can take more vacations later!

@mwalker174
Copy link
Contributor

mwalker174 commented Sep 17, 2021

@samuelklee Can you briefly explain how the sampling is used in dCR approximations and how they are used downstream? I would expect only 20 samples to give noisy estimates of mean and variance…

@samuelklee
Copy link
Contributor Author

samuelklee commented Sep 17, 2021

The only place I’ve used them is in the malaria genotyping method, where we calculate very simple annotations for each set of breakpoints—mostly summary statistics of the posterior means, like the minimum in a segment or the segment-level mean. We also calculate a single changepoint statistic using the posterior means. Didn’t see any indicators that posterior sampling noise was causing any issues—the only real issue I had was fixed in #7261.

I think the initial reason we started emitting these is because Talkowski folks wanted to use them for plotting/visualization, in which case I would be even less worried about sampling noise. But maybe you all are using them for something else now? Even then, not sure if anyone is using the posterior variances.

If we ever move towards emitting this for somatic/mosaicism at the single-bin, percent level, then I might be a bit more worried—in which case, good for us for exposing it! But if you’d like to increase the default and it doesn’t seem to add significant runtime, go for it! Might also be good to get some understanding of the sampling noise level for typical data, if only for a few different parameter values.

@mwalker174
Copy link
Contributor

Thanks for that explanation. I'd rather set it to a safer value like 100 as a default - @asmirnov239 do you have any sense of how much this affects run time?

@samuelklee
Copy link
Contributor Author

Looking at this again, plots in #5764 might be useful. Probably OK to bump things up since the rest of inference will take up the bulk of runtime. Although I think one other consideration might be the WDL tests, since they run slow on Travis and already push the time limit. Perhaps set a lower value for those, if necessary? In any case, please double check all of this!

@mwalker174
Copy link
Contributor

Interesting. We fixed the memory spike due to sampling correct? However, I wouldn't want to use more samples if it increases peak memory.

@samuelklee
Copy link
Contributor Author

Yup! The sampling and calculation of means/stdevs is done online (I think @asmirnov239 added this) so it shouldn’t take additional memory to use more samples.

@asmirnov239
Copy link
Collaborator

I have benchmarked several runs with various parameters for number of samples for emitting denoised copy ratio. In particular, with num_samples = 20, 200, 500.
The average shard run time for num_samples=20 was 2h34m, for num_samples=200 also 2h34m and for num_samples=500 it was 2h35m, so there is really no significant difference in the runtime.

Here are the concordance plots for different parameters for 3 1KG exome samples:
image

image

image

image

image

image

Since there is no effect on the runtime, and there is poor concordance for high denoised copy ratio values it makes sense to raise the default parameter to 200.

@samuelklee
Copy link
Contributor Author

Thanks for making and documenting those plots, @asmirnov239! Does seem worthwhile to sample more if it makes no difference in the runtime. Just curious, what was the shard size?

Slightly counterintuitive that the high end is more noisy, but I guess it must be due to sampling noise of low bias---I'd expect more competition from reduced noise due to higher counts. Mind sharing the num_samples = 20 and 200 dCR files for at least one sample so I can take a quick look?

Might also be nice to see the posterior standard deviations, but if it's too much work to compile those it's fine. Perhaps we should just concatenate them anyway (I don't recall why we didn't originally add this in #5823, but maybe we had a reason).

In any case, the noise doesn't look crazy in the CR regime that would've been important for the CNV genotyping stuff---phew!

@asmirnov239
Copy link
Collaborator

The shard size was what we usually use for exomes: 12500. By the way I put a cutoff on all sites with dCR > 50 for plotting these.

It makes sense to increase the default to 200 so I changed in in the PR #7450.

@samuelklee
Copy link
Contributor Author

samuelklee commented Oct 18, 2021

Thanks for that info and for sharing the files, @asmirnov239.

I suspect that there are essentially two types of bins: "nice" and "not so nice". The sampling noise in the former is determined by Poisson observation noise, whereas that in the latter is determined by uncertainty in the bias posteriors. This is a bit hard to see in the plots above, and even in this version where I tried to adjust the point size and alpha:

image

However, plotting a measure of the difference in the dCRs (from 20 and 200 posterior samples) vs. the dCR is more suggestive:

image

As are the dCR histograms:

image

I would guess that the nice spike around CR ~ 2 and the fatter base extending up to dCR ~ 100 are distinct populations of bins. So the punchline would be that differences at high dCR are probably just noise within the noise. For "nice" bins at dCR ~ few, the sampling noise looks to be <1%. Not really sure what's going on at very high dCR, but I think it's safe to say that these are "not so nice" bins!

I've seen this pattern in other WES cohorts when plotting the posterior means vs. std devs for the biases; tried to dig up the plots on Slack, but I can't find them at the moment. Perhaps something along those lines might be worth visualizing in your model-criticism notebooks, if you don't already?

Again, hard to say this is indeed the case from the dCRs alone, but if so, it might be worth baking this sort of mixture into future versions of the model or coming up with other strategies to deal with such bins.

@asmirnov239
Copy link
Collaborator

@samuelklee that's interesting thanks for making these. I have only observed seen this behavior for 1kG samples so far. I can add the extra plot you mentioned to the plotting notebook. It would be good to characterize these bins and see if it possible to devise an a priori filter.

@asmirnov239
Copy link
Collaborator

Closed by #7450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Germline Copy Number Variants
Improvements / Miscellaneous
Development

No branches or pull requests

4 participants