remove-background version 0.2.0 #71

sjfleming · 2020-09-25T22:21:21Z

An intermediate step on our way to more-finalized code which will be published in an upcoming paper, this PR represents a large change for remove-background, and this version is actively being used in many ongoing research projects.

Major changes from v0.1.0:

The model has been modified so that cell counts are drawn from a negative binomial and noise counts are Poisson-distributed. Observed count data is then explicitly the sum of cell counts and noise counts.
The output count matrix is constructed by posterior sampling the real cell counts (which are marginalized over during training). This guarantees that the output counts are strictly less than the input counts for each entry in the count matrix.
A new parameter --fpr now controls the tradeoff between signal and noise that is implicit in any noise-removal procedure. Posterior regularization is used to generate an output count matrix that targets this "false positive rate" specified by the user. A "false positive" in this context is a real cell count that is erroneously removed. FPR of 0.01 means that the tool will remove as much noise as possible while controlling the removal of real signal to be no more than 1% in expectation.

Many other implementation details have been modified as well.

mbabadi · 2020-09-30T22:39:02Z

@sjfleming could you squish and rebase before I review? I'd be totally fine with a giant commit, but feel free to cherrypick and break it up into several commits if you think it makes sense.

mbabadi · 2020-09-30T22:59:17Z

@sjfleming actually, nvm -- i can do it on github

mbabadi · 2020-09-30T23:00:14Z

@sjfleming mmmm nah, it doesn't work like that... you should do it yourself

mbabadi

Very nice code as usual :-) made a few mostly stylistic comments here and there.

mbabadi · 2020-10-01T19:18:08Z

cellbender/remove_background/consts.py

+
+# Prior on rho, the swapping fraction: the two concentration parameters alpha and beta.
+RHO_ALPHA_PRIOR = 18.


This is quite the magic number! :D how did you choose this?

Admittedly these are rather arbitrary and need to be revisited for 0.2.1... the thinking was that around 8% swapping was close to the largest rate we'd typically see [18/(18+200)] (again, I'm lacking a real justification for that number), and that the relatively tight prior would encourage the model to explain as much noise as possible via swapping.

I see -- please make a github issue about it so we won't forget to take care of it later.

mbabadi · 2020-10-01T19:20:02Z

cellbender/remove_background/consts.py

+
+# Prior on epsilon, the RT efficiency concentration parameter [Gamma(alpha, alpha)].
+EPSILON_PRIOR = 500.


Do you think the prior could be estimated from the surely empties? Just a thought -- maybe make an issue for future follow up?

I think that is a great idea, yes, I am experimenting with that now and I think it should be part of 0.2.1. I do think it makes sense mechanistically that epsilon is responsible for the dispersion in the empty droplet UMI counts, rather than differences in physical droplet size (which I think is pretty tightly controlled by the instrument).

please make a github issue about it so we won't forget to take care of it later.

cellbender/remove_background/data/dataprep.py

cellbender/remove_background/data/dataset.py

cellbender/remove_background/distributions/NegativeBinomialPoissonConv.py

cellbender/remove_background/model.py

cellbender/remove_background/train.py

cellbender/remove_background/vae/encoder.py

mbabadi · 2020-10-01T20:14:04Z

@sjfleming back to you! :-)

sjfleming · 2020-10-10T00:24:07Z

@mbabadi back to you!

mbabadi · 2020-10-10T16:53:10Z

cellbender/remove_background/model.py


-            if USE_NBPC:
+            if self.use_exact_log_prob:

                # Sample gene expression from our Negative Binomial Poisson
                # Convolution distribution, and compare with observed data.
                c = pyro.sample("obs", NBPC(mu=mu_cell + 1e-10,


1e-10 => consts (e.g. NPBC_MU_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_ALPHA_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_LAM_EPS_SAFEGAURD)

mbabadi · 2020-10-10T16:59:17Z

@sjfleming everything is looking good! just made a small comment to move the 1e-10 safeguarding shifts in NPBC to consts; I have been bitten by various "small" epsilons in the past too many times -- it's good to have all of those in consts, just as a reminder that you are shifting things. Ideally, you want to make sure these shifts do not affect the results within a given accepted error tolerance. Let's make an issue to further investigate NBPC approximation edge cases. If we decided to move to to a pure Poisson model, we can close that issue.

If internal tests are passing, feel free to rebase and merge (on github) :-)

Added swapping model back in Intermediate commit Produces poster results for v2 Updates to include epsilon inference Working status as of 20200114 Still a bug working with some v3 chem v2 cellranger input files where genes seem to get re-ordered? Still slow. Faster and improved log prob, learning rate schedule Rho as a latent variable, epsilon set to 1 Approx log_prob = big speedup; better cell prob encoder Big speedup from an approximate log_prob computation. Important changes with the encoder and regularizers lead to better performance on difficult test data. Prevent NaNs late in training It was noticed that (presumably due to underflow) late in training, alpha can include zero values, which become NaN. Also put tougher constraints on rho params. Solid on benchmarks Code to include gene_ids, feature_types, antibodies Data reading and writing Write to cellranger format matching input file. Include gene_ids. Include all the things in v3 needed for scanpy to read it using sc.read_10x_h5() Fix "simple" model errors Add z to latent encoder, and fixes for "simple" model Stale import, and a gene_ids MTX fix Output dataset in same format (v2 or v3) as input Fix typo Fix gene_id parse error for CellRanger v2 h5 Extend the range of epsilon, correct an error in lam Calculation of lambda had an error in its use of epsilon. Epsilon previously had very limited range. Most of the responsibility was on d. Now most of the responsibility is shifted back to epsilon. This improves noise removal especially in the case of swapping. Update posterior inference to FPR calculation updates to lambda multiplier handling global overdispersion, no Dirichlet Eliminate use of deprecated scheduler epoch param, preclude lambda infinite loop Remove cooldown from LR scheduler, as it is deprecated Code cleanup, no substance changes Default learning rate 1e-4 Minor logging tweak Tiny change: make a constant Cosmetic code cleanup More code cleanup Move values to consts

Update README.rst

sjfleming added this to the v0.2 milestone Sep 25, 2020

sjfleming requested a review from mbabadi September 25, 2020 22:21

sjfleming self-assigned this Sep 25, 2020

sjfleming mentioned this pull request Sep 30, 2020

High percentage of MT genes #69

Closed

sjfleming force-pushed the sf_removebkg_v2.1 branch from 8118db5 to 9721708 Compare September 30, 2020 23:56

mbabadi requested changes Oct 1, 2020

View reviewed changes

mbabadi approved these changes Oct 1, 2020

View reviewed changes

mbabadi reviewed Oct 10, 2020

View reviewed changes

mbabadi approved these changes Oct 10, 2020

View reviewed changes

sjfleming added 3 commits October 16, 2020 11:09

Documentation updates

9921b0d

Update README.rst

Update cellbender_remove_background.wdl

63c6ff2

sjfleming force-pushed the sf_removebkg_v2.1 branch from 7f371bc to 63c6ff2 Compare October 16, 2020 15:13

sjfleming merged commit f93d67b into master Oct 16, 2020

sjfleming deleted the sf_removebkg_v2.1 branch October 16, 2020 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove-background version 0.2.0 #71

remove-background version 0.2.0 #71

sjfleming commented Sep 25, 2020

mbabadi commented Sep 30, 2020

mbabadi commented Sep 30, 2020

mbabadi commented Sep 30, 2020

mbabadi left a comment

mbabadi Oct 1, 2020

sjfleming Oct 1, 2020

mbabadi Oct 5, 2020

mbabadi Oct 1, 2020

sjfleming Oct 1, 2020

mbabadi Oct 5, 2020

mbabadi commented Oct 1, 2020

sjfleming commented Oct 10, 2020

mbabadi Oct 10, 2020

sjfleming Oct 16, 2020

mbabadi commented Oct 10, 2020


		# Prior on rho, the swapping fraction: the two concentration parameters alpha and beta.
		RHO_ALPHA_PRIOR = 18.


		# Prior on epsilon, the RT efficiency concentration parameter [Gamma(alpha, alpha)].
		EPSILON_PRIOR = 500.

remove-background version 0.2.0 #71

remove-background version 0.2.0 #71

Conversation

sjfleming commented Sep 25, 2020

mbabadi commented Sep 30, 2020

mbabadi commented Sep 30, 2020

mbabadi commented Sep 30, 2020

mbabadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Oct 1, 2020

sjfleming commented Oct 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Oct 10, 2020