-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove-background version 0.2.0 #71
Conversation
@sjfleming could you squish and rebase before I review? I'd be totally fine with a giant commit, but feel free to cherrypick and break it up into several commits if you think it makes sense. |
@sjfleming actually, nvm -- i can do it on github |
@sjfleming mmmm nah, it doesn't work like that... you should do it yourself |
8118db5
to
9721708
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice code as usual :-) made a few mostly stylistic comments here and there.
|
||
# Prior on rho, the swapping fraction: the two concentration parameters alpha and beta. | ||
RHO_ALPHA_PRIOR = 18. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite the magic number! :D how did you choose this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly these are rather arbitrary and need to be revisited for 0.2.1
... the thinking was that around 8% swapping was close to the largest rate we'd typically see [18/(18+200)] (again, I'm lacking a real justification for that number), and that the relatively tight prior would encourage the model to explain as much noise as possible via swapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- please make a github issue about it so we won't forget to take care of it later.
|
||
# Prior on epsilon, the RT efficiency concentration parameter [Gamma(alpha, alpha)]. | ||
EPSILON_PRIOR = 500. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the prior could be estimated from the surely empties? Just a thought -- maybe make an issue for future follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a great idea, yes, I am experimenting with that now and I think it should be part of 0.2.1
. I do think it makes sense mechanistically that epsilon is responsible for the dispersion in the empty droplet UMI counts, rather than differences in physical droplet size (which I think is pretty tightly controlled by the instrument).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make a github issue about it so we won't forget to take care of it later.
cellbender/remove_background/distributions/NegativeBinomialPoissonConv.py
Outdated
Show resolved
Hide resolved
@sjfleming back to you! :-) |
@mbabadi back to you! |
|
||
if USE_NBPC: | ||
if self.use_exact_log_prob: | ||
|
||
# Sample gene expression from our Negative Binomial Poisson | ||
# Convolution distribution, and compare with observed data. | ||
c = pyro.sample("obs", NBPC(mu=mu_cell + 1e-10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e-10 => consts (e.g. NPBC_MU_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_ALPHA_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_LAM_EPS_SAFEGAURD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
@sjfleming everything is looking good! just made a small comment to move the 1e-10 safeguarding shifts in NPBC to consts; I have been bitten by various "small" epsilons in the past too many times -- it's good to have all of those in consts, just as a reminder that you are shifting things. Ideally, you want to make sure these shifts do not affect the results within a given accepted error tolerance. Let's make an issue to further investigate NBPC approximation edge cases. If we decided to move to to a pure Poisson model, we can close that issue. If internal tests are passing, feel free to rebase and merge (on github) :-) |
Added swapping model back in Intermediate commit Produces poster results for v2 Updates to include epsilon inference Working status as of 20200114 Still a bug working with some v3 chem v2 cellranger input files where genes seem to get re-ordered? Still slow. Faster and improved log prob, learning rate schedule Rho as a latent variable, epsilon set to 1 Approx log_prob = big speedup; better cell prob encoder Big speedup from an approximate log_prob computation. Important changes with the encoder and regularizers lead to better performance on difficult test data. Prevent NaNs late in training It was noticed that (presumably due to underflow) late in training, alpha can include zero values, which become NaN. Also put tougher constraints on rho params. Solid on benchmarks Code to include gene_ids, feature_types, antibodies Data reading and writing Write to cellranger format matching input file. Include gene_ids. Include all the things in v3 needed for scanpy to read it using sc.read_10x_h5() Fix "simple" model errors Add z to latent encoder, and fixes for "simple" model Stale import, and a gene_ids MTX fix Output dataset in same format (v2 or v3) as input Fix typo Fix gene_id parse error for CellRanger v2 h5 Extend the range of epsilon, correct an error in lam Calculation of lambda had an error in its use of epsilon. Epsilon previously had very limited range. Most of the responsibility was on d. Now most of the responsibility is shifted back to epsilon. This improves noise removal especially in the case of swapping. Update posterior inference to FPR calculation updates to lambda multiplier handling global overdispersion, no Dirichlet Eliminate use of deprecated scheduler epoch param, preclude lambda infinite loop Remove cooldown from LR scheduler, as it is deprecated Code cleanup, no substance changes Default learning rate 1e-4 Minor logging tweak Tiny change: make a constant Cosmetic code cleanup More code cleanup Move values to consts
Update README.rst
7f371bc
to
63c6ff2
Compare
An intermediate step on our way to more-finalized code which will be published in an upcoming paper, this PR represents a large change for
remove-background
, and this version is actively being used in many ongoing research projects.Major changes from
v0.1.0
:--fpr
now controls the tradeoff between signal and noise that is implicit in any noise-removal procedure. Posterior regularization is used to generate an output count matrix that targets this "false positive rate" specified by the user. A "false positive" in this context is a real cell count that is erroneously removed. FPR of 0.01 means that the tool will remove as much noise as possible while controlling the removal of real signal to be no more than 1% in expectation.Many other implementation details have been modified as well.