Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove-background version 0.2.0 #71

Merged
merged 3 commits into from
Oct 16, 2020
Merged

remove-background version 0.2.0 #71

merged 3 commits into from
Oct 16, 2020

Conversation

sjfleming
Copy link
Member

An intermediate step on our way to more-finalized code which will be published in an upcoming paper, this PR represents a large change for remove-background, and this version is actively being used in many ongoing research projects.

Major changes from v0.1.0:

  • The model has been modified so that cell counts are drawn from a negative binomial and noise counts are Poisson-distributed. Observed count data is then explicitly the sum of cell counts and noise counts.
  • The output count matrix is constructed by posterior sampling the real cell counts (which are marginalized over during training). This guarantees that the output counts are strictly less than the input counts for each entry in the count matrix.
  • A new parameter --fpr now controls the tradeoff between signal and noise that is implicit in any noise-removal procedure. Posterior regularization is used to generate an output count matrix that targets this "false positive rate" specified by the user. A "false positive" in this context is a real cell count that is erroneously removed. FPR of 0.01 means that the tool will remove as much noise as possible while controlling the removal of real signal to be no more than 1% in expectation.

Many other implementation details have been modified as well.

@sjfleming sjfleming added this to the v0.2 milestone Sep 25, 2020
@sjfleming sjfleming self-assigned this Sep 25, 2020
@mbabadi
Copy link
Collaborator

mbabadi commented Sep 30, 2020

@sjfleming could you squish and rebase before I review? I'd be totally fine with a giant commit, but feel free to cherrypick and break it up into several commits if you think it makes sense.

@mbabadi
Copy link
Collaborator

mbabadi commented Sep 30, 2020

@sjfleming actually, nvm -- i can do it on github

@mbabadi
Copy link
Collaborator

mbabadi commented Sep 30, 2020

@sjfleming mmmm nah, it doesn't work like that... you should do it yourself

Copy link
Collaborator

@mbabadi mbabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice code as usual :-) made a few mostly stylistic comments here and there.


# Prior on rho, the swapping fraction: the two concentration parameters alpha and beta.
RHO_ALPHA_PRIOR = 18.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite the magic number! :D how did you choose this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly these are rather arbitrary and need to be revisited for 0.2.1... the thinking was that around 8% swapping was close to the largest rate we'd typically see [18/(18+200)] (again, I'm lacking a real justification for that number), and that the relatively tight prior would encourage the model to explain as much noise as possible via swapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- please make a github issue about it so we won't forget to take care of it later.


# Prior on epsilon, the RT efficiency concentration parameter [Gamma(alpha, alpha)].
EPSILON_PRIOR = 500.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the prior could be estimated from the surely empties? Just a thought -- maybe make an issue for future follow up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a great idea, yes, I am experimenting with that now and I think it should be part of 0.2.1. I do think it makes sense mechanistically that epsilon is responsible for the dispersion in the empty droplet UMI counts, rather than differences in physical droplet size (which I think is pretty tightly controlled by the instrument).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a github issue about it so we won't forget to take care of it later.

cellbender/remove_background/data/dataprep.py Outdated Show resolved Hide resolved
cellbender/remove_background/data/dataset.py Outdated Show resolved Hide resolved
cellbender/remove_background/model.py Outdated Show resolved Hide resolved
cellbender/remove_background/train.py Outdated Show resolved Hide resolved
cellbender/remove_background/vae/encoder.py Outdated Show resolved Hide resolved
cellbender/remove_background/vae/encoder.py Show resolved Hide resolved
cellbender/remove_background/vae/encoder.py Outdated Show resolved Hide resolved
@mbabadi
Copy link
Collaborator

mbabadi commented Oct 1, 2020

@sjfleming back to you! :-)

@sjfleming
Copy link
Member Author

@mbabadi back to you!


if USE_NBPC:
if self.use_exact_log_prob:

# Sample gene expression from our Negative Binomial Poisson
# Convolution distribution, and compare with observed data.
c = pyro.sample("obs", NBPC(mu=mu_cell + 1e-10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-10 => consts (e.g. NPBC_MU_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_ALPHA_EPS_SAFEGAURD)
1e-10 => consts (e.g. NPBC_LAM_EPS_SAFEGAURD)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@mbabadi
Copy link
Collaborator

mbabadi commented Oct 10, 2020

@sjfleming everything is looking good! just made a small comment to move the 1e-10 safeguarding shifts in NPBC to consts; I have been bitten by various "small" epsilons in the past too many times -- it's good to have all of those in consts, just as a reminder that you are shifting things. Ideally, you want to make sure these shifts do not affect the results within a given accepted error tolerance. Let's make an issue to further investigate NBPC approximation edge cases. If we decided to move to to a pure Poisson model, we can close that issue.

If internal tests are passing, feel free to rebase and merge (on github) :-)

Added swapping model back in

Intermediate commit

Produces poster results for v2

Updates to include epsilon inference

Working status as of 20200114

Still a bug working with some v3 chem v2 cellranger input files where genes seem to get re-ordered?  Still slow.

Faster and improved log prob, learning rate schedule

Rho as a latent variable, epsilon set to 1

Approx log_prob = big speedup; better cell prob encoder

Big speedup from an approximate log_prob computation.
 Important changes with the encoder and regularizers lead to better performance on difficult test data.

Prevent NaNs late in training

It was noticed that (presumably due to underflow) late in training, alpha can include zero values, which become NaN.  Also put tougher constraints on rho params.

Solid on benchmarks

Code to include gene_ids, feature_types, antibodies

Data reading and writing

Write to cellranger format matching input file.  Include gene_ids.  Include all the things in v3 needed for scanpy to read it using sc.read_10x_h5()

Fix "simple" model errors

Add z to latent encoder, and fixes for "simple" model

Stale import, and a gene_ids MTX fix

Output dataset in same format (v2 or v3) as input

Fix typo

Fix gene_id parse error for CellRanger v2 h5

Extend the range of epsilon, correct an error in lam

Calculation of lambda had an error in its use of epsilon.  Epsilon previously had very limited range.  Most of the responsibility was on d.  Now most of the responsibility is shifted back to epsilon.  This improves noise removal especially in the case of swapping.

Update posterior inference to FPR calculation

updates to lambda multiplier handling

global overdispersion, no Dirichlet

Eliminate use of deprecated scheduler epoch param, preclude lambda infinite loop

Remove cooldown from LR scheduler, as it is deprecated

Code cleanup, no substance changes

Default learning rate 1e-4

Minor logging tweak

Tiny change: make a constant

Cosmetic code cleanup

More code cleanup

Move values to consts
Update README.rst
@sjfleming sjfleming merged commit f93d67b into master Oct 16, 2020
@sjfleming sjfleming deleted the sf_removebkg_v2.1 branch October 16, 2020 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
remove-background
  
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

2 participants