Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results on 10X hgmm10k_v3 dataset #51

Closed
nh3 opened this issue Jan 29, 2020 · 9 comments · Fixed by #53
Closed

Strange results on 10X hgmm10k_v3 dataset #51

nh3 opened this issue Jan 29, 2020 · 9 comments · Fixed by #53
Labels
bug Something isn't working

Comments

@nh3
Copy link

nh3 commented Jan 29, 2020

Hi,

While CellBender works as expected on 10X hgmm12k (v2), on 10X hgmm10k (v3), it strangely removes large mouse gene counts and adds large human gene counts to mouse cells. 10X hgmm5k (v3) gives similar unexpected results as hgmm10k (v3). Please see logs and plots (hgmm12k and hgmm10k only) below:

hgmm12k, v2

  1. Log:
cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5 --output data/cellbender/hgmm_12k_raw_gene_bc_matrices_h5.cellbender.h5 --expected-cells 12000 --total-droplets-included 22000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 12:36:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5
cellbender:remove-background: CellRanger v2 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 199
cellbender:remove-background: Prior on counts for cells is 13864
cellbender:remove-background: Excluding barcodes with counts below 159
cellbender:remove-background: Using 12000 probable cell barcodes, plus an additional 10000 barcodes, and 48062 empty droplets.
  1. Elbow plot, vertical line marks --expected-cells and --total-droplets-included:
    image

  2. Before correction (called cells):
    image

  3. After correction (called cells):
    image

  4. Convergence:
    image

hgmm10k, v3

  1. Log:
cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5 --output data/cellbender/hgmm_10k_v3_raw_feature_bc_matrix.cellbender.h5 --expected-cells 10000 --total-droplets-included 20000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 09:31:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 444
cellbender:remove-background: Prior on counts for cells is 19036
cellbender:remove-background: Excluding barcodes with counts below 355
cellbender:remove-background: Using 10000 probable cell barcodes, plus an additional 10000 barcodes, and 56957 empty droplets.
  1. Elbow plot, vertical line marks --expected-cells and --total-droplets-included:
    image

  2. Before correction (called cells):
    image

  3. After correction (called cells):
    image

  4. Convergence:
    image

@sjfleming
Copy link
Member

sjfleming commented Feb 4, 2020

I believe I have a parsing error in the newer v3 format HDF5 files from CellRanger that involve multiple genomes! I will track down this bug asap.

Thanks for reporting. I think what you're seeing there is essentially a garbled output due to an input parsing error.

@sjfleming
Copy link
Member

For an urgent workaround, I believe you can input your data using the CellRanger mtx directory format, and then even the v3 multiple-genome data should be parsed correctly. But this is a hunch, and I still need to try it myself. Either way, I will be working on fixing that bug soon.

@nh3
Copy link
Author

nh3 commented Feb 4, 2020

Thank you for the reply! If what I saw was simply garbled output, I would expect to see some cells with high mouse gene counts. The fact that 1) 90% of mouse gene counts are removed from all cells, 2) tens of thousands human gene counts are added to cells that originally had only hundreds, and 3) the inferred priors and cutoffs look correct, makes me suspect it might be due to something else.

@nh3
Copy link
Author

nh3 commented Feb 4, 2020

Tried supplying mtx and got exactly the same result.

The code for loading/parsing input seems alright. Though it doesn't read "/matrix/features/genome", the inference shouldn't care about an extra gene label, should it?

elif cellranger_version == 3:
# Read in data for this genome, and put it into a
# scipy.sparse.csc.csc_matrix
barcodes = getattr(f.root.matrix, 'barcodes').read()
data = getattr(f.root.matrix, 'data').read()
indices = getattr(f.root.matrix, 'indices').read()
indptr = getattr(f.root.matrix, 'indptr').read()
shape = getattr(f.root.matrix, 'shape').read()
csc_list.append(sp.csc_matrix((data, indices, indptr),
shape=shape))
# Read in 'feature' information
feature_group = f.get_node(f.root.matrix, 'features')
feature_types = getattr(feature_group,
'feature_type').read()
feature_names = getattr(feature_group, 'name').read()
feature_ids = getattr(feature_group, 'id').read()
# The only 'feature' we want is 'Gene Expression'
is_gene_expression = (feature_types == b'Gene Expression')
gene_names.extend(feature_names[is_gene_expression])
gene_ids.extend(feature_ids[is_gene_expression])
# Excise other 'features' from the count matrix
gene_feature_inds = np.where(is_gene_expression)[0]
csc_list[-1] = csc_list[-1][gene_feature_inds, :]

What I notice is that in hgmm5k_v3 and hgmm10k_v3, nUMI per cell is distinctively lower in mouse cells than human cells, whereas in hgmm12k_v2 the difference is smaller. See plots below (blue: human, green: mouse, green: empty droplets)

  1. hgmm10k_v3
    image

  2. hgmm5k_v3
    image

  3. hgmm12k_v2
    image

Could it be that this distribution somehow confused the method to (partially) model empty droplets out of mouse cells?

@sjfleming
Copy link
Member

I will look into that, but the fact that you gave it the --expected-cells parameter should enable it to figure out a good prior on cell counts that can cover both human and mouse...

@sjfleming sjfleming added the bug Something isn't working label Feb 4, 2020
@mtvector
Copy link

mtvector commented Feb 4, 2020

Just wanted to add that the HgMm mix datasets are sometimes problematic in assessing ambient RNAs, as I've found 0%-3% of UMIs (depending on cell type) will end up in the wrong species cells simply due to mismapping thanks to genome, and annotation and sequencing error.

@sjfleming
Copy link
Member

sjfleming commented Feb 5, 2020

Found it. It was coming from the use of the datatype uint16 to store gene indices during the creation of the output sparse count matrix...
I guess at some point way back, I thought, "There won't be transcriptomes with more than 65k genes, right?" Not right.

I will push a fix for this soon.

@sjfleming
Copy link
Member

This run might not have totally converged, but this is the result of running

cellbender remove-background --input 10k_hgmm_v3_nextgem_raw_feature_bc_matrix.h5 --output 10k_hgmm_v3_nextgem_out.h5 --cuda --expected-cells 10000 --total-droplets-included 20000 --epochs 300 --z-dim 20 --z-layers 100

image

image

image

@nh3
Copy link
Author

nh3 commented Feb 7, 2020

Thank you for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants