Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sequence depth issues in algorithms of LSI, LSA, LDA, PCA for dimension reduction #60

Closed
wangmeijiao opened this issue Mar 27, 2020 · 1 comment

Comments

@wangmeijiao
Copy link

wangmeijiao commented Mar 27, 2020

Hi all,

The sequencing depth of single cells would be an important factor that may hinder true discovers like cell type identification, pseudo-time paths calculation etc. As far as I know, many scATAC tools (cistopic with LDA, signac with LSI, cellranger-atac with LSA, episcanpy with PCA) have difficulties to deduce a true dimension reduced clustering space without pre-filtering low-depth cells (correct me if I miss something).
However, in some cases, cells may perhaps indeed show less ATAC fragments (or low UMI transcription) for some biological reasons. Therefore how to precisely distinguish those cells from broken cells is a true challenging. There is very few information about this issue (one mentioned here stuart-lab/signac#106) and I think this is an important question and many researchers will be interested with it.
In my case, I compared the UMAP plots before and after removal of the first four dimensions (the first dimension are indeed correlated with sequence depth, I excluded the first four dims for safe), the shape of scatterplot looks similar and positions of cell clusters with low-depth (not too low, at least 3k fragments per cell after prefiltering) remain unchanged too much.
To summary my question, how to deal with cells with low depth to avoid false positive result but keep real cells? Any suggestions will be weIcome.

@DaneseAnna
Copy link
Collaborator

Hi,

This is a very interesting and complex question. I will try to give an answer/discussion.

Let me rephrase your questions in smaller ones.
How to distinguish actual cell types with very little open chromatin from low quality barcodes (they may not even be cellsI am not even calling them cells in this case).
How to identify these low-depth cell types when the first components seems to be library size

Technical low depth barcodes/cells can be caused by 2 things:
how many reads were sequenced from the library
how many insertions the transposase actually managed to do in a cell.
Biological low depth can be due to:
the very close chromatin state of the cell type
a cell type/nucleus that doesn’t resist the protocol or that can be harder to integrate for the transposase

About a):
All cells, independently of cell type, have some regions that should systematically be open (RNApol2, other ubiquitous genes). So we can expect a minimum number of insertion per cell.
This can be explored by looking at the QC and checking TSS enrichment at house keeping genes, for example.

About b):

Usually, if you are using peaks you will identify peaks from highly covered cells so you will lose the low-depth as noisy (having more than x percent of their reads outside of peaks).You can try to use an annotation based feature space to try to keep some of the biological signal.

You can also focus on the lowly covered cell and try to use a different feature space, like promoter regions or small windows to see if there are some regions enriched in the lowly covered cells that might be cell type specific. Once you have done that you can decide on a feature space containing the regions that are cell type specific and look at all the cells together.

So, to some extent you can salvage the low-depth cells from the technicaly lowly covered cells. However, you will still have the library size effect. It is a big technical artefact and it is not disappearing despite excluding PC1 and/or oding library size correction.

You can check the relationship between library size (or any other technical artifact) and the PC components using the function correlation_pc. This is very useful to identify artifacts in the data. However, we would not recommend to remove the first four PCs, as you will be removing a lot of the biological variation present in the data like that (as you can see that library size is mainly correlated with PC1; to check how much library size explains the other PCs you can use correlation_pc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants