Chunking cells in large dataset #185

Elhl93 · 2020-07-17T12:20:29Z

Hi,

I have a huge single-cell dataset (>600k cells) from different animals and different conditions/timepoints. After filtering, I have ~15k genes to work with.

Although I work on a HPC, but as the dataset is quite big, I need to split it in chunks as it will improve runtime (referring to #99 ). As I have no pre-made lists as in post #99 , I need to run GRNBoost2 and cisTarget.

I did a trial in which I took one celltype, split it in 2 chunks each 40k cells with an overlap of ~10k cells. My goal was to understand wether the identified RSS scores in both 10k chunks are identical - or wether the score is relative depending on the other cells in the dataset.

Do you recommend to split the dataset by sample (the dataset comprises >100 samples), or e.g. by celltype (we detect 15 celltypes), condition? - If the results depend on other cells in the dataset.
By correlating the RSS scores of the above described 10k cells I got a r=0.51. Is that expected? You described that there is variability due to the probabilistic nature of GRNboost2. Do you recommend then running it multiple times (e.g. n=5) and only consider e.g. the reoccuring top 20?

Thanks for your thoughts!

cflerin · 2020-07-27T09:10:30Z

Hi @Elhl93 ,

Interesting question, and I think you have the right idea about downsampling. If you have a few conditions/timepoints, normally I would run these separately, and then also run the combined full dataset. But with a dataset this large, if you try to use the full 600k cells, it will take quite some time to run the GRNBoost2 step -- probably on the order of days to weeks, even on an HPC with many cores. So I think first, I would run the conditions separately, downsampling to maybe 100k cells if necessary. Splitting it by cell type isn't a good approach because it makes it harder to pick out TFs/regulons that are differential across cell types. This could be a reason that your RSS correlation is "low". Running the conditions separately will already give a good idea of the regulons present in your datasets.

Then you can decide if running the full dataset is worth the computation time. I would start with a random 100k cells on the combined data first. Then if you find something particularly interesting, like a few regulons, you could run the full dataset, and only specify those TFs as an input (for instance).

In general, running multiple times is helpful for refining the list of regulons and target genes, but there could be an issue in computation time with your dataset and I don't think it would affect RSS all that much unless you really aggressively prune the target genes. It's more likely that the differences in RSS are based on the definition (cell composition) of the clusters that you're comparing.

Elhl93 added the results Question about pySCENIC results label Jul 17, 2020

Elhl93 changed the title ~~[results][enhancement]~~ Chunking cells in large dataset Jul 17, 2020

cflerin mentioned this issue Jul 28, 2020

identifying TF regulons for conditions in scRNAseq within each of the cell types[results] #189

Closed

cflerin closed this as completed Aug 17, 2020

cflerin mentioned this issue Sep 8, 2020

Running pySCENIC for finding condition-specific as well as cell-type specific regulons. #209

Closed

cflerin mentioned this issue May 28, 2021

[results]whether to split data before using SCENIC #292

Closed

ggruenhagen3 mentioned this issue Apr 24, 2024

Is pyscenic grn proceeding? It does not throw any responses. #538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking cells in large dataset #185

Chunking cells in large dataset #185

Elhl93 commented Jul 17, 2020 •

edited

Loading

cflerin commented Jul 27, 2020

Chunking cells in large dataset #185

Chunking cells in large dataset #185

Comments

Elhl93 commented Jul 17, 2020 • edited Loading

cflerin commented Jul 27, 2020

Elhl93 commented Jul 17, 2020 •

edited

Loading