Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking cells in large dataset #185

Closed
Elhl93 opened this issue Jul 17, 2020 · 1 comment
Closed

Chunking cells in large dataset #185

Elhl93 opened this issue Jul 17, 2020 · 1 comment
Labels
results Question about pySCENIC results

Comments

@Elhl93
Copy link

Elhl93 commented Jul 17, 2020

Hi,

I have a huge single-cell dataset (>600k cells) from different animals and different conditions/timepoints. After filtering, I have ~15k genes to work with.

Although I work on a HPC, but as the dataset is quite big, I need to split it in chunks as it will improve runtime (referring to #99 ). As I have no pre-made lists as in post #99 , I need to run GRNBoost2 and cisTarget.

I did a trial in which I took one celltype, split it in 2 chunks each 40k cells with an overlap of ~10k cells. My goal was to understand wether the identified RSS scores in both 10k chunks are identical - or wether the score is relative depending on the other cells in the dataset.

  1. Do you recommend to split the dataset by sample (the dataset comprises >100 samples), or e.g. by celltype (we detect 15 celltypes), condition? - If the results depend on other cells in the dataset.
  2. By correlating the RSS scores of the above described 10k cells I got a r=0.51. Is that expected? You described that there is variability due to the probabilistic nature of GRNboost2. Do you recommend then running it multiple times (e.g. n=5) and only consider e.g. the reoccuring top 20?

Thanks for your thoughts!

@Elhl93 Elhl93 added the results Question about pySCENIC results label Jul 17, 2020
@Elhl93 Elhl93 changed the title [results][enhancement] Chunking cells in large dataset Jul 17, 2020
@cflerin
Copy link
Contributor

cflerin commented Jul 27, 2020

Hi @Elhl93 ,

Interesting question, and I think you have the right idea about downsampling. If you have a few conditions/timepoints, normally I would run these separately, and then also run the combined full dataset. But with a dataset this large, if you try to use the full 600k cells, it will take quite some time to run the GRNBoost2 step -- probably on the order of days to weeks, even on an HPC with many cores. So I think first, I would run the conditions separately, downsampling to maybe 100k cells if necessary. Splitting it by cell type isn't a good approach because it makes it harder to pick out TFs/regulons that are differential across cell types. This could be a reason that your RSS correlation is "low". Running the conditions separately will already give a good idea of the regulons present in your datasets.

Then you can decide if running the full dataset is worth the computation time. I would start with a random 100k cells on the combined data first. Then if you find something particularly interesting, like a few regulons, you could run the full dataset, and only specify those TFs as an input (for instance).

In general, running multiple times is helpful for refining the list of regulons and target genes, but there could be an issue in computation time with your dataset and I don't think it would affect RSS all that much unless you really aggressively prune the target genes. It's more likely that the differences in RSS are based on the definition (cell composition) of the clusters that you're comparing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
results Question about pySCENIC results
Projects
None yet
Development

No branches or pull requests

2 participants