Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of topics and peak using questions. Thank you. #31

Closed
helenhuangmath opened this issue Oct 9, 2019 · 2 comments
Closed

Number of topics and peak using questions. Thank you. #31

helenhuangmath opened this issue Oct 9, 2019 · 2 comments

Comments

@helenhuangmath
Copy link

Hi cisTopic team,

Thank you for developing the cisTopic software. We’re trying it on our scATAC data and find some promising results. But we have several concerns about our results. It will be great if you could provide some suggestions.

  1. We tested different number of topics, but results showed the more topics the more stable model is in our data (attached figure1). Do you have any idea the reason of this?
    We also noticed that some of the topics are similar to each other. Is there any good way to merge those similar topics? Is that ok to average the z-score or probabilities for these topics? Or do you think that we’d better manually select lower number of topics in selectModel() step?

  2. Do you have any idea that how many times that each peak/region is really meaningful in contribution to topics in general? I noticed that when the algorithm builds region score, it seems that almost all peaks are used. However, some peaks have very limited contributions. After running binarizecisToipcs() to binarize topics, there're only about 20% of peaks passed the cutoff and saved in the results object@binarized.cisTopics and used for downstream functional and pathway analysis. But the rest 80% of the peaks do not have meaningful contribution to any topics. (the attached figure2). And some peaks are used more than 15 times in contributing to different topics. Is that normal? How could we interpret this result?

Thank you so much!!

cisTopic_issue_figure1
cisTopic_issue_figure2

@cbravo93
Copy link
Member

Hi @helenhuangmath !

I think I just answered this by email, but here it goes:

0- Based on the top above left (likelihood per iteration plot), I would increase the number of burning iterations. For the models with higher number of topics (from 35 on), your likelihood is not stabilized when starting the sampling (this can effect you model selection and results). Maybe 300 burn-in, or even a bit higher would be better.

1a- We have seen that in some datasets it seems more like the curve reach a stable likelihood rather than go down again after maximum. In this case, you can try with the most simple model which has a similar likelihood. For your data, I would maybe add some models (10,20,30) to make sure it is not around there.

1b- I would check the correlation between the scores (topic-cell, region-topic) before merging, and would be careful with downstream analyses (these correlated topics, how much do the regions on them overlap?). If you opt for merging them, then do it on the assignment matrices [for topic-cell and region-topic, respectively: object@selected.model$document_expects & object@selected.model$topics], and the rest of the functions/normalizations should work. I can help with the code for this, just let me know :).

  1. Normally, binarised topics are around couple thousands regions; but this depends on the thresholds for binarisation you choose. I don’t like thresholds and prefer to work with the probabilities themselves when possible. Normally, regions that are not in topics are because they are generally lowly accessible (can you check the number of cells in which topic regions vs non-topic regions are accessible? And also check the binarisation plots). If you prefer to work with differentially accessible regions, you can also use the predictive distribution matrix (with probabilities of the regions in cells), and run e.g. a wilcoxon test between whatever groups. Also with this matrix, you can look for the accessibility probability of regions of interest (whether they are in a topic or not, this is quite interesting :P).

I hope this is useful, and let me know if you have more questions :)!

C

@helenhuangmath
Copy link
Author

Thank you so much again! It's quite helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants