Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

Open
YH-Zheng opened this issue Mar 8, 2024 · 3 comments

Comments

@YH-Zheng
Copy link

YH-Zheng commented Mar 8, 2024

I randomly sample 1k cells from each celltype in a data set of 4 million cells, and get an atac matrix of 55243 cells x 165804 peaks. However, when I perform topic calculation, loglikelihood is not reached when the number of topics reaches 350. Does this mean I need to increase the number of topics? But 350 is a large value relative to the example, how do I pick the optimal number of topics?

models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[200,250,300,350],
                    n_cpu=55,
                    n_iter=150,
                    random_state=555,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_dir, #Use SCRATCH if many models or big data set
                    save_path=work_dir,
                    reuse_corpus=True)

download-3

@SeppeDeWinter
Copy link
Collaborator

Hi @YH-Zheng

Indeed, choosing the correct number of topics can be a bit tricky and subjective. I would not run models with a larger number of topics. I would choose the model with 200 topics in your case, that's the point where most metrics are maximised.

After selecting the model you should check wether your topics represent your cell types well, based on plotting cell-topic probabilities, i.e. do you have a topic that is specific for each cell type? and based on motif enrichment, are the regions in topics enriched for the motifs that you are expecting?

All the best,

Seppe

@YH-Zheng
Copy link
Author

Hi @SeppeDeWinter

Thans for your reply. You mean to make all four indicators as large as possible as the appropriate number of topics, but both of the metrics (Arun_2010, Cao_Juan_2009 ) you mentioned in the tutorial are that the better the model, the lower the metric.

Arun_2010: Uses a density-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric.
Cao_Juan_2009: Uses a divergence-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

If the chosen topic does not separate my ATAC data by my celltype annotation, would it be better to divide all cells into subsets and run subject modeling separately (e.g., B cells, CD4T cells, and many smaller subsets within these large subsets of cells)? Or increase the number of topic?

Best wishes,

Yuhui

@SeppeDeWinter
Copy link
Collaborator

Hi @YH-Zheng

You are correct about those two metrics, however for plotting them we invert their values (hence the "inv" prefix).

I would not run topic modelling separately per cell type, you need the background of the other cell types to be able to identify cell type specific regions. In that case I would indeed increate the number of topics.

All the best,

Seppe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants