Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

YH-Zheng · 2024-03-08T06:06:56Z

I randomly sample 1k cells from each celltype in a data set of 4 million cells, and get an atac matrix of 55243 cells x 165804 peaks. However, when I perform topic calculation, loglikelihood is not reached when the number of topics reaches 350. Does this mean I need to increase the number of topics? But 350 is a large value relative to the example, how do I pick the optimal number of topics?

models=run_cgs_models_mallet(path_to_mallet_binary,
                    cistopic_obj,
                    n_topics=[200,250,300,350],
                    n_cpu=55,
                    n_iter=150,
                    random_state=555,
                    alpha=50,
                    alpha_by_topic=True,
                    eta=0.1,
                    eta_by_topic=False,
                    tmp_path=tmp_dir, #Use SCRATCH if many models or big data set
                    save_path=work_dir,
                    reuse_corpus=True)

The text was updated successfully, but these errors were encountered:

SeppeDeWinter · 2024-03-11T08:29:29Z

Hi @YH-Zheng

Indeed, choosing the correct number of topics can be a bit tricky and subjective. I would not run models with a larger number of topics. I would choose the model with 200 topics in your case, that's the point where most metrics are maximised.

After selecting the model you should check wether your topics represent your cell types well, based on plotting cell-topic probabilities, i.e. do you have a topic that is specific for each cell type? and based on motif enrichment, are the regions in topics enriched for the motifs that you are expecting?

All the best,

Seppe

YH-Zheng · 2024-03-11T09:12:43Z

Hi @SeppeDeWinter

Thans for your reply. You mean to make all four indicators as large as possible as the appropriate number of topics, but both of the metrics (Arun_2010, Cao_Juan_2009 ) you mentioned in the tutorial are that the better the model, the lower the metric.

Arun_2010: Uses a density-based metric as in Arun et al (2010) using the topic-region distribution, the cell-topic distribution and the cell coverage. The better the model, the lower the metric.
Cao_Juan_2009: Uses a divergence-based metric as in Cao Juan et al (2009) using the topic-region distribution. The better the model, the lower the metric.

If the chosen topic does not separate my ATAC data by my celltype annotation, would it be better to divide all cells into subsets and run subject modeling separately (e.g., B cells, CD4T cells, and many smaller subsets within these large subsets of cells)? Or increase the number of topic？

Best wishes，

Yuhui

SeppeDeWinter · 2024-03-11T13:16:07Z

Hi @YH-Zheng

You are correct about those two metrics, however for plotting them we invert their values (hence the "inv" prefix).

I would not run topic modelling separately per cell type, you need the background of the other cell types to be able to identify cell type specific regions. In that case I would indeed increate the number of topics.

All the best,

Seppe

AmosFong1 mentioned this issue Jul 23, 2024

Guidance on topic model selection #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

YH-Zheng commented Mar 8, 2024 •

edited

Loading

SeppeDeWinter commented Mar 11, 2024

YH-Zheng commented Mar 11, 2024

SeppeDeWinter commented Mar 11, 2024

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

Topics has been calculated to number 350, but the loglikelihood is still not optimal. #114

Comments

YH-Zheng commented Mar 8, 2024 • edited Loading

SeppeDeWinter commented Mar 11, 2024

YH-Zheng commented Mar 11, 2024

SeppeDeWinter commented Mar 11, 2024

YH-Zheng commented Mar 8, 2024 •

edited

Loading