pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

aburkov · 2018-06-28T04:33:16Z

When used with gensim model, pyLDAvis' topic 1 is not the same as gensim's topic 1, pyLDAvis' topic 2 is not the same as gensim's topic 2, and so on.

Is there any way to find out what is gensim's ID of pyLDAvis' topic 15?

a087861 · 2018-06-28T14:36:33Z

I noticed the same issue when preparing an analysis on a gensim LDA model. Any insight on this topic is greatly appreciated. For ease of searching and additional analysis, it would be awesome if the visualization used the same model indexing as the underlying LDA model (index starting at zero).

jdm2980 · 2018-07-30T15:24:33Z

You can use topic_order to get the gensim topics or use sort_topics to not reorder by size.

vis_data = pyLDAvis.gensim.prepare(model, corpus, dictionary)

# print gensim topic for each LDA topic
# Note gensim starts indexing at 0 and pyLDAvis starts at 1
print(vis_data.topic_order)

# alternatively keep the same ordering as gensim
vis_data = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

print(vis_data.topic_order) # 1,2,3,4,........

mileserickson · 2020-01-02T18:58:45Z

The same thing happens when using an sklearn model. This shuffling of topic IDs without warning is a very, very confusing behavior, and I struggle to comprehend why it occurs by default.

If topics are to be sorted by prevalence, could it be helpful to assign different topic IDs that include a mapping back to the original topic ID? For example, if topic 9 is the most prevalent, why not call it "Topic A-09"? If the purpose of this visualization tool is to facilitate comprehension and labeling of topics discovered through unsupervised learning, how is it helpful to create labels that can't be mapped back to the model?

Unfortunately, the docstrings and method signatures that users are most likely to read (pyLDAvis.sklearn.prepare and pyLDAvis.gensim.prepare) don't warn the user about this behavior, nor do they mention the sort_topics keyword argument:

Signature: pyLDAvis.sklearn.prepare(lda_model, dtm, vectorizer, **kwargs)
Docstring:
Create Prepared Data from sklearn's LatentDirichletAllocation and CountVectorizer.

Parameters
----------
lda_model : sklearn.decomposition.LatentDirichletAllocation.
    Latent Dirichlet Allocation model from sklearn fitted with `dtm`

dtm : array-like or sparse matrix, shape=(n_samples, n_features)
    Document-term matrix used to fit on LatentDirichletAllocation model (`lda_model`)

vectorizer : sklearn.feature_extraction.text.(CountVectorizer, TfIdfVectorizer).
    vectorizer used to convert raw documents to document-term matrix (`dtm`)

**kwargs: Keyword argument to be passed to pyLDAvis.prepare()


Returns
-------
prepared_data : PreparedData
      the data structures used in the visualization


Example
--------
For example usage please see this notebook:
http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb

See
------
See `pyLDAvis.prepare` for **kwargs.
File:      ~/.local/lib/python3.5/site-packages/pyLDAvis/sklearn.py
Type:      function

nickhamlin · 2020-04-24T13:46:59Z

This issue had me tied up for hours. The crux of the issue is exactly what @mileserickson highlights - there's no way to easily discover that this exists. Perhaps an easy fix would be to change the prepare method such that sort_topics=False by default? That way, users could make the (natural) assumption that indexes would match, but would still retain the option of overriding that if they wanted to.

francis-de-ladu · 2020-05-01T19:03:50Z

Had same issue, thinking there was a problem during inference of topics. Topic sorting by relevance shouldn't be the default.

cmastronardo · 2020-05-27T13:26:24Z

Same issue happened to me.. this is very confusing

alexeyegorov · 2020-10-09T10:57:03Z

I also would like to admit that this is really not something I have expected and it lead me to wrong results as no tutorial mentioned this. I would really appreciate changing the default to FALSE.

msusol · 2021-03-14T16:15:34Z

Making this change will break the unit tests, that rely on R output data. So users who use the R package to produce the data, but use the pyLDAvis visualizations will have the problem then.

>       assert_array_equal(np.array(expected['topic.order']), np.array(output['topic.order']))
E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatched elements: 20 / 20 (100%)
E       Max absolute difference: 18
E       Max relative difference: 18.
E        x: array([19,  1, 11, 17, 14, 15, 13, 10,  5,  7, 20, 16, 18,  9,  3,  8,  2,
E              12,  4,  6])
E        y: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
E              18, 19, 20])

tests/pyLDAvis/test_prepare.py:44: AssertionError

@bmabey Thoughts?

Override non-intuitive parameters with more appropriate values to better match the expectations of tomotopy users who are not familiar with the internals of pyLDAvis A similar issue is discussed here: bmabey/pyLDAvis#127

EclipsedSentry mentioned this issue Sep 4, 2018

Gensim: pyLDAvis index is out of bounds with mismatched dictionary & internal model dict #135

Open

Rhuax mentioned this issue Oct 9, 2020

Set sort_topic=False in prepare #178

Closed

msusol self-assigned this Mar 14, 2021

jonaschn mentioned this issue May 28, 2021

Improve pyLDAvis example bab2min/tomotopy#122

Merged

ed9w2in6 mentioned this issue Apr 24, 2024

[feature] decouple visualisation UI's topic numbering with their label #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

aburkov commented Jun 28, 2018

a087861 commented Jun 28, 2018

jdm2980 commented Jul 30, 2018

mileserickson commented Jan 2, 2020 •

edited

nickhamlin commented Apr 24, 2020

francis-de-ladu commented May 1, 2020

cmastronardo commented May 27, 2020

alexeyegorov commented Oct 9, 2020

msusol commented Mar 14, 2021

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

Comments

aburkov commented Jun 28, 2018

a087861 commented Jun 28, 2018

jdm2980 commented Jul 30, 2018

mileserickson commented Jan 2, 2020 • edited

nickhamlin commented Apr 24, 2020

francis-de-ladu commented May 1, 2020

cmastronardo commented May 27, 2020

alexeyegorov commented Oct 9, 2020

msusol commented Mar 14, 2021

mileserickson commented Jan 2, 2020 •

edited