Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

Open
aburkov opened this issue Jun 28, 2018 · 8 comments
Open

pyLDAvis topic IDs doesn't correspond to gensim topic IDs #127

aburkov opened this issue Jun 28, 2018 · 8 comments
Assignees

Comments

@aburkov
Copy link

aburkov commented Jun 28, 2018

When used with gensim model, pyLDAvis' topic 1 is not the same as gensim's topic 1, pyLDAvis' topic 2 is not the same as gensim's topic 2, and so on.

Is there any way to find out what is gensim's ID of pyLDAvis' topic 15?

@a087861
Copy link

a087861 commented Jun 28, 2018

I noticed the same issue when preparing an analysis on a gensim LDA model. Any insight on this topic is greatly appreciated. For ease of searching and additional analysis, it would be awesome if the visualization used the same model indexing as the underlying LDA model (index starting at zero).

@jdm2980
Copy link

jdm2980 commented Jul 30, 2018

You can use topic_order to get the gensim topics or use sort_topics to not reorder by size.

vis_data = pyLDAvis.gensim.prepare(model, corpus, dictionary)

# print gensim topic for each LDA topic
# Note gensim starts indexing at 0 and pyLDAvis starts at 1
print(vis_data.topic_order)

# alternatively keep the same ordering as gensim
vis_data = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)

print(vis_data.topic_order) # 1,2,3,4,........

@mileserickson
Copy link

mileserickson commented Jan 2, 2020

The same thing happens when using an sklearn model. This shuffling of topic IDs without warning is a very, very confusing behavior, and I struggle to comprehend why it occurs by default.

If topics are to be sorted by prevalence, could it be helpful to assign different topic IDs that include a mapping back to the original topic ID? For example, if topic 9 is the most prevalent, why not call it "Topic A-09"? If the purpose of this visualization tool is to facilitate comprehension and labeling of topics discovered through unsupervised learning, how is it helpful to create labels that can't be mapped back to the model?

Unfortunately, the docstrings and method signatures that users are most likely to read (pyLDAvis.sklearn.prepare and pyLDAvis.gensim.prepare) don't warn the user about this behavior, nor do they mention the sort_topics keyword argument:

Signature: pyLDAvis.sklearn.prepare(lda_model, dtm, vectorizer, **kwargs)
Docstring:
Create Prepared Data from sklearn's LatentDirichletAllocation and CountVectorizer.

Parameters
----------
lda_model : sklearn.decomposition.LatentDirichletAllocation.
    Latent Dirichlet Allocation model from sklearn fitted with `dtm`

dtm : array-like or sparse matrix, shape=(n_samples, n_features)
    Document-term matrix used to fit on LatentDirichletAllocation model (`lda_model`)

vectorizer : sklearn.feature_extraction.text.(CountVectorizer, TfIdfVectorizer).
    vectorizer used to convert raw documents to document-term matrix (`dtm`)

**kwargs: Keyword argument to be passed to pyLDAvis.prepare()


Returns
-------
prepared_data : PreparedData
      the data structures used in the visualization


Example
--------
For example usage please see this notebook:
http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb

See
------
See `pyLDAvis.prepare` for **kwargs.
File:      ~/.local/lib/python3.5/site-packages/pyLDAvis/sklearn.py
Type:      function

@nickhamlin
Copy link

This issue had me tied up for hours. The crux of the issue is exactly what @mileserickson highlights - there's no way to easily discover that this exists. Perhaps an easy fix would be to change the prepare method such that sort_topics=False by default? That way, users could make the (natural) assumption that indexes would match, but would still retain the option of overriding that if they wanted to.

@francis-de-ladu
Copy link

Had same issue, thinking there was a problem during inference of topics. Topic sorting by relevance shouldn't be the default.

@cmastronardo
Copy link

Same issue happened to me.. this is very confusing

@alexeyegorov
Copy link

I also would like to admit that this is really not something I have expected and it lead me to wrong results as no tutorial mentioned this. I would really appreciate changing the default to FALSE.

@msusol
Copy link
Collaborator

msusol commented Mar 14, 2021

Making this change will break the unit tests, that rely on R output data. So users who use the R package to produce the data, but use the pyLDAvis visualizations will have the problem then.

>       assert_array_equal(np.array(expected['topic.order']), np.array(output['topic.order']))
E       AssertionError: 
E       Arrays are not equal
E       
E       Mismatched elements: 20 / 20 (100%)
E       Max absolute difference: 18
E       Max relative difference: 18.
E        x: array([19,  1, 11, 17, 14, 15, 13, 10,  5,  7, 20, 16, 18,  9,  3,  8,  2,
E              12,  4,  6])
E        y: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
E              18, 19, 20])

tests/pyLDAvis/test_prepare.py:44: AssertionError

@bmabey Thoughts?

@msusol msusol self-assigned this Mar 14, 2021
jonaschn added a commit to jonaschn/tomotopy that referenced this issue May 28, 2021
Override non-intuitive parameters with more appropriate values to better match the expectations of tomotopy users who are not familiar with the internals of pyLDAvis
A similar issue is discussed here: bmabey/pyLDAvis#127
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants