New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support of the Hierarchical Dirichlet Process from Gensim. #63
Conversation
Thanks! I am not familiar with the HDP from gensim as well so before we merge this in I think we should do two things:
|
Hi ! Okay I will try to put some tests, I wanted to check that LDA was still working but it seems the I will look for the LDA/HDP tests in the meantime :) |
Yeah, it looks like the build failed on this PR due to a newer version of pandas. I'll fix that and let you know when to rebase. |
BTW, I've just been running the tests locally with |
Okay, I fixed the issue so try rebasing and pushing this PR again to get a clean Travis build. |
Oh, and if you really want to go the extra mile adding a HDP example to the gensim notebook would be great. :) |
@tmylk should be able to help here. Better model visualisations are definitely on our roadmap with gensim :) |
Ok no problem I will try to rebase, add some tests and the example in the gensim notebook. Thanks @piskvorky for reviewing :) |
I guess there is still a problem, I still have an issue with
Am I missing something ? |
The problem is that you don't have github's LFS (large file support) git extension: Either install that or just run the tests you are writing and let Travis run them all on push.
|
Oh ok, thank's I didn't know the repo required that extension. |
I just put some tests to ensure the I will do the notebook part later, can't do it now. |
I have done the notebook part. I hope it shows enough the simplicity of the use of HDP models with pyLDAvis. Also I noticed a problem concerning HDP models, when applying principal component analysis, it seems that some eigenvalues are negatives, I think it screwing a little bit the visualization but I don't know if it's really a problem. Correct me if I am wrong, but it seems HDP models are very sensitive to noise (and you need a lot more data I guess) ? From what I have dealt with on my own projects, I had to remove as much noise as possible to get relevant topics from a human point of view. |
num_topics=2) | ||
|
||
data = pyLDAvis.gensim.prepare(lda, corpus, dictionary) | ||
pyLDAvis.save_html(data, 'index_lda.html') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good... maybe write the html file to /tmp
or delete it to have proper cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
Yeah, that visualization doesn't look good... Do all of the gensim HDP ones look like that? BTW, could you rerun the notebook so the cache takes effect and we don't have as much in the cell output (i.e. cell 2). |
Hi, I have this kind of visualization for my self project : http://puu.sh/oSvaV/05c9e050b6.png, so I think they are not always like that. If it uses decomposition, maybe HDP didn't get the topics correctly (for some reason) and the noise is prevalent in all the topics. Making a unique axe or two importants, hence this visualization. Got it for the notebook. |
Hi, I updated the notebook and cleared the outputs. There seems to have one final problem, for the dependencies, if I add gensim to the test-requirements dependencies, it will throw a conflict exception for the I tested by putting the gensim dependency first in the requirements and it works, but it's not very clean to do that ... The best would be to remove the version requirement from What should we do ? |
I agree, lets relax the version requirement for 'smart-open' and see if that fixes it while preserving the functionality provided by 'smart-open'
|
Perhaps we should comment about this known problem in the notebook... caveat emptor. We could also point them to this Gibbs sampler HDP lib that may be easier to use (albeit slower): http://datamicroscopes.github.io/topic.html Thanks for all these changes BTW. The notebook and tests give me enough confidence so we can skip the gensim code review if no one steps up.
|
The requests version pin was removed in the latest version of smart_open released today. |
Gave it a quick review. The use of gensim looks good. |
Thanks @tmylk! |
Cool ! Thanks for reviewing also @tmylk :) Concerning the caveat emptor, I am not sure I understood, do you want me to comment about the eigenvalue part ? The visualization that "fails" because HDP can be a little more sensitive to noise (I am still not sure of that, someone with better experience should comment on this) ? For the datamicroscope part, I am not sure to see why it would be easier, running lda seems to require 4 lines of codes that doesn't seem very clear at first sight. |
Easier to fit I should have said.. I've never ran into these odd looking visualizations with a HDP gibbs sampler but I have with gensim's HDP before. I don't know if that is related to gensim's implementation or the approach taken (variational bayes I assume but I don't know). |
Yeah, maybe it could also come from the implementation (maybe unlikely) or the approach. Still, I am not sure what you want me to do, can you be more explicit ? I see that there is problems from Travis for the 3.3 version of python. I use myself python3.5 without any problem. Do you have an idea where it could come from ? Maybe I am missing something, but for the python version 3.4, we get this output:
Anaconda is installing python 3.5.1, so I am not sure python 3.4 is tested. The job for python 3.3 is also installing python 3.5.1 in the virtualenv .. Is that what you want to do ? And what's weird then is that in both virtualenv we should have python 3.5.1 but the python3.3 job is failing for a syntax error in boto 2.4 ... I am not very used to conda so it's very likely I am missing something again :P |
I'm confused by Travis as well. Squash (rebase) your commits and I'll merge it in and take it from here. Thanks!
|
Ok done! I made 4 commits with specific tags, I hope its okay, or you want me to squash all in one commit ? My pleasure :) |
Thanks. Please squash the first three commits into one. That way we have a single clean commit with code, deps, and test changes. The notebook commit is good to keep separate. |
…models. [requirements] added gensim packages to test-requirements. [tests] added gensim tests to ensure prepare/save_html functions still works with lda and hdp models.
…t the interface is the same.
Alright ! Done. |
I'll report back once I've resolved the CI issues and have cut a release (it may be a week or so before I get to it). |
Hi,
I was playing with HDP models and I wanted to visualise them with pyLDAvis. Unfortunately it wasn't natively supported, so I made the few fixes to make it.
I am simply looking for attributes
lda_alpha
andlda_beta
as they are specific to the HDP model. I also changed the__num_dist_rows__
function because even if the matrix was normalized, thedoc_topic_dists
made the asserts crying (maybe NaN values ?). I didn't look too much into it but this fix is working.I am not sure if the
lda_beta
parameter is exactly the same asstate.get_lambda()
parameter but it needed the topic-term distribution so I though it was ok...I am using it and it's working. Tell me if the PR seems right to you ! :)