Support of the Hierarchical Dirichlet Process from Gensim. #63

bloody76 · 2016-05-12T15:48:32Z

Hi,

I was playing with HDP models and I wanted to visualise them with pyLDAvis. Unfortunately it wasn't natively supported, so I made the few fixes to make it.

I am simply looking for attributes lda_alpha and lda_beta as they are specific to the HDP model. I also changed the __num_dist_rows__ function because even if the matrix was normalized, the doc_topic_dists made the asserts crying (maybe NaN values ?). I didn't look too much into it but this fix is working.

I am not sure if the lda_beta parameter is exactly the same as state.get_lambda() parameter but it needed the topic-term distribution so I though it was ok...

I am using it and it's working. Tell me if the PR seems right to you ! :)

bmabey · 2016-05-12T16:05:50Z

Thanks! I am not familiar with the HDP from gensim as well so before we merge this in I think we should do two things:

Have a code review from someone familiar with gensim... @piskvorky, would you mind taking a look or suggesting someone else?
Write some tests to make sure both the standard LDA and HDP model run. This is largely to ensure that the LDA model still works and we guard against regressions. See add tests for gensim prepare! #46.

bloody76 · 2016-05-12T16:24:31Z

Hi !

Okay I will try to put some tests, I wanted to check that LDA was still working but it seems the make test command is out. Travis also. Could you fix it ? I think it has nothing to do with my PR, when I remove my patch it's still failing...

I will look for the LDA/HDP tests in the meantime :)

bmabey · 2016-05-12T16:30:41Z

Yeah, it looks like the build failed on this PR due to a newer version of pandas. I'll fix that and let you know when to rebase.

bmabey · 2016-05-12T16:31:41Z

BTW, I've just been running the tests locally with py.test.

bmabey · 2016-05-12T16:46:20Z

Okay, I fixed the issue so try rebasing and pushing this PR again to get a clean Travis build.

bmabey · 2016-05-12T16:49:10Z

Oh, and if you really want to go the extra mile adding a HDP example to the gensim notebook would be great. :)

piskvorky · 2016-05-13T04:41:25Z

@tmylk should be able to help here. Better model visualisations are definitely on our roadmap with gensim :)

bloody76 · 2016-05-13T07:42:47Z

Ok no problem I will try to rebase, add some tests and the example in the gensim notebook. Thanks @piskvorky for reviewing :)

bloody76 · 2016-05-13T13:29:12Z

I guess there is still a problem, I still have an issue with py.test:

$ py.test
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/apis/nlp_core, inifile: tox.ini
plugins: cov-2.2.1
collected 1 items 

tests/pyLDAvis/test_prepare.py F
=================================================================================================== FAILURES ====================================================================================================
________________________________________________________________________________________ test_end_to_end_with_R_examples ________________________________________________________________________________________

    def test_end_to_end_with_R_examples():
>       data_input, expected = load_dataset('movie_reviews')

tests/pyLDAvis/test_prepare.py:33: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/pyLDAvis/test_prepare.py:21: in load_dataset
    data_input = json.load(j)
/usr/lib/python3.5/json/__init__.py:268: in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
/usr/lib/python3.5/json/__init__.py:319: in loads
    return _default_decoder.decode(s)
/usr/lib/python3.5/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <json.decoder.JSONDecoder object at 0x69e934b5ecc0>, s = 'version https://git-lfs.github.com/spec/v1\noid sha256:0f8753ac5b6e89031fc56623e9f71a61ebaca0e3382956944ad05c0844580298\nsize 7087084\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
            a JSON document) and return a 2-tuple of the Python
            representation and the index in ``s`` where the document ended.

            This can be used to decode a JSON document from a string that may
            have extraneous data at the end.

            """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib/python3.5/json/decoder.py:357: JSONDecodeError

Am I missing something ?

bmabey · 2016-05-13T13:37:49Z

The problem is that you don't have github's LFS (large file support) git extension:
https://git-lfs.github.com

Either install that or just run the tests you are writing and let Travis run them all on push.

On May 13, 2016, at 7:29 AM, Bd76 notifications@github.com wrote:

I guess there is still a problem, I still have an issue with py.test:

$ py.test
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/apis/nlp_core, inifile: tox.ini
plugins: cov-2.2.1
collected 1 items

tests/pyLDAvis/test_prepare.py F
=================================================================================================== FAILURES ====================================================================================================
________________________________________________________________________________________ test_end_to_end_with_R_examples ________________________________________________________________________________________
def test_end_to_end_with_R_examples():
  data_input, expected = load_dataset('movie_reviews')
tests/pyLDAvis/test_prepare.py:33:

tests/pyLDAvis/test_prepare.py:21: in load_dataset
data_input = json.load(j)
/usr/lib/python3.5/json/init.py:268: in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
/usr/lib/python3.5/json/init.py:319: in loads
return _default_decoder.decode(s)
/usr/lib/python3.5/json/decoder.py:339: in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())

self = <json.decoder.JSONDecoder object at 0x69e934b5ecc0>, s = 'version https://git-lfs.github.com/spec/v1\noid sha256:0f8753ac5b6e89031fc56623e9f71a61ebaca0e3382956944ad05c0844580298\nsize 7087084\n'
idx = 0
def raw_decode(self, s, idx=0):
    """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.

        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.

        """
    try:
        obj, end = self.scan_once(s, idx)
    except StopIteration as err:
      raise JSONDecodeError("Expecting value", s, err.value) from None
E json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
/usr/lib/python3.5/json/decoder.py:357: JSONDecodeError
Am I missing something ?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

bloody76 · 2016-05-13T14:19:57Z

Oh ok, thank's I didn't know the repo required that extension.

bloody76 · 2016-05-13T14:26:06Z

I just put some tests to ensure the prepare function is still working with lda model and hdp model. Is it how you imagined it ?

I will do the notebook part later, can't do it now.

bloody76 · 2016-05-14T10:28:36Z

I have done the notebook part. I hope it shows enough the simplicity of the use of HDP models with pyLDAvis.

Also I noticed a problem concerning HDP models, when applying principal component analysis, it seems that some eigenvalues are negatives, I think it screwing a little bit the visualization but I don't know if it's really a problem.

Correct me if I am wrong, but it seems HDP models are very sensitive to noise (and you need a lot more data I guess) ? From what I have dealt with on my own projects, I had to remove as much noise as possible to get relevant topics from a human point of view.

bmabey · 2016-05-15T01:43:17Z

tests/pyLDAvis/test_gensim_models.py

+                   num_topics=2)
+
+    data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
+    pyLDAvis.save_html(data, 'index_lda.html')


Looks good... maybe write the html file to /tmp or delete it to have proper cleanup.

bmabey · 2016-05-15T01:49:43Z

Yeah, that visualization doesn't look good... Do all of the gensim HDP ones look like that?

BTW, could you rerun the notebook so the cache takes effect and we don't have as much in the cell output (i.e. cell 2).

bloody76 · 2016-05-15T08:57:53Z

Hi,

I have this kind of visualization for my self project : http://puu.sh/oSvaV/05c9e050b6.png, so I think they are not always like that.

If it uses decomposition, maybe HDP didn't get the topics correctly (for some reason) and the noise is prevalent in all the topics. Making a unique axe or two importants, hence this visualization.

Got it for the notebook.

bloody76 · 2016-05-15T13:31:32Z

Hi,

I updated the notebook and cleared the outputs.

There seems to have one final problem, for the dependencies, if I add gensim to the test-requirements dependencies, it will throw a conflict exception for the request package. I digged in a little, and it seems it comes from the smart-open package that require a specific version while we install the last one during the requirements.

I tested by putting the gensim dependency first in the requirements and it works, but it's not very clean to do that ... The best would be to remove the version requirement from smart-open I guess.

What should we do ?

bmabey · 2016-05-15T13:45:11Z

I agree, lets relax the version requirement for 'smart-open' and see if that fixes it while preserving the functionality provided by 'smart-open'

On May 15, 2016, at 7:31 AM, Bd76 notifications@github.com wrote:

Hi,

I updated the notebook and cleared the outputs.

There seems to have one final problem, for the dependencies, if I add gensim to the test-requirements dependencies, it will throw a conflict exception for the request package. I digged in a little, and it seems it comes from the smart-open package that require a specific version while we install the last one during the requirements.

I tested by putting the gensim dependency first in the requirements and it works, but it's not very clean to do that ... The best would be to remove the version requirement from smart-open I guess.

What should we do ?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

bmabey · 2016-05-15T13:52:40Z

Perhaps we should comment about this known problem in the notebook... caveat emptor.

We could also point them to this Gibbs sampler HDP lib that may be easier to use (albeit slower):

http://datamicroscopes.github.io/topic.html

Thanks for all these changes BTW. The notebook and tests give me enough confidence so we can skip the gensim code review if no one steps up.

On May 15, 2016, at 2:57 AM, Bd76 notifications@github.com wrote:

Hi,

I have this kind of visualization for my self project : http://puu.sh/oSvaV/05c9e050b6.png, so I think they are not always like that.

If it uses decomposition, maybe HDP didn't get the topics correctly (for some reason) and the noise is prevalent in all the topics. Making a unique axe or two importants, hence this visualization.

Got it for the notebook.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

tmylk · 2016-05-16T13:00:15Z

The requests version pin was removed in the latest version of smart_open released today.

tmylk · 2016-05-16T13:28:47Z

Gave it a quick review. The use of gensim looks good.

bmabey · 2016-05-16T14:58:13Z

Thanks @tmylk!

bloody76 · 2016-05-16T18:44:56Z

Cool ! Thanks for reviewing also @tmylk :)

Concerning the caveat emptor, I am not sure I understood, do you want me to comment about the eigenvalue part ? The visualization that "fails" because HDP can be a little more sensitive to noise (I am still not sure of that, someone with better experience should comment on this) ?

For the datamicroscope part, I am not sure to see why it would be easier, running lda seems to require 4 lines of codes that doesn't seem very clear at first sight.

bmabey · 2016-05-16T22:03:13Z

Easier to fit I should have said.. I've never ran into these odd looking visualizations with a HDP gibbs sampler but I have with gensim's HDP before. I don't know if that is related to gensim's implementation or the approach taken (variational bayes I assume but I don't know).

bloody76 · 2016-05-17T12:47:36Z

Yeah, maybe it could also come from the implementation (maybe unlikely) or the approach. Still, I am not sure what you want me to do, can you be more explicit ?

I see that there is problems from Travis for the 3.3 version of python. I use myself python3.5 without any problem. Do you have an idea where it could come from ?

Maybe I am missing something, but for the python version 3.4, we get this output:

$ python --version
Python 3.4.2
$ bash miniconda.sh -b -p $HOME/miniconda
PREFIX=/home/travis/miniconda
installing: _cache-0.0-py35_x0 ...
installing: python-3.5.1-0 ...

Anaconda is installing python 3.5.1, so I am not sure python 3.4 is tested. The job for python 3.3 is also installing python 3.5.1 in the virtualenv .. Is that what you want to do ?

And what's weird then is that in both virtualenv we should have python 3.5.1 but the python3.3 job is failing for a syntax error in boto 2.4 ...

I am not very used to conda so it's very likely I am missing something again :P

bmabey · 2016-05-17T15:15:34Z

I'm confused by Travis as well. Squash (rebase) your commits and I'll merge it in and take it from here. Thanks!

On May 17, 2016, at 6:47 AM, Bd76 notifications@github.com wrote:

Yeah, maybe it could also come from the implementation (maybe unlikely) or the approach. Still, I am not sure what you want me to do, can you be more explicit ?

I see that there is problems from Travis for the 3.3 version of python. I use myself python3.5 without any problem. Do you have an idea where it could come from ?

Maybe I am missing something, but for the python version 3.4, we get this output:

$ python --version
Python 3.4.2
$ bash miniconda.sh -b -p $HOME/miniconda
PREFIX=/home/travis/miniconda
installing: _cache-0.0-py35_x0 ...
installing: python-3.5.1-0 ...
Anaconda is installing python 3.5.1, so I am not sure python 3.4 is tested. The job for python 3.3 is also installing python 3.5.1 in the virtualenv .. Is that what you want to do ?

And what's weird then is that in both virtualenv we should have python 3.5.1 but the python3.3 job is failing for a syntax error in boto 2.4 ...

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

bloody76 · 2016-05-18T10:01:44Z

Ok done! I made 4 commits with specific tags, I hope its okay, or you want me to squash all in one commit ?

My pleasure :)

bmabey · 2016-05-18T15:04:48Z

Thanks. Please squash the first three commits into one. That way we have a single clean commit with code, deps, and test changes. The notebook commit is good to keep separate.

…models. [requirements] added gensim packages to test-requirements. [tests] added gensim tests to ensure prepare/save_html functions still works with lda and hdp models.

…t the interface is the same.

bloody76 · 2016-05-18T15:14:15Z

Alright ! Done.

bmabey · 2016-05-18T15:22:20Z

I'll report back once I've resolved the CI issues and have cut a release (it may be a week or so before I get to it).

bmabey reviewed May 15, 2016
View reviewed changes

bloody76 force-pushed the master branch from 65dd509 to be895e4 Compare May 18, 2016 09:59

[gensim] adapted code to handle HDP model from gensim along with lda …

70d79f8

…models. [requirements] added gensim packages to test-requirements. [tests] added gensim tests to ensure prepare/save_html functions still works with lda and hdp models.

[notebooks] added the HDP part at the end of the notebook to show tha…

aae916a

…t the interface is the same.

bloody76 force-pushed the master branch from be895e4 to aae916a Compare May 18, 2016 15:13

bmabey merged commit c228ab0 into bmabey:master May 18, 2016

bmabey mentioned this pull request Jun 21, 2016

add tests for gensim prepare! #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of the Hierarchical Dirichlet Process from Gensim. #63

Support of the Hierarchical Dirichlet Process from Gensim. #63

bloody76 commented May 12, 2016

bmabey commented May 12, 2016 •

edited

bloody76 commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

piskvorky commented May 13, 2016

bloody76 commented May 13, 2016

bloody76 commented May 13, 2016

bmabey commented May 13, 2016

bloody76 commented May 13, 2016

bloody76 commented May 13, 2016 •

edited

bloody76 commented May 14, 2016

bmabey May 15, 2016

bloody76 May 15, 2016

bmabey commented May 15, 2016

bloody76 commented May 15, 2016

bloody76 commented May 15, 2016

bmabey commented May 15, 2016

bmabey commented May 15, 2016

tmylk commented May 16, 2016

tmylk commented May 16, 2016

bmabey commented May 16, 2016

bloody76 commented May 16, 2016

bmabey commented May 16, 2016

bloody76 commented May 17, 2016 •

edited

bmabey commented May 17, 2016

bloody76 commented May 18, 2016

bmabey commented May 18, 2016

bloody76 commented May 18, 2016

bmabey commented May 18, 2016

Support of the Hierarchical Dirichlet Process from Gensim. #63

Support of the Hierarchical Dirichlet Process from Gensim. #63

Conversation

bloody76 commented May 12, 2016

bmabey commented May 12, 2016 • edited

bloody76 commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

bmabey commented May 12, 2016

piskvorky commented May 13, 2016

bloody76 commented May 13, 2016

bloody76 commented May 13, 2016

bmabey commented May 13, 2016

bloody76 commented May 13, 2016

bloody76 commented May 13, 2016 • edited

bloody76 commented May 14, 2016

bmabey May 15, 2016

Choose a reason for hiding this comment

bloody76 May 15, 2016

Choose a reason for hiding this comment

bmabey commented May 15, 2016

bloody76 commented May 15, 2016

bloody76 commented May 15, 2016

bmabey commented May 15, 2016

bmabey commented May 15, 2016

tmylk commented May 16, 2016

tmylk commented May 16, 2016

bmabey commented May 16, 2016

bloody76 commented May 16, 2016

bmabey commented May 16, 2016

bloody76 commented May 17, 2016 • edited

bmabey commented May 17, 2016

bloody76 commented May 18, 2016

bmabey commented May 18, 2016

bloody76 commented May 18, 2016

bmabey commented May 18, 2016

bmabey commented May 12, 2016 •

edited

bloody76 commented May 13, 2016 •

edited

bloody76 commented May 17, 2016 •

edited