Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of the Hierarchical Dirichlet Process from Gensim. #63

Merged
merged 2 commits into from May 18, 2016

Conversation

bloody76
Copy link

Hi,

I was playing with HDP models and I wanted to visualise them with pyLDAvis. Unfortunately it wasn't natively supported, so I made the few fixes to make it.

I am simply looking for attributes lda_alpha and lda_beta as they are specific to the HDP model. I also changed the __num_dist_rows__ function because even if the matrix was normalized, the doc_topic_dists made the asserts crying (maybe NaN values ?). I didn't look too much into it but this fix is working.

I am not sure if the lda_beta parameter is exactly the same as state.get_lambda() parameter but it needed the topic-term distribution so I though it was ok...

I am using it and it's working. Tell me if the PR seems right to you ! :)

@bmabey
Copy link
Owner

bmabey commented May 12, 2016

Thanks! I am not familiar with the HDP from gensim as well so before we merge this in I think we should do two things:

  • Have a code review from someone familiar with gensim... @piskvorky, would you mind taking a look or suggesting someone else?
  • Write some tests to make sure both the standard LDA and HDP model run. This is largely to ensure that the LDA model still works and we guard against regressions. See add tests for gensim prepare! #46.

@bloody76
Copy link
Author

Hi !

Okay I will try to put some tests, I wanted to check that LDA was still working but it seems the make test command is out. Travis also. Could you fix it ? I think it has nothing to do with my PR, when I remove my patch it's still failing...

I will look for the LDA/HDP tests in the meantime :)

@bmabey
Copy link
Owner

bmabey commented May 12, 2016

Yeah, it looks like the build failed on this PR due to a newer version of pandas. I'll fix that and let you know when to rebase.

@bmabey
Copy link
Owner

bmabey commented May 12, 2016

BTW, I've just been running the tests locally with py.test.

@bmabey
Copy link
Owner

bmabey commented May 12, 2016

Okay, I fixed the issue so try rebasing and pushing this PR again to get a clean Travis build.

@bmabey
Copy link
Owner

bmabey commented May 12, 2016

Oh, and if you really want to go the extra mile adding a HDP example to the gensim notebook would be great. :)

@piskvorky
Copy link

@tmylk should be able to help here. Better model visualisations are definitely on our roadmap with gensim :)

@bloody76
Copy link
Author

Ok no problem I will try to rebase, add some tests and the example in the gensim notebook. Thanks @piskvorky for reviewing :)

@bloody76
Copy link
Author

I guess there is still a problem, I still have an issue with py.test:

$ py.test
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/apis/nlp_core, inifile: tox.ini
plugins: cov-2.2.1
collected 1 items 

tests/pyLDAvis/test_prepare.py F
=================================================================================================== FAILURES ====================================================================================================
________________________________________________________________________________________ test_end_to_end_with_R_examples ________________________________________________________________________________________

    def test_end_to_end_with_R_examples():
>       data_input, expected = load_dataset('movie_reviews')

tests/pyLDAvis/test_prepare.py:33: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/pyLDAvis/test_prepare.py:21: in load_dataset
    data_input = json.load(j)
/usr/lib/python3.5/json/__init__.py:268: in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
/usr/lib/python3.5/json/__init__.py:319: in loads
    return _default_decoder.decode(s)
/usr/lib/python3.5/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <json.decoder.JSONDecoder object at 0x69e934b5ecc0>, s = 'version https://git-lfs.github.com/spec/v1\noid sha256:0f8753ac5b6e89031fc56623e9f71a61ebaca0e3382956944ad05c0844580298\nsize 7087084\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
            a JSON document) and return a 2-tuple of the Python
            representation and the index in ``s`` where the document ended.

            This can be used to decode a JSON document from a string that may
            have extraneous data at the end.

            """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib/python3.5/json/decoder.py:357: JSONDecodeError

Am I missing something ?

@bmabey
Copy link
Owner

bmabey commented May 13, 2016

The problem is that you don't have github's LFS (large file support) git extension:
https://git-lfs.github.com

Either install that or just run the tests you are writing and let Travis run them all on push.

On May 13, 2016, at 7:29 AM, Bd76 notifications@github.com wrote:

I guess there is still a problem, I still have an issue with py.test:

$ py.test
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.5.1, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/apis/nlp_core, inifile: tox.ini
plugins: cov-2.2.1
collected 1 items

tests/pyLDAvis/test_prepare.py F
=================================================================================================== FAILURES ====================================================================================================
________________________________________________________________________________________ test_end_to_end_with_R_examples ________________________________________________________________________________________

def test_end_to_end_with_R_examples():
  data_input, expected = load_dataset('movie_reviews')

tests/pyLDAvis/test_prepare.py:33:


tests/pyLDAvis/test_prepare.py:21: in load_dataset
data_input = json.load(j)
/usr/lib/python3.5/json/init.py:268: in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
/usr/lib/python3.5/json/init.py:319: in loads
return _default_decoder.decode(s)
/usr/lib/python3.5/json/decoder.py:339: in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())


self = <json.decoder.JSONDecoder object at 0x69e934b5ecc0>, s = 'version https://git-lfs.github.com/spec/v1\noid sha256:0f8753ac5b6e89031fc56623e9f71a61ebaca0e3382956944ad05c0844580298\nsize 7087084\n'
idx = 0

def raw_decode(self, s, idx=0):
    """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.

        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.

        """
    try:
        obj, end = self.scan_once(s, idx)
    except StopIteration as err:
      raise JSONDecodeError("Expecting value", s, err.value) from None

E json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib/python3.5/json/decoder.py:357: JSONDecodeError
Am I missing something ?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@bloody76
Copy link
Author

Oh ok, thank's I didn't know the repo required that extension.

@bloody76
Copy link
Author

bloody76 commented May 13, 2016

I just put some tests to ensure the prepare function is still working with lda model and hdp model. Is it how you imagined it ?

I will do the notebook part later, can't do it now.

@bloody76
Copy link
Author

I have done the notebook part. I hope it shows enough the simplicity of the use of HDP models with pyLDAvis.

Also I noticed a problem concerning HDP models, when applying principal component analysis, it seems that some eigenvalues are negatives, I think it screwing a little bit the visualization but I don't know if it's really a problem.

Correct me if I am wrong, but it seems HDP models are very sensitive to noise (and you need a lot more data I guess) ? From what I have dealt with on my own projects, I had to remove as much noise as possible to get relevant topics from a human point of view.

num_topics=2)

data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(data, 'index_lda.html')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good... maybe write the html file to /tmp or delete it to have proper cleanup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

@bmabey
Copy link
Owner

bmabey commented May 15, 2016

Yeah, that visualization doesn't look good... Do all of the gensim HDP ones look like that?

BTW, could you rerun the notebook so the cache takes effect and we don't have as much in the cell output (i.e. cell 2).

@bloody76
Copy link
Author

Hi,

I have this kind of visualization for my self project : http://puu.sh/oSvaV/05c9e050b6.png, so I think they are not always like that.

If it uses decomposition, maybe HDP didn't get the topics correctly (for some reason) and the noise is prevalent in all the topics. Making a unique axe or two importants, hence this visualization.

Got it for the notebook.

@bloody76
Copy link
Author

Hi,

I updated the notebook and cleared the outputs.

There seems to have one final problem, for the dependencies, if I add gensim to the test-requirements dependencies, it will throw a conflict exception for the request package. I digged in a little, and it seems it comes from the smart-open package that require a specific version while we install the last one during the requirements.

I tested by putting the gensim dependency first in the requirements and it works, but it's not very clean to do that ... The best would be to remove the version requirement from smart-open I guess.

What should we do ?

@bmabey
Copy link
Owner

bmabey commented May 15, 2016

I agree, lets relax the version requirement for 'smart-open' and see if that fixes it while preserving the functionality provided by 'smart-open'

On May 15, 2016, at 7:31 AM, Bd76 notifications@github.com wrote:

Hi,

I updated the notebook and cleared the outputs.

There seems to have one final problem, for the dependencies, if I add gensim to the test-requirements dependencies, it will throw a conflict exception for the request package. I digged in a little, and it seems it comes from the smart-open package that require a specific version while we install the last one during the requirements.

I tested by putting the gensim dependency first in the requirements and it works, but it's not very clean to do that ... The best would be to remove the version requirement from smart-open I guess.

What should we do ?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@bmabey
Copy link
Owner

bmabey commented May 15, 2016

Perhaps we should comment about this known problem in the notebook... caveat emptor.

We could also point them to this Gibbs sampler HDP lib that may be easier to use (albeit slower):

http://datamicroscopes.github.io/topic.html

Thanks for all these changes BTW. The notebook and tests give me enough confidence so we can skip the gensim code review if no one steps up.

On May 15, 2016, at 2:57 AM, Bd76 notifications@github.com wrote:

Hi,

I have this kind of visualization for my self project : http://puu.sh/oSvaV/05c9e050b6.png, so I think they are not always like that.

If it uses decomposition, maybe HDP didn't get the topics correctly (for some reason) and the noise is prevalent in all the topics. Making a unique axe or two importants, hence this visualization.

Got it for the notebook.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@tmylk
Copy link

tmylk commented May 16, 2016

The requests version pin was removed in the latest version of smart_open released today.

@tmylk
Copy link

tmylk commented May 16, 2016

Gave it a quick review. The use of gensim looks good.

@bmabey
Copy link
Owner

bmabey commented May 16, 2016

Thanks @tmylk!

@bloody76
Copy link
Author

Cool ! Thanks for reviewing also @tmylk :)

Concerning the caveat emptor, I am not sure I understood, do you want me to comment about the eigenvalue part ? The visualization that "fails" because HDP can be a little more sensitive to noise (I am still not sure of that, someone with better experience should comment on this) ?

For the datamicroscope part, I am not sure to see why it would be easier, running lda seems to require 4 lines of codes that doesn't seem very clear at first sight.

@bmabey
Copy link
Owner

bmabey commented May 16, 2016

Easier to fit I should have said.. I've never ran into these odd looking visualizations with a HDP gibbs sampler but I have with gensim's HDP before. I don't know if that is related to gensim's implementation or the approach taken (variational bayes I assume but I don't know).

@bloody76
Copy link
Author

bloody76 commented May 17, 2016

Yeah, maybe it could also come from the implementation (maybe unlikely) or the approach. Still, I am not sure what you want me to do, can you be more explicit ?

I see that there is problems from Travis for the 3.3 version of python. I use myself python3.5 without any problem. Do you have an idea where it could come from ?

Maybe I am missing something, but for the python version 3.4, we get this output:

$ python --version
Python 3.4.2
$ bash miniconda.sh -b -p $HOME/miniconda
PREFIX=/home/travis/miniconda
installing: _cache-0.0-py35_x0 ...
installing: python-3.5.1-0 ...

Anaconda is installing python 3.5.1, so I am not sure python 3.4 is tested. The job for python 3.3 is also installing python 3.5.1 in the virtualenv .. Is that what you want to do ?

And what's weird then is that in both virtualenv we should have python 3.5.1 but the python3.3 job is failing for a syntax error in boto 2.4 ...

I am not very used to conda so it's very likely I am missing something again :P

@bmabey
Copy link
Owner

bmabey commented May 17, 2016

I'm confused by Travis as well. Squash (rebase) your commits and I'll merge it in and take it from here. Thanks!

On May 17, 2016, at 6:47 AM, Bd76 notifications@github.com wrote:

Yeah, maybe it could also come from the implementation (maybe unlikely) or the approach. Still, I am not sure what you want me to do, can you be more explicit ?

I see that there is problems from Travis for the 3.3 version of python. I use myself python3.5 without any problem. Do you have an idea where it could come from ?

Maybe I am missing something, but for the python version 3.4, we get this output:

$ python --version
Python 3.4.2
$ bash miniconda.sh -b -p $HOME/miniconda
PREFIX=/home/travis/miniconda
installing: _cache-0.0-py35_x0 ...
installing: python-3.5.1-0 ...
Anaconda is installing python 3.5.1, so I am not sure python 3.4 is tested. The job for python 3.3 is also installing python 3.5.1 in the virtualenv .. Is that what you want to do ?

And what's weird then is that in both virtualenv we should have python 3.5.1 but the python3.3 job is failing for a syntax error in boto 2.4 ...


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@bloody76
Copy link
Author

Ok done! I made 4 commits with specific tags, I hope its okay, or you want me to squash all in one commit ?

My pleasure :)

@bmabey
Copy link
Owner

bmabey commented May 18, 2016

Thanks. Please squash the first three commits into one. That way we have a single clean commit with code, deps, and test changes. The notebook commit is good to keep separate.

…models.

[requirements] added gensim packages to test-requirements.

[tests] added gensim tests to ensure prepare/save_html functions still works with lda and hdp models.
@bloody76
Copy link
Author

Alright ! Done.

@bmabey bmabey merged commit c228ab0 into bmabey:master May 18, 2016
@bmabey
Copy link
Owner

bmabey commented May 18, 2016

I'll report back once I've resolved the CI issues and have cut a release (it may be a week or so before I get to it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants