Skip to content

Error with saving/loading PLDAModel #204

@juhopaak

Description

@juhopaak

When I train a PLDAModel and then save and load it, after load the model's properties have changed.

For instance

from tomotopy import PLDAModel
docs = [['foo'], ['bar'], ['baz'], ['foo', 'bar'], ['baz', 'bar']]
mdl = tp.PLDAModel(latent_topics=2)
for doc in docs:
    mdl.add_doc(doc)
mdl.train(100)
print(mdl.summary())
print(mdl.perplexity)

produces

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -1.94159
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3261328688 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [3.0139716 7.3531275]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo

6.985572470333207

but after calling

mdl.save('model.bin', full=True)
mdl = PLDAModel.load('model.bin')
print(mdl.summary())
print(mdl.perplexity)

I get

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -2.19768
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3666141070 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.1 0.1]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo
|

9.004082581035151

The model has diverging Log-likelihood per word and perplexity scores before and after saving/loading. I've tried this with both full=True and full=False, and with saves/loads, but the issue persists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions