Deterministic behavior? #86

tploetz · 2020-12-05T17:18:20Z

Hi,

Really interesting paper and great that you are sharing the code. Much appreciated!

I am plying around with the code a bit and am seemingly running into some non-deterministic behavior, which I would like to hear your thoughts on.

I have a relatively small corpus of documents (3000, each with some 30 sentences; some documents may be duplicated or at least partially overlap).

I noticed that when running top2vec (no matter which encoder I am using; basically following the tutorial) multiple times, I get ever so slightly different results. That means, that the behavior of top2vec does not seem to be deterministic.

What would be the reason for this non-determinism and is there a way to make the behavior deterministic? I would need the same output every time I run the model on the same data.

Thanks!

ddangelov · 2020-12-06T00:11:22Z

If you are using doc2vec as the embedding_model than that will not be deterministic since training the neural network is stochastic. Additionally UMAP is also a stochastic algorithm and it is at the heart of Top2Vec. There are no plans at the moment for trying to make it deterministic.

tploetz · 2020-12-10T21:44:36Z

So, I am using the universal-sentence-encoder and thus the doc2vec randomness should not apply here, which only (?) leaves the UMAP stochasticity. Digging a bit deeper into the UMAP documentation, it seems like it is possible to set a random seed for it:

https://umap-learn.readthedocs.io/en/latest/reproducibility.html
--> map.UMAP(random_state=42).fit(data)

Any chance you could give access to this through the top2vec API?

Thanks!

tploetz · 2020-12-11T17:36:41Z

Ok, confirmed. If you change

umap_model = umap.UMAP(n_neighbors=15,
n_components=5,
metric='cosine').fit(self._get_document_vectors(norm=False))

to, for example:

umap_model = umap.UMAP(n_neighbors=15,
n_components=5,
metric='cosine',
random_state=42).fit(self._get_document_vectors(norm=False))

(lines 330ff.)

then the behavior is deterministic (when choosing pretrained embeddings, such as universal-sentence-encoder). I get exactly the same results when running the algorithm multiple times.

I could imagine exposing this to the top2vec API through the constructor (line 161) with an optional parameter (like 'verbose') and then checking right before the calll to umap.UMAP and then have two ways of fitting the data (with or without specified random seed).

Do you want me to create a pull request or would that be too much overhead?

Thanks!

ddangelov · 2020-12-13T22:59:44Z

Thank you pull request offer and for the research and suggestions. However it is not a priority at the moment to have it deterministic but perhaps it will be re-visited in the future.

ddangelov closed this as completed Dec 6, 2020

ddangelov mentioned this issue Dec 9, 2020

Reproducible results #88

Closed

tploetz mentioned this issue Dec 11, 2020

Deterministic Behavior (follow-up to #86) #93

Closed

ddangelov mentioned this issue Mar 5, 2021

Each time i pass in the same keyword, why do i land up with different topic numbers, documents and words?? #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic behavior? #86

Deterministic behavior? #86

tploetz commented Dec 5, 2020

ddangelov commented Dec 6, 2020

tploetz commented Dec 10, 2020

tploetz commented Dec 11, 2020

ddangelov commented Dec 13, 2020

Deterministic behavior? #86

Deterministic behavior? #86

Comments

tploetz commented Dec 5, 2020

ddangelov commented Dec 6, 2020

tploetz commented Dec 10, 2020

tploetz commented Dec 11, 2020

ddangelov commented Dec 13, 2020