Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic behavior? #86

Closed
tploetz opened this issue Dec 5, 2020 · 4 comments
Closed

Deterministic behavior? #86

tploetz opened this issue Dec 5, 2020 · 4 comments

Comments

@tploetz
Copy link

tploetz commented Dec 5, 2020

Hi,

Really interesting paper and great that you are sharing the code. Much appreciated!

I am plying around with the code a bit and am seemingly running into some non-deterministic behavior, which I would like to hear your thoughts on.

I have a relatively small corpus of documents (3000, each with some 30 sentences; some documents may be duplicated or at least partially overlap).

I noticed that when running top2vec (no matter which encoder I am using; basically following the tutorial) multiple times, I get ever so slightly different results. That means, that the behavior of top2vec does not seem to be deterministic.

What would be the reason for this non-determinism and is there a way to make the behavior deterministic? I would need the same output every time I run the model on the same data.

Thanks!

@ddangelov
Copy link
Owner

If you are using doc2vec as the embedding_model than that will not be deterministic since training the neural network is stochastic. Additionally UMAP is also a stochastic algorithm and it is at the heart of Top2Vec. There are no plans at the moment for trying to make it deterministic.

@tploetz
Copy link
Author

tploetz commented Dec 10, 2020

So, I am using the universal-sentence-encoder and thus the doc2vec randomness should not apply here, which only (?) leaves the UMAP stochasticity. Digging a bit deeper into the UMAP documentation, it seems like it is possible to set a random seed for it:

https://umap-learn.readthedocs.io/en/latest/reproducibility.html
--> map.UMAP(random_state=42).fit(data)

Any chance you could give access to this through the top2vec API?

Thanks!

@tploetz
Copy link
Author

tploetz commented Dec 11, 2020

Ok, confirmed. If you change

umap_model = umap.UMAP(n_neighbors=15,
n_components=5,
metric='cosine').fit(self._get_document_vectors(norm=False))

to, for example:

umap_model = umap.UMAP(n_neighbors=15,
n_components=5,
metric='cosine',
random_state=42).fit(self._get_document_vectors(norm=False))

(lines 330ff.)

then the behavior is deterministic (when choosing pretrained embeddings, such as universal-sentence-encoder). I get exactly the same results when running the algorithm multiple times.

I could imagine exposing this to the top2vec API through the constructor (line 161) with an optional parameter (like 'verbose') and then checking right before the calll to umap.UMAP and then have two ways of fitting the data (with or without specified random seed).

Do you want me to create a pull request or would that be too much overhead?

Thanks!

@ddangelov
Copy link
Owner

Thank you pull request offer and for the research and suggestions. However it is not a priority at the moment to have it deterministic but perhaps it will be re-visited in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants