-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic behavior? #86
Comments
If you are using |
So, I am using the universal-sentence-encoder and thus the doc2vec randomness should not apply here, which only (?) leaves the UMAP stochasticity. Digging a bit deeper into the UMAP documentation, it seems like it is possible to set a random seed for it: https://umap-learn.readthedocs.io/en/latest/reproducibility.html Any chance you could give access to this through the top2vec API? Thanks! |
Ok, confirmed. If you change umap_model = umap.UMAP(n_neighbors=15, to, for example: umap_model = umap.UMAP(n_neighbors=15, (lines 330ff.) then the behavior is deterministic (when choosing pretrained embeddings, such as universal-sentence-encoder). I get exactly the same results when running the algorithm multiple times. I could imagine exposing this to the top2vec API through the constructor (line 161) with an optional parameter (like 'verbose') and then checking right before the calll to umap.UMAP and then have two ways of fitting the data (with or without specified random seed). Do you want me to create a pull request or would that be too much overhead? Thanks! |
Thank you pull request offer and for the research and suggestions. However it is not a priority at the moment to have it deterministic but perhaps it will be re-visited in the future. |
Hi,
Really interesting paper and great that you are sharing the code. Much appreciated!
I am plying around with the code a bit and am seemingly running into some non-deterministic behavior, which I would like to hear your thoughts on.
I have a relatively small corpus of documents (3000, each with some 30 sentences; some documents may be duplicated or at least partially overlap).
I noticed that when running top2vec (no matter which encoder I am using; basically following the tutorial) multiple times, I get ever so slightly different results. That means, that the behavior of top2vec does not seem to be deterministic.
What would be the reason for this non-determinism and is there a way to make the behavior deterministic? I would need the same output every time I run the model on the same data.
Thanks!
The text was updated successfully, but these errors were encountered: