Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default parameter values #14

Open
michalovadek opened this issue Jan 8, 2021 · 1 comment
Open

default parameter values #14

michalovadek opened this issue Jan 8, 2021 · 1 comment

Comments

@michalovadek
Copy link

michalovadek commented Jan 8, 2021

Finally a solid doc2vec implementation in R. Many thanks! I have a relatively minor suggestion: I feel that the default parameter values might be underselling the power of this method. I know everyone can change the default settings, but in reality most users just want to "press play". When I look at most doc2vec applications in Python - the go to text analysis language for most - they go for more demanding settings. For example, the top2vec module uses roughly the following default parameter values (from https://github.com/ddangelov/Top2Vec/blob/master/top2vec/Top2Vec.py):

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 300, iter = 40, hs = TRUE, window = 15, negative = 0, sample = 0.00001)

These values are surely not 100% scientific, but I think the authors have experimented quite a bit before arriving to them. I think they are a useful starting point.

The default values as you have them now make the process very fast but the resulting embeddings might often be quite poor. Negative subsampling, in particular, has been in some contexts associated with hurting the quality of the semantic space. I can also say that in my use case the default settings are not ideal, while the ones above yield pretty solid results within a reasonable time. Just a suggestion.

@jwijffels
Copy link
Collaborator

Interesting. I think (don't remember today) I've taken some default settings from https://github.com/hiyijian/doc2vec/blob/master/test/TestTrain.cpp#L8 but maybe I looked as well to the gensim defaults at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec.py#L165 but would certainly be interested in finding out links to other default settings.

@jwijffels jwijffels mentioned this issue Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants