Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added support word2vec training with additional data #18636

Closed
wants to merge 2 commits into from

Conversation

LeoIV
Copy link

@LeoIV LeoIV commented Jul 14, 2017

What changes were proposed in this pull request?

Word2Vec is trained unsupervised. The more data it is trained on, the more "accurate" are the word vectors. Hence, Word2Vec should support to be fit on additional data.

How was this patch tested?

Additional unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@MLnick
Copy link
Contributor

MLnick commented Sep 19, 2017

Hi there - I don't see the value here of adding a few words in a String array to the training. You're effectively adding a second (non-distributed, therefore limited in size) corpus to the training.

Word2Vec is more aimed at training on a larger corpus of text. If you want more accuracy train on a larger training set.

Could you close this PR please?

@LeoIV
Copy link
Author

LeoIV commented Sep 19, 2017

At the moment, it is not possible to improve a models accuracy by incorporating additional data. I think this should be supported since it can increase a classifiers performance significantly. With this implementation, I was able to train unsupervised on a Wikipedia Dump, which is pretty large. However, distributing the set is a good point.

@MLnick
Copy link
Contributor

MLnick commented Sep 19, 2017

I'm sorry but I still don't understand the intention here. You can already train on a Wikipedia dump (or any other dataset) by passing that dataset as the input DataFrame to Word2Vec.

If you want to "incorporate additional data" why not just union the additional sentences / documents together with your other training set?

@LeoIV
Copy link
Author

LeoIV commented Sep 20, 2017

The problem emerges in cases where you built a whole pipeline. You have a set of documents you want to classify. These documents have some additional features and they are preprocessed in the pipeline. When coming to Word2Vec, you want to vectorize your documents. However, you see bad performance of your word vectors and you want to tune them by adding additional documents. You don't want these documents to be part of the whole pipeline, because they are unable to pass the previous preprocessing steps.

That was my intention to add this. Probably, it is a very rare usecase. I don't know.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants