added support word2vec training with additional data #18636

LeoIV · 2017-07-14T12:19:58Z

What changes were proposed in this pull request?

Word2Vec is trained unsupervised. The more data it is trained on, the more "accurate" are the word vectors. Hence, Word2Vec should support to be fit on additional data.

How was this patch tested?

Additional unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

MLnick · 2017-09-19T15:13:33Z

Hi there - I don't see the value here of adding a few words in a String array to the training. You're effectively adding a second (non-distributed, therefore limited in size) corpus to the training.

Word2Vec is more aimed at training on a larger corpus of text. If you want more accuracy train on a larger training set.

Could you close this PR please?

LeoIV · 2017-09-19T15:21:19Z

At the moment, it is not possible to improve a models accuracy by incorporating additional data. I think this should be supported since it can increase a classifiers performance significantly. With this implementation, I was able to train unsupervised on a Wikipedia Dump, which is pretty large. However, distributing the set is a good point.

MLnick · 2017-09-19T16:07:11Z

I'm sorry but I still don't understand the intention here. You can already train on a Wikipedia dump (or any other dataset) by passing that dataset as the input DataFrame to Word2Vec.

If you want to "incorporate additional data" why not just union the additional sentences / documents together with your other training set?

LeoIV · 2017-09-20T07:37:33Z

The problem emerges in cases where you built a whole pipeline. You have a set of documents you want to classify. These documents have some additional features and they are preprocessed in the pipeline. When coming to Word2Vec, you want to vectorize your documents. However, you see bad performance of your word vectors and you want to tune them by adding additional documents. You don't want these documents to be part of the whole pipeline, because they are unable to pass the previous preprocessing steps.

That was my intention to add this. Probably, it is a very rare usecase. I don't know.

AmplabJenkins · 2018-11-10T16:01:54Z

Can one of the admins verify this patch?

Leonard Hövelmann added 2 commits July 14, 2017 14:10

added support word2vec training with additional data

7473d6f

fixed intendation

9979214

HyukjinKwon mentioned this pull request Nov 11, 2018

[INFRA] Close stale PRs #23001

Closed

asfgit closed this in a3ba3a8 Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added support word2vec training with additional data #18636

added support word2vec training with additional data #18636

LeoIV commented Jul 14, 2017

MLnick commented Sep 19, 2017

LeoIV commented Sep 19, 2017

MLnick commented Sep 19, 2017

LeoIV commented Sep 20, 2017

AmplabJenkins commented Nov 10, 2018

added support word2vec training with additional data #18636

added support word2vec training with additional data #18636

Conversation

LeoIV commented Jul 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

MLnick commented Sep 19, 2017

LeoIV commented Sep 19, 2017

MLnick commented Sep 19, 2017

LeoIV commented Sep 20, 2017

AmplabJenkins commented Nov 10, 2018