[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

kalmanchapman · 2016-10-31T23:09:49Z

This pr implements Word2Vec for FlinkML - addressing Jira Issue Flink-2094

Word2Vec is a word embedding algorithm that generates vectors
to reflect the contextual and semantic values of words in a text.

find out more detail about word2vec here:
https://arxiv.org/pdf/1411.2738v4.pdf

This implementation uses an abstracted embedding algorithm
which I've called a ContextEmbedder - based on the original Word2Vec algorithms -
to allow users to extend embedding to reflect problems outside of words.
Word2Vec is an implementation of the ContextEmbedder

Word2Vec is a word embedding algorithm that generates vectors to reflect the contextual and semantic values of those words in a text This implementation uses an abstracted embedding algorithm - based on the original Word2Vec algorithms - to allow users to extend embedding to reflect problems outside of words

thvasilo · 2016-11-03T17:10:57Z

Thank you for your contribution Kalman!

I just took a brief look, this is a big PR so will probably take some time to review.

For now a few things that jump to mind:

We'll need to add docs for the algorithm, which should be example heavy. Here's a simple example for another pre-processing algorithm. I see you already have extensive ScalaDoc's we could prolly replicate those in the docs.
Have you tested it in a relatively large scale dataset? Ideally in a distributed setting where the input files are on HDFS. This way we test the scalability of the implementation, and problems usually arise.
Have you compared the output with a reference implementation? My knowledge of word2vec is not very deep but as far as I understand the output is non-deterministic, so we would need some sort of proxy to evaluate the integrated correctness of the implementation.
Finally I see this introduces a new nlp package. I'm not sure how to treat this (and relevant algorithms, say TF-IDF), as they are not necessarily NLP specific, even though they stem from the field you could treat any sequence of objects as a "sentence" and embed them. I would favor including them as pre-processing steps and hence inheriting from the Transformer interface, perhaps by having a feature pre-processing package.

Regards,
Theodore

EDIT: Do you mind adding the [ml] tag to the PR title? It helps with filtering oustanding FlinkML PRs.

kalmanchapman · 2016-11-04T00:16:48Z

Hey Theodore,
Thanks for taking a look at my PR!

I'll add docs shortly, per the examples you posted.
I've tested against datasets in the hundreds-of-megabytes size (using the preprocessed wikipedia articles available here) in a distributed, HDFS supported environment. The implementation worked well as the scale of the data increased - although I was experiencing some frustrating memory issues as I increased the number of iterations performed.
I can show that the vectors generated show good results along the lines of the original paper - that they show semantic similarity in line with cosine similarity and that difference vectors can be used to create 'analogy' relationships that make sense. But you're right that it's non-deterministic and surveying how it's tested in other libraries is inconclusive. I've included some toy datasets in the integration tests that show good results and exercise these qualities.
I know what you mean about the new package. I included it because the feature requests was specifically for Word2Vec. But - similar to your suggestion - the class in the nlp package is really just a wrapper around a generic embedding algorithm that can perform on any data that is word-in-sentence-like. The ContextEmbedder class, in the optimization package, is where the actual embedding is occurring.
That said, optimization might not be the right home either (although we are optimizing toward some minima)

Best,
Kalman

kateri1

@kalmanchapman, @thvasilo I have started to perform the review of current PR, and got the following questions:
Do we really need one more time to implement word2Vec when open source community has already implemented it in java? Why we don't consider such library like deeplearning4j (https://deeplearning4j.org/word2vec)? It has already implemented, checked and tried w2Vec with performance optimizations and some additional preprocessing like t-SNE and so on. Also this library proposed to be integrated with Spark (https://deeplearning4j.org/spark), why we can't integrate it with Flink?

kateri1 · 2017-01-31T07:58:41Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+  * sets the number of global iterations the training set is passed through - essentially looping on
+  * whole set, leveraging flink's iteration operator (Default value: '''10''')
+  *
+  * - [[org.apache.flink.ml.nlp.Word2Vec.TargetCount]]


the name of this field is not selfdescribing, please consider something like: MinWordFrequency instead of TargetCount, target count is misleading. And also for corresponding get/set methods.

hi @kateri1 - I'll be updating this to use a self-describing name

kateri1 · 2017-01-31T07:59:24Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+    * @tparam T Subtype of Iterable[String]
+    * @return
+    */
+  implicit def learnWordVectors[T <: Iterable[String]] = {


I would like to propose more typical name like fitWord2Vec instead of learnWordVectors, becuase it has clearer sematics.

similarly, I will be updating this name

kateri1 · 2017-01-31T08:00:01Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+      : Unit = {
+        val resultingParameters = instance.parameters ++ fitParameters
+
+        val skipGrams = input


There is no check for input, what would be in case incomingDataset will be empty, at least trainig is not required. May be it would be better to throw exception notifing about that or may be write a waring.
Such a trivial encoding could be recource consuming, but not efficient.

@kateri1 - I am reluctant to test teh size of the dataset here, but we can throw some warning message in the case that the word2vec vocabulary size is < 1 after initialization. I agree that this should only warrant a notification.

I'm not sure what you mean with Such a trivial encoding could be recource consuming, but not efficient

By that I mean, that your code, will start to perform many initialization steps for trivial input of size =0, for example it will go to initialization of ContextEmbedder to the method .createInitialWeightsDS(instance.wordVectors, skipGrams) and further, but this is not necessary for trivial case. Please perform the check of trivial situations.

kateri1 · 2017-01-31T08:00:50Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+        val resultingParameters = instance.parameters ++ fitParameters
+
+        val skipGrams = input
+          .flatMap(x =>


copypased code is used in methods learnWordVectors and words2Vecs, consider to create a function for this repeating code to simplify potentional refactoring.

no problem - I will be including this as a function that can be reused in both methods

kateri1 · 2017-01-31T08:01:04Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+    * @tparam T subtype of Iterable[String]
+    * @return
+    */
+  implicit def words2Vecs[T <: Iterable[String]] = {


please consider more selfdescribing name than: words2Vecs -> transformWords2Vecs

👍 will do

kateri1 · 2017-01-31T08:01:56Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+
+            learnedVectors
+              .flatMap(_.fetchVectors)
+          case None =>


Transformation could be performed multiple times over the same model, so I think that's not ok to throw everytime exception on incoming word set for encoding only because once the model was trained incorrectly. May be we should consider some trivial default value instead of performing of heavy Exception processing?

hey @kateri1 - I'm not 100% sure what you mean here.
We do check that the word vectors have been initialized via learnWordVectors/fitWord2Vec - and throw an exception if the initialization has not been run. This is similar to the logic used in MultipleLinearRegression fitter and other methods that require an initial fitting step.
Let me know if this answers the question you have.

Ok, let it be, lets follow existing approach.

kateri1

Update: the integration with DL4J is the separated topic, out of scope of current commit. Integration with DL4J will be discussed first. Current implementation will be performed over Flink's data structures.

kalmanchapman · 2017-04-03T02:20:02Z

@kateri1 - I agree that seeking a solution with Flink's data structures is valuable.

I also think that Flink-ML is in a unique position to implement streaming-first, iterative implementations of this algorithm. They are fairly novel on the web, but in theory have been implemented in Gensim's word2vec.

Having an initial, offline implementation of word2vec in flink could be considered as a foundation for an online word2vec that Flink would be in a unique position to implement and be of great use to the community looking for a scaling solution to this class of problem

kateri1

I don't see any API which allow the user to evaluate the quality of trained transformer? At least some API to extract word synonyms or bulk testing of embeddings? That is necessary to debug trained transformer.

kateri1 · 2017-04-08T21:26:15Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+
+        instance.wordVectors match {
+          case Some(vectors) =>
+            val skipGrams = input


Why only skipGram model is considered and CBOW was not included?

And why only Hierarchical SoftMax was supported? What about negative sampling, why it was not supported?
Could you please provide results of your algorithm implementation testing on some public dataset? May be compare with model from tenzorflow or DL4J?

kateri1 · 2017-04-08T21:43:55Z

flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala

+    * @tparam T Subtype of Iterable[String]
+    * @return
+    */
+  implicit def learnWordVectors[T <: Iterable[String]] = {


What do you propose for text data preprocessing? What should be used in this pipeline for tokenization and stop words removing?

zentol · 2019-02-28T23:04:45Z

Closing since flink-ml is effectively frozen.

kalmanchapman changed the title ~~[FLINK-2094] implements Word2Vec for FlinkML~~ [FLINK-2094] [ml] implements Word2Vec for FlinkML Nov 3, 2016

kateri1 suggested changes Jan 31, 2017

View reviewed changes

kateri1 reviewed Feb 7, 2017

View reviewed changes

kateri1 suggested changes Apr 8, 2017

View reviewed changes

zentol closed this Feb 28, 2019

rmetzger added the component=Library/MachineLearning label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

kalmanchapman commented Oct 31, 2016

thvasilo commented Nov 3, 2016 •

edited

kalmanchapman commented Nov 4, 2016

kateri1 left a comment •

edited

kateri1 Jan 31, 2017 •

edited

kalmanchapman Apr 3, 2017

kateri1 Jan 31, 2017

kalmanchapman Apr 3, 2017

kateri1 Jan 31, 2017

kalmanchapman Apr 3, 2017 •

edited

kateri1 Apr 8, 2017

kateri1 Jan 31, 2017

kalmanchapman Apr 3, 2017

kateri1 Jan 31, 2017

kalmanchapman Apr 3, 2017

kateri1 Jan 31, 2017 •

edited

kalmanchapman Apr 3, 2017

kateri1 Apr 8, 2017

kateri1 left a comment •

edited

kalmanchapman commented Apr 3, 2017

kateri1 left a comment

kateri1 Apr 8, 2017

kateri1 Apr 8, 2017

kateri1 Apr 8, 2017

zentol commented Feb 28, 2019

[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

Conversation

kalmanchapman commented Oct 31, 2016

thvasilo commented Nov 3, 2016 • edited

kalmanchapman commented Nov 4, 2016

kateri1 left a comment • edited

Choose a reason for hiding this comment

kateri1 Jan 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalmanchapman Apr 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kateri1 Jan 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kateri1 left a comment • edited

Choose a reason for hiding this comment

kalmanchapman commented Apr 3, 2017

kateri1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zentol commented Feb 28, 2019

thvasilo commented Nov 3, 2016 •

edited

kateri1 left a comment •

edited

kateri1 Jan 31, 2017 •

edited

kalmanchapman Apr 3, 2017 •

edited

kateri1 Jan 31, 2017 •

edited

kateri1 left a comment •

edited