New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735
Conversation
Word2Vec is a word embedding algorithm that generates vectors to reflect the contextual and semantic values of those words in a text This implementation uses an abstracted embedding algorithm - based on the original Word2Vec algorithms - to allow users to extend embedding to reflect problems outside of words
Thank you for your contribution Kalman! I just took a brief look, this is a big PR so will probably take some time to review. For now a few things that jump to mind:
Regards, EDIT: Do you mind adding the [ml] tag to the PR title? It helps with filtering oustanding FlinkML PRs. |
Hey Theodore,
Best, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kalmanchapman, @thvasilo I have started to perform the review of current PR, and got the following questions:
Do we really need one more time to implement word2Vec when open source community has already implemented it in java? Why we don't consider such library like deeplearning4j (https://deeplearning4j.org/word2vec)? It has already implemented, checked and tried w2Vec with performance optimizations and some additional preprocessing like t-SNE and so on. Also this library proposed to be integrated with Spark (https://deeplearning4j.org/spark), why we can't integrate it with Flink?
* sets the number of global iterations the training set is passed through - essentially looping on | ||
* whole set, leveraging flink's iteration operator (Default value: '''10''') | ||
* | ||
* - [[org.apache.flink.ml.nlp.Word2Vec.TargetCount]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name of this field is not selfdescribing, please consider something like: MinWordFrequency instead of TargetCount, target count is misleading. And also for corresponding get/set methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kateri1 - I'll be updating this to use a self-describing name
* @tparam T Subtype of Iterable[String] | ||
* @return | ||
*/ | ||
implicit def learnWordVectors[T <: Iterable[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to propose more typical name like fitWord2Vec instead of learnWordVectors, becuase it has clearer sematics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, I will be updating this name
: Unit = { | ||
val resultingParameters = instance.parameters ++ fitParameters | ||
|
||
val skipGrams = input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no check for input, what would be in case incomingDataset will be empty, at least trainig is not required. May be it would be better to throw exception notifing about that or may be write a waring.
Such a trivial encoding could be recource consuming, but not efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kateri1 - I am reluctant to test teh size of the dataset here, but we can throw some warning message in the case that the word2vec vocabulary size is < 1 after initialization. I agree that this should only warrant a notification.
I'm not sure what you mean with Such a trivial encoding could be recource consuming, but not efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By that I mean, that your code, will start to perform many initialization steps for trivial input of size =0, for example it will go to initialization of ContextEmbedder to the method .createInitialWeightsDS(instance.wordVectors, skipGrams) and further, but this is not necessary for trivial case. Please perform the check of trivial situations.
val resultingParameters = instance.parameters ++ fitParameters | ||
|
||
val skipGrams = input | ||
.flatMap(x => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copypased code is used in methods learnWordVectors and words2Vecs, consider to create a function for this repeating code to simplify potentional refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem - I will be including this as a function that can be reused in both methods
* @tparam T subtype of Iterable[String] | ||
* @return | ||
*/ | ||
implicit def words2Vecs[T <: Iterable[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please consider more selfdescribing name than: words2Vecs -> transformWords2Vecs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 will do
|
||
learnedVectors | ||
.flatMap(_.fetchVectors) | ||
case None => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transformation could be performed multiple times over the same model, so I think that's not ok to throw everytime exception on incoming word set for encoding only because once the model was trained incorrectly. May be we should consider some trivial default value instead of performing of heavy Exception processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @kateri1 - I'm not 100% sure what you mean here.
We do check that the word vectors have been initialized via learnWordVectors
/fitWord2Vec
- and throw an exception if the initialization has not been run. This is similar to the logic used in MultipleLinearRegression fitter and other methods that require an initial fitting step.
Let me know if this answers the question you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let it be, lets follow existing approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the integration with DL4J is the separated topic, out of scope of current commit. Integration with DL4J will be discussed first. Current implementation will be performed over Flink's data structures.
@kateri1 - I agree that seeking a solution with Flink's data structures is valuable. I also think that Flink-ML is in a unique position to implement streaming-first, iterative implementations of this algorithm. They are fairly novel on the web, but in theory have been implemented in Gensim's word2vec. Having an initial, offline implementation of word2vec in flink could be considered as a foundation for an online word2vec that Flink would be in a unique position to implement and be of great use to the community looking for a scaling solution to this class of problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any API which allow the user to evaluate the quality of trained transformer? At least some API to extract word synonyms or bulk testing of embeddings? That is necessary to debug trained transformer.
|
||
instance.wordVectors match { | ||
case Some(vectors) => | ||
val skipGrams = input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only skipGram model is considered and CBOW was not included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why only Hierarchical SoftMax was supported? What about negative sampling, why it was not supported?
Could you please provide results of your algorithm implementation testing on some public dataset? May be compare with model from tenzorflow or DL4J?
* @tparam T Subtype of Iterable[String] | ||
* @return | ||
*/ | ||
implicit def learnWordVectors[T <: Iterable[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you propose for text data preprocessing? What should be used in this pipeline for tokenization and stop words removing?
Closing since flink-ml is effectively frozen. |
This pr implements Word2Vec for FlinkML - addressing Jira Issue Flink-2094
Word2Vec is a word embedding algorithm that generates vectors
to reflect the contextual and semantic values of words in a text.
find out more detail about word2vec here:
https://arxiv.org/pdf/1411.2738v4.pdf
This implementation uses an abstracted embedding algorithm
which I've called a ContextEmbedder - based on the original Word2Vec algorithms -
to allow users to extend embedding to reflect problems outside of words.
Word2Vec is an implementation of the ContextEmbedder