Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-2094] [ml] implements Word2Vec for FlinkML #2735

Closed
wants to merge 1 commit into from

Conversation

kalmanchapman
Copy link

This pr implements Word2Vec for FlinkML - addressing Jira Issue Flink-2094

Word2Vec is a word embedding algorithm that generates vectors
to reflect the contextual and semantic values of words in a text.

find out more detail about word2vec here:
https://arxiv.org/pdf/1411.2738v4.pdf

This implementation uses an abstracted embedding algorithm
which I've called a ContextEmbedder - based on the original Word2Vec algorithms -
to allow users to extend embedding to reflect problems outside of words.
Word2Vec is an implementation of the ContextEmbedder

Word2Vec is a word embedding algorithm that generates vectors
to reflect the contextual and semantic values of those words in a text

This implementation uses an abstracted embedding algorithm
- based on the original Word2Vec algorithms - to allow users
to extend embedding to reflect problems outside of words
@thvasilo
Copy link

thvasilo commented Nov 3, 2016

Thank you for your contribution Kalman!

I just took a brief look, this is a big PR so will probably take some time to review.

For now a few things that jump to mind:

  • We'll need to add docs for the algorithm, which should be example heavy. Here's a simple example for another pre-processing algorithm. I see you already have extensive ScalaDoc's we could prolly replicate those in the docs.
  • Have you tested it in a relatively large scale dataset? Ideally in a distributed setting where the input files are on HDFS. This way we test the scalability of the implementation, and problems usually arise.
  • Have you compared the output with a reference implementation? My knowledge of word2vec is not very deep but as far as I understand the output is non-deterministic, so we would need some sort of proxy to evaluate the integrated correctness of the implementation.
  • Finally I see this introduces a new nlp package. I'm not sure how to treat this (and relevant algorithms, say TF-IDF), as they are not necessarily NLP specific, even though they stem from the field you could treat any sequence of objects as a "sentence" and embed them. I would favor including them as pre-processing steps and hence inheriting from the Transformer interface, perhaps by having a feature pre-processing package.

Regards,
Theodore

EDIT: Do you mind adding the [ml] tag to the PR title? It helps with filtering oustanding FlinkML PRs.

@kalmanchapman kalmanchapman changed the title [FLINK-2094] implements Word2Vec for FlinkML [FLINK-2094] [ml] implements Word2Vec for FlinkML Nov 3, 2016
@kalmanchapman
Copy link
Author

Hey Theodore,
Thanks for taking a look at my PR!

  • I'll add docs shortly, per the examples you posted.
  • I've tested against datasets in the hundreds-of-megabytes size (using the preprocessed wikipedia articles available here) in a distributed, HDFS supported environment. The implementation worked well as the scale of the data increased - although I was experiencing some frustrating memory issues as I increased the number of iterations performed.
  • I can show that the vectors generated show good results along the lines of the original paper - that they show semantic similarity in line with cosine similarity and that difference vectors can be used to create 'analogy' relationships that make sense. But you're right that it's non-deterministic and surveying how it's tested in other libraries is inconclusive. I've included some toy datasets in the integration tests that show good results and exercise these qualities.
  • I know what you mean about the new package. I included it because the feature requests was specifically for Word2Vec. But - similar to your suggestion - the class in the nlp package is really just a wrapper around a generic embedding algorithm that can perform on any data that is word-in-sentence-like. The ContextEmbedder class, in the optimization package, is where the actual embedding is occurring.
    That said, optimization might not be the right home either (although we are optimizing toward some minima)

Best,
Kalman

Copy link

@kateri1 kateri1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalmanchapman, @thvasilo I have started to perform the review of current PR, and got the following questions:
Do we really need one more time to implement word2Vec when open source community has already implemented it in java? Why we don't consider such library like deeplearning4j (https://deeplearning4j.org/word2vec)? It has already implemented, checked and tried w2Vec with performance optimizations and some additional preprocessing like t-SNE and so on. Also this library proposed to be integrated with Spark (https://deeplearning4j.org/spark), why we can't integrate it with Flink?

* sets the number of global iterations the training set is passed through - essentially looping on
* whole set, leveraging flink's iteration operator (Default value: '''10''')
*
* - [[org.apache.flink.ml.nlp.Word2Vec.TargetCount]]
Copy link

@kateri1 kateri1 Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of this field is not selfdescribing, please consider something like: MinWordFrequency instead of TargetCount, target count is misleading. And also for corresponding get/set methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kateri1 - I'll be updating this to use a self-describing name

* @tparam T Subtype of Iterable[String]
* @return
*/
implicit def learnWordVectors[T <: Iterable[String]] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to propose more typical name like fitWord2Vec instead of learnWordVectors, becuase it has clearer sematics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, I will be updating this name

: Unit = {
val resultingParameters = instance.parameters ++ fitParameters

val skipGrams = input
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no check for input, what would be in case incomingDataset will be empty, at least trainig is not required. May be it would be better to throw exception notifing about that or may be write a waring.
Such a trivial encoding could be recource consuming, but not efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kateri1 - I am reluctant to test teh size of the dataset here, but we can throw some warning message in the case that the word2vec vocabulary size is < 1 after initialization. I agree that this should only warrant a notification.

I'm not sure what you mean with Such a trivial encoding could be recource consuming, but not efficient

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that I mean, that your code, will start to perform many initialization steps for trivial input of size =0, for example it will go to initialization of ContextEmbedder to the method .createInitialWeightsDS(instance.wordVectors, skipGrams) and further, but this is not necessary for trivial case. Please perform the check of trivial situations.

val resultingParameters = instance.parameters ++ fitParameters

val skipGrams = input
.flatMap(x =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copypased code is used in methods learnWordVectors and words2Vecs, consider to create a function for this repeating code to simplify potentional refactoring.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem - I will be including this as a function that can be reused in both methods

* @tparam T subtype of Iterable[String]
* @return
*/
implicit def words2Vecs[T <: Iterable[String]] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider more selfdescribing name than: words2Vecs -> transformWords2Vecs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 will do


learnedVectors
.flatMap(_.fetchVectors)
case None =>
Copy link

@kateri1 kateri1 Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformation could be performed multiple times over the same model, so I think that's not ok to throw everytime exception on incoming word set for encoding only because once the model was trained incorrectly. May be we should consider some trivial default value instead of performing of heavy Exception processing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @kateri1 - I'm not 100% sure what you mean here.
We do check that the word vectors have been initialized via learnWordVectors/fitWord2Vec - and throw an exception if the initialization has not been run. This is similar to the logic used in MultipleLinearRegression fitter and other methods that require an initial fitting step.
Let me know if this answers the question you have.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let it be, lets follow existing approach.

Copy link

@kateri1 kateri1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the integration with DL4J is the separated topic, out of scope of current commit. Integration with DL4J will be discussed first. Current implementation will be performed over Flink's data structures.

@kalmanchapman
Copy link
Author

@kateri1 - I agree that seeking a solution with Flink's data structures is valuable.

I also think that Flink-ML is in a unique position to implement streaming-first, iterative implementations of this algorithm. They are fairly novel on the web, but in theory have been implemented in Gensim's word2vec.

Having an initial, offline implementation of word2vec in flink could be considered as a foundation for an online word2vec that Flink would be in a unique position to implement and be of great use to the community looking for a scaling solution to this class of problem

Copy link

@kateri1 kateri1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any API which allow the user to evaluate the quality of trained transformer? At least some API to extract word synonyms or bulk testing of embeddings? That is necessary to debug trained transformer.


instance.wordVectors match {
case Some(vectors) =>
val skipGrams = input
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only skipGram model is considered and CBOW was not included?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why only Hierarchical SoftMax was supported? What about negative sampling, why it was not supported?
Could you please provide results of your algorithm implementation testing on some public dataset? May be compare with model from tenzorflow or DL4J?

* @tparam T Subtype of Iterable[String]
* @return
*/
implicit def learnWordVectors[T <: Iterable[String]] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you propose for text data preprocessing? What should be used in this pipeline for tokenization and stop words removing?

@zentol
Copy link
Contributor

zentol commented Feb 28, 2019

Closing since flink-ml is effectively frozen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants