-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large #26722
Conversation
Test build #114651 has finished for PR 26722 at commit
|
Test build #114652 has finished for PR 26722 at commit
|
retest this please. |
Test build #114655 has finished for PR 26722 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why is it valid to normalize these here? is there any reference that helps understand why it's OK? The magnitude of the embeddings generally does matter.
Hm, seems the original paper did not talk much about how distributed training should be done. The original multi-threading implementation simply overwrite to training weights. Currently we simply sum up weights trained in each partition. I think it is the issue. For the final embedding, because we do care about cosine similarity later, the magnitude of the word vectors should not be matter. In the training code, the magnitude of training vector is involved in computing gradient. Changing magnitude should not change direction of gradient, but the magnitude of the gradient. Alternatively, I tried to divide trained weights when summing up, like: val synAgg = partial.reduceByKey { case (v1, v2) =>
blas.saxpy(vectorSize, 1.0f / numPartitions, v2, 1, v1, 1)
v1
}.collect()
There are big values remaining, although no infinity. |
Hm, I don't think it's only cosine similarity that matters; these are often used in general in embeddings for neural nets or something. Changing the norms individually changes their relative magnitude. It doesn't mean the answer is wrong per se, just not clear losing that info is not costing something - indeed in training too. I do think it's more valid to divide through everything by a constant, which could arbitrarily be the number of partitions, or words. I'd love to find a more reliable reference for that kind of thing. I haven't looked at the impl in a long time but I'm also trying to figure out why it happens. What in the code makes it not scale with the size of the input? because conceptually it should not. |
The original paper/implementation does not cover distributed training case like this. In the multi-threading version implementation, it seems does not consider that too. Each thread training on part of training data simply writes to the same network weights. We do add trained weights from each partition. I think this is not correct too. Like normalization it can be seen as change to the magnitude of embedding too, compared with single thread training. I agree that divide seems more valid, though there are big values in the trained embedding. I will look for if there is a reference for distributed Word2Vec training. |
I see, so are you saying the weights are effectively N times larger with N partitions than 1? that might be worth a sense check. If so, and the implementation is inadvertently scaling weights by N in this case, then diving by num partitions before aggregation sounds good. |
The implementations/papers including the original, "Parallelizing Word2Vec in Shared and Distributed Memory", word2vec++, seems all follow Hogwild style approach that simply ignores conflicting when updating the weights. We train each partition separately, so I think we can't do like that. For now I do not see good reference for how to aggregating updated weights from partitions. Let me do additional check to see how the weights scales by N. |
I ran the test shown in the description. With the number of partitions increasing from 1: numPartitions = 1:
numPartitions = 5:
... numPartitions = 20:
numPartitions = 25, infinity vectors show up:
|
BTW what are the exponents on these figures -- can you print more? or print their magnitude? |
This is the magnitude:
If divided by the number of partitions when aggregating weight vectors:
|
Test build #114687 has finished for PR 26722 at commit
|
Test build #114688 has finished for PR 26722 at commit
|
Hm, that's a crazy result. Something is wrong, to be sure. I can't imagine why just 5 partitions would make such a difference. I don't know word2vec well, but it looks kind of like it adds the new vectors per word instead of taking one of them arbitrarily (a la Hogwild). But I might be misreading it. And even if it were adding them, you'd imagine that 5 partitions might make the result 5x larger than usual, not 10^17. Do you see info log output showing what alpha is? I'm curious about what happens in the line:
and also how this might become large:
It kind of feels like some multiplier which is supposed to be in [0,1] is becoming significantly negative and it makes it grow out of control |
That's good point! I checked the alpha value during fitting for 5 partitions. At the end of fitting, the alpha becomes significantly small value like 3.131027051017643E-6. I think current alpha value is also not correctly computed. Originally, the alpha is updated like https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L397:
In Spark's Word2Vec, it is updated as:
Here by multiplying numPartitions, we may update to significantly negative alpha. |
Hm, that value isn't negative though, just very small. The next line, perhaps accidentally, would handle negative values: But yes the update rule looks different. I'm kind of wondering about this line in the C code: I don't quite see its equivalent here. What I don't really understand is why it would be 'triggered' by the number of partitions rather than iterations here, or why it doesn't seem to show up otherwise. It's possible that it's really iterations driving this, and numPartitions isn't 'helping. Hm, what about sticking in a normalization by that factor above as a hack to see what happens? |
This part is for cbow architecture in Word2Vec. As we support only Skip-gram, I think the update rule is different. For Skip-gram training beginning at https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L495, I think we did the same thing in updating weights. Looks not an issue to me. |
Specially, I looked into the timing when it produces any Infinity value in aggregating weight vectors at: val synAgg = partial.reduceByKey { case (v1, v2) =>
blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)
}.collect() When it first produces any Infinity values among weights: alpha: 0.025 v1 (before blas.saxpy): 3.7545144E37, 9.609645E37, -8.751438E37, -1.6193201E38, 1.1736736E38, 3.2835947E38, 8.1553495E37, -1.6691325E38, -7.576555E37, -5.648573E37, -1.9869322E37, -1.6807897E37, -5.7600233E37, -6.2470694E37, -1.4104866E38, -1.4680707E38, -3.1782221E37, 1.8944205E38, 1.5494958E38, -2.1342228E38, -6.157935E37, 3.9677284E37, 1.1558841E37, 4.331978E37, -8.0626774E36, -5.8198486E36, 8.500153E37, -5.662092E36, -4.009228E37, -1.9031902E38, -2.4923412E38, 7.174913E37, 5.1235664E37, -5.5351527E37, 5.5978614E37, -1.8525286E38, 1.066509E37, 1.5285991E37, -2.0523789E38, 8.57768E37, -9.894086E37, -1.8595572E38, 2.0450045E37, 7.084625E37, 1.7256363E38, 1.7746238E37, 1.4823289E37, 1.2560103E38, -1.910456E38, -5.6934737E37, 3.9446576E37, 1.9320926E38, 5.9035325E37, -1.2072379E38, 7.4097296E37, -8.0367785E37, 1.9674684E38, 5.9296644E37, -1.8741689E38, -1.4480887E38, -2.933689E37, -6.161533E37, 1.02056735E36, 2.3885107E38 v1 (after blas.saxpy): 4.022694E37, 1.0296049E38, -9.376541E37, -1.734986E38, 1.2575074E38, Infinity, 8.737874E37, -1.7883562E38, -8.1177373E37, -6.0520423E37, -2.128856E37, -1.8008461E37, -6.1714535E37, -6.6932885E37, -1.5112356E38, -1.5729329E38, -3.405238E37, 2.0297362E38, 1.660174E38, -2.2866673E38, -6.597787E37, 4.2511375E37, 1.2384472E37, 4.641405E37, -8.638583E36, -6.235552E36, 9.107307E37, -6.066527E36, -4.2956014E37, -2.0391324E38, -2.6703656E38, 7.687407E37, 5.4895356E37, -5.930521E37, 5.997709E37, -1.984852E38, 1.1426883E37, 1.6377848E37, -2.1989773E38, 9.190372E37, -1.0600806E38, -1.9923827E38, 2.1910763E37, 7.5906695E37, 1.848896E38, 1.9013826E37, 1.5882095E37, 1.3457253E38, -2.046917E38, -6.10015E37, 4.2264189E37, 2.0700992E38, 6.3252134E37, -1.2934692E38, 7.938996E37, -8.610834E37, 2.1080017E38, 6.353212E37, -2.008038E38, -1.5515236E38, -3.1432383E37, -6.6016424E37, 1.093465E36, 2.5591186E38 v2: 2.681796E36, 6.8640314E36, -6.251026E36, -1.1566574E37, 8.3833835E36, 2.3454244E37, 5.825249E36, -1.1922373E37, -5.411826E36, -4.034695E36, -1.419237E36, -1.2005642E36, -4.114303E36, -4.4621932E36, -1.0074906E37, -1.048622E37, -2.2701585E36, 1.3531576E37, 1.1067827E37, -1.524445E37, -4.3985254E36, 2.834092E36, 8.256315E35, 3.09427E36, -5.759055E35, -4.1570347E35, 6.071537E36, -4.044352E35, -2.8637343E36, -1.3594217E37, -1.7802438E37, 5.1249374E36, 3.65969E36, -3.9536805E36, 3.9984723E36, -1.3232347E37, 7.6179225E35, 1.0918563E36, -1.465985E37, 6.126915E36, -7.067205E36, -1.3282551E37, 1.4607173E36, 5.0604463E36, 1.2325974E37, 1.2675882E36, 1.05880635E36, 8.971503E36, -1.3646113E37, -4.066767E36, 2.817613E36, 1.3800664E37, 4.2168083E36, -8.623128 There are extremely positive/negative values that exceeds the range of float. So infinity/-infinity weights produced there. |
Both the number of partitions and the iterations do affect to that. By reducing iteration to 10 (30 previously), when number of partitions is 5, you won't see infinity magnitude as iteration 30 case.
|
The above alpha value looks normal, when infinity values appear. For now it looks like we continue adding weights from partitions iteratively, until it is too positive/negative to hold in float. Thus increasing partitions can cause it, also increasing iteration can too. |
Ah right, disregard my previous comment. Am I right that the original implementation, being single-threaded, computes just one updated vector per word per iteration? and in the Spark implementation, it comes up with several, because the word may appear in multiple partitions. Then adding them doesn't make sense. It would make sense to average them. That's not quite the same as dividing by number of partitions, as the word may not appear in all partitions. You could accumulate a simple count in reduceByKey then divide through the sum by count? |
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
Outdated
Show resolved
Hide resolved
Test build #114746 has finished for PR 26722 at commit
|
Test build #114748 has finished for PR 26722 at commit
|
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
Outdated
Show resolved
Hide resolved
Test build #114801 has finished for PR 26722 at commit
|
PS does this also solve your problem? this change sounds OK to me. |
Yeah, rerun the test manually and it does not produce infinity vectors. |
@srowen do you have more comments on this? if no, i will go to merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK for 3.0 if tests pass and it solves your problem.
It'd be great to get a test for it, but it sounds like it can only happen with huge data? or would enough iterations do it?
Let me check if it is also reproducible with big partition number + iteration on existing mllib test. If not, I think it can not or hard to reproduce on small dataset. |
Ok. I tried to increase number of partitions and iterations of existing test, but can not reproduce infinity weights. To add reproducible test case, the concern is big dataset needed, another one is fitting time. |
…terations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[-Infinity,Infini...| | Hourglass|[2.02805806500023...| |Nickelodeon's|[-4.2918617120906...| | Priests|[-1.3570403355926...| | Religion:|[-6.7049072282803...| | Bu|[5.05591774315586...| | Totoro:|[-1.0539840178632...| | Trouble,|[-3.5363592836003...| | Hatter|[4.90413981352826...| | '79|[7.50436471285412...| | Vile|[-2.9147142985312...| | 9/11|[-Infinity,Infini...| | Santino|[1.30005911270850...| | Motives|[-1.2538958306253...| | '13|[-4.5040152427657...| | Fierce|[Infinity,Infinit...| | Stover|[-2.6326895394029...| | 'It|[1.66574533864436...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-0.0454501919448...| | Talent|[-0.2657704949378...| | Hourglass|[-0.1399687677621...| |Nickelodeon's|[-0.1767119318246...| | Priests|[-0.0047509293071...| | Religion:|[-0.0411605164408...| | Bu|[0.11837736517190...| | Totoro:|[0.05258282646536...| | Trouble,|[0.09482011198997...| | Hatter|[0.06040831282734...| | '79|[0.04783720895648...| | Vile|[-0.0017210749210...| | 9/11|[-0.0713915303349...| | Santino|[-0.0412711687386...| | Motives|[-0.0492418706417...| | '13|[-0.0073119504377...| | Fierce|[-0.0565455369651...| | Stover|[0.06938160210847...| | 'It|[0.01117012929171...| | Butts|[0.05374567210674...| +-------------+--------------------+ only showing top 20 rows ``` Closes #26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com> (cherry picked from commit 755d889) Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
Thanks! Merging to master and 2.4. |
I'm OK with putting it in 2.4, I think. It's a minor behavior change, but, also appears to be more correct IMHO and fixes a bug |
…terations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[-Infinity,Infini...| | Hourglass|[2.02805806500023...| |Nickelodeon's|[-4.2918617120906...| | Priests|[-1.3570403355926...| | Religion:|[-6.7049072282803...| | Bu|[5.05591774315586...| | Totoro:|[-1.0539840178632...| | Trouble,|[-3.5363592836003...| | Hatter|[4.90413981352826...| | '79|[7.50436471285412...| | Vile|[-2.9147142985312...| | 9/11|[-Infinity,Infini...| | Santino|[1.30005911270850...| | Motives|[-1.2538958306253...| | '13|[-4.5040152427657...| | Fierce|[Infinity,Infinit...| | Stover|[-2.6326895394029...| | 'It|[1.66574533864436...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-0.0454501919448...| | Talent|[-0.2657704949378...| | Hourglass|[-0.1399687677621...| |Nickelodeon's|[-0.1767119318246...| | Priests|[-0.0047509293071...| | Religion:|[-0.0411605164408...| | Bu|[0.11837736517190...| | Totoro:|[0.05258282646536...| | Trouble,|[0.09482011198997...| | Hatter|[0.06040831282734...| | '79|[0.04783720895648...| | Vile|[-0.0017210749210...| | 9/11|[-0.0713915303349...| | Santino|[-0.0412711687386...| | Motives|[-0.0492418706417...| | '13|[-0.0073119504377...| | Fierce|[-0.0565455369651...| | Stover|[0.06938160210847...| | 'It|[0.01117012929171...| | Butts|[0.05374567210674...| +-------------+--------------------+ only showing top 20 rows ``` Closes apache#26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
OK,very good! |
…terations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[-Infinity,Infini...| | Hourglass|[2.02805806500023...| |Nickelodeon's|[-4.2918617120906...| | Priests|[-1.3570403355926...| | Religion:|[-6.7049072282803...| | Bu|[5.05591774315586...| | Totoro:|[-1.0539840178632...| | Trouble,|[-3.5363592836003...| | Hatter|[4.90413981352826...| | '79|[7.50436471285412...| | Vile|[-2.9147142985312...| | 9/11|[-Infinity,Infini...| | Santino|[1.30005911270850...| | Motives|[-1.2538958306253...| | '13|[-4.5040152427657...| | Fierce|[Infinity,Infinit...| | Stover|[-2.6326895394029...| | 'It|[1.66574533864436...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-0.0454501919448...| | Talent|[-0.2657704949378...| | Hourglass|[-0.1399687677621...| |Nickelodeon's|[-0.1767119318246...| | Priests|[-0.0047509293071...| | Religion:|[-0.0411605164408...| | Bu|[0.11837736517190...| | Totoro:|[0.05258282646536...| | Trouble,|[0.09482011198997...| | Hatter|[0.06040831282734...| | '79|[0.04783720895648...| | Vile|[-0.0017210749210...| | 9/11|[-0.0713915303349...| | Santino|[-0.0412711687386...| | Motives|[-0.0492418706417...| | '13|[-0.0073119504377...| | Fierce|[-0.0565455369651...| | Stover|[0.06938160210847...| | 'It|[0.01117012929171...| | Butts|[0.05374567210674...| +-------------+--------------------+ only showing top 20 rows ``` Closes apache#26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
What changes were proposed in this pull request?
This patch adds normalization to word vectors when fitting dataset in Word2Vec.
Why are the changes needed?
Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.
Does this PR introduce any user-facing change?
Yes. After this patch, Word2Vec won't produce infinity word vectors.
How was this patch tested?
Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.
Before:
After: