Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large #26722

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 30, 2019

What changes were proposed in this pull request?

This patch adds normalization to word vectors when fitting dataset in Word2Vec.

Why are the changes needed?

Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.

Does this PR introduce any user-facing change?

Yes. After this patch, Word2Vec won't produce infinity word vectors.

How was this patch tested?

Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.

case class Sentences(name: String, words: Array[String])
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()

Before:

Training model...                    
+-------------+--------------------+                                                                                                                                         
|         word|              vector|                                                                                                                                         
+-------------+--------------------+                                                                                                                                         
|     Unspoken|[-Infinity,-Infin...|                                                                                                                                         
|       Talent|[-Infinity,Infini...|                                                                                                                                         
|    Hourglass|[2.02805806500023...|                                                                                                                                         
|Nickelodeon's|[-4.2918617120906...|                                                                                                                                         
|      Priests|[-1.3570403355926...|                                                                                                                                         
|    Religion:|[-6.7049072282803...|                                                                                                                                         
|           Bu|[5.05591774315586...|                                                                                                                                         
|      Totoro:|[-1.0539840178632...|                                                                                                                                         
|     Trouble,|[-3.5363592836003...|                                                                                                                                         
|       Hatter|[4.90413981352826...|                                                                                                                                         
|          '79|[7.50436471285412...|                                                                                                                                         
|         Vile|[-2.9147142985312...|                                                                                                                                         
|         9/11|[-Infinity,Infini...|                                                                                                                                         
|      Santino|[1.30005911270850...|                                                                                                                                         
|      Motives|[-1.2538958306253...|                                                                                                                                         
|          '13|[-4.5040152427657...|                                                                                                                                         
|       Fierce|[Infinity,Infinit...|                                                                                                                                         
|       Stover|[-2.6326895394029...|                                                                                                                                         
|          'It|[1.66574533864436...|                                                                                                                                         
|        Butts|[Infinity,Infinit...|                                                                      
+-------------+--------------------+                                                                                                                                         
only showing top 20 rows              

After:

Training model...                    
+-------------+--------------------+                                  
|         word|              vector|
+-------------+--------------------+                  
|     Unspoken|[-0.0454501919448...|                           
|       Talent|[-0.2657704949378...|
|    Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
|      Priests|[-0.0047509293071...|
|    Religion:|[-0.0411605164408...|
|           Bu|[0.11837736517190...|
|      Totoro:|[0.05258282646536...|
|     Trouble,|[0.09482011198997...|
|       Hatter|[0.06040831282734...|
|          '79|[0.04783720895648...|
|         Vile|[-0.0017210749210...|
|         9/11|[-0.0713915303349...|
|      Santino|[-0.0412711687386...|
|      Motives|[-0.0492418706417...|
|          '13|[-0.0073119504377...|
|       Fierce|[-0.0565455369651...|
|       Stover|[0.06938160210847...|
|          'It|[0.01117012929171...|
|        Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows

@SparkQA
Copy link

SparkQA commented Nov 30, 2019

Test build #114651 has finished for PR 26722 at commit d08fab9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2019

Test build #114652 has finished for PR 26722 at commit 933cd5d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 30, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 30, 2019

Test build #114655 has finished for PR 26722 at commit 933cd5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why is it valid to normalize these here? is there any reference that helps understand why it's OK? The magnitude of the embeddings generally does matter.

@viirya
Copy link
Member Author

viirya commented Nov 30, 2019

Hm, seems the original paper did not talk much about how distributed training should be done. The original multi-threading implementation simply overwrite to training weights.

Currently we simply sum up weights trained in each partition. I think it is the issue.

For the final embedding, because we do care about cosine similarity later, the magnitude of the word vectors should not be matter.

In the training code, the magnitude of training vector is involved in computing gradient. Changing magnitude should not change direction of gradient, but the magnitude of the gradient.

Alternatively, I tried to divide trained weights when summing up, like:

val synAgg = partial.reduceByKey { case (v1, v2) =>
  blas.saxpy(vectorSize, 1.0f / numPartitions, v2, 1, v1, 1)
  v1
}.collect()
Training model...                                                                                                                                                    [4/1918]
+-------------+--------------------+                                                                                                                                         
|         word|              vector|                                                                                                                                         
+-------------+--------------------+                                                                                                                                         
|     Unspoken|[-5745.0419921875...|                                                                                                                                         
|       Talent|[-1.6445698E7,-33...|                                                                                                                                         
|    Hourglass|[-16.266916275024...|                                  
|Nickelodeon's|[-44.274482727050...|                           
|      Priests|[-9.2923946380615...|
|    Religion:|[-0.4360949993133...|                                                                                                                                         
|           Bu|[-0.1929319202899...|                                     
|      Totoro:|[-0.4030513465404...|                                                              
|     Trouble,|[3.99134278297424...|
|       Hatter|[-4.4252877235412...|                                                                                                                                         
|          '79|[-1.7150703668594...|                                     
|         Vile|[-270.66030883789...|          
|         9/11|[-2927296.0,85244...|
|      Santino|[0.80541723966598...|                                                    
|      Motives|[35.5695533752441...|                                                                  
|          '13|[2.82274293899536...|                                                                                                                                        
|       Fierce|[29019.884765625,...|
|       Stover|[-3.0654258728027...|  
|          'It|[1.14618539810180...|
|        Butts|[2.0819126E7,-3.0...| 
+-------------+--------------------+                              
only showing top 20 rows

There are big values remaining, although no infinity.

@srowen
Copy link
Member

srowen commented Nov 30, 2019

Hm, I don't think it's only cosine similarity that matters; these are often used in general in embeddings for neural nets or something. Changing the norms individually changes their relative magnitude. It doesn't mean the answer is wrong per se, just not clear losing that info is not costing something - indeed in training too.

I do think it's more valid to divide through everything by a constant, which could arbitrarily be the number of partitions, or words. I'd love to find a more reliable reference for that kind of thing.

I haven't looked at the impl in a long time but I'm also trying to figure out why it happens. What in the code makes it not scale with the size of the input? because conceptually it should not.

@viirya
Copy link
Member Author

viirya commented Nov 30, 2019

The original paper/implementation does not cover distributed training case like this. In the multi-threading version implementation, it seems does not consider that too. Each thread training on part of training data simply writes to the same network weights.

We do add trained weights from each partition. I think this is not correct too. Like normalization it can be seen as change to the magnitude of embedding too, compared with single thread training.

I agree that divide seems more valid, though there are big values in the trained embedding.

I will look for if there is a reference for distributed Word2Vec training.

@srowen
Copy link
Member

srowen commented Nov 30, 2019

I see, so are you saying the weights are effectively N times larger with N partitions than 1? that might be worth a sense check. If so, and the implementation is inadvertently scaling weights by N in this case, then diving by num partitions before aggregation sounds good.

@viirya
Copy link
Member Author

viirya commented Nov 30, 2019

The implementations/papers including the original, "Parallelizing Word2Vec in Shared and Distributed Memory", word2vec++, seems all follow Hogwild style approach that simply ignores conflicting when updating the weights.

We train each partition separately, so I think we can't do like that. For now I do not see good reference for how to aggregating updated weights from partitions.

Let me do additional check to see how the weights scales by N.

@viirya
Copy link
Member Author

viirya commented Dec 1, 2019

I ran the test shown in the description. With the number of partitions increasing from 1:

numPartitions = 1:

Training model...                   
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-1.1236536502838...|
|       Talent|[0.25907281041145...|
|    Hourglass|[-0.1529812067747...|
|Nickelodeon's|[0.56294345855712...|
|      Priests|[-0.6632155776023...|
|    Religion:|[-0.2995840609073...|
|           Bu|[-0.1570600569248...|
|      Totoro:|[-0.6866843700408...|
|     Trouble,|[-0.0878739282488...|
|       Hatter|[0.30810326337814...|
|          '79|[0.30140322446823...|
|         Vile|[0.24635089933872...|
|         9/11|[-0.0068172565661...|
|      Santino|[0.02316198870539...|
|      Motives|[0.01498097740113...|
|          '13|[0.41122323274612...|
|       Fierce|[-0.5384349822998...|
|       Stover|[-0.1143656522035...|
|          'It|[0.64874994754791...|
|        Butts|[0.30401131510734...|
+-------------+--------------------+
only showing top 20 rows

numPartitions = 5:

Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-6.7764405751741...|
|       Talent|[-9.6067479920091...|
|    Hourglass|[-5.7962791272710...|
|Nickelodeon's|[-5.5408793368190...|
|      Priests|[-3.4655685116755...|
|    Religion:|[1.75421539703848...|
|           Bu|[5.605338120192E1...|
|      Totoro:|[-7.3925109040147...|
|     Trouble,|[-8.3033904E7,1.2...|
|       Hatter|[8.05359369958017...|
|          '79|[6.48125380885584...|
|         Vile|[-1.1937402140810...|
|         9/11|[7.41007681173124...|
|      Santino|[2.34601249830338...|
|      Motives|[1.62917771495709...|
|          '13|[2.42768495215902...|
|       Fierce|[-2.9470496032303...|
|       Stover|[-1.6529095739310...|
|          'It|[-2.644077182976E...|
|        Butts|[1.79399782848582...|
+-------------+--------------------+
only showing top 20 rows

...

numPartitions = 20:

Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-2.6176356140015...|
|       Talent|[-5.5199353538505...|
|    Hourglass|[1.86132916555005...|
|Nickelodeon's|[-9.0021568288662...|
|      Priests|[-7.5406975123998...|
|    Religion:|[5.01159335080765...|
|           Bu|[9.12173977905292...|
|      Totoro:|[2.22915976975771...|
|     Trouble,|[2.96316624467027...|
|       Hatter|[1.68119056598216...|
|          '79|[3.15082808602809...|
|         Vile|[1.25104734842133...|
|         9/11|[-1.3190123545398...|
|      Santino|[-8.8219865210421...|
|      Motives|[1.21626889307140...|
|          '13|[-3.9497276817527...|
|       Fierce|[-5.7803542675530...|
|       Stover|[3.02013414112860...|
|          'It|[5.61268932083837...|
|        Butts|[1.12267731121987...|
+-------------+--------------------+
only showing top 20 rows

numPartitions = 25, infinity vectors show up:

Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-1.1421569937574...|
|       Talent|[-Infinity,-Infin...|
|    Hourglass|[-9.5144596513321...|
|Nickelodeon's|[-2.6736808943986...|
|      Priests|[-3.2282427842550...|
|    Religion:|[5.17546632302999...|
|           Bu|[1.75799700304267...|
|      Totoro:|[-7.4863996312756...|
|     Trouble,|[-5.0377472139952...|
|       Hatter|[-2.3868325167321...|
|          '79|[6.10459795780245...|
|         Vile|[-3.5907428812103...|
|         9/11|[-Infinity,Infini...|
|      Santino|[-3.9457649799987...|
|      Motives|[-5.8161171272514...|
|          '13|[-5.9268523384381...|
|       Fierce|[-Infinity,-Infin...|
|       Stover|[-7.4122975091097...|
|          'It|[1.19725408831668...|
|        Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows

@srowen
Copy link
Member

srowen commented Dec 1, 2019

BTW what are the exponents on these figures -- can you print more? or print their magnitude?

@viirya
Copy link
Member Author

viirya commented Dec 1, 2019

This is the magnitude:

Training model..., numParts = 1                                                                                                                                              
word: Martha's, magnitude: 3.743278710659449                                                                    
word: Marta, magnitude: 3.0524119611411527                                                                                                                                   
word: Marvel's, magnitude: 3.7116662524570962                         
word: Arlovski, magnitude: 5.367418309208839                                                                                                                                 
word: Nation:, magnitude: 2.3689783957491817                                                                                                                                
word: Stock, magnitude: 3.6421844678790065                                                                                                                                  
word: #9:, magnitude: 3.592700041933447                                                    
word: Chayon-Ryu, magnitude: 4.190948205327908                                                                                                                              
word: (Fifth, magnitude: 5.577325501834904                                        
word: Shiver, magnitude: 2.924427895961321                                                                                                                                  
word: Porcupine, magnitude: 2.979338865483149                                    
word: Whiteman, magnitude: 2.9768204064161585                                                                                                                                
word: Baldpate, magnitude: 4.443153970183657                                                               
word: Einstein, magnitude: 2.534842496715387                                          
word: Neapolitan, magnitude: 3.364223370424205
word: Vi, magnitude: 1.6161103277786248
word: Tallest, magnitude: 2.959881355124488
word: Novak, magnitude: 3.6077390984642848
word: Park', magnitude: 3.699203539517963
word: #28:, magnitude: 3.9433840429206155
Training model..., numParts = 5
word: Martha's, magnitude: Infinity
word: Marta, magnitude: 3.8229410743441402E17
word: Marvel's, magnitude: Infinity
word: Arlovski, magnitude: Infinity
word: Nation:, magnitude: Infinity
word: Stock, magnitude: Infinity
word: #9:, magnitude: Infinity
word: Chayon-Ryu, magnitude: Infinity
word: (Fifth, magnitude: Infinity
word: Shiver, magnitude: 3.4865850155309619E17
word: Porcupine, magnitude: Infinity
word: Whiteman, magnitude: 2.9038712510688166E17
word: Baldpate, magnitude: Infinity
word: Einstein, magnitude: Infinity
word: Neapolitan, magnitude: Infinity
word: Vi, magnitude: 3.476331244503146
word: Tallest, magnitude: 3.243632130386425E17
word: Novak, magnitude: Infinity
word: Park', magnitude: 4.3857814193649882E17
word: #28:, magnitude: 1.442267607046858E14
Training model..., numParts = 50                                                        
word: Martha's, magnitude: Infinity                                                                       
word: Marta, magnitude: Infinity                                                                                                                                             
word: Marvel's, magnitude: Infinity                                                     
word: Arlovski, magnitude: Infinity                                                                                                                                          
word: Nation:, magnitude: Infinity                                               
word: Stock, magnitude: Infinity                                                                                                                                             
word: #9:, magnitude: Infinity                                                       
word: Chayon-Ryu, magnitude: Infinity                                                                                                                                        
word: (Fifth, magnitude: Infinity                                            
word: Shiver, magnitude: Infinity                                                                                                                                            
word: Porcupine, magnitude: Infinity                                                                                                                                        
word: Whiteman, magnitude: Infinity                                                                                                                                          
word: Baldpate, magnitude: Infinity                                
word: Einstein, magnitude: Infinity                                                                                                                                          
word: Neapolitan, magnitude: Infinity                                                 
word: Vi, magnitude: 2.313728521691476                                                                                                                                      
word: Tallest, magnitude: Infinity                                              
word: Novak, magnitude: Infinity                                                                                                                                            
word: Park', magnitude: Infinity                                                  
word: #28:, magnitude: Infinity 

If divided by the number of partitions when aggregating weight vectors:

Training model..., numParts = 50                                              
word: Martha's, magnitude: 1837.5938071293122                                                                                                                                
word: Marta, magnitude: 222.50718913947478                                                                                                                                   
word: Marvel's, magnitude: 26457.30749717363                             
word: Arlovski, magnitude: 38.445068805110594                                                     
word: Nation:, magnitude: 965137.5871076621                                                                                                                                  
word: Stock, magnitude: 7012249.419810451                                                                                                                                    
word: #9:, magnitude: 2.0486049390084442E8                                
word: Chayon-Ryu, magnitude: 45.13060068472458
word: (Fifth, magnitude: 202.46650957627534                                                                                                                                  
word: Shiver, magnitude: 200.8438739886034                                                       
word: Porcupine, magnitude: 78.98423745042425                                           
word: Whiteman, magnitude: 17.939542328473102                                                               
word: Baldpate, magnitude: 36.798928419350744                                                                                                                                
word: Einstein, magnitude: 6474.188443349482                                         
word: Neapolitan, magnitude: 607.5522611265635                                          
word: Vi, magnitude: 6.211080604679997                                                                    
word: Tallest, magnitude: 23.01892240254132                                                                                                                                  
word: Novak, magnitude: 416.79496982929146                                              
word: Park', magnitude: 29.33652629236024                                                                                                                                    
word: #28:, magnitude: 29.42300764548468   

@SparkQA
Copy link

SparkQA commented Dec 1, 2019

Test build #114687 has finished for PR 26722 at commit 4ab9906.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2019

Test build #114688 has finished for PR 26722 at commit fefae3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 1, 2019

Hm, that's a crazy result. Something is wrong, to be sure. I can't imagine why just 5 partitions would make such a difference. I don't know word2vec well, but it looks kind of like it adds the new vectors per word instead of taking one of them arbitrarily (a la Hogwild). But I might be misreading it. And even if it were adding them, you'd imagine that 5 partitions might make the result 5x larger than usual, not 10^17.

Do you see info log output showing what alpha is? I'm curious about what happens in the line:

 alpha = learningRate *
                (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) /
                  totalWordsCounts)

and also how this might become large:

val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat

It kind of feels like some multiplier which is supposed to be in [0,1] is becoming significantly negative and it makes it grow out of control

@viirya
Copy link
Member Author

viirya commented Dec 2, 2019

That's good point!

I checked the alpha value during fitting for 5 partitions. At the end of fitting, the alpha becomes significantly small value like 3.131027051017643E-6.

I think current alpha value is also not correctly computed.

Originally, the alpha is updated like https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L397:

alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1));

In Spark's Word2Vec, it is updated as:

alpha = learningRate *
                (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) /
                  totalWordsCounts)

Here by multiplying numPartitions, we may update to significantly negative alpha.

@srowen
Copy link
Member

srowen commented Dec 2, 2019

Hm, that value isn't negative though, just very small. The next line, perhaps accidentally, would handle negative values: if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001 starting_alpha is learningRate in the Spark code, so that much looks OK.

But yes the update rule looks different. numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations is a heuristic to estimate the total number of words processed by all partitions. The denominator looks like it's the same though. I would guess this could be negative but then choosing a very small alpha is "OK". I'm not sure that's the issue.

I'm kind of wondering about this line in the C code:
https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L448

I don't quite see its equivalent here. syn0 is basically used for neu1, but it's missing some normalization by cw, which is I believe 2 * windowSize + 1 - 2 * b here. That's up to a factor of about 9 if windowSize is 4. That feeds, I think, directly into the size of g as it makes the magnitude of the dot product that feeds f a lot larger.

What I don't really understand is why it would be 'triggered' by the number of partitions rather than iterations here, or why it doesn't seem to show up otherwise. It's possible that it's really iterations driving this, and numPartitions isn't 'helping.

Hm, what about sticking in a normalization by that factor above as a hack to see what happens?

@viirya
Copy link
Member Author

viirya commented Dec 2, 2019

I'm kind of wondering about this line in the C code:
https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L448

I don't quite see its equivalent here. syn0 is basically used for neu1, but it's missing some normalization by cw, which is I believe 2 * windowSize + 1 - 2 * b here. That's up to a factor of about 9 if windowSize is 4. That feeds, I think, directly into the size of g as it makes the magnitude of the dot product that feeds f a lot larger.

This part is for cbow architecture in Word2Vec. As we support only Skip-gram, I think the update rule is different.

For Skip-gram training beginning at https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L495, I think we did the same thing in updating weights. Looks not an issue to me.

@viirya
Copy link
Member Author

viirya commented Dec 2, 2019

Specially, I looked into the timing when it produces any Infinity value in aggregating weight vectors at:

val synAgg = partial.reduceByKey { case (v1, v2) =>        
  blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)
}.collect()

When it first produces any Infinity values among weights:

alpha: 0.025

v1 (before blas.saxpy): 3.7545144E37, 9.609645E37, -8.751438E37, -1.6193201E38, 1.1736736E38, 3.2835947E38, 8.1553495E37, -1.6691325E38, -7.576555E37, -5.648573E37, -1.9869322E37, -1.6807897E37, -5.7600233E37, -6.2470694E37, -1.4104866E38, -1.4680707E38, -3.1782221E37, 1.8944205E38, 1.5494958E38, -2.1342228E38, -6.157935E37, 3.9677284E37, 1.1558841E37, 4.331978E37, -8.0626774E36, -5.8198486E36, 8.500153E37, -5.662092E36, -4.009228E37, -1.9031902E38, -2.4923412E38, 7.174913E37, 5.1235664E37, -5.5351527E37, 5.5978614E37, -1.8525286E38, 1.066509E37, 1.5285991E37, -2.0523789E38, 8.57768E37, -9.894086E37, -1.8595572E38, 2.0450045E37, 7.084625E37, 1.7256363E38, 1.7746238E37, 1.4823289E37, 1.2560103E38, -1.910456E38, -5.6934737E37, 3.9446576E37, 1.9320926E38, 5.9035325E37, -1.2072379E38, 7.4097296E37, -8.0367785E37, 1.9674684E38, 5.9296644E37, -1.8741689E38, -1.4480887E38, -2.933689E37, -6.161533E37, 1.02056735E36, 2.3885107E38

v1 (after blas.saxpy): 4.022694E37, 1.0296049E38, -9.376541E37, -1.734986E38, 1.2575074E38, Infinity, 8.737874E37, -1.7883562E38, -8.1177373E37, -6.0520423E37, -2.128856E37, -1.8008461E37, -6.1714535E37, -6.6932885E37, -1.5112356E38, -1.5729329E38, -3.405238E37, 2.0297362E38, 1.660174E38, -2.2866673E38, -6.597787E37, 4.2511375E37, 1.2384472E37, 4.641405E37, -8.638583E36, -6.235552E36, 9.107307E37, -6.066527E36, -4.2956014E37, -2.0391324E38, -2.6703656E38, 7.687407E37, 5.4895356E37, -5.930521E37, 5.997709E37, -1.984852E38, 1.1426883E37, 1.6377848E37, -2.1989773E38, 9.190372E37, -1.0600806E38, -1.9923827E38, 2.1910763E37, 7.5906695E37, 1.848896E38, 1.9013826E37, 1.5882095E37, 1.3457253E38, -2.046917E38, -6.10015E37, 4.2264189E37, 2.0700992E38, 6.3252134E37, -1.2934692E38, 7.938996E37, -8.610834E37, 2.1080017E38, 6.353212E37, -2.008038E38, -1.5515236E38, -3.1432383E37, -6.6016424E37, 1.093465E36, 2.5591186E38

v2: 2.681796E36, 6.8640314E36, -6.251026E36, -1.1566574E37, 8.3833835E36, 2.3454244E37, 5.825249E36, -1.1922373E37, -5.411826E36, -4.034695E36, -1.419237E36, -1.2005642E36, -4.114303E36, -4.4621932E36, -1.0074906E37, -1.048622E37, -2.2701585E36, 1.3531576E37, 1.1067827E37, -1.524445E37, -4.3985254E36, 2.834092E36, 8.256315E35, 3.09427E36, -5.759055E35, -4.1570347E35, 6.071537E36, -4.044352E35, -2.8637343E36, -1.3594217E37, -1.7802438E37, 5.1249374E36, 3.65969E36, -3.9536805E36, 3.9984723E36, -1.3232347E37, 7.6179225E35, 1.0918563E36, -1.465985E37, 6.126915E36, -7.067205E36, -1.3282551E37, 1.4607173E36, 5.0604463E36, 1.2325974E37, 1.2675882E36, 1.05880635E36, 8.971503E36, -1.3646113E37, -4.066767E36, 2.817613E36, 1.3800664E37, 4.2168083E36, -8.623128
E36, 5.2926634E36, -5.7405564E36, 1.4053346E37, 4.2354745E36, -1.3386921E37, -1.03434895E37, -2.0954919E36, -4.401094E36, 7.2897654E34, 1.7060787E37

There are extremely positive/negative values that exceeds the range of float. So infinity/-infinity weights produced there.

@viirya
Copy link
Member Author

viirya commented Dec 2, 2019

What I don't really understand is why it would be 'triggered' by the number of partitions rather than iterations here, or why it doesn't seem to show up otherwise. It's possible that it's really iterations driving this, and numPartitions isn't 'helping.

Both the number of partitions and the iterations do affect to that. By reducing iteration to 10 (30 previously), when number of partitions is 5, you won't see infinity magnitude as iteration 30 case.

Training model..., numParts = 5
word: Martha's, magnitude: 3256804.914837854
word: Marta, magnitude: 353497.2433499305
word: Marvel's, magnitude: 4069358.2203939725
word: Arlovski, magnitude: 7118886.591379862
word: Nation:, magnitude: 6296374.743896999
word: Stock, magnitude: 4837719.561042786
word: #9:, magnitude: 2.2051319577420667E7
word: Chayon-Ryu, magnitude: 2965298.379611738
word: (Fifth, magnitude: 1.3429820237455452E7
word: Shiver, magnitude: 319441.2914574445
word: Porcupine, magnitude: 2267350.3638493987
word: Whiteman, magnitude: 260164.35322311163
word: Baldpate, magnitude: 2392710.378421927
word: Einstein, magnitude: 4124620.3929244205
word: Neapolitan, magnitude: 2840700.767267119
word: Vi, magnitude: 4.443334263321898
word: Tallest, magnitude: 285155.5317927394
word: Novak, magnitude: 2646500.711022009
word: Park', magnitude: 387377.63594714657
word: #28:, magnitude: 40106.517375608666

@viirya
Copy link
Member Author

viirya commented Dec 2, 2019

The above alpha value looks normal, when infinity values appear. For now it looks like we continue adding weights from partitions iteratively, until it is too positive/negative to hold in float. Thus increasing partitions can cause it, also increasing iteration can too.

@srowen
Copy link
Member

srowen commented Dec 2, 2019

Ah right, disregard my previous comment. Am I right that the original implementation, being single-threaded, computes just one updated vector per word per iteration? and in the Spark implementation, it comes up with several, because the word may appear in multiple partitions. Then adding them doesn't make sense. It would make sense to average them. That's not quite the same as dividing by number of partitions, as the word may not appear in all partitions. You could accumulate a simple count in reduceByKey then divide through the sum by count?

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114746 has finished for PR 26722 at commit bcd1aa7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114748 has finished for PR 26722 at commit 236b0fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114801 has finished for PR 26722 at commit 21c6b84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 4, 2019

PS does this also solve your problem? this change sounds OK to me.

@viirya
Copy link
Member Author

viirya commented Dec 4, 2019

Yeah, rerun the test manually and it does not produce infinity vectors.

@viirya
Copy link
Member Author

viirya commented Dec 5, 2019

@srowen do you have more comments on this? if no, i will go to merge this.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK for 3.0 if tests pass and it solves your problem.
It'd be great to get a test for it, but it sounds like it can only happen with huge data? or would enough iterations do it?

@viirya
Copy link
Member Author

viirya commented Dec 5, 2019

Let me check if it is also reproducible with big partition number + iteration on existing mllib test.

If not, I think it can not or hard to reproduce on small dataset.

@viirya
Copy link
Member Author

viirya commented Dec 5, 2019

Ok. I tried to increase number of partitions and iterations of existing test, but can not reproduce infinity weights. To add reproducible test case, the concern is big dataset needed, another one is fitting time.

@viirya viirya closed this in 755d889 Dec 6, 2019
viirya added a commit that referenced this pull request Dec 6, 2019
…terations are large

### What changes were proposed in this pull request?

This patch adds normalization to word vectors when fitting dataset in Word2Vec.

### Why are the changes needed?

Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.

### Does this PR introduce any user-facing change?

Yes. After this patch, Word2Vec won't produce infinity word vectors.

### How was this patch tested?

Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.

```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```

Before:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-Infinity,-Infin...|
|       Talent|[-Infinity,Infini...|
|    Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
|      Priests|[-1.3570403355926...|
|    Religion:|[-6.7049072282803...|
|           Bu|[5.05591774315586...|
|      Totoro:|[-1.0539840178632...|
|     Trouble,|[-3.5363592836003...|
|       Hatter|[4.90413981352826...|
|          '79|[7.50436471285412...|
|         Vile|[-2.9147142985312...|
|         9/11|[-Infinity,Infini...|
|      Santino|[1.30005911270850...|
|      Motives|[-1.2538958306253...|
|          '13|[-4.5040152427657...|
|       Fierce|[Infinity,Infinit...|
|       Stover|[-2.6326895394029...|
|          'It|[1.66574533864436...|
|        Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```

After:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-0.0454501919448...|
|       Talent|[-0.2657704949378...|
|    Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
|      Priests|[-0.0047509293071...|
|    Religion:|[-0.0411605164408...|
|           Bu|[0.11837736517190...|
|      Totoro:|[0.05258282646536...|
|     Trouble,|[0.09482011198997...|
|       Hatter|[0.06040831282734...|
|          '79|[0.04783720895648...|
|         Vile|[-0.0017210749210...|
|         9/11|[-0.0713915303349...|
|      Santino|[-0.0412711687386...|
|      Motives|[-0.0492418706417...|
|          '13|[-0.0073119504377...|
|       Fierce|[-0.0565455369651...|
|       Stover|[0.06938160210847...|
|          'It|[0.01117012929171...|
|        Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```

Closes #26722 from viirya/SPARK-24666-2.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
(cherry picked from commit 755d889)
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
@viirya
Copy link
Member Author

viirya commented Dec 6, 2019

Thanks! Merging to master and 2.4.

@srowen
Copy link
Member

srowen commented Dec 6, 2019

I'm OK with putting it in 2.4, I think. It's a minor behavior change, but, also appears to be more correct IMHO and fixes a bug

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…terations are large

### What changes were proposed in this pull request?

This patch adds normalization to word vectors when fitting dataset in Word2Vec.

### Why are the changes needed?

Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.

### Does this PR introduce any user-facing change?

Yes. After this patch, Word2Vec won't produce infinity word vectors.

### How was this patch tested?

Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.

```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```

Before:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-Infinity,-Infin...|
|       Talent|[-Infinity,Infini...|
|    Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
|      Priests|[-1.3570403355926...|
|    Religion:|[-6.7049072282803...|
|           Bu|[5.05591774315586...|
|      Totoro:|[-1.0539840178632...|
|     Trouble,|[-3.5363592836003...|
|       Hatter|[4.90413981352826...|
|          '79|[7.50436471285412...|
|         Vile|[-2.9147142985312...|
|         9/11|[-Infinity,Infini...|
|      Santino|[1.30005911270850...|
|      Motives|[-1.2538958306253...|
|          '13|[-4.5040152427657...|
|       Fierce|[Infinity,Infinit...|
|       Stover|[-2.6326895394029...|
|          'It|[1.66574533864436...|
|        Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```

After:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-0.0454501919448...|
|       Talent|[-0.2657704949378...|
|    Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
|      Priests|[-0.0047509293071...|
|    Religion:|[-0.0411605164408...|
|           Bu|[0.11837736517190...|
|      Totoro:|[0.05258282646536...|
|     Trouble,|[0.09482011198997...|
|       Hatter|[0.06040831282734...|
|          '79|[0.04783720895648...|
|         Vile|[-0.0017210749210...|
|         9/11|[-0.0713915303349...|
|      Santino|[-0.0412711687386...|
|      Motives|[-0.0492418706417...|
|          '13|[-0.0073119504377...|
|       Fierce|[-0.0565455369651...|
|       Stover|[0.06938160210847...|
|          'It|[0.01117012929171...|
|        Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```

Closes apache#26722 from viirya/SPARK-24666-2.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
@WngMen
Copy link

WngMen commented Jan 20, 2020

OK,very good!

mattsills pushed a commit to mattsills/spark that referenced this pull request Jun 1, 2020
…terations are large

### What changes were proposed in this pull request?

This patch adds normalization to word vectors when fitting dataset in Word2Vec.

### Why are the changes needed?

Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors.

### Does this PR introduce any user-facing change?

Yes. After this patch, Word2Vec won't produce infinity word vectors.

### How was this patch tested?

Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload.

```scala
case class Sentences(name: String, words: Array[String])
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)
model.getVectors.show()
```

Before:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-Infinity,-Infin...|
|       Talent|[-Infinity,Infini...|
|    Hourglass|[2.02805806500023...|
|Nickelodeon's|[-4.2918617120906...|
|      Priests|[-1.3570403355926...|
|    Religion:|[-6.7049072282803...|
|           Bu|[5.05591774315586...|
|      Totoro:|[-1.0539840178632...|
|     Trouble,|[-3.5363592836003...|
|       Hatter|[4.90413981352826...|
|          '79|[7.50436471285412...|
|         Vile|[-2.9147142985312...|
|         9/11|[-Infinity,Infini...|
|      Santino|[1.30005911270850...|
|      Motives|[-1.2538958306253...|
|          '13|[-4.5040152427657...|
|       Fierce|[Infinity,Infinit...|
|       Stover|[-2.6326895394029...|
|          'It|[1.66574533864436...|
|        Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
```

After:
```
Training model...
+-------------+--------------------+
|         word|              vector|
+-------------+--------------------+
|     Unspoken|[-0.0454501919448...|
|       Talent|[-0.2657704949378...|
|    Hourglass|[-0.1399687677621...|
|Nickelodeon's|[-0.1767119318246...|
|      Priests|[-0.0047509293071...|
|    Religion:|[-0.0411605164408...|
|           Bu|[0.11837736517190...|
|      Totoro:|[0.05258282646536...|
|     Trouble,|[0.09482011198997...|
|       Hatter|[0.06040831282734...|
|          '79|[0.04783720895648...|
|         Vile|[-0.0017210749210...|
|         9/11|[-0.0713915303349...|
|      Santino|[-0.0412711687386...|
|      Motives|[-0.0492418706417...|
|          '13|[-0.0073119504377...|
|       Fierce|[-0.0565455369651...|
|       Stover|[0.06938160210847...|
|          'It|[0.01117012929171...|
|        Butts|[0.05374567210674...|
+-------------+--------------------+
only showing top 20 rows
```

Closes apache#26722 from viirya/SPARK-24666-2.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
@viirya viirya deleted the SPARK-24666-2 branch December 27, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants