[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

nzw0301 · 2017-09-27T18:24:35Z

What changes were proposed in this pull request?

Current equation of learning rate is incorrect when numIterations > 1.
This PR is based on original C code.

cc: @mengxr

How was this patch tested?

manual tests

I modified this example code.

`numIteration=1`

Code

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

Result

2 0.175856813788414
0 0.10971353203058243
4 0.09818313270807266
3 0.012947646901011467
9 -0.09881238639354706

`numIteration=5`

Code

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()
word2vec.setNumIterations(5)

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

Result

0 0.9898583889007568
2 0.9808019399642944
4 0.9794934391975403
3 0.9506527781486511
9 -0.9065656661987305

srowen · 2017-09-28T07:07:18Z

You should make a JIRA as it's a non-trivial behavior change.
I agree that the intent seems to be to follow the original code. I also see that it was intentionally not used. I think the author would have to weigh in.

nzw0301 · 2017-09-28T09:13:19Z

Thank you for your comment, @srowen.
I'll create an issue on JIRA.

LowikC · 2017-09-28T11:35:11Z

I think the PR is incorrect:

the original C code decreases the learning rate linearly from starting_alpha to 0, across all iterations
new_alpha = starting_alpha * (1 - progress), progress = word_count_actual / (numIterations * numWordsPerIteration + 1)
Note that word_count_actual counts the number of words processed so far, in all iterations.
in the PR, word_count_actual is called wordCount, but wordCount is reset at the beginning of each iteration (see https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365).
the current Spark code is also different: it decreases the learning rate linearly from learningRate to 0, in one iteration. At the beginning of the next iteration, the value is again learningRate
the "right" formula (to keep the behavior of the original C code) should be:

val numWordsToProcess = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount
val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble
val progress = (numWordsProcessedInPreviousIterations + numWordsProcessedInCurrentIteration) / numWordsToProcess
alpha = learningRate * (1 - progress)

nzw0301 · 2017-09-28T12:41:13Z

Thank you for your comment, @LowikC.
You are right, my PR code is incorrect.

Correct update formula based on your comment is

alpha = learningRate *
  (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /
    (numIterations * trainWordsCount + 1))

LowikC

Correct parentheses

LowikC · 2017-09-28T12:50:23Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

-                learningRate *
-                  (1 - numPartitions * wordCount.toDouble / (numIterations * trainWordsCount + 1))
+              alpha = learningRate *
+                (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /


you need numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount between parentheses

alpha = learningRate * (1 - (numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount) / (numIterations * trainWordsCount + 1))

oh... Thanks! I fixed it.

srowen · 2017-09-30T08:59:08Z

What do you think @LowikC ?

SparkQA · 2017-09-30T10:07:28Z

Test build #3939 has finished for PR 19372 at commit 90735a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LowikC · 2017-10-02T07:58:15Z

Looks good to me.
Maybe @nzw0301 could split the formula for readability?

nzw0301 · 2017-10-02T09:15:15Z

Thank you for your reviews, @LowikC.

Like this?

val totalWordsCounts = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount

alpha = learningRate *
  (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) /
    totalWordsCounts)
if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) +
  ", alpha = " + alpha)

nzw0301 · 2017-10-02T12:30:37Z

I updated the results of word2vec example based on this PR in the first comment.

srowen · 2017-10-03T07:19:19Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

              if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
-              logInfo("wordCount = " + wordCount + ", alpha = " + alpha)
+              logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) +


If you update this again, you can use string interpolation: logInfo(s"wordCount = ${wordCount + ...}, alpha = $alpha")

@srowen Done.

srowen

OK with me if OK with @LowikC

LowikC

ok for me

SparkQA · 2017-10-06T21:51:59Z

Test build #3943 has finished for PR 19372 at commit 2ea3f18.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2017-10-07T07:30:58Z

Merged to master

nzw0301 · 2017-10-07T07:34:53Z

Thank you for your kindful reviews!

Update equation of lr

e2a7d39

nzw0301 changed the title ~~[MLLIB] Fix update equation of learning rate in Word2Vec.scala~~ [SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala Sep 28, 2017

Fix update equation

ed846c3

LowikC reviewed Sep 28, 2017

View reviewed changes

nzw0301 added 2 commits September 28, 2017 21:58

Correct parentheses

b7db7d0

Update logInfo for multiple iterations

90735a9

Use variable for readability

db6c8c8

srowen reviewed Oct 3, 2017

View reviewed changes

Use string interpolation

2ea3f18

srowen approved these changes Oct 4, 2017

View reviewed changes

LowikC approved these changes Oct 6, 2017

View reviewed changes

asfgit closed this in 5eacc3b Oct 7, 2017

GulajavaMinistudio mentioned this pull request Oct 8, 2017

Update upstream GulajavaMinistudio/spark#182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

nzw0301 commented Sep 27, 2017 •

edited

Loading

srowen commented Sep 28, 2017

nzw0301 commented Sep 28, 2017 •

edited

Loading

LowikC commented Sep 28, 2017

nzw0301 commented Sep 28, 2017 •

edited

Loading

LowikC left a comment

LowikC Sep 28, 2017

nzw0301 Sep 28, 2017 •

edited

Loading

srowen commented Sep 30, 2017

SparkQA commented Sep 30, 2017

LowikC commented Oct 2, 2017

nzw0301 commented Oct 2, 2017

nzw0301 commented Oct 2, 2017

srowen Oct 3, 2017

nzw0301 Oct 3, 2017

srowen left a comment

LowikC left a comment

SparkQA commented Oct 6, 2017

srowen commented Oct 7, 2017

nzw0301 commented Oct 7, 2017

[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

Conversation

nzw0301 commented Sep 27, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

numIteration=1

Code

Result

numIteration=5

Code

Result

srowen commented Sep 28, 2017

nzw0301 commented Sep 28, 2017 • edited Loading

LowikC commented Sep 28, 2017

nzw0301 commented Sep 28, 2017 • edited Loading

LowikC left a comment

Choose a reason for hiding this comment

LowikC Sep 28, 2017

Choose a reason for hiding this comment

nzw0301 Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

srowen commented Sep 30, 2017

SparkQA commented Sep 30, 2017

LowikC commented Oct 2, 2017

nzw0301 commented Oct 2, 2017

nzw0301 commented Oct 2, 2017

srowen Oct 3, 2017

Choose a reason for hiding this comment

nzw0301 Oct 3, 2017

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

LowikC left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 6, 2017

srowen commented Oct 7, 2017

nzw0301 commented Oct 7, 2017

nzw0301 commented Sep 27, 2017 •

edited

Loading

`numIteration=1`

`numIteration=5`

nzw0301 commented Sep 28, 2017 •

edited

Loading

nzw0301 commented Sep 28, 2017 •

edited

Loading

nzw0301 Sep 28, 2017 •

edited

Loading