Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala #19372

Closed
wants to merge 6 commits into from
Closed

Conversation

nzw0301
Copy link
Member

@nzw0301 nzw0301 commented Sep 27, 2017

What changes were proposed in this pull request?

Current equation of learning rate is incorrect when numIterations > 1.
This PR is based on original C code.

cc: @mengxr

How was this patch tested?

manual tests

I modified this example code.

numIteration=1

Code

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

Result

2 0.175856813788414
0 0.10971353203058243
4 0.09818313270807266
3 0.012947646901011467
9 -0.09881238639354706

numIteration=5

Code

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}

val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq)

val word2vec = new Word2Vec()
word2vec.setNumIterations(5)

val model = word2vec.fit(input)

val synonyms = model.findSynonyms("1", 5)

for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

Result

0 0.9898583889007568
2 0.9808019399642944
4 0.9794934391975403
3 0.9506527781486511
9 -0.9065656661987305

@srowen
Copy link
Member

srowen commented Sep 28, 2017

You should make a JIRA as it's a non-trivial behavior change.
I agree that the intent seems to be to follow the original code. I also see that it was intentionally not used. I think the author would have to weigh in.

@nzw0301
Copy link
Member Author

nzw0301 commented Sep 28, 2017

Thank you for your comment, @srowen.
I'll create an issue on JIRA.

@nzw0301 nzw0301 changed the title [MLLIB] Fix update equation of learning rate in Word2Vec.scala [SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala Sep 28, 2017
@LowikC
Copy link

LowikC commented Sep 28, 2017

I think the PR is incorrect:

  • the original C code decreases the learning rate linearly from starting_alpha to 0, across all iterations
    new_alpha = starting_alpha * (1 - progress), progress = word_count_actual / (numIterations * numWordsPerIteration + 1)
    Note that word_count_actual counts the number of words processed so far, in all iterations.

  • in the PR, word_count_actual is called wordCount, but wordCount is reset at the beginning of each iteration (see https://github.com/nzw0301/spark/blob/e2a7d393e141405f658a68f99bc4a1f53816db95/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L365).

  • the current Spark code is also different: it decreases the learning rate linearly from learningRate to 0, in one iteration. At the beginning of the next iteration, the value is again learningRate

  • the "right" formula (to keep the behavior of the original C code) should be:

val numWordsToProcess = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount
val numWordsProcessedInCurrentIteration = numPartitions * wordCount.toDouble
val progress = (numWordsProcessedInPreviousIterations + numWordsProcessedInCurrentIteration) / numWordsToProcess
alpha = learningRate * (1 - progress)

@nzw0301
Copy link
Member Author

nzw0301 commented Sep 28, 2017

Thank you for your comment, @LowikC.
You are right, my PR code is incorrect.

Correct update formula based on your comment is

alpha = learningRate *
  (1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /
    (numIterations * trainWordsCount + 1))

Copy link

@LowikC LowikC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct parentheses

learningRate *
(1 - numPartitions * wordCount.toDouble / (numIterations * trainWordsCount + 1))
alpha = learningRate *
(1 - numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount /
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount between parentheses

alpha = learningRate * (1 - (numPartitions * wordCount.toDouble + (k - 1) * trainWordsCount) / (numIterations * trainWordsCount + 1))

Copy link
Member Author

@nzw0301 nzw0301 Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... Thanks! I fixed it.

@srowen
Copy link
Member

srowen commented Sep 30, 2017

What do you think @LowikC ?

@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #3939 has finished for PR 19372 at commit 90735a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LowikC
Copy link

LowikC commented Oct 2, 2017

Looks good to me.
Maybe @nzw0301 could split the formula for readability?

@nzw0301
Copy link
Member Author

nzw0301 commented Oct 2, 2017

Thank you for your reviews, @LowikC.

Like this?

val totalWordsCounts = numIterations * trainWordsCount + 1
val numWordsProcessedInPreviousIterations = (k - 1) * trainWordsCount

alpha = learningRate *
  (1 - (numPartitions * wordCount.toDouble + numWordsProcessedInPreviousIterations) /
    totalWordsCounts)
if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) +
  ", alpha = " + alpha)

@nzw0301
Copy link
Member Author

nzw0301 commented Oct 2, 2017

I updated the results of word2vec example based on this PR in the first comment.

if (alpha < learningRate * 0.0001) alpha = learningRate * 0.0001
logInfo("wordCount = " + wordCount + ", alpha = " + alpha)
logInfo("wordCount = " + (wordCount + numWordsProcessedInPreviousIterations) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update this again, you can use string interpolation: logInfo(s"wordCount = ${wordCount + ...}, alpha = $alpha")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Done.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me if OK with @LowikC

Copy link

@LowikC LowikC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for me

@SparkQA
Copy link

SparkQA commented Oct 6, 2017

Test build #3943 has finished for PR 19372 at commit 2ea3f18.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 7, 2017

Merged to master

@asfgit asfgit closed this in 5eacc3b Oct 7, 2017
@nzw0301
Copy link
Member Author

nzw0301 commented Oct 7, 2017

Thank you for your kindful reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants