[SPARK-28081][ML] Handle large vocab counts in word2vec #24893

srowen · 2019-06-17T14:43:34Z

What changes were proposed in this pull request?

The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count.

This takes over #24814

How was this patch tested?

Existing tests.

srowen · 2019-06-17T14:44:17Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

        min2i = pos2
        pos2 += 1
      }
+      assert(count(min1i) < Long.MaxValue)


I may remove these asserts before we commit; just a double check

SparkQA · 2019-06-17T16:02:08Z

Test build #106590 has finished for PR 24893 at commit 8d74927.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count. This takes over #24814 ## How was this patch tested? Existing tests. Closes #24893 from srowen/SPARK-28081. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit e96dd82) Signed-off-by: Sean Owen <sean.owen@databricks.com>

srowen · 2019-06-19T01:29:37Z

Merged to master/2.4/2.3

## What changes were proposed in this pull request? The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count. This takes over apache#24814 ## How was this patch tested? Existing tests. Closes apache#24893 from srowen/SPARK-28081. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit e96dd82) Signed-off-by: Sean Owen <sean.owen@databricks.com>

Handle large vocab counts in word2vec

8d74927

srowen self-assigned this Jun 17, 2019

srowen mentioned this pull request Jun 17, 2019

set Int MaxValue #24814

Closed

srowen commented Jun 17, 2019

View reviewed changes

dongjoon-hyun added the ML label Jun 17, 2019

dongjoon-hyun approved these changes Jun 17, 2019

View reviewed changes

srowen closed this in e96dd82 Jun 19, 2019

srowen deleted the SPARK-28081 branch June 19, 2019 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28081][ML] Handle large vocab counts in word2vec #24893

[SPARK-28081][ML] Handle large vocab counts in word2vec #24893

Uh oh!

srowen commented Jun 17, 2019

Uh oh!

srowen Jun 17, 2019

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-28081][ML] Handle large vocab counts in word2vec #24893

[SPARK-28081][ML] Handle large vocab counts in word2vec #24893

Uh oh!

Conversation

srowen commented Jun 17, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants