[SPARK-11898] [MLlib] Use broadcast for the global tables in Word2Vec #9878

hhbyyh · 2015-11-21T03:51:28Z

jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,

decrease the worker memory consumption by 45%.
decrease running time by 40%.

This will also help extend the upper limit for Word2Vec.

SparkQA · 2015-11-21T04:40:33Z

Test build #46468 has finished for PR 9878 at commit cee80c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-21T14:46:13Z

I think this looks good.

jkbradley · 2015-11-23T01:12:04Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

Broadcasting is a good idea, but these values are modified (based on aggregated updates) on each iteration. You'll need to make a new broadcast variable on each iteration.

This may also change the results of your timing tests; would you be able to re-run them?

Hi Joseph, actually I found broadcast variable can somehow automatically get updated...
an example

val arr = (0 to 16).toArray val bc = sc.broadcast(arr) val rdd = sc.parallelize(1 to 8) for(w <- 1 to 10){ val result = rdd.map(i => bc.value(2)).collect().mkString(", ") println(result) arr(2) = new Random().nextInt() }

The code will print different numbers in the 10 iterations.
I'm not sure if it's by design.

@hhbyyh Joseph is correct. What you see only happens to work since you are running locally in one JVM.

I got where it went wrong.
I tested on the cluster with a different edition

val value = bc.value val result = rdd.map(i => value(2)).collect().mkString(", ")

Anyway, you are correct. Thanks

hhbyyh · 2015-11-24T10:24:15Z

I changed it to creating a new broadcast variable in each iteration.
I ran the benchmark with executor-core = 4 and iterations = 5. The result is quite similar with the one before.
Memory usage decrease from 55% to 29%
time cost decrease from 524 to 304.

SparkQA · 2015-11-24T11:19:34Z

Test build #46599 has finished for PR 9878 at commit b1c65a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-25T13:57:16Z

LGTM again -- good thing @jkbradley reviewed so I will pause a bit for him to give an OK, if possible.

jkbradley · 2015-11-25T18:14:22Z

Yes, I think it's correct now. My only question is if we should explicitly unpersist the broadcast vars after synAgg is created (by a collect, so it should be safe to unpersist then). I don't really know how quickly they would be cleaned up after going out of scope. (Do you?)

hhbyyh · 2015-11-26T02:27:15Z

@jkbradley I tried to add unpersist(false) at the end but it seems made no difference.
I'll run some experiment to confirm.

hhbyyh · 2015-11-26T08:23:37Z

Still I get no difference. Yet I think it's still reasonable to add the unpersist.

SparkQA · 2015-11-26T09:11:33Z

Test build #46759 has finished for PR 9878 at commit cf4b9e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-28T09:22:54Z

@jkbradley OK with merging?

srowen · 2015-12-01T09:27:19Z

Merged to master

jkbradley · 2015-12-01T18:10:03Z

Thanks for merging it. Yes, it looks good to me. Thanks @hhbyyh

broadcast global table

cee80c0

jkbradley reviewed Nov 23, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/master' into w2vBC

c9d3b2b

create broadcast in iterations for W2V

b1c65a9

add broadcast unpersist

cf4b9e7

asfgit closed this in a0af0e3 Dec 1, 2015

[SPARK-11898] [MLlib] Use broadcast for the global tables in Word2Vec #9878

[SPARK-11898] [MLlib] Use broadcast for the global tables in Word2Vec #9878

Uh oh!

Conversation

hhbyyh commented Nov 21, 2015

Uh oh!

SparkQA commented Nov 21, 2015

Uh oh!

srowen commented Nov 21, 2015

Uh oh!

jkbradley Nov 23, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley Nov 23, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

srowen Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Nov 24, 2015

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

srowen commented Nov 25, 2015

Uh oh!

jkbradley commented Nov 25, 2015

Uh oh!

hhbyyh commented Nov 26, 2015

Uh oh!

hhbyyh commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

srowen commented Nov 28, 2015

Uh oh!

srowen commented Dec 1, 2015

Uh oh!

jkbradley commented Dec 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants