Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Nov 21, 2015

jira: https://issues.apache.org/jira/browse/SPARK-11898
syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.

Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,

  1. decrease the worker memory consumption by 45%.
  2. decrease running time by 40%.

This will also help extend the upper limit for Word2Vec.

@SparkQA
Copy link

SparkQA commented Nov 21, 2015

Test build #46468 has finished for PR 9878 at commit cee80c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 21, 2015

I think this looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcasting is a good idea, but these values are modified (based on aggregated updates) on each iteration. You'll need to make a new broadcast variable on each iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also change the results of your timing tests; would you be able to re-run them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Joseph, actually I found broadcast variable can somehow automatically get updated...
an example

val arr = (0 to 16).toArray
    val bc = sc.broadcast(arr)
    val rdd = sc.parallelize(1 to 8)
    for(w <- 1 to 10){
      val result = rdd.map(i => bc.value(2)).collect().mkString(", ")
      println(result)
      arr(2) = new Random().nextInt()
    }

The code will print different numbers in the 10 iterations.
I'm not sure if it's by design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhbyyh Joseph is correct. What you see only happens to work since you are running locally in one JVM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got where it went wrong.
I tested on the cluster with a different edition

val value = bc.value
      val result = rdd.map(i => value(2)).collect().mkString(", ")

Anyway, you are correct. Thanks

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Nov 24, 2015

I changed it to creating a new broadcast variable in each iteration.
I ran the benchmark with executor-core = 4 and iterations = 5. The result is quite similar with the one before.
Memory usage decrease from 55% to 29%
time cost decrease from 524 to 304.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46599 has finished for PR 9878 at commit b1c65a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 25, 2015

LGTM again -- good thing @jkbradley reviewed so I will pause a bit for him to give an OK, if possible.

@jkbradley
Copy link
Member

Yes, I think it's correct now. My only question is if we should explicitly unpersist the broadcast vars after synAgg is created (by a collect, so it should be safe to unpersist then). I don't really know how quickly they would be cleaned up after going out of scope. (Do you?)

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Nov 26, 2015

@jkbradley I tried to add unpersist(false) at the end but it seems made no difference.
I'll run some experiment to confirm.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Nov 26, 2015

Still I get no difference. Yet I think it's still reasonable to add the unpersist.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46759 has finished for PR 9878 at commit cf4b9e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 28, 2015

@jkbradley OK with merging?

@srowen
Copy link
Member

srowen commented Dec 1, 2015

Merged to master

@asfgit asfgit closed this in a0af0e3 Dec 1, 2015
@jkbradley
Copy link
Member

Thanks for merging it. Yes, it looks good to me. Thanks @hhbyyh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants