-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33609][ML] word2vec reduce broadcast size #30548
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -502,22 +502,15 @@ class Word2VecModel private[spark] ( | |
private val vectorSize = wordVectors.length / numWords | ||
|
||
// wordList: Ordered list of words obtained from wordIndex. | ||
private val wordList: Array[String] = { | ||
val (wl, _) = wordIndex.toSeq.sortBy(_._2).unzip | ||
wl.toArray | ||
private lazy val wordList: Array[String] = { | ||
wordIndex.toSeq.sortBy(_._2).iterator.map(_._1).toArray | ||
} | ||
|
||
// wordVecNorms: Array of length numWords, each value being the Euclidean norm | ||
// of the wordVector. | ||
private val wordVecNorms: Array[Float] = { | ||
val wordVecNorms = new Array[Float](numWords) | ||
var i = 0 | ||
while (i < numWords) { | ||
val vec = wordVectors.slice(i * vectorSize, i * vectorSize + vectorSize) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. avoid this slicing |
||
wordVecNorms(i) = blas.snrm2(vectorSize, vec, 1) | ||
i += 1 | ||
} | ||
wordVecNorms | ||
private lazy val wordVecNorms: Array[Float] = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How much does this save, if it only happens once and has to happen to use the model? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this var There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK fair enough. There are use cases here that would never need this calculated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am wrong. this var |
||
val size = vectorSize | ||
Array.tabulate(numWords)(i => blas.snrm2(size, wordVectors, i * size, 1)) | ||
} | ||
|
||
@Since("1.5.0") | ||
|
@@ -538,9 +531,13 @@ class Word2VecModel private[spark] ( | |
@Since("1.1.0") | ||
def transform(word: String): Vector = { | ||
wordIndex.get(word) match { | ||
case Some(ind) => | ||
val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + vectorSize) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. avoid this slicing |
||
Vectors.dense(vec.map(_.toDouble)) | ||
case Some(index) => | ||
val size = vectorSize | ||
val offset = index * size | ||
val array = Array.ofDim[Double](size) | ||
var i = 0 | ||
while (i < size) { array(i) = wordVectors(offset + i); i += 1 } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this actually more efficient than slice? Likewise above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess so, I will do a simple test. |
||
Vectors.dense(array) | ||
case None => | ||
throw new IllegalStateException(s"$word not in vocabulary") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance this makes more sense. But, we can't call
bcModel.destroy()
at the end here anyway. So we have this broadcast we can't explicitly close no matter what. And now I guess, this would re-broadcast every time? that could be bad. What do you think? I know this is not consistent in the code either way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I perfer not using broadcasting in
transform
, but this may need more discussion. we can keep current behavior for now.GBT models are also broadcasted in this way for performance since SPARK-7127.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but i'd back out this part of the change