[SPARK-28421][ML] SparseVector.apply performance optimization#25178
[SPARK-28421][ML] SparseVector.apply performance optimization#25178zhengruifeng wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #107784 has finished for PR 25178 at commit
|
| throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)") | ||
| } | ||
|
|
||
| if (indices.isEmpty || i < indices(0) || i > indices(indices.length - 1)) { |
There was a problem hiding this comment.
Isn't this case already covered by the binarySearch case below? if it's not found for any reason you get a negative results.
There was a problem hiding this comment.
1, the impl of Arrays.binarySearch does not chek the range:
public static int binarySearch(int[] a, int key) {
return binarySearch0(a, 0, a.length, key);
}
// Like public version, but without range checks.
private static int binarySearch0(long[] a, int fromIndex, int toIndex,
long key) {
int low = fromIndex;
int high = toIndex - 1;
while (low <= high) {
int mid = (low + high) >>> 1;
long midVal = a[mid];
if (midVal < key)
low = mid + 1;
else if (midVal > key)
high = mid - 1;
else
return mid; // key found
}
return -(low + 1); // key not found.
}
2, in breeze.collection.mutable.SparseArray, the findOffset function called in apply to perform binary seach, take the special case that the key is out of range into account
if (used == 0) {
// empty list do nothing
-1
} else {
val index = this.index
if (i > index(used - 1)) {
// special case for end of list - this is a big win for growing sparse arrays
~used
so I added those simple checking between binary search.
There was a problem hiding this comment.
Yes you need the check that the index is < 0 or >= length, keep that.
But binarySearch already handles the case that the query index is >= 0 but before the first actual index:
scala> java.util.Arrays.binarySearch(Array(2,3), 1)
res0: Int = -1
scala> java.util.Arrays.binarySearch(Array(2,3), 4)
res1: Int = -3
Why repeat that part?
There was a problem hiding this comment.
I see. This performance improvement comes from avoiding to walk a binary tree if a key is not included in a given sorted array. It makes sense to me .
One question. What do you mean avoiding internal conversion in the description?
There was a problem hiding this comment.
Yeah, but you're also always paying the cost of these two checks. It depends on the access pattern, but assuming pretty uniform distribution, the check will rarely save checks and always add a few. It seems simpler to avoid it unless there's a clear case it's a win.
There was a problem hiding this comment.
@srowen I add the checks just because in the impl of findOffset in breeze.collection.mutable.SparseArray,
it says // special case for end of list - this is a big win for growing sparse arrays, and I think it is reasonable.
There was a problem hiding this comment.
@zhengruifeng Would it be possible to show the performance comparison in the case that @srowen mentions. In other words, most of keys exist in indices. I hope the overhead of adding three tests would be negligible.
There was a problem hiding this comment.
Where does a conversion happen? this is just avoiding binarySearch, no?
There was a problem hiding this comment.
Existing SparseVector do not override the apply method inheriting from Vector:
/**
* Gets the value of the ith element.
* @param i index
*/
@Since("2.0.0")
def apply(i: Int): Double = asBreeze(i)
So a spark.ml.linalg.SparseVector will first be converted to a breeze.collection.mutable.SparseArray and then a breeze.linalg.SparseVector.
As to the range check, I think it is just a tiny optimization.
There was a problem hiding this comment.
Oh right I see. That's a big win.
Well I'm OK with it though still not clear the extra range checks are an optimization.
|
Retest this please. |
|
Test build #108024 has finished for PR 25178 at commit
|
|
I added the extra range check just because in @srowen @dongjoon-hyun @kiszk I will do anothe simple test to find whether the extra range chek helps. |
|
I test the perf among current impl ( the test suite is similar with the above one
So the version without range check is faster, I will update the pr. |
|
The expected cost without range check is previous test suite uses |
|
Test build #108034 has finished for PR 25178 at commit
|
|
LGTM |
srowen
left a comment
There was a problem hiding this comment.
That's convincing, thanks for checking!
|
Merged to master |
|
@srowen How about backporting it to 2.X? |
|
I'm OK with it. It should be a pretty safe optimization. |
|
Hm, I can't seem to back-port a merged PR with the merge script right now. I've seen this before. @dongjoon-hyun are you seeing problems like "not mergeable in its current form" if you try the merge script on this one again to backport it? |
|
Yes, @srowen . I thought it's the current behavior of our script. |
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.
| size| nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|
## How was this patch tested?
existing tests
using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._
val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
val rng = new Random(123)
val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
val values = Array.fill(nnz)(rng.nextDouble)
val vec = Vectors.sparse(size, indices, values).toSparse
val tic1 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
val toc1 = System.currentTimeMillis;
val tic2 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
val toc2 = System.currentTimeMillis;
val tic3 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
val toc3 = System.currentTimeMillis;
println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```
Closes #25178 from zhengruifeng/sparse_vec_apply.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
|
OK, also manually merged to 2.4 in a285c0d |
|
Thank you, @srowen ! |
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.
| size| nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|
## How was this patch tested?
existing tests
using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._
val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
val rng = new Random(123)
val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
val values = Array.fill(nnz)(rng.nextDouble)
val vec = Vectors.sparse(size, indices, values).toSparse
val tic1 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
val toc1 = System.currentTimeMillis;
val tic2 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
val toc2 = System.currentTimeMillis;
val tic3 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
val toc3 = System.currentTimeMillis;
println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```
Closes apache#25178 from zhengruifeng/sparse_vec_apply.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.
| size| nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|
## How was this patch tested?
existing tests
using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._
val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
val rng = new Random(123)
val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
val values = Array.fill(nnz)(rng.nextDouble)
val vec = Vectors.sparse(size, indices, values).toSparse
val tic1 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
val toc1 = System.currentTimeMillis;
val tic2 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
val toc2 = System.currentTimeMillis;
val tic3 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
val toc3 = System.currentTimeMillis;
println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```
Closes apache#25178 from zhengruifeng/sparse_vec_apply.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
optimize the
SparseVector.applyby avoiding internal conversionSince the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.
How was this patch tested?
existing tests
using following code to test performance (here the new impl is named
apply2, and another impl with extra range check is namedapply3):