Skip to content

[SPARK-28421][ML] SparseVector.apply performance optimization#25178

Closed
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:sparse_vec_apply
Closed

[SPARK-28421][ML] SparseVector.apply performance optimization#25178
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:sparse_vec_apply

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jul 17, 2019

What changes were proposed in this pull request?

optimize the SparseVector.apply by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

size nnz apply(old) apply2(new impl) apply3(new impl with extra range check)
10000000 100 75294 12208 18682
10000000 10000 75616 23132 32932
10000000 1000000 92949 42529 48821

How was this patch tested?

existing tests

using following code to test performance (here the new impl is named apply2, and another impl with extra range check is named apply3):

import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;
	
	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107784 has finished for PR 25178 at commit 1484602.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)")
}

if (indices.isEmpty || i < indices(0) || i > indices(indices.length - 1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this case already covered by the binarySearch case below? if it's not found for any reason you get a negative results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, the impl of Arrays.binarySearch does not chek the range:

    public static int binarySearch(int[] a, int key) {
        return binarySearch0(a, 0, a.length, key);
    }

    // Like public version, but without range checks.
    private static int binarySearch0(long[] a, int fromIndex, int toIndex,
                                     long key) {
        int low = fromIndex;
        int high = toIndex - 1;

        while (low <= high) {
            int mid = (low + high) >>> 1;
            long midVal = a[mid];

            if (midVal < key)
                low = mid + 1;
            else if (midVal > key)
                high = mid - 1;
            else
                return mid; // key found
        }
        return -(low + 1);  // key not found.
    }

2, in breeze.collection.mutable.SparseArray, the findOffset function called in apply to perform binary seach, take the special case that the key is out of range into account

    if (used == 0) {
      // empty list do nothing
      -1
    } else {
      val index = this.index
      if (i > index(used - 1)) {
        // special case for end of list - this is a big win for growing sparse arrays
        ~used

so I added those simple checking between binary search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you need the check that the index is < 0 or >= length, keep that.
But binarySearch already handles the case that the query index is >= 0 but before the first actual index:

scala> java.util.Arrays.binarySearch(Array(2,3), 1)
res0: Int = -1

scala> java.util.Arrays.binarySearch(Array(2,3), 4)
res1: Int = -3

Why repeat that part?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This performance improvement comes from avoiding to walk a binary tree if a key is not included in a given sorted array. It makes sense to me .

One question. What do you mean avoiding internal conversion in the description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but you're also always paying the cost of these two checks. It depends on the access pattern, but assuming pretty uniform distribution, the check will rarely save checks and always add a few. It seems simpler to avoid it unless there's a clear case it's a win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I add the checks just because in the impl of findOffset in breeze.collection.mutable.SparseArray,
it says // special case for end of list - this is a big win for growing sparse arrays, and I think it is reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Would it be possible to show the performance comparison in the case that @srowen mentions. In other words, most of keys exist in indices. I hope the overhead of adding three tests would be negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does a conversion happen? this is just avoiding binarySearch, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing SparseVector do not override the apply method inheriting from Vector:

  /**
   * Gets the value of the ith element.
   * @param i index
   */
  @Since("2.0.0")
  def apply(i: Int): Double = asBreeze(i)

So a spark.ml.linalg.SparseVector will first be converted to a breeze.collection.mutable.SparseArray and then a breeze.linalg.SparseVector.

As to the range check, I think it is just a tiny optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right I see. That's a big win.
Well I'm OK with it though still not clear the extra range checks are an optimization.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 22, 2019

Test build #108024 has finished for PR 25178 at commit 1484602.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

I added the extra range check just because in breeze.collection.mutable.SparseArray, the findOffset function does this and comments special case for end of list - this is a big win for growing sparse arrays

@srowen @dongjoon-hyun @kiszk I will do anothe simple test to find whether the extra range chek helps.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jul 23, 2019

I test the perf among current impl (apply) , direct binary-search (apply2), binary-seach with extra range check (apply3)

  def apply2(i: Int): Double = {
    if (i < 0 || i >= size) {
      throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)")
    }

    val j = util.Arrays.binarySearch(indices, i)
    if (j < 0) 0.0 else values(j)
  }

  def apply3(i: Int): Double = {
    if (i < 0 || i >= size) {
      throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)")
    }

    if (indices.isEmpty || i < indices(0) || i > indices(indices.length - 1)) {
      0.0
    } else {
      val j = util.Arrays.binarySearch(indices, i)
      if (j < 0) 0.0 else values(j)
    }
  }

the test suite is similar with the above one

import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;
	
	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
size nnz apply(old) apply2 apply3
10000000 100 75294 12208 18682
10000000 10000 75616 23132 32932
10000000 1000000 92949 42529 48821

So the version without range check is faster, I will update the pr.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jul 23, 2019

The expected cost without range check is E(cost(apply2)) = log(NNZ);
while the one with range check is E(cost(apply3)) = 2 + P(key in range)*log(NNZ);
The diff is E(cost(apply3) - cost(apply2)) = 2 - P(key out of range) * log(NNZ), so the optimization is high related to the key distribution and the NNZ.
The above suite suppose the input key is from an uniform distribution. And show that, if the NNZ is small, range check will cost extra 10% cost; otherwise, the range check will save about 50% cost.

previous test suite uses val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.sorted.take(nnz) to generate indices, which is biased.
I just change it to val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted.
Now the version without range check is faster, since P(key out of range) in most case should be a probability near 0%.

@SparkQA
Copy link

SparkQA commented Jul 23, 2019

Test build #108034 has finished for PR 25178 at commit 99dfe7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

LGTM

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's convincing, thanks for checking!

@srowen
Copy link
Member

srowen commented Jul 24, 2019

Merged to master

@srowen srowen closed this in a3bbc37 Jul 24, 2019
@zhengruifeng
Copy link
Contributor Author

@srowen How about backporting it to 2.X?

@srowen
Copy link
Member

srowen commented Jul 24, 2019

I'm OK with it. It should be a pretty safe optimization.

@srowen
Copy link
Member

srowen commented Jul 24, 2019

Hm, I can't seem to back-port a merged PR with the merge script right now. I've seen this before. @dongjoon-hyun are you seeing problems like "not mergeable in its current form" if you try the merge script on this one again to backport it?

@dongjoon-hyun
Copy link
Member

Yes, @srowen . I thought it's the current behavior of our script.
For later backporting, I did manual cherry-picking always.

srowen pushed a commit that referenced this pull request Jul 25, 2019
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

| size|  nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|

## How was this patch tested?
existing tests

using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;

	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```

Closes #25178 from zhengruifeng/sparse_vec_apply.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@srowen
Copy link
Member

srowen commented Jul 25, 2019

OK, also manually merged to 2.4 in a285c0d

@dongjoon-hyun
Copy link
Member

Thank you, @srowen !

@zhengruifeng zhengruifeng deleted the sparse_vec_apply branch July 26, 2019 01:32
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

| size|  nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|

## How was this patch tested?
existing tests

using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;

	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```

Closes apache#25178 from zhengruifeng/sparse_vec_apply.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
## What changes were proposed in this pull request?
optimize the `SparseVector.apply` by avoiding internal conversion
Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting.

| size|  nnz | apply(old) | apply2(new impl) | apply3(new impl with extra range check)|
|------|----------|------------|----------|----------|
|10000000|100|75294|12208|18682|
|10000000|10000|75616|23132|32932|
|10000000|1000000|92949|42529|48821|

## How was this patch tested?
existing tests

using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`):
```
import scala.util.Random
import org.apache.spark.ml.linalg._

val size = 10000000
for (nnz <- Seq(100, 10000, 1000000)) {
	val rng = new Random(123)
	val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted
	val values = Array.fill(nnz)(rng.nextDouble)
	val vec = Vectors.sparse(size, indices, values).toSparse

	val tic1 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} };
	val toc1 = System.currentTimeMillis;

	val tic2 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} };
	val toc2 = System.currentTimeMillis;

	val tic3 = System.currentTimeMillis;
	(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} };
	val toc3 = System.currentTimeMillis;

	println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
}
```

Closes apache#25178 from zhengruifeng/sparse_vec_apply.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments