[SPARK-19368][MLlib] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices #16732

uzadude · 2017-01-29T06:09:58Z

What changes were proposed in this pull request?

Optimization [SPARK-12869] was made for dense matrices but caused great performance issue for sparse matrices because manipulating them is very inefficient. When manipulating sparse matrices in Breeze we better use VectorBuilder.

How was this patch tested?

checked it against a use case that we have that after moving to Spark 2 took 6.5 hours instead of 20 mins. After the change it is back to 20 mins again.

kiszk · 2017-01-29T17:02:20Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+          val wholeVectorBuf = VectorBuilder.zeros[Double](cols)
+          vectors.foreach { case (blockColIdx: Int, vec: BV[Double]) =>
+            val offset = colsPerBlock * blockColIdx
+            vec.activeIterator.foreach { case (colIdx: Int, value: Double) => wholeVectorBuf.add(offset + colIdx, value) }


This line includes more than 100 characters.

kiszk · 2017-01-29T17:05:19Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

@@ -280,17 +278,24 @@ class BlockMatrix @Since("1.3.0") (
    }.groupByKey().map { case (rowIdx, vectors) =>


At line 276, do we need to call asBreeze for vector. I think that it is okay with (blockIndex, vector) to avoid an object creation in a method asBreeze.

kiszk · 2017-01-29T17:08:33Z

Could you please add performance results without/with this PR?

kiszk · 2017-01-29T17:08:41Z

Jenkins, retest this please

uzadude · 2017-01-29T20:38:42Z

Hi,
I removed the .asBreeze call for the sparse case.
about the performance issue for this sample code:

    val n = 20000
    val rndEntryList = GraphTestUtils.getRandomMatrixEntries(n,n,density = 0.001, seed = 123).map { case (i,j,d) => (i, (j,d)) }
    val entries: RDD[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 4)
    val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, Vectors.sparse(n, e._2.toSeq)))
    val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n).toBlockMatrix(1000,1000).cache()

    mat.blocks.count()

    val t1 = System.nanoTime()
    println(mat.toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
    val t2 = System.nanoTime()
    println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
    println("============================================================")
    println(mat.toIndexedRowMatrixNew().rows.map(_.vector.numActives).sum())
    val t3 = System.nanoTime()
    println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
    println("============================================================")
    //println(BlockMatrixFixedOps.toIndexedRowMatrix(mat).rows.map(_.vector.numActives).sum())
    println(mat.toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
    val t4 = System.nanoTime()
    println("took: " + (t4 - t3) / 1000 / 1000 + " ms")
    println("============================================================")

I get:

took: 1242 ms
took: 2772 ms
took: 19647 ms

kiszk · 2017-01-30T09:04:44Z

Thank you.

If I understand correctly, matSparse in "toIndexedRowMatrix" does not cover a new path.
If so, it would be good to add a more sparse matrix to validate a new path.

uzadude · 2017-01-30T15:52:20Z

not sure I understand, matSparse has 1/10th nnz as needed for line 282:

if (numberNonZeroPerRow <= 0.1) { // Sparse at 1/10th nnz

kiszk · 2017-01-30T17:25:34Z

Sorry, as you pointed out, matSparse has 0.1 for numberNonZeroPerRow.

akaltsikis · 2017-04-18T10:48:26Z

has the issue been resolved ?

kiszk · 2017-04-19T14:16:57Z

Jenkins, test this please

akaltsikis · 2017-05-12T15:19:04Z

Hey guys i was implementing that as an external function in jar to work with spark 2.1.1. Even if @uzadude improved performance of the 2.x implementation by creating 2 seperate cases for sparse and dense matrices it seems that the Blockmatrix.toCoordinateMatrix().toIndexedRowMatrix() still works faster according to my recent benchmarks.I would suggest that we put that in case of numberNonZeroPerRow<0.1 till the function for sparse matrices has better performance that the double conversion.

srowen · 2018-11-14T15:46:00Z

Looks good @uzadude ; just saw this very old PR. However what about @akaltsikis 's comment?

akaltsikis · 2018-11-14T16:02:59Z

Looks good @uzadude ; just saw this very old PR. However what about @akaltsikis 's comment?

@srowen Tbh after 1 year and half i really can't recall many details.
I guess due to the better performance of Blockmatrix.toCoordinateMatrix().toIndexedRowMatrix() for matrices (sparser than numberNonZeroPerRow<0.1) to use that instead of the function we are using as it should give us the same response in half of the time.

uzadude · 2018-11-15T06:46:40Z

After running some more experiments I was able to reduce the runtime by another 1.5x factor. So currently the "toCoordinateMatrix().toIndexedRowMatrix()" is better by a bit only in the extreme cases when the block matrix size was somewhat incorrectly configured (as above - 1000x1000 and density 1/1000) - meaning it will contain many rows with only one value, then the gain comes only from the overhead of shuffling primitive instead of a Vector. So I generally think this approach is better.

akaltsikis · 2018-11-15T13:43:36Z

After running some more experiments I was able to reduce the runtime by another 1.5x factor. So currently the "toCoordinateMatrix().toIndexedRowMatrix()" is better by a bit only in the extreme cases when the block matrix size was somewhat incorrectly configured (as above - 1000x1000 and density 1/1000) - meaning it will contain many rows with only one value, then the gain comes only from the overhead of shuffling primitive instead of a Vector. So I generally think this approach is better.

Agreed. Good Job as always my friend @uzadude

SparkQA · 2018-11-15T14:45:26Z

Test build #4426 has finished for PR 16732 at commit 5531806.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-15T15:08:43Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+          val arrBufferIndices = new ArrayBuffer[Int](numberNonZero)
+          val arrBufferValues = new ArrayBuffer[Double](numberNonZero)
+
+          vectors.foreach { case (blockColIdx: Int, vec: SparseVector) =>


I think this has to be Vector and not SparseVector

srowen · 2018-11-15T15:08:46Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

-          blockRowIdx * rowsPerBlock + rowIdx -> ((blockColIdx, vector.asBreeze))
-      }
+          blockRowIdx * rowsPerBlock + rowIdx -> ((blockColIdx, vector))
+      }.filter(_._2._2.size > 0)


I suppose you could filter just before the map too, but it won't matter much.

You can drop the second filter now, right?

uzadude · 2018-11-15T15:49:31Z

@srowen - you're right. I've fixed it.

srowen

Otherwise fine.

srowen · 2018-11-15T15:57:07Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

-          blockRowIdx * rowsPerBlock + rowIdx -> ((blockColIdx, vector.asBreeze))
-      }
+          blockRowIdx * rowsPerBlock + rowIdx -> ((blockColIdx, vector))
+      }.filter(_._2._2.size > 0)


You can drop the second filter now, right?

SparkQA · 2018-11-22T03:23:09Z

Test build #4438 has finished for PR 16732 at commit 540b261.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-11-22T15:07:27Z

Test build #4439 has finished for PR 16732 at commit 540b261.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-22T21:49:02Z

Merged to master

…r sparse matrices ## What changes were proposed in this pull request? Optimization [SPARK-12869] was made for dense matrices but caused great performance issue for sparse matrices because manipulating them is very inefficient. When manipulating sparse matrices in Breeze we better use VectorBuilder. ## How was this patch tested? checked it against a use case that we have that after moving to Spark 2 took 6.5 hours instead of 20 mins. After the change it is back to 20 mins again. Closes apache#16732 from uzadude/SparseVector_optimization. Authored-by: oraviv <oraviv@paypal.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

kiszk reviewed Jan 29, 2017

View reviewed changes

srowen approved these changes Nov 14, 2018

View reviewed changes

uzadude added 4 commits November 14, 2018 21:05

add special treatment for SparseVector in toIndexedRowMatrix

5e6bbb7

missed lines

66fd09f

removed .asBreeze where not needed

eed7253

another 1.5x improvement by using ArrayBuffers

877552c

uzadude force-pushed the SparseVector_optimization branch from 181f4c3 to 877552c Compare November 15, 2018 06:37

fixed imports

5531806

srowen approved these changes Nov 15, 2018

View reviewed changes

srowen reviewed Nov 15, 2018

View reviewed changes

fixed partial function

eeeee50

srowen approved these changes Nov 15, 2018

View reviewed changes

oops ..

540b261

asfgit closed this in d81d95a Nov 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19368][MLlib] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices #16732

[SPARK-19368][MLlib] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices #16732

uzadude commented Jan 29, 2017

kiszk Jan 29, 2017

kiszk Jan 29, 2017

kiszk commented Jan 29, 2017

kiszk commented Jan 29, 2017 •

edited

uzadude commented Jan 29, 2017

kiszk commented Jan 30, 2017 •

edited

uzadude commented Jan 30, 2017

kiszk commented Jan 30, 2017

akaltsikis commented Apr 18, 2017

kiszk commented Apr 19, 2017

akaltsikis commented May 12, 2017

srowen commented Nov 14, 2018

akaltsikis commented Nov 14, 2018

uzadude commented Nov 15, 2018

akaltsikis commented Nov 15, 2018

SparkQA commented Nov 15, 2018

srowen Nov 15, 2018

srowen Nov 15, 2018

srowen Nov 15, 2018

uzadude commented Nov 15, 2018

srowen left a comment

srowen Nov 15, 2018

SparkQA commented Nov 22, 2018

SparkQA commented Nov 22, 2018

srowen commented Nov 22, 2018

		@@ -280,17 +278,24 @@ class BlockMatrix @Since("1.3.0") (
		}.groupByKey().map { case (rowIdx, vectors) =>

[SPARK-19368][MLlib] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices #16732

[SPARK-19368][MLlib] BlockMatrix.toIndexedRowMatrix() optimization for sparse matrices #16732

Conversation

uzadude commented Jan 29, 2017

What changes were proposed in this pull request?

How was this patch tested?

kiszk Jan 29, 2017

Choose a reason for hiding this comment

kiszk Jan 29, 2017

Choose a reason for hiding this comment

kiszk commented Jan 29, 2017

kiszk commented Jan 29, 2017 • edited

uzadude commented Jan 29, 2017

kiszk commented Jan 30, 2017 • edited

uzadude commented Jan 30, 2017

kiszk commented Jan 30, 2017

akaltsikis commented Apr 18, 2017

kiszk commented Apr 19, 2017

akaltsikis commented May 12, 2017

srowen commented Nov 14, 2018

akaltsikis commented Nov 14, 2018

uzadude commented Nov 15, 2018

akaltsikis commented Nov 15, 2018

SparkQA commented Nov 15, 2018

srowen Nov 15, 2018

Choose a reason for hiding this comment

srowen Nov 15, 2018

Choose a reason for hiding this comment

srowen Nov 15, 2018

Choose a reason for hiding this comment

uzadude commented Nov 15, 2018

srowen left a comment

Choose a reason for hiding this comment

srowen Nov 15, 2018

Choose a reason for hiding this comment

SparkQA commented Nov 22, 2018

SparkQA commented Nov 22, 2018

srowen commented Nov 22, 2018

kiszk commented Jan 29, 2017 •

edited

kiszk commented Jan 30, 2017 •

edited