[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893

KyleLi1985 · 2018-10-30T11:26:12Z

What changes were proposed in this pull request?

Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.

How was this patch tested?

From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)

For calculation logic test
There is my test for sparse-sparse, dense-dense, sparse-dense case

There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

For real world data test
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

HyukjinKwon · 2018-10-30T11:38:06Z

Please fix the PR title as described in https://spark.apache.org/contributing.html and read it.

HyukjinKwon · 2018-10-30T11:39:53Z

How much performance does it gain in end-to-end test, and how does it provide better performance?

KyleLi1985 · 2018-10-31T10:25:23Z

End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt")
.map(f => f.split(",")
.map(f => f.toDouble))

val vectorRdd = rdd.map(f => Vectors.dense(f))
val startTime = System.currentTimeMillis()
for (i <- 0 until 20) {
  val model = new KMeans()
    .setK(8)
    .setMaxIterations(100)
    .setInitializationMode(K_MEANS_PARALLEL)
    .run(vectorRdd)
}
val endTime = System.currentTimeMillis()

// scalastyle:off println
println("cost time: " + (endTime - startTime))
// scalastyle:on println

`
Input Data:
extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
Test Result:
Before: cost time is 297686 milliseconds (consider the worst situation)
After add patch: cost time is 180544 milliseconds (consider the worst situation)

Function Test Situation:
Only test function fastSquaredDistance in below situation:
call fastSquaredDistance function 100000000 times before and after added patch respectively

Input Data:
1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
Test Result:
Before: cost time is 8395 milliseconds
After added patch: cost time is 5448 milliseconds

So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode.
( further more the sqDist = Vectors.sqdist(v1, v2)
is better than
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
in calculation performance
)

mgaido91 · 2018-10-31T10:29:43Z

@KyleLi1985 do you have native BLAS installed?

KyleLi1985 · 2018-10-31T10:32:19Z

@KyleLi1985 do you have native BLAS installed?
Like code said : // For level-1 routines, we use Java implementation.

mgaido91 · 2018-10-31T10:40:38Z

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.

KyleLi1985 · 2018-10-31T10:44:42Z

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
Ok, For a fair result, I will try it

srowen · 2018-10-31T14:34:56Z

I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).

Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.

And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.

(If you did remove this check you could remove some other values that get computed to check this bound, like precision1)

KyleLi1985 · 2018-11-01T10:16:10Z

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
This part only use F2j to calculate as I said in last comment, so the performance is not influence by the native BLAS

KyleLi1985 · 2018-11-01T11:52:18Z

I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).

Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.

And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.

(If you did remove this check you could remove some other values that get computed to check this bound, like precision1)

We use only "Vectors Dense", here is the test file
SparkMLlibTest.txt
I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!

`

if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
  || !v2.isInstanceOf[DenseVector])) {
    sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
  val dotValue = dot(v1, v2)
  sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
  val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
    (sqDist + EPSILON)
  if (precisionBound2 > precision) {
    sqDist = Vectors.sqdist(v1, v2)
  }
} else {
  sqDist = Vectors.sqdist(v1, v2)
}

`

only use sqdist to calculate distance when the logic is presented above

srowen · 2018-11-01T14:09:52Z

Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.

KyleLi1985 · 2018-11-02T07:26:15Z

Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.

There is my test for sparse-sparse, dense-dense, sparse-dense case
SparkMLlibTest.txt

There is test result:

First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

The above result based on comparison between Original fastSquaredDistance and Enhanced fastSquaredDistance which is showed below

`

private[mllib] def fastSquaredDistance(
    v1: Vector,
    norm1: Double,
    v2: Vector,
    norm2: Double,
    precision: Double = 1e-6): Double = {
  val n = v1.size
  require(v2.size == n)
  require(norm1 >= 0.0 && norm2 >= 0.0)
  val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
  val normDiff = norm1 - norm2
  var sqDist = 0.0
  /*
   * The relative error is
   * <pre>
   * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
   * </pre>
   * which is bounded by
   * <pre>
   * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
   * </pre>
   * The bound doesn't need the inner product, so we can use it as a sufficient condition to
   * check quickly whether the inner product approach is accurate.
   */

  if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
    sqDist = Vectors.sqdist(v1, v2)
  } else {
    val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)

    if (precisionBound1 < precision) {
      sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
    } else {
      val dotValue = dot(v1, v2)
      sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
      val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
        (sqDist + EPSILON)

      if (precisionBound2 > precision) {
        sqDist = Vectors.sqdist(v1, v2)
      }
    }
  }

  sqDist
}

`

srowen · 2018-11-02T16:22:32Z

So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.

Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.

How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?

This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.

KyleLi1985 · 2018-11-03T06:01:44Z

So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.

Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.

How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?

This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.

why use Spark? not for special reason, only align with my common using tool.
About the vector, I did a more representative input test, I show this result below
About the precision, it is trick, you can meet your goal (let your calculation logic into which branch) by manually change it. As I said in last comment, take LOGIC2 for example, you can manually change precision to -10000 in ( precisionbound1 < precision) and change precision to 10000 in (precisionbound2 > precision), so you calculation login will into LOGIC2 situation. It is like codecoverage thing. Anyway, we goal is to show the performance will not change in same calculation logic before and after added Enhance for sparse-sparse and sparse-dense situation.

There is my test file
SparkMLlibTest.txt

There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

srowen · 2018-11-03T14:28:27Z

OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.

This doesn't show an improvement, right?

KyleLi1985 · 2018-11-03T15:33:21Z

OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.

This doesn't show an improvement, right?

TEST, I agree with you

No influence for sparse case and there is an improvement for dense case

You can also check this data as I show in previously comment, it is a dense case
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals

before Enhance 28800, 28190, 28320
after Enhance 15693,16034,16322

srowen · 2018-11-07T15:38:59Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

      sqDist = Vectors.sqdist(v1, v2)
+    } else {
+      val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)


Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose.

I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.

@srowen Thanks for review, I will update the new commit and related test

KyleLi1985 · 2018-11-08T11:19:45Z

I form the final test case for sparse case and dense case on realistic data to test new commit
SparkMLlibTest.txt

For Dense case, we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and use all the dense data file from this realistic data

Dense case Test Result time cost (milliseconds)
Before Enhance 211878 210845 215375
After Enhance 140827 149282 130691

For Sparse case, we use data from
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
and extract all the sparse data file (PS1, PS2, PS3, PS4, PS5, PS6) from this realistic data

Sparse case Test Result time cost (milliseconds)
Before Enhance 108080 103582 103586
After Enhance 107652 107145 104768

srowen

OK, I'm reasonably convinced.

KyleLi1985 · 2018-11-09T09:40:54Z

@AmplabJenkins test this please

SparkQA · 2018-11-09T15:04:42Z

Test build #4420 has finished for PR 22893 at commit 762755d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-09T15:29:25Z

Heh, as a side effect, this made the output of computeCost more accurate in one Pyspark test. It prints "2.0" rather than "2.000..." I think you can change the three instances that failed to just expect "2.0".

KyleLi1985 · 2018-11-10T03:52:36Z

It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spark and open another pull request, or is there other method?

srowen · 2018-11-10T05:39:21Z

There's no merge conflict right now. You can just update the file and push the commit to your branch. If there were a merge conflict, you'd just rebase on apache/master, resolve the conflict, and force-push the branch.

KyleLi1985 · 2018-11-10T07:14:01Z

@SparkQA test this please

mgaido91 · 2018-11-10T09:13:58Z

retest this please

SparkQA · 2018-11-10T15:50:50Z

Test build #4423 has finished for PR 22893 at commit dc30bac.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-10T15:53:54Z

python/pyspark/ml/clustering.py

@@ -88,6 +88,14 @@ def clusterSizes(self):
        """
        return self._call_java("clusterSizes")

+    @property


Why is this now added, by mistake? this change should just be the one above and the test change

It is my mistake, already update the commit to fit value "2.0"
in spark/python/pyspark/ml/clustering.py and spark/python/pyspark/mllib/clustering.py

KyleLi1985 · 2018-11-10T16:19:03Z

@SparkQA retest this please

SparkQA · 2018-11-12T16:47:27Z

Test build #4424 has finished for PR 22893 at commit 50ef296.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-14T15:24:41Z

Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.

srowen · 2018-11-14T15:24:55Z

Merged to master

KyleLi1985 · 2018-11-15T05:42:53Z

Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.

@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleasure.

…problem ## What changes were proposed in this pull request? Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy. ## How was this patch tested? From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense) **For calculation logic test** There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case if meet precisionBound1, we define it as LOGIC1 if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2 if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3 (There is a trick, you can manually change the precision value to meet above situation) sparse- sparse case time cost situation (milliseconds) LOGIC1 Before add patch: 7786, 7970, 8086 After add patch: 7729, 7653, 7903 LOGIC2 Before add patch: 8412, 9029, 8606 After add patch: 8603, 8724, 9024 LOGIC3 Before add patch: 19365, 19146, 19351 After add patch: 18917, 19007, 19074 sparse-dense case time cost situation (milliseconds) LOGIC1 Before add patch: 4195, 4014, 4409 After add patch: 4081,3971, 4151 LOGIC2 Before add patch: 4968, 5579, 5080 After add patch: 4980, 5472, 5148 LOGIC3 Before add patch: 11848, 12077, 12168 After add patch: 11718, 11874, 11743 And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) Before add patch: 7340, 7816, 7672 After add patch: 5752, 5800, 5753 **For real world data test** There is my test data situation I use the data http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data total instances are 13230 the attributes for line are 6000 Result for sparse-sparse situation time cost (milliseconds) Before Enhance: 7670, 7704, 7652 After Enhance: 7634, 7729, 7645 Closes apache#22893 from KyleLi1985/updatekmeanpatch. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

KyleLi1985 changed the title ~~One part of Spark MLlib Kmean Logic Performance problem~~ [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem Oct 30, 2018

srowen reviewed Nov 7, 2018

View reviewed changes

srowen approved these changes Nov 8, 2018

View reviewed changes

srowen reviewed Nov 10, 2018

View reviewed changes

upgrade kmean performance

50ef296

asfgit closed this in e503065 Nov 14, 2018

[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893

[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893

Conversation

KyleLi1985 commented Oct 30, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

KyleLi1985 commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

KyleLi1985 commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

KyleLi1985 commented Oct 31, 2018

srowen commented Oct 31, 2018

KyleLi1985 commented Nov 1, 2018

KyleLi1985 commented Nov 1, 2018 • edited

srowen commented Nov 1, 2018

KyleLi1985 commented Nov 2, 2018

srowen commented Nov 2, 2018

KyleLi1985 commented Nov 3, 2018

srowen commented Nov 3, 2018

KyleLi1985 commented Nov 3, 2018 • edited

srowen Nov 7, 2018

Choose a reason for hiding this comment

KyleLi1985 Nov 8, 2018

Choose a reason for hiding this comment

KyleLi1985 commented Nov 8, 2018

srowen left a comment

Choose a reason for hiding this comment

KyleLi1985 commented Nov 9, 2018

SparkQA commented Nov 9, 2018

srowen commented Nov 9, 2018

KyleLi1985 commented Nov 10, 2018

srowen commented Nov 10, 2018

KyleLi1985 commented Nov 10, 2018

mgaido91 commented Nov 10, 2018

SparkQA commented Nov 10, 2018

srowen Nov 10, 2018

Choose a reason for hiding this comment

KyleLi1985 Nov 10, 2018 • edited

Choose a reason for hiding this comment

KyleLi1985 commented Nov 10, 2018

SparkQA commented Nov 12, 2018

srowen commented Nov 14, 2018

srowen commented Nov 14, 2018

KyleLi1985 commented Nov 15, 2018

KyleLi1985 commented Oct 30, 2018 •

edited

KyleLi1985 commented Nov 1, 2018 •

edited

KyleLi1985 commented Nov 3, 2018 •

edited

KyleLi1985 Nov 10, 2018 •

edited