Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893

Closed
wants to merge 1 commit into from
Closed

[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893

wants to merge 1 commit into from

Conversation

KyleLi1985
Copy link
Contributor

@KyleLi1985 KyleLi1985 commented Oct 30, 2018

What changes were proposed in this pull request?

Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.

How was this patch tested?

From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)

For calculation logic test
There is my test for sparse-sparse, dense-dense, sparse-dense case

There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

For real world data test
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

@HyukjinKwon
Copy link
Member

Please fix the PR title as described in https://spark.apache.org/contributing.html and read it.

@HyukjinKwon
Copy link
Member

How much performance does it gain in end-to-end test, and how does it provide better performance?

@KyleLi1985 KyleLi1985 changed the title One part of Spark MLlib Kmean Logic Performance problem [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem Oct 30, 2018
@KyleLi1985
Copy link
Contributor Author

End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt")
.map(f => f.split(",")
.map(f => f.toDouble))

val vectorRdd = rdd.map(f => Vectors.dense(f))
val startTime = System.currentTimeMillis()
for (i <- 0 until 20) {
  val model = new KMeans()
    .setK(8)
    .setMaxIterations(100)
    .setInitializationMode(K_MEANS_PARALLEL)
    .run(vectorRdd)
}
val endTime = System.currentTimeMillis()

// scalastyle:off println
println("cost time: " + (endTime - startTime))
// scalastyle:on println

`
Input Data:
extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data
Test Result:
Before: cost time is 297686 milliseconds (consider the worst situation)
After add patch: cost time is 180544 milliseconds (consider the worst situation)

Function Test Situation:
Only test function fastSquaredDistance in below situation:
call fastSquaredDistance function 100000000 times before and after added patch respectively

Input Data:
1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5
4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9
Test Result:
Before: cost time is 8395 milliseconds
After added patch: cost time is 5448 milliseconds

So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode.
( further more the sqDist = Vectors.sqdist(v1, v2)
is better than
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
in calculation performance
)

@mgaido91
Copy link
Contributor

@KyleLi1985 do you have native BLAS installed?

@KyleLi1985
Copy link
Contributor Author

@KyleLi1985 do you have native BLAS installed?
Like code said : // For level-1 routines, we use Java implementation.

@mgaido91
Copy link
Contributor

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.

@KyleLi1985
Copy link
Contributor Author

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
Ok, For a fair result, I will try it

@srowen
Copy link
Member

srowen commented Oct 31, 2018

I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).

Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.

And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.

(If you did remove this check you could remove some other values that get computed to check this bound, like precision1)

@KyleLi1985
Copy link
Contributor Author

then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO.
This part only use F2j to calculate as I said in last comment, so the performance is not influence by the native BLAS

@KyleLi1985
Copy link
Contributor Author

KyleLi1985 commented Nov 1, 2018

I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM).

Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note.

And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work.

(If you did remove this check you could remove some other values that get computed to check this bound, like precision1)

We use only "Vectors Dense", here is the test file
SparkMLlibTest.txt
I extract the relevant part from code and compare the performance, The result show in Vectors Dense situation the sqdist is faster。
And for End-to-End test, I consider the worst situation, input vector are all dense and the precision is not OK!

`

if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
  || !v2.isInstanceOf[DenseVector])) {
    sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
  val dotValue = dot(v1, v2)
  sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
  val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
    (sqDist + EPSILON)
  if (precisionBound2 > precision) {
    sqDist = Vectors.sqdist(v1, v2)
  }
} else {
  sqDist = Vectors.sqdist(v1, v2)
}

`

only use sqdist to calculate distance when the logic is presented above

@srowen
Copy link
Member

srowen commented Nov 1, 2018

Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.

@KyleLi1985
Copy link
Contributor Author

Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case.

There is my test for sparse-sparse, dense-dense, sparse-dense case
SparkMLlibTest.txt

There is test result:

First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

The above result based on comparison between Original fastSquaredDistance and Enhanced fastSquaredDistance which is showed below

`

private[mllib] def fastSquaredDistance(
    v1: Vector,
    norm1: Double,
    v2: Vector,
    norm2: Double,
    precision: Double = 1e-6): Double = {
  val n = v1.size
  require(v2.size == n)
  require(norm1 >= 0.0 && norm2 >= 0.0)
  val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
  val normDiff = norm1 - norm2
  var sqDist = 0.0
  /*
   * The relative error is
   * <pre>
   * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
   * </pre>
   * which is bounded by
   * <pre>
   * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
   * </pre>
   * The bound doesn't need the inner product, so we can use it as a sufficient condition to
   * check quickly whether the inner product approach is accurate.
   */

  if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) {
    sqDist = Vectors.sqdist(v1, v2)
  } else {
    val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)

    if (precisionBound1 < precision) {
      sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
    } else {
      val dotValue = dot(v1, v2)
      sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
      val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
        (sqDist + EPSILON)

      if (precisionBound2 > precision) {
        sqDist = Vectors.sqdist(v1, v2)
      }
    }
  }

  sqDist
}

`

@srowen
Copy link
Member

srowen commented Nov 2, 2018

So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.

Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.

How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?

This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.

@KyleLi1985
Copy link
Contributor Author

So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened.

Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse.

How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here?

This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.

  1. why use Spark? not for special reason, only align with my common using tool.

  2. About the vector, I did a more representative input test, I show this result below

  3. About the precision, it is trick, you can meet your goal (let your calculation logic into which branch) by manually change it. As I said in last comment, take LOGIC2 for example, you can manually change precision to -10000 in ( precisionbound1 < precision) and change precision to 10000 in (precisionbound2 > precision), so you calculation login will into LOGIC2 situation. It is like codecoverage thing. Anyway, we goal is to show the performance will not change in same calculation logic before and after added Enhance for sparse-sparse and sparse-dense situation.

There is my test file
SparkMLlibTest.txt

There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

@srowen
Copy link
Member

srowen commented Nov 3, 2018

OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.

This doesn't show an improvement, right?

@KyleLi1985
Copy link
Contributor Author

KyleLi1985 commented Nov 3, 2018

OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running.

This doesn't show an improvement, right?

TEST, I agree with you

No influence for sparse case and there is an improvement for dense case

You can also check this data as I show in previously comment, it is a dense case
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals

before Enhance 28800, 28190, 28320
after Enhance 15693,16034,16322

sqDist = Vectors.sqdist(v1, v2)
} else {
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose.

I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Thanks for review, I will update the new commit and related test

@KyleLi1985
Copy link
Contributor Author

I form the final test case for sparse case and dense case on realistic data to test new commit
SparkMLlibTest.txt

For Dense case, we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and use all the dense data file from this realistic data

Dense case Test Result time cost (milliseconds)
Before Enhance 211878 210845 215375
After Enhance 140827 149282 130691

For Sparse case, we use data from
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
and extract all the sparse data file (PS1, PS2, PS3, PS4, PS5, PS6) from this realistic data

Sparse case Test Result time cost (milliseconds)
Before Enhance 108080 103582 103586
After Enhance 107652 107145 104768

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm reasonably convinced.

@KyleLi1985
Copy link
Contributor Author

@AmplabJenkins test this please

@SparkQA
Copy link

SparkQA commented Nov 9, 2018

Test build #4420 has finished for PR 22893 at commit 762755d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 9, 2018

Heh, as a side effect, this made the output of computeCost more accurate in one Pyspark test. It prints "2.0" rather than "2.000..." I think you can change the three instances that failed to just expect "2.0".

@KyleLi1985
Copy link
Contributor Author

It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spark and open another pull request, or is there other method?

@srowen
Copy link
Member

srowen commented Nov 10, 2018

There's no merge conflict right now. You can just update the file and push the commit to your branch. If there were a merge conflict, you'd just rebase on apache/master, resolve the conflict, and force-push the branch.

@KyleLi1985
Copy link
Contributor Author

@SparkQA test this please

@mgaido91
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2018

Test build #4423 has finished for PR 22893 at commit dc30bac.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -88,6 +88,14 @@ def clusterSizes(self):
"""
return self._call_java("clusterSizes")

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this now added, by mistake? this change should just be the one above and the test change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my mistake, already update the commit to fit value "2.0"
in spark/python/pyspark/ml/clustering.py and spark/python/pyspark/mllib/clustering.py

@KyleLi1985
Copy link
Contributor Author

@SparkQA retest this please

@SparkQA
Copy link

SparkQA commented Nov 12, 2018

Test build #4424 has finished for PR 22893 at commit 50ef296.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 14, 2018

Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.

@srowen
Copy link
Member

srowen commented Nov 14, 2018

Merged to master

@asfgit asfgit closed this in e503065 Nov 14, 2018
@KyleLi1985
Copy link
Contributor Author

Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation.

@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleasure.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…problem

## What changes were proposed in this pull request?

Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.

## How was this patch tested?
From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)

**For calculation logic test**
There is my test for sparse-sparse, dense-dense, sparse-dense case

There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

**For real world data test**
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

Closes apache#22893 from KyleLi1985/updatekmeanpatch.

Authored-by: 李亮 <liang.li.work@outlook.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants