New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem #22893
Conversation
Please fix the PR title as described in https://spark.apache.org/contributing.html and read it. |
How much performance does it gain in end-to-end test, and how does it provide better performance? |
End-to-End TEST Situation:
` Function Test Situation: Input Data: So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode. |
@KyleLi1985 do you have native BLAS installed? |
|
then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO. |
|
I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). Are all the vectors dense? I suppose I'm still surprised if sqdist is faster than dot here as it ought to be a little more math. The sparse-dense case might come out differently, note. And I suppose I have a hard time believing that the sparse-sparse case is faster after this change (when the precision bound is met) because now it's handled in the sparse-sparse if case in this code, which definitely does a dot plus more work. (If you did remove this check you could remove some other values that get computed to check this bound, like precision1) |
|
We use only "Vectors Dense", here is the test file `
` only use sqdist to calculate distance when the logic is presented above |
Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where the branch you deleted helps. As I say, you'd have to show this is not impacting other cases significantly, and I think it should. Consider the sparse-sparse case. |
There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case sparse- sparse case time cost situation (milliseconds) sparse-dense case time cost situation (milliseconds) And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) The above result based on comparison between Original fastSquaredDistance and Enhanced fastSquaredDistance which is showed below `
` |
So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened. Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse. How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here? This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense. |
There is my test file There is my test data situation total instances are 13230 Result for sparse-sparse situation time cost (milliseconds) |
OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I understand you manually modified it. Because the test is so central here I think it's important to understand exactly what you're measuring and exactly what you're running. This doesn't show an improvement, right? |
TEST, I agree with you No influence for sparse case and there is an improvement for dense case You can also check this data as I show in previously comment, it is a dense case before Enhance 28800, 28190, 28320 |
sqDist = Vectors.sqdist(v1, v2) | ||
} else { | ||
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose.
I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thanks for review, I will update the new commit and related test
I form the final test case for sparse case and dense case on realistic data to test new commit For Dense case, we use data from Dense case Test Result time cost (milliseconds) For Sparse case, we use data from Sparse case Test Result time cost (milliseconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm reasonably convinced.
@AmplabJenkins test this please |
Test build #4420 has finished for PR 22893 at commit
|
Heh, as a side effect, this made the output of computeCost more accurate in one Pyspark test. It prints "2.0" rather than "2.000..." I think you can change the three instances that failed to just expect "2.0". |
It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spark and open another pull request, or is there other method? |
There's no merge conflict right now. You can just update the file and push the commit to your branch. If there were a merge conflict, you'd just rebase on apache/master, resolve the conflict, and force-push the branch. |
@SparkQA test this please |
retest this please |
Test build #4423 has finished for PR 22893 at commit
|
python/pyspark/ml/clustering.py
Outdated
@@ -88,6 +88,14 @@ def clusterSizes(self): | |||
""" | |||
return self._call_java("clusterSizes") | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this now added, by mistake? this change should just be the one above and the test change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my mistake, already update the commit to fit value "2.0"
in spark/python/pyspark/ml/clustering.py and spark/python/pyspark/mllib/clustering.py
@SparkQA retest this please |
Test build #4424 has finished for PR 22893 at commit
|
Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation. |
Merged to master |
@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleasure. |
…problem ## What changes were proposed in this pull request? Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy. ## How was this patch tested? From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense) **For calculation logic test** There is my test for sparse-sparse, dense-dense, sparse-dense case There is test result: First we need define some branch path logic for sparse-sparse and sparse-dense case if meet precisionBound1, we define it as LOGIC1 if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2 if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3 (There is a trick, you can manually change the precision value to meet above situation) sparse- sparse case time cost situation (milliseconds) LOGIC1 Before add patch: 7786, 7970, 8086 After add patch: 7729, 7653, 7903 LOGIC2 Before add patch: 8412, 9029, 8606 After add patch: 8603, 8724, 9024 LOGIC3 Before add patch: 19365, 19146, 19351 After add patch: 18917, 19007, 19074 sparse-dense case time cost situation (milliseconds) LOGIC1 Before add patch: 4195, 4014, 4409 After add patch: 4081,3971, 4151 LOGIC2 Before add patch: 4968, 5579, 5080 After add patch: 4980, 5472, 5148 LOGIC3 Before add patch: 11848, 12077, 12168 After add patch: 11718, 11874, 11743 And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance dense-dense case time cost situation (milliseconds) Before add patch: 7340, 7816, 7672 After add patch: 5752, 5800, 5753 **For real world data test** There is my test data situation I use the data http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data total instances are 13230 the attributes for line are 6000 Result for sparse-sparse situation time cost (milliseconds) Before Enhance: 7670, 7704, 7652 After Enhance: 7634, 7729, 7645 Closes apache#22893 from KyleLi1985/updatekmeanpatch. Authored-by: 李亮 <liang.li.work@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.
How was this patch tested?
From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)
For calculation logic test
There is my test for sparse-sparse, dense-dense, sparse-dense case
There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)
sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074
sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743
And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance
dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753
For real world data test
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data
total instances are 13230
the attributes for line are 6000
Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645