[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

huaxingao · 2019-12-12T06:10:29Z

What changes were proposed in this pull request?

Use BoundedPriorityQueue for small datasets in LSH.approxNearestNeighbors

Why are the changes needed?

For small datasets, we can get exact result instead of using approxQuantile

Does this PR introduce any user-facing change?

no

How was this patch tested?

Use existing unit tests

…pproxNearestNeighbors

huaxingao · 2019-12-12T06:11:31Z

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

-      if (approxQuantile >= 1) {
-        modelDatasetWithDist
+      // for a small dataset, use BoundedPriorityQueue
+      if (count < 1000) {


what is a good number to use here?

huaxingao · 2019-12-12T06:12:22Z

cc @zhengruifeng @srowen

SparkQA · 2019-12-12T08:05:02Z

Test build #115214 has finished for PR 26858 at commit 48a91ee.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-12-12T09:09:44Z

I am afraid this PR is wrong.

Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors

Not for small dataset but for small numNearestNeighbors. Only when numNearestNeighbors is small, we can use a max-heap (BoundedPriorityQueue) to directly obtain top entries.

srowen

Hm yeah might be worthwhile to optimize this case. I'd keep the size limit small. How much does it speed things up though?

srowen · 2019-12-12T14:27:03Z

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

-        modelDatasetWithDist
+      // for a small dataset, use BoundedPriorityQueue
+      if (count < 1000) {
+        val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])


Shouldn't the queue just need numNearestNeighbors elements?

Sorry, I should use numNearestNeighbors. Will fix this.

srowen · 2019-12-12T14:27:53Z

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

+        }
+        var sortedDistCol = queue.toArray.sorted(Ordering[Double])
+        queue.clear()
+        modelDatasetWithDist.filter(col(distCol) <= sortedDistCol(numNearestNeighbors - 1))


I might pull sortedDistCol(numNearestNeighbors - 1) into a val or else this has to send the whole array. (You don't have to clear the queue)

Will fix this. Thanks!

srowen · 2019-12-12T14:28:32Z

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

+      // for a small dataset, use BoundedPriorityQueue
+      if (count < 1000) {
+        val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])
+        modelDatasetWithDist.collect().foreach { case Row(keys, values, distCol: Double) =>


keys and values can be _

srowen · 2019-12-12T14:30:34Z

@zhengruifeng I think the logic does have to depend on the size of the dataset; you don't want to collect 10M elements to find 10 nearest neighbors. But I do think the priority queue size needs to be numNearestNeighbors, yes, if that's what you mean.

zhengruifeng · 2019-12-13T06:24:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

-        modelDatasetWithDist
+      // for a small dataset, use BoundedPriorityQueue
+      if (count < 1000) {
+        val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])


This place should be like:

val exactThreshold = modelDatasetWithDist .select(distCol) .as[Double] .rdd .treeAggregate(new BoundedPriorityQueue[Double](numNearestNeighbors)(Ordering[Double].reverse))( seqOp= (q, v) => q += v, combOp = (q1, q2) => q1 ++= q2, depth = 2 ).toArray.max

And this impl should have no dependency on the size of dataset.

this only depends on numNearestNeighbors, when it is small (maybe < 10000?).
On each partition, collect the minmum 10 values, and merge them by treeAggregate to get the global minmum 10 values, and the max value in them is the threshold.

BoundedPriorityQueue only maintains the topK entries, so it is safe to absorb a lot of entries.

approxQuantile already kind of works this way, so I think the point of this PR is avoiding several passes of Spark jobs for tree reduce in this case.

However it's a fair point, I wonder if, overall, this approach is faster than approxQuantile? it already does something like what you're suggesting.

I tried a small test. I used existing BucketedRandomProjectionLSHSuite but made the dataset bigger:

val data = { for (i <- -200 until 200; j <- -200 until 200) yield Vectors.dense(i*10, j*10) } dataset = spark.createDataFrame(data1.map(Tuple1.apply)).toDF("keys")

So the dataset count is 160000 and I tested 10000, 9000, 8000, 7000, 6000, 5000, 4000, 3000, 2000 and 1000 nearest neighbors. I didn't see any performance gain using BoundedPriorityQueue.

I wrongly thought that the approxNearestNeighbors only return an approximate threshold, then we can use top-k to obtain an exact threshold.
Since the approxNearestNeighbors already gaurantee an enough threshold which had already taken the relative error into account, so I guess we no longer need a top-k solution.
A BoundedPriorityQueue only maintains the topK entries, so it should be much smaller than a QuantileSummaries, however since there is only one column to process, so there should be no performance gain.

A slight performance gain may come from that BoundedPriorityQueue do not need a count job to compute the var approxQuantile.

I agree just implementing this in terms of BoundedPriorityQueue is also a viable solution. I don't know which one is faster. I had assumed the approximate quantile would be as I think it does less work overall, but I haven't tested it. That is, I think it will hang on to fewer than k entries per partition

I am trying to do some performance tests, however I found that in the public approxNearestNeighbors methods , singleProbe is always true for now, so our changes in LSH can not be reached in the code path.

zhengruifeng · 2019-12-25T03:16:40Z

I guess we do not need BoundedPriorityQueue any more, and maybe OK to close this PR?

srowen · 2019-12-26T15:12:54Z

I think the question here is different from in #26948 : is it worthwhile to avoid the distributed operation entirely if the data set is small? It may or may not be faster, but it's more exact. I guess I'm neutral on it, just because it adds some complexity for a little gain, but I am not against it.

huaxingao · 2019-12-26T22:49:44Z

I will close this PR. Thanks for reviewing!

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH a…

48a91ee

…pproxNearestNeighbors

huaxingao commented Dec 12, 2019

View reviewed changes

dongjoon-hyun added the ML label Dec 12, 2019

srowen reviewed Dec 12, 2019

View reviewed changes

zhengruifeng reviewed Dec 13, 2019

View reviewed changes

huaxingao closed this Dec 26, 2019

huaxingao deleted the spark-30120 branch December 26, 2019 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

huaxingao commented Dec 12, 2019

huaxingao Dec 12, 2019

huaxingao commented Dec 12, 2019

SparkQA commented Dec 12, 2019

zhengruifeng commented Dec 12, 2019

srowen left a comment

srowen Dec 12, 2019

huaxingao Dec 12, 2019

srowen Dec 12, 2019

huaxingao Dec 12, 2019

srowen Dec 12, 2019

srowen commented Dec 12, 2019

zhengruifeng Dec 13, 2019

zhengruifeng Dec 13, 2019

zhengruifeng Dec 13, 2019

srowen Dec 13, 2019

huaxingao Dec 15, 2019

zhengruifeng Dec 17, 2019

zhengruifeng Dec 17, 2019

zhengruifeng Dec 17, 2019

srowen Dec 17, 2019

zhengruifeng Dec 19, 2019

zhengruifeng commented Dec 25, 2019

srowen commented Dec 26, 2019

huaxingao commented Dec 26, 2019

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

Conversation

huaxingao commented Dec 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

huaxingao commented Dec 12, 2019

SparkQA commented Dec 12, 2019

zhengruifeng commented Dec 12, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Dec 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Dec 25, 2019

srowen commented Dec 26, 2019

huaxingao commented Dec 26, 2019