Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30120][ML] Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors #26858

Closed
wants to merge 1 commit into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Use BoundedPriorityQueue for small datasets in LSH.approxNearestNeighbors

Why are the changes needed?

For small datasets, we can get exact result instead of using approxQuantile

Does this PR introduce any user-facing change?

no

How was this patch tested?

Use existing unit tests

if (approxQuantile >= 1) {
modelDatasetWithDist
// for a small dataset, use BoundedPriorityQueue
if (count < 1000) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a good number to use here?

@huaxingao
Copy link
Contributor Author

cc @zhengruifeng @srowen

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115214 has finished for PR 26858 at commit 48a91ee.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

I am afraid this PR is wrong.

Use BoundedPriorityQueue for small dataset in LSH approxNearestNeighbors

Not for small dataset but for small numNearestNeighbors. Only when numNearestNeighbors is small, we can use a max-heap (BoundedPriorityQueue) to directly obtain top entries.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah might be worthwhile to optimize this case. I'd keep the size limit small. How much does it speed things up though?

modelDatasetWithDist
// for a small dataset, use BoundedPriorityQueue
if (count < 1000) {
val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the queue just need numNearestNeighbors elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should use numNearestNeighbors. Will fix this.

}
var sortedDistCol = queue.toArray.sorted(Ordering[Double])
queue.clear()
modelDatasetWithDist.filter(col(distCol) <= sortedDistCol(numNearestNeighbors - 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might pull sortedDistCol(numNearestNeighbors - 1) into a val or else this has to send the whole array. (You don't have to clear the queue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this. Thanks!

// for a small dataset, use BoundedPriorityQueue
if (count < 1000) {
val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])
modelDatasetWithDist.collect().foreach { case Row(keys, values, distCol: Double) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys and values can be _

@srowen
Copy link
Member

srowen commented Dec 12, 2019

@zhengruifeng I think the logic does have to depend on the size of the dataset; you don't want to collect 10M elements to find 10 nearest neighbors. But I do think the priority queue size needs to be numNearestNeighbors, yes, if that's what you mean.

modelDatasetWithDist
// for a small dataset, use BoundedPriorityQueue
if (count < 1000) {
val queue = new BoundedPriorityQueue[Double](count.toInt)(Ordering[Double])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place should be like:

val exactThreshold = modelDatasetWithDist
.select(distCol)
.as[Double]
.rdd
.treeAggregate(new BoundedPriorityQueue[Double](numNearestNeighbors)(Ordering[Double].reverse))(
seqOp= (q, v) => q += v,
combOp = (q1, q2) => q1 ++= q2,
depth = 2
).toArray.max

And this impl should have no dependency on the size of dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only depends on numNearestNeighbors, when it is small (maybe < 10000?).
On each partition, collect the minmum 10 values, and merge them by treeAggregate to get the global minmum 10 values, and the max value in them is the threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundedPriorityQueue only maintains the topK entries, so it is safe to absorb a lot of entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approxQuantile already kind of works this way, so I think the point of this PR is avoiding several passes of Spark jobs for tree reduce in this case.

However it's a fair point, I wonder if, overall, this approach is faster than approxQuantile? it already does something like what you're suggesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a small test. I used existing BucketedRandomProjectionLSHSuite but made the dataset bigger:

val data = {
      for (i <- -200 until 200; j <- -200 until 200) yield Vectors.dense(i*10, j*10)
}
dataset = spark.createDataFrame(data1.map(Tuple1.apply)).toDF("keys")

So the dataset count is 160000 and I tested 10000, 9000, 8000, 7000, 6000, 5000, 4000, 3000, 2000 and 1000 nearest neighbors. I didn't see any performance gain using BoundedPriorityQueue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrongly thought that the approxNearestNeighbors only return an approximate threshold, then we can use top-k to obtain an exact threshold.
Since the approxNearestNeighbors already gaurantee an enough threshold which had already taken the relative error into account, so I guess we no longer need a top-k solution.
A BoundedPriorityQueue only maintains the topK entries, so it should be much smaller than a QuantileSummaries, however since there is only one column to process, so there should be no performance gain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrongly thought that the approxNearestNeighbors only return an approximate threshold, then we can use top-k to obtain an exact threshold.
Since the approxNearestNeighbors already gaurantee an enough threshold which had already taken the relative error into account, so I guess we no longer need a top-k solution.
A BoundedPriorityQueue only maintains the topK entries, so it should be much smaller than a QuantileSummaries, however since there is only one column to process, so there should be no performance gain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slight performance gain may come from that BoundedPriorityQueue do not need a count job to compute the var approxQuantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree just implementing this in terms of BoundedPriorityQueue is also a viable solution. I don't know which one is faster. I had assumed the approximate quantile would be as I think it does less work overall, but I haven't tested it. That is, I think it will hang on to fewer than k entries per partition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to do some performance tests, however I found that in the public approxNearestNeighbors methods , singleProbe is always true for now, so our changes in LSH can not be reached in the code path.

@zhengruifeng
Copy link
Contributor

I guess we do not need BoundedPriorityQueue any more, and maybe OK to close this PR?

@srowen
Copy link
Member

srowen commented Dec 26, 2019

I think the question here is different from in #26948 : is it worthwhile to avoid the distributed operation entirely if the data set is small? It may or may not be faster, but it's more exact. I guess I'm neutral on it, just because it adds some complexity for a little gain, but I am not against it.

@huaxingao
Copy link
Contributor Author

I will close this PR. Thanks for reviewing!

@huaxingao huaxingao closed this Dec 26, 2019
@huaxingao huaxingao deleted the spark-30120 branch December 26, 2019 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants