[SPARK-8682][SQL][WIP] Range Join #7379

hvanhovell · 2015-07-13T22:37:57Z

...copied from JIRA (SPARK-8682):

Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered Cartesian Join) when it has to execute the following range query:

SELECT A.*,
       B.*
FROM   tableA A
       JOIN tableB B
        ON A.start <= B.end
         AND A.end > B.start

This is horribly inefficient. The performance of this query can be greatly improved, when one of the tables can be broadcasted, by creating a range index. A range index is basically a sorted map containing the rows of the smaller table, indexed by both the high and low keys. using this structure the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = number of records in the larger table, M = number of records in the smaller (indexed) table.

This is currently a work in progress. I will be adding more tests and a small benchmark in the next couple of days. If you want to try this out, set the spark.sql.planner.rangeJoin option to true in the SQL configuration.

marmbrus · 2015-07-14T02:54:03Z

ok to test

SparkQA · 2015-07-14T02:58:12Z

Test build #37182 has finished for PR 7379 at commit d2bd793.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastRangeJoin(

SparkQA · 2015-07-14T05:45:08Z

Test build #37193 has finished for PR 7379 at commit 6727807.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastRangeJoin(

hvanhovell · 2015-07-14T14:17:03Z

Current test errors are a bit weird. They shouldn't have been caused by this change, because the functionality is disabled by default.

Rebased to most recent master. See if this helps.

SparkQA · 2015-07-14T16:50:38Z

Test build #37229 has finished for PR 7379 at commit 773c009.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Least(children: Seq[Expression]) extends Expression
- case class Greatest(children: Seq[Expression]) extends Expression
- case class BroadcastRangeJoin(

marmbrus · 2015-07-14T17:30:35Z

This looks pretty cool! I can try and do a more through review in a bit, but a few testing suggestions:

It would be great to add a test for the query planner PlannerSuite
I would also add some unit tests for the operator itself, probably a new class modeled after OuterJoinSuite

SparkQA · 2015-07-16T03:19:59Z

Test build #37448 has finished for PR 7379 at commit b405e45.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastRangeJoin(

SparkQA · 2015-07-16T05:23:38Z

Test build #37456 has finished for PR 7379 at commit 8204eae.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastRangeJoin(

chenghao-intel · 2015-07-17T04:49:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastRangeJoin.scala

+        private[this] var iterator: Iterator[InternalRow] = Iterator.empty
+
+        override final def hasNext: Boolean = {
+          var result = iterator.hasNext


Thus it's very rare case, but we cannot assume that user will not call the hasNext multiple times before call the next().

Multiple calls to hasNext shouldn't be a problemen. Granted the first call can have a side effect (updating the state of the iterator), but the subsequent ones won't.

A problem will occur when the next is called without calling hasNext first. I was inspired by the HashedRelation class in the same package when writing this.

chenghao-intel · 2015-07-17T04:59:34Z

This is a very interesting optimization, but will it be more general if we consider that with the SortMergeJoin? As well as the case like:

SELECT A.*,
       B.*
FROM   tableA A
       JOIN tableB B
        ON A.start <= B.start

hvanhovell · 2015-07-17T05:13:06Z

The <= case is quite easy to implement.

This implementation is currently targetted at range joining a rather small (broadcastable) to an arbitrarily large table. I don't think this matches the use case of SMJ: i.e. equi joining arbitrarily large tables. But I might be missing something?

chenghao-intel · 2015-07-17T05:45:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastRangeJoin.scala

+        override final def hasNext: Boolean = {
+          var result = iterator.hasNext
+          while (!result && stream.hasNext) {
+            row = stream.next()


Sorry, actually I mean here, we probably skip some rows if we call hasNext multiple times.

Ah I see. The current (WIP) implementation only allows for inner joins, and we drop rows if they don't have a match in the index. Outer Joins are possible, however BuildSide Outerjoins will require a bit of bookkeeping.

chenghao-intel · 2015-07-17T06:22:03Z

Sorry, I shouldn't use the word SMJ.
I mean if we are planning to improve the performance of RangeJoin, probably we can think of it in a more general way, not just binary ConjunctivePredicates, but also for arbitrary-nary predicates.

e.g.
join on a.key < b.key
join on a.key < b.key and a.key2 > b.key2 and a.key3>=b.key3 ...

I just have kind of feeling, maybe it will be more helpful/easy if we specify the same RangePartitioner (range key as the sorted key) for table(s), and then make the Cartesian Product for partitions.

And the binary BroadcastRangeJoin just a specific case, and definitely lots of code can be shared (like the binary search for the closest tuple).

Sorry if I misunderstood something.

hvanhovell · 2015-07-17T16:28:32Z

No problem.

Supporting N-Ary Predicates.

In order to make the range join work we need the predicates to define a single interval for each side of the join. For instance the clause: a.low < b.high && b.low < a.high implies that there are two intervals: [a.low, a.high] & [b.low, b.high]. An open interval, for instance a.low < b.high, would also work.

When we use more than two clauses, we can potentially have multiple intervals, in your example for instance a.key < b.key and a.key2 > b.key2 and a.key3>=b.key3 would yield the following intervals: [a.key1, a.key2], [a.key1, a.key3], [b.key2, b.key1] & [b.key2, b.key3]. Creating a working index, that can deal with the (partially) uncorrelated intervals, will be quite a challenge (I haven't really looked into this yet). We could offcourse pick join on one pair of intervals and use filtering to take of the rest.

I think the Unary and Binary cases are the most common. Let's start there, and see if there is demand for N-ary designs.

Generalization

If you consider the fact that we are joining intervals (Ranges if you will), range partitioning will not work because this assumes both intervals will be entirely in the same partition (they can span multiple partitions). When dealing with larger tables we would have to use a special interval-aware partitioning, this would create partitions for a number of fully covering non-overlapping intervals, and would multicast the rows to each interval it belongs to. The subsequent step would be using an index or doing a cartesian/BNL join.

Doing a Cartesian Join in a single partition performs horrible. I thought it wouldn't be a problem either, but this completely killed the performance of an analysis I was doing for a client (account balances at specific dates).

I do see opportunities for code re-use. But this would be by generalizing HashedRelation and the BroadCast join family.

marmbrus · 2015-07-17T23:59:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastRangeJoin.scala

+    val ordering = TypeUtils.getOrdering(buildKeys.head.dataType)
+
+    // Note that we use .execute().collect() because we don't want to convert data to Scala types
+    // TODO find out if the result of a sort and a collect is still sorted.


it should be

marmbrus · 2015-09-03T20:47:07Z

@hvanhovell thanks for working on this! To keep the PR queue manageable I propose we close this issue for now until you have time to bring it up to date and remove the WIP tag.

saj1th · 2017-01-19T18:52:09Z

Facing huge performance issues with range joins. Hoping to see this implemented

zzeekk · 2017-02-16T21:04:10Z

Same here. A Workaround is to build blocks and add them as equi-join condition. But then you need to make an additional join on the following block und coalesce the results.

IceMan81 · 2017-06-06T06:41:04Z

@zzeekk Would you mind explaining how your workaround works.

A Workaround is to build blocks and add them as equi-join condition
Not sure I understand what you are suggesting here.

@marmbrus Inability to do range join efficiently results in very poor performance. Are there plans on addressing this directly in an upcoming release? I've scenarios where the optimizer sorts the results into the single partition for the join (all other partitions are empty) because the sort does not include the columns in the range condition. And this task will run for more than a day which a forced broadcast version of it will run in 3 hours. And here I'm only able to do the boradcast because I'm using a smaller data set on one side of the join.

zzeekk · 2017-06-09T08:30:26Z

@IceMan81 Here is an abstract example of our workaround, building blocks as additional equi-join conditions.
The task is to join Points (df1) on a line back to Segments of the same line (df2): df1.id_line=df2.id_line and df1.position>=df2.position_from and df1.position<df2.position_to
By building blocks for the position, we can add a more detailed equi-join condition, but need to join twice and coalesce results to catch the case when position_from is in a different block than position. Additional note: block size (/10) must be adapted to expected data:

  df1.as("df1")
  .join( df2.as("df2a"), $"df1.id_line"===$"df2a.id_line" and floor($"df1.position"/10)===floor($"df2a.position_from"/10) 
                           and $"df1.position">=$"df2a.position_from" and $"df1.position"<$"df2a.position_to", "left")
  .join( df2.as("df2b"), $"df1.id_line"===$"df2b.id_line" and floor($"df1.position"/10)===floor($"df2b.position_to"/10) 
                           and $"df1.position">=$"df2b.position_from" and $"df1.position"<$"df2b.position_to", "left")

IceMan81 · 2017-06-17T10:35:22Z

@zzeekk Okay, I get the idea. But, what would you do for timestamp ranges; how would you get additional equi-join conditions. The idea of floor($"df1.position"/10)===floor($"df2a.position_from"/10) (or in the case of timestamps "df1.timestamp" - interval === $"df21.timestamp" - interval ) wouldn't apply as you may have not have timestamps that match that condition.

zzeekk · 2017-06-19T20:41:42Z

Hello @IceMan81, you need to truncate your timestamps to days, hours or mins depending on your use case, and use that for the additional equi-join condition.

hvanhovell added 3 commits July 14, 2015 10:03

Initial Range Join commit: Compiles & Style Checks work.

21722fe

Added Tests for Range Index. Ton of Bug Fixes.

65ce5ff

Add License to RangeIndexSuite

773c009

hvanhovell force-pushed the SPARK-8682 branch from 6727807 to 773c009 Compare July 14, 2015 14:04

Treat intervals and points differenty during index creation.

6661c47

hvanhovell added 2 commits July 15, 2015 00:12

Bug Fixes. Improved Iterator code.

6d205d4

Added more test and fixed a few boudary related bugs.

b405e45

hvanhovell added 2 commits July 15, 2015 23:29

scalastyle

9cc1c7e

scalastyle...

8204eae

chenghao-intel reviewed Jul 17, 2015
View reviewed changes

marmbrus reviewed Jul 17, 2015
View reviewed changes

asfgit closed this in 804a012 Sep 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8682][SQL][WIP] Range Join #7379

[SPARK-8682][SQL][WIP] Range Join #7379

hvanhovell commented Jul 13, 2015

marmbrus commented Jul 14, 2015

SparkQA commented Jul 14, 2015

SparkQA commented Jul 14, 2015

hvanhovell commented Jul 14, 2015

SparkQA commented Jul 14, 2015

marmbrus commented Jul 14, 2015

SparkQA commented Jul 16, 2015

SparkQA commented Jul 16, 2015

chenghao-intel Jul 17, 2015

hvanhovell Jul 17, 2015

chenghao-intel commented Jul 17, 2015

hvanhovell commented Jul 17, 2015

chenghao-intel Jul 17, 2015

hvanhovell Jul 17, 2015

chenghao-intel commented Jul 17, 2015

hvanhovell commented Jul 17, 2015

marmbrus Jul 17, 2015

marmbrus commented Sep 3, 2015

saj1th commented Jan 19, 2017

zzeekk commented Feb 16, 2017

IceMan81 commented Jun 6, 2017

zzeekk commented Jun 9, 2017 •

edited

Loading

IceMan81 commented Jun 17, 2017

zzeekk commented Jun 19, 2017

[SPARK-8682][SQL][WIP] Range Join #7379

[SPARK-8682][SQL][WIP] Range Join #7379

Conversation

hvanhovell commented Jul 13, 2015

marmbrus commented Jul 14, 2015

SparkQA commented Jul 14, 2015

SparkQA commented Jul 14, 2015

hvanhovell commented Jul 14, 2015

SparkQA commented Jul 14, 2015

marmbrus commented Jul 14, 2015

SparkQA commented Jul 16, 2015

SparkQA commented Jul 16, 2015

chenghao-intel Jul 17, 2015

Choose a reason for hiding this comment

hvanhovell Jul 17, 2015

Choose a reason for hiding this comment

chenghao-intel commented Jul 17, 2015

hvanhovell commented Jul 17, 2015

chenghao-intel Jul 17, 2015

Choose a reason for hiding this comment

hvanhovell Jul 17, 2015

Choose a reason for hiding this comment

chenghao-intel commented Jul 17, 2015

hvanhovell commented Jul 17, 2015

Supporting N-Ary Predicates.

Generalization

marmbrus Jul 17, 2015

Choose a reason for hiding this comment

marmbrus commented Sep 3, 2015

saj1th commented Jan 19, 2017

zzeekk commented Feb 16, 2017

IceMan81 commented Jun 6, 2017

zzeekk commented Jun 9, 2017 • edited Loading

IceMan81 commented Jun 17, 2017

zzeekk commented Jun 19, 2017

zzeekk commented Jun 9, 2017 •

edited

Loading