[HIVEMALL-44][SAPRK] Implement a prototype of Join with TopK for DataFrame/Spark #33

maropu · 2017-01-31T02:09:43Z

What changes were proposed in this pull request?

This pr implemented a prototype for Top-K joins. In Hivemall, each_top_k is useful for practical use cases. On the other hand, there are some cases we need to join tables then compute Top-K entries.You know we can compute this query by using regular joins + each_top_k. However, we have space to improve this query more; that is, we compute Top-K entries while processing joins. This optimization avoids a substantial amount of I/O for joins.

An example query is as follows;

val inputDf = Seq(
  ("user1", 1, 0.3, 0.5),
  ("user2", 2, 0.1, 0.1),
  ("user3", 3, 0.8, 0.0),
  ("user4", 1, 0.9, 0.9),
  ("user5", 3, 0.7, 0.2),
  ("user6", 1, 0.5, 0.4),
  ("user7", 2, 0.6, 0.8)
).toDF("userId", "group", "x", "y")

val masterDf = Seq(
  (1, "pos1-1", 0.5, 0.1),
  (1, "pos1-2", 0.0, 0.0),
  (1, "pos1-3", 0.3, 0.3),
  (2, "pos2-3", 0.1, 0.3),
  (2, "pos2-3", 0.8, 0.8),
  (3, "pos3-1", 0.1, 0.7),
  (3, "pos3-1", 0.7, 0.1),
  (3, "pos3-1", 0.9, 0.0),
  (3, "pos3-1", 0.1, 0.3)
).toDF("group", "position", "x", "y")

// Compute top-1 rows for each group
val distance = sqrt(
  pow(inputDf("x") - masterDf("x"), lit(2.0)) +
  pow(inputDf("y") - masterDf("y"), lit(2.0))
)

val top1Df = inputDf.join_top_k(
  lit(1), masterDf, inputDf("group") === masterDf("group"),
  distance.as("score")
)

A quick benchmark is as follows;

    /**
     * Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13 on Mac OS X 10.10.2
     * Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
     *
     * top-k join (k=3):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     * -------------------------------------------------------------------------------
     * join + rank                65959 / 71324          0.0      503223.9       1.0X
     * join + each_top_k          66093 / 78864          0.0      504247.3       1.0X
     * join_top_k                   5013 / 5431          0.0       38249.3      13.2X
     */

The APIs this pr added is unstable and we might change them in upcoming activities.

What type of PR is it?

Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-44

How was this patch tested?

Added tests in HivemallOpsSuite

coveralls · 2017-01-31T03:18:59Z

Coverage remained the same at 35.842% when pulling a7c70c9 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

myui · 2017-01-31T03:34:43Z

output top1Df should be explained in the description. Also, note about tail-k computation should be documented. Please add docs/gitbook/spark/topkjoin.md that explains usage of this function.

BTW, I personally prefer top_k_join instead of join_top_k.

coveralls · 2017-01-31T04:42:40Z

Coverage remained the same at 35.842% when pulling 5ed10f2 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

coveralls · 2017-01-31T04:42:40Z

Coverage increased (+0.3%) to 36.14% when pulling 5ed10f2 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

myui · 2017-01-31T05:06:27Z

@maropu waiting for markdown to be included in this PR :-)

maropu · 2017-01-31T05:48:40Z

Will do

maropu · 2017-02-01T07:55:15Z

@myui Added a doc.

coveralls · 2017-02-01T18:30:18Z