Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-44][SAPRK] Implement a prototype of Join with TopK for DataFrame/Spark #33

Closed
wants to merge 5 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jan 31, 2017

What changes were proposed in this pull request?

This pr implemented a prototype for Top-K joins. In Hivemall, each_top_k is useful for practical use cases. On the other hand, there are some cases we need to join tables then compute Top-K entries.You know we can compute this query by using regular joins + each_top_k. However, we have space to improve this query more; that is, we compute Top-K entries while processing joins. This optimization avoids a substantial amount of I/O for joins.

An example query is as follows;

val inputDf = Seq(
  ("user1", 1, 0.3, 0.5),
  ("user2", 2, 0.1, 0.1),
  ("user3", 3, 0.8, 0.0),
  ("user4", 1, 0.9, 0.9),
  ("user5", 3, 0.7, 0.2),
  ("user6", 1, 0.5, 0.4),
  ("user7", 2, 0.6, 0.8)
).toDF("userId", "group", "x", "y")

val masterDf = Seq(
  (1, "pos1-1", 0.5, 0.1),
  (1, "pos1-2", 0.0, 0.0),
  (1, "pos1-3", 0.3, 0.3),
  (2, "pos2-3", 0.1, 0.3),
  (2, "pos2-3", 0.8, 0.8),
  (3, "pos3-1", 0.1, 0.7),
  (3, "pos3-1", 0.7, 0.1),
  (3, "pos3-1", 0.9, 0.0),
  (3, "pos3-1", 0.1, 0.3)
).toDF("group", "position", "x", "y")

// Compute top-1 rows for each group
val distance = sqrt(
  pow(inputDf("x") - masterDf("x"), lit(2.0)) +
  pow(inputDf("y") - masterDf("y"), lit(2.0))
)

val top1Df = inputDf.join_top_k(
  lit(1), masterDf, inputDf("group") === masterDf("group"),
  distance.as("score")
)

A quick benchmark is as follows;

    /**
     * Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13 on Mac OS X 10.10.2
     * Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
     *
     * top-k join (k=3):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
     * -------------------------------------------------------------------------------
     * join + rank                65959 / 71324          0.0      503223.9       1.0X
     * join + each_top_k          66093 / 78864          0.0      504247.3       1.0X
     * join_top_k                   5013 / 5431          0.0       38249.3      13.2X
     */

The APIs this pr added is unstable and we might change them in upcoming activities.

What type of PR is it?

Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-44

How was this patch tested?

Added tests in HivemallOpsSuite

@coveralls
Copy link

coveralls commented Jan 31, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling a7c70c9 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

@myui
Copy link
Member

myui commented Jan 31, 2017

output top1Df should be explained in the description. Also, note about tail-k computation should be documented. Please add docs/gitbook/spark/topkjoin.md that explains usage of this function.

BTW, I personally prefer top_k_join instead of join_top_k.

@coveralls
Copy link

coveralls commented Jan 31, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling 5ed10f2 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 36.14% when pulling 5ed10f2 on maropu:HIVEMALL-44 into ed0d139 on apache:master.

@myui
Copy link
Member

myui commented Jan 31, 2017

@maropu waiting for markdown to be included in this PR :-)

@maropu
Copy link
Member Author

maropu commented Jan 31, 2017

Will do

@maropu
Copy link
Member Author

maropu commented Feb 1, 2017

@myui Added a doc.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 36.14% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

1 similar comment
@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 35.842% when pulling c7a9838 on maropu:HIVEMALL-44 into c837e51 on apache:master.

@maropu
Copy link
Member Author

maropu commented Feb 2, 2017

@myui ping

@asfgit asfgit closed this in b2032af Feb 2, 2017
@myui
Copy link
Member

myui commented Feb 2, 2017

@maropu Thanks! Merged.

@maropu
Copy link
Member Author

maropu commented Feb 2, 2017

@myui Many Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants