Introduce 'LOOKUP' Transform Function #6383

cbalci · 2020-12-25T00:07:53Z

Description

Introducing a new transform function; LookupTransformFunction as a part of Join project as described in Lookup UDF Join In Pinot. This is a followup to the addition of DimensionTableManager in #6346.

LOOKUP is a regular transform function which uses the previously added DimensionTableDataManager to execute a lookup from a Dimension table. Call signature is as follows:

LOOKUP(TableName, ColumnName, JoinKey, JoinValue [, JoinKey2, JoinValue2 ...])

TableName: name of the dimension table which will be used
ColumnName: column name from the dimension table to look up
JoinKey: primary key column name for the dimension table. Note: Only primary key is supported for JoinKey
JoinValue: primary key value
*If the dimension table has more then one primary keys (composite PK), you can add more keys and values for the rest of the args: JoinKey2, JoinValue2 ... etc.

Example Query:

SELECT
    baseballStats.playerName,
    baseballStats.teamID,
    LOOKUP('dimBaseballTeams', 'teamName', 'teamID', baseballStats.teamID)
FROM
    baseballStats
LIMIT 10

Above example joins the dimension table 'baseballTeams' into regular table 'baseballStats' on 'teamID' key. Lookup function returns the value of the column 'teamName'.

To see the function in action you can also fire JoinQuickstart and test it as follows:

Testing

Unit tests are included to cover basic functionality and I also added a sample usage (table creation + query) in JoinQuickstart.
End to end functionality is also tested in a previous POC pull request here.

Documentation

Comprehensive documentation explaining the usage of 'Dimension' tables is being worked on and will be added to the https://github.com/pinot-contrib/pinot-docs repository in a separate PR.
LookupTransformFuncion usage and a sample query is added as a JavaDoc .

codecov-io · 2020-12-25T00:45:07Z

Codecov Report

Merging #6383 (4577c21) into master (1beaab5) will decrease coverage by 0.95%.
The diff coverage is 56.80%.

@@            Coverage Diff             @@
##           master    #6383      +/-   ##
==========================================
- Coverage   66.44%   65.49%   -0.96%     
==========================================
  Files        1075     1315     +240     
  Lines       54773    63797    +9024     
  Branches     8168     9293    +1125     
==========================================
+ Hits        36396    41781    +5385     
- Misses      15700    19034    +3334     
- Partials     2677     2982     +305

Flag	Coverage Δ
unittests	`65.49% <56.80%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...e/pinot/broker/api/resources/PinotBrokerDebug.java	`0.00% <0.00%> (-79.32%)`	⬇️
...ot/broker/broker/AllowAllAccessControlFactory.java	`71.42% <ø> (-28.58%)`	⬇️
.../helix/BrokerUserDefinedMessageHandlerFactory.java	`33.96% <0.00%> (-32.71%)`	⬇️
...ker/routing/instanceselector/InstanceSelector.java	`100.00% <ø> (ø)`
...ava/org/apache/pinot/client/AbstractResultSet.java	`66.66% <0.00%> (+9.52%)`	⬆️
.../main/java/org/apache/pinot/client/Connection.java	`35.55% <0.00%> (-13.29%)`	⬇️
...inot/client/JsonAsyncHttpPinotClientTransport.java	`10.90% <0.00%> (-51.10%)`	⬇️
...not/common/assignment/InstancePartitionsUtils.java	`73.80% <ø> (+0.63%)`	⬆️
...common/config/tuner/NoOpTableTableConfigTuner.java	`100.00% <ø> (ø)`
...ot/common/config/tuner/RealTimeAutoIndexTuner.java	`100.00% <ø> (ø)`
... and 1175 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33de6dc...4577c21. Read the comment docs.

xiangfu0 · 2020-12-25T23:30:03Z

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

+ */
+public class LookupTransformFunction extends BaseTransformFunction {
+    public static final String FUNCTION_NAME = "lookUp";
+    private static final String TABLE_NAME_SUFFIX = "_OFFLINE";


Do we need to assume a dim table is always an offline table?

Hi Xiang, yes you're right. In current design Dimension Table has the following constraints:

Must be of OFFLINE type

Must have a primary key (we support lookups by primary key for now)

Must have ingestion type REFRESH

Please check out #6286 and #6346 to see implementation details.

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

yupeng9 · 2020-12-27T23:05:41Z

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

+                    pkColumns[i] = ArrayUtils.toObject(tf.transformToLongValuesSV(projectionBlock));
+                    break;
+                default:
+                    throw new IllegalStateException("Unknown column type for primary key");


how about byte?

Updated to support all types: INT, LONG, FLOAT, DOUBLE, STRING, BYTES

fyi, double is hard to check equality.

For lookupByPrimaryKey we are relying on the PrimaryKey.hashCode implementation you added here. Here is the dimension table HashMap which is keyed by 'PrimaryKey's. Let me know if you see any potential issues with this usage.

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

* Add DimensionTableData manager * Address review comments. * CLose reader after using. * Revisit javadocs. * Release segment after use. * Touch up instance instantiation. * Cleanup segment in test. * Release segments in "finally" block. * Update logs. * Add TableConfig validations for Dim Tables. * Seperate IngestionConfigTests for dim tables. * Remove defensive null checks. * Fix github action profile name. * Fix ingestionTest dependencies. * Undo the gihub-actions mvn profile name fix.

...core/src/main/java/org/apache/pinot/core/data/manager/offline/DimensionTableDataManager.java

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

Jackie-Jiang · 2020-12-29T20:25:33Z

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

+          break;
+        case BYTES:
+          byte[][] primitiveValues = tf.transformToBytesValuesSV(projectionBlock);
+          pkColumns[i] = new Byte[numDocuments][];


pkColumns should be stored as ByteArray[numDocuments]. Please add a test for all these data types to ensure them working

Double checked and can't see how this is wrong. What we are doing here is translating the output of the transform function from type byte[numDocuments][] to Byte[numDocuments][] so it can be passed back as Object[]. Second [] is indicating that we simply have 'byte arrays' for each row/entry. I'm updating the loop index variable name to make it a bit more clear.
We already have test coverage for this behavior here (as well as other types).
Let me know if I missed something.

The equals() in PrimaryKey won't do deep comparison for array, and Byte[] won't be compared correctly (it will only compare the references). We use ByteArray as a wrapper to bypass this problem. It is used to store byte[] internally

Oh, I see what you mean, great catch. Looks like the reason unit tests didn't catch this was, I was mocking the lookupRowByPrimaryKey to match byte array type PKs by their string representation 🤦. Fixed it and revamped all the test cases to match only by the 'hashCode' of the PK instance. Thanks 👍

Jackie-Jiang · 2020-12-29T20:27:34Z

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

+    Object[] resultSet = new Object[numDocuments];
+    for (int i = 0; i < numDocuments; i++) {
+      // prepare pk
+      Object[] pkValues = new Object[numPkColumns];


Reuse this Object[] instead of creating a new one per iteration

I can do something like:

Object[] resultSet = new Object[numDocuments]; Object[] pkValues = new Object[numPkColumns]; for (int i = 0; i < numDocuments; i++) { // prepare pk for (int j = 0; j < numPkColumns; j++) { pkValues[j] = pkColumns[j][i]; } // lookup GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues)); if (row != null) { resultSet[i] = row.getValue(_dimColumnName); } }

I don't see much point in doing the same for PrimaryKey though. We will only be reusing the pointer which is not really helpful.

It will reuse the object instead of creating one per doc:

Object[] resultSet = new Object[numDocuments]; Object[] pkValues = new Object[numPkColumns]; PrimaryKey primaryKey = new PrimaryKey(pkValues); for (int i = 0; i < numDocuments; i++) { ...

I don't think this will work. We have to create a new 'PrimaryKey' instance per document since the values (pkValues) are going to be different.

You can directly modify pkValues without changing primaryKey. They share the same reference.
This is not critical, so both ways are fine. It just saves minor garbages

Jackie-Jiang · 2020-12-29T20:28:19Z

...src/main/java/org/apache/pinot/core/operator/transform/function/LookupTransformFunction.java

+        pkValues[j] = pkColumns[j][i];
+      }
+      // lookup
+      GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues));


PrimaryKey object can also be reused

Replied above

cbalci · 2021-01-08T00:35:40Z

Thanks for the review @Jackie-Jiang and @yupeng9 !
@kishoreg , @mcvsubbu, @siddharthteotia please let me know if you would like to add anything, otherwise I'd like to proceed with merging the branch and start addressing some small remaining items such as 'quota config' and documentation.

mcvsubbu

lgtm thanks

cbalci added 2 commits December 23, 2020 19:35

Add 'lookUp' transform function

fec8d81

Add sample lookup query to join quickstart

1b141d8

siddharthteotia self-requested a review December 25, 2020 02:06

xiangfu0 requested review from mcvsubbu, Jackie-Jiang, xiangfu0 and kishoreg December 25, 2020 23:28

xiangfu0 reviewed Dec 25, 2020

View reviewed changes

yupeng9 reviewed Dec 27, 2020

View reviewed changes

cbalci added 2 commits December 27, 2020 17:24

Fix formatting and whitespace

345b1f7

Small fix.

fc5a0ee

Jackie-Jiang reviewed Dec 28, 2020

View reviewed changes

cbalci added 3 commits December 28, 2020 14:22

Support all possible PK types

cee23d7

Address review comments.

91b635c

Small fix

1bc88ca

Jackie-Jiang reviewed Dec 29, 2020

View reviewed changes

cbalci added 2 commits December 29, 2020 15:39

Address review comments.

77e5a77

Fix PK hash match issue for byte[] type columns

4577c21

Jackie-Jiang approved these changes Dec 30, 2020

View reviewed changes

yupeng9 approved these changes Jan 5, 2021

View reviewed changes

mcvsubbu reviewed Jan 8, 2021

View reviewed changes

Jackie-Jiang merged commit d04785c into apache:master Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce 'LOOKUP' Transform Function #6383

Introduce 'LOOKUP' Transform Function #6383

cbalci commented Dec 25, 2020 •

edited

Loading

codecov-io commented Dec 25, 2020 •

edited

Loading

xiangfu0 Dec 25, 2020 •

edited

Loading

cbalci Dec 27, 2020

yupeng9 Dec 27, 2020

cbalci Dec 28, 2020

yupeng9 Dec 29, 2020

cbalci Dec 29, 2020

Jackie-Jiang Dec 29, 2020

cbalci Dec 29, 2020

Jackie-Jiang Dec 30, 2020

cbalci Dec 30, 2020

Jackie-Jiang Dec 29, 2020

cbalci Dec 29, 2020

Jackie-Jiang Dec 30, 2020

cbalci Dec 30, 2020 •

edited

Loading

Jackie-Jiang Dec 30, 2020 •

edited

Loading

Jackie-Jiang Dec 29, 2020

cbalci Dec 29, 2020

cbalci commented Jan 8, 2021

mcvsubbu left a comment

Introduce 'LOOKUP' Transform Function #6383

Introduce 'LOOKUP' Transform Function #6383

Conversation

cbalci commented Dec 25, 2020 • edited Loading

Description

Testing

Documentation

codecov-io commented Dec 25, 2020 • edited Loading

Codecov Report

xiangfu0 Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalci Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Jackie-Jiang Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalci commented Jan 8, 2021

mcvsubbu left a comment

Choose a reason for hiding this comment

cbalci commented Dec 25, 2020 •

edited

Loading

codecov-io commented Dec 25, 2020 •

edited

Loading

xiangfu0 Dec 25, 2020 •

edited

Loading

cbalci Dec 30, 2020 •

edited

Loading

Jackie-Jiang Dec 30, 2020 •

edited

Loading