Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce 'LOOKUP' Transform Function #6383

Merged
merged 9 commits into from
Jan 8, 2021

Conversation

cbalci
Copy link
Contributor

@cbalci cbalci commented Dec 25, 2020

Description

Introducing a new transform function; LookupTransformFunction as a part of Join project as described in Lookup UDF Join In Pinot. This is a followup to the addition of DimensionTableManager in #6346.

LOOKUP is a regular transform function which uses the previously added DimensionTableDataManager to execute a lookup from a Dimension table. Call signature is as follows:

LOOKUP(TableName, ColumnName, JoinKey, JoinValue [, JoinKey2, JoinValue2 ...])

TableName: name of the dimension table which will be used
ColumnName: column name from the dimension table to look up
JoinKey: primary key column name for the dimension table. Note: Only primary key is supported for JoinKey
JoinValue: primary key value
*If the dimension table has more then one primary keys (composite PK), you can add more keys and values for the rest of the args: JoinKey2, JoinValue2 ... etc.

Example Query:

SELECT
    baseballStats.playerName,
    baseballStats.teamID,
    LOOKUP('dimBaseballTeams', 'teamName', 'teamID', baseballStats.teamID)
FROM
    baseballStats
LIMIT 10

Above example joins the dimension table 'baseballTeams' into regular table 'baseballStats' on 'teamID' key. Lookup function returns the value of the column 'teamName'.

To see the function in action you can also fire JoinQuickstart and test it as follows:
Screen Shot 2020-12-23 at 8 04 46 PM

Testing

  • Unit tests are included to cover basic functionality and I also added a sample usage (table creation + query) in JoinQuickstart.
  • End to end functionality is also tested in a previous POC pull request here.

Documentation

  • Comprehensive documentation explaining the usage of 'Dimension' tables is being worked on and will be added to the https://github.com/pinot-contrib/pinot-docs repository in a separate PR.
  • LookupTransformFuncion usage and a sample query is added as a JavaDoc .

@codecov-io
Copy link

codecov-io commented Dec 25, 2020

Codecov Report

Merging #6383 (4577c21) into master (1beaab5) will decrease coverage by 0.95%.
The diff coverage is 56.80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6383      +/-   ##
==========================================
- Coverage   66.44%   65.49%   -0.96%     
==========================================
  Files        1075     1315     +240     
  Lines       54773    63797    +9024     
  Branches     8168     9293    +1125     
==========================================
+ Hits        36396    41781    +5385     
- Misses      15700    19034    +3334     
- Partials     2677     2982     +305     
Flag Coverage Δ
unittests 65.49% <56.80%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...e/pinot/broker/api/resources/PinotBrokerDebug.java 0.00% <0.00%> (-79.32%) ⬇️
...ot/broker/broker/AllowAllAccessControlFactory.java 71.42% <ø> (-28.58%) ⬇️
.../helix/BrokerUserDefinedMessageHandlerFactory.java 33.96% <0.00%> (-32.71%) ⬇️
...ker/routing/instanceselector/InstanceSelector.java 100.00% <ø> (ø)
...ava/org/apache/pinot/client/AbstractResultSet.java 66.66% <0.00%> (+9.52%) ⬆️
.../main/java/org/apache/pinot/client/Connection.java 35.55% <0.00%> (-13.29%) ⬇️
...inot/client/JsonAsyncHttpPinotClientTransport.java 10.90% <0.00%> (-51.10%) ⬇️
...not/common/assignment/InstancePartitionsUtils.java 73.80% <ø> (+0.63%) ⬆️
...common/config/tuner/NoOpTableTableConfigTuner.java 100.00% <ø> (ø)
...ot/common/config/tuner/RealTimeAutoIndexTuner.java 100.00% <ø> (ø)
... and 1175 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33de6dc...4577c21. Read the comment docs.

*/
public class LookupTransformFunction extends BaseTransformFunction {
public static final String FUNCTION_NAME = "lookUp";
private static final String TABLE_NAME_SUFFIX = "_OFFLINE";
Copy link
Contributor

@xiangfu0 xiangfu0 Dec 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assume a dim table is always an offline table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xiang, yes you're right. In current design Dimension Table has the following constraints:

  • Must be of OFFLINE type
  • Must have a primary key (we support lookups by primary key for now)
  • Must have ingestion type REFRESH

Please check out #6286 and #6346 to see implementation details.

pkColumns[i] = ArrayUtils.toObject(tf.transformToLongValuesSV(projectionBlock));
break;
default:
throw new IllegalStateException("Unknown column type for primary key");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about byte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to support all types: INT, LONG, FLOAT, DOUBLE, STRING, BYTES

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, double is hard to check equality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lookupByPrimaryKey we are relying on the PrimaryKey.hashCode implementation you added here. Here is the dimension table HashMap which is keyed by 'PrimaryKey's. Let me know if you see any potential issues with this usage.

cbalci referenced this pull request Dec 29, 2020
* Add DimensionTableData manager

* Address review comments.

* CLose reader after using.

* Revisit javadocs.

* Release segment after use.

* Touch up instance instantiation.

* Cleanup segment in test.

* Release segments in "finally" block.

* Update logs.

* Add TableConfig validations for Dim Tables.

* Seperate IngestionConfigTests for dim tables.

* Remove defensive null checks.

* Fix github action profile name.

* Fix ingestionTest dependencies.

* Undo the gihub-actions mvn profile name fix.
break;
case BYTES:
byte[][] primitiveValues = tf.transformToBytesValuesSV(projectionBlock);
pkColumns[i] = new Byte[numDocuments][];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkColumns should be stored as ByteArray[numDocuments]. Please add a test for all these data types to ensure them working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checked and can't see how this is wrong. What we are doing here is translating the output of the transform function from type byte[numDocuments][] to Byte[numDocuments][] so it can be passed back as Object[]. Second [] is indicating that we simply have 'byte arrays' for each row/entry. I'm updating the loop index variable name to make it a bit more clear.
We already have test coverage for this behavior here (as well as other types).
Let me know if I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equals() in PrimaryKey won't do deep comparison for array, and Byte[] won't be compared correctly (it will only compare the references). We use ByteArray as a wrapper to bypass this problem. It is used to store byte[] internally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you mean, great catch. Looks like the reason unit tests didn't catch this was, I was mocking the lookupRowByPrimaryKey to match byte array type PKs by their string representation 🤦. Fixed it and revamped all the test cases to match only by the 'hashCode' of the PK instance. Thanks 👍

Object[] resultSet = new Object[numDocuments];
for (int i = 0; i < numDocuments; i++) {
// prepare pk
Object[] pkValues = new Object[numPkColumns];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse this Object[] instead of creating a new one per iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do something like:

  Object[] resultSet = new Object[numDocuments];
  Object[] pkValues = new Object[numPkColumns];
  for (int i = 0; i < numDocuments; i++) {
    // prepare pk
    for (int j = 0; j < numPkColumns; j++) {
      pkValues[j] = pkColumns[j][i];
    }
    // lookup
    GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues));
    if (row != null) {
      resultSet[i] = row.getValue(_dimColumnName);
    }
  }

I don't see much point in doing the same for PrimaryKey though. We will only be reusing the pointer which is not really helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will reuse the object instead of creating one per doc:

  Object[] resultSet = new Object[numDocuments];
  Object[] pkValues = new Object[numPkColumns];
  PrimaryKey primaryKey = new PrimaryKey(pkValues);
  for (int i = 0; i < numDocuments; i++) {
    ...

Copy link
Contributor Author

@cbalci cbalci Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work. We have to create a new 'PrimaryKey' instance per document since the values (pkValues) are going to be different.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can directly modify pkValues without changing primaryKey. They share the same reference.
This is not critical, so both ways are fine. It just saves minor garbages

pkValues[j] = pkColumns[j][i];
}
// lookup
GenericRow row = _dataManager.lookupRowByPrimaryKey(new PrimaryKey(pkValues));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimaryKey object can also be reused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied above

@cbalci
Copy link
Contributor Author

cbalci commented Jan 8, 2021

Thanks for the review @Jackie-Jiang and @yupeng9 !
@kishoreg , @mcvsubbu, @siddharthteotia please let me know if you would like to add anything, otherwise I'd like to proceed with merging the branch and start addressing some small remaining items such as 'quota config' and documentation.

Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks

@Jackie-Jiang Jackie-Jiang merged commit d04785c into apache:master Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants