Evaluate schema transform expressions during ingestion #5238

npawar · 2020-04-10T20:32:51Z

This PR adds the feature to execute transform functions written in the schema. This transformation will happen during ingestion. Detailed design doc listing all changes and design: Column Transformation during ingestion in Pinot.

Changes mainly are:

transformFunction can be written in the schema for any fieldSpec using Groovy. The convention to follow is:
"transformFunction": "FunctionType({function}, argument1, argument2,...argumentN)"
For example:
"transformFunction" : "Groovy({firstName + ' ' + lastName}, firstName, lastName)"
The RecordReader will provide the Pinot schema to the SourceFieldNamesExtractor utility to get source field names.
RecordExtractor interface is introduced, one per input format. The RecordReader will pass the source field names and the source record to the RecordExtractor, which will provide the destination GenericRow.
The ExpressionTransformer will create ExpressionEvaluator for each transform function and execute the functions.
ExpressionTransformer will go before all other RecordTransformers, so that every other transformer has the real values.

~~I'll be happy to break this down into smaller PRs, if this is getting too big to review.~~
I'm finding it hard to break this down, because the AvroRecordExtractor is already used in realtime decoders

Pending

Add transform functions in some integration test
JMH benchmarks

Some open questions

We no longer know the data type of the source fields. This is a problem in CSV and JSON.
CSV
a) Everything is read as String, right until the DataTypeTransformer. The function will have to take care of handling the type conversion.
b) Cannot distinguish between MV columns of single value vs single value column. Function will have to take care
c) All empty values will be null values. Cannot distinguish between genuine “” and null in String
JSON
a) Cannot distinguish between INT/LONG and DOUBLE/FLOAT, until DataTypeTransformer.
What should we do if any of the inputs to the transform function is null? Currently, it is skipped. But should we make it the responsibility of the function to handle this?
KafkaJSONDecoder needs to create JSONRecordExtractor by default. But we cannot access JSONRecordExtractor of input-format module in the stream-ingestion module. Did not face this problem in Avro, because everything is in input-format
Before ExpressionTransformer, the GenericRecord contains only source columns. After ExpressionTransformer, the GenericRecord contains source + destination columns, all the way up to the indexing. Should we introduce a Transformer which will create new GenericRecord with only the destination columns, to avoid the memory consumption by the extra columns?

mcvsubbu

Yes, splitting it into a couple of PRs may help -- since you offered. Perhaps starting with the record extractor, but you decide.
Regarding your open questions:
1 . We face this problem because json and csv dont have any schema. Should we introduce the concept of an input schema ? Should we pick the schema from the first record? (what if it has null for some field, and we find the field later? Should we dictate that the first record MUST have all the fields they ever expect to see in the input, and take that to be the schema? Just adding some ideas for consideration. Some related discussion in PR #4968
2. Can we assume pinot defaults if input to transform function is null? This will work until we support more input types than pinot itself supports -- far-fetched I think.
3. ?
4. Since we are clearing and re-using GenericRecord, it should be OK I think.

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

kishoreg

This is amazing!. Minor comments - mostly around refactoring.

...re/src/test/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformerTest.java

pinot-core/src/test/resources/data/expression_transformer/groovy_expression_transformer.json

...pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordReader.java

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java

...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/SourceFieldNameExtractor.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java

Jackie-Jiang

Will be nice if we can completely decouple schema from RecordReader and RecordExtractor, which will make future development much easier

pinot-common/src/test/java/org/apache/pinot/common/utils/time/TimeConverterTest.java

pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/CompositeTransformer.java

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/ClusterTest.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordReaderUtils.java

...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/SourceFieldNameExtractor.java

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java

...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java

pinot-spi/src/test/java/org/apache/pinot/spi/utils/TimeConverterTest.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordReaderUtils.java

npawar · 2020-04-13T23:29:07Z

Will be nice if we can completely decouple schema from RecordReader and RecordExtractor, which will make future development much easier

It's already decoupled from RecordExtractor. Do you mean pull up the schema even more, such that RecordReader also doesn't need Schema? What would we achieve by doing that?

Jackie-Jiang · 2020-04-13T23:36:33Z

Will be nice if we can completely decouple schema from RecordReader and RecordExtractor, which will make future development much easier

It's already decoupled from RecordExtractor. Do you mean pull up the schema even more, such that RecordReader also doesn't need Schema? What would we achieve by doing that?

@npawar RecordExtractor is for stream ingestion, and RecordReader is for batch ingestion. Think of some users trying to add a new record reader, they don't need to understand what schema is, they only need to know here are the fields that should be read.
This might be bigger change, so we can add a TODO and address it separately.

Jackie-Jiang

For the tests, can we hardcode more than one records to catch the problem such as ORC case?

...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java

...pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordReader.java

.../pinot/plugin/inputformat/avro/confluent/KafkaConfluentSchemaRegistryAvroMessageDecoder.java

...rmat/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java

.../pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java

.../pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java

mcvsubbu

Partial

...rmat/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java

npawar · 2020-04-14T02:57:06Z

Will be nice if we can completely decouple schema from RecordReader and RecordExtractor, which will make future development much easier

It's already decoupled from RecordExtractor. Do you mean pull up the schema even more, such that RecordReader also doesn't need Schema? What would we achieve by doing that?

@npawar RecordExtractor is for stream ingestion, and RecordReader is for batch ingestion. Think of some users trying to add a new record reader, they don't need to understand what schema is, they only need to know here are the fields that should be read.
This might be bigger change, so we can add a TODO and address it separately.

As per offline sync up, StreamMessageDecoder is the entry point for realtime, and RecordReader is the entry point for batch. The RecordExtractor is expected to be common to both of them. Picture in the design doc linked in the description.
And I've added a TODO in RecordReader class to further pull out Schema. For consistency, we should do the same in StreamMessageDecoder as well then. Since this is a bigger change, will leave it out for the scope of this PR.

npawar · 2020-04-14T02:59:02Z

For the tests, can we hardcode more than one records to catch the problem such as ORC case?

The RecordExtractor tests are on more than 1 record

Jackie-Jiang

LGTM otherwise, great work

…essionTransformer, Map type, TimeSpec conversion and more

…nversion

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

...ot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java

...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java

mayankshriv · 2020-04-14T16:30:16Z

We missed to discuss the pros/cons of using Groovy on the design doc, apologies. One thing to note is that opening the door for executing an external script/code is a security risk. For example, we need to protect against malicious intent such as System.exit(), while (true), forced OOM, etc. Some of these can be prevented by using Protected/Priviliged mode in Groovy. But we should consider whether there are easier alternatives that don't require inventing another DSL.

...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/DefaultTimeSpecEvaluator.java

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java

pinot-spi/src/main/java/org/apache/pinot/spi/utils/SchemaFieldExtractorUtils.java

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java

… record

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java

Jackie-Jiang

LGTM

npawar · 2020-04-15T18:49:50Z

We missed to discuss the pros/cons of using Groovy on the design doc, apologies. One thing to note is that opening the door for executing an external script/code is a security risk. For example, we need to protect against malicious intent such as System.exit(), while (true), forced OOM, etc. Some of these can be prevented by using Protected/Priviliged mode in Groovy. But we should consider whether there are easier alternatives that don't require inventing another DSL.

Let's continue the discussion on the design doc. We can always remove the Groovy function type if we decide against it. The rest of the changes should remain the same.
Regarding Protected/Privileged mode - can you point me to some docs, I can't seem to find what you're referring to.

npawar · 2020-04-15T18:50:29Z

#5135

npawar · 2020-04-15T19:17:51Z

Yes, splitting it into a couple of PRs may help -- since you offered. Perhaps starting with the record extractor, but you decide.
Regarding your open questions:
1 . We face this problem because json and csv dont have any schema. Should we introduce the concept of an input schema ? Should we pick the schema from the first record? (what if it has null for some field, and we find the field later? Should we dictate that the first record MUST have all the fields they ever expect to see in the input, and take that to be the schema? Just adding some ideas for consideration. Some related discussion in PR #4968

InputSchema seems like the cleanest solution. But that's more configuration and more scope for error for the user. Giving it more thought, I think the current implementation will work fine, because most often, the same input format will be used for all pushes to the same table. The problem arises only if someone keeps switching between input formats of the data.

Can we assume pinot defaults if input to transform function is null? This will work until we support more input types than pinot itself supports -- far-fetched I think.

Right now, I think the best way forward is to let the function handle it. The function definition will change only if someone uses data with different input formats to push to the same schema (if they use Avro once and it succeeds, but then they use CSV and the function fails) - which is not a common practical case.

npawar requested review from kishoreg, xiangfu0, Jackie-Jiang, mayankshriv and mcvsubbu April 10, 2020 20:33

mcvsubbu reviewed Apr 10, 2020

View reviewed changes

.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java Outdated Show resolved Hide resolved

kishoreg approved these changes Apr 12, 2020

View reviewed changes

Jackie-Jiang reviewed Apr 13, 2020

View reviewed changes

xiangfu0 reviewed Apr 13, 2020

View reviewed changes

...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java Show resolved Hide resolved

xiangfu0 reviewed Apr 13, 2020

View reviewed changes

pinot-spi/src/test/java/org/apache/pinot/spi/utils/TimeConverterTest.java Outdated Show resolved Hide resolved

xiangfu0 reviewed Apr 13, 2020

View reviewed changes

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordReaderUtils.java Show resolved Hide resolved

Jackie-Jiang reviewed Apr 14, 2020

View reviewed changes

.../pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java Outdated Show resolved Hide resolved

mcvsubbu reviewed Apr 14, 2020

View reviewed changes

...rmat/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java Outdated Show resolved Hide resolved

Jackie-Jiang approved these changes Apr 14, 2020

View reviewed changes

Neha Pawar added 11 commits April 13, 2020 20:22

First pass of transform functions using Groovy

7740e91

Tests for RecordExtractors, Evaluators, SourceFieldName utility, Expr…

6ac6ff1

…essionTransformer, Map type, TimeSpec conversion and more

More Groovy test

2926598

Javadocs

93a80d5

Corner cases in Time

c6b3718

Review comments - Nullable, newline, blank lines, some refactoring

398af9b

Remove unused code form RecordReaderUtils

e0b3619

Addressing some review comments

f8474d4

Make JsonRecordExtractor not use the generic RecordReaderUtils for co…

9fec757

…nversion

Fix AvroRecordToPinotRowGeneratortest

ec6f299

Addressed review comments

55c789a

mayankshriv reviewed Apr 14, 2020

View reviewed changes

Simplify time conversion logic

2570bf7

npawar force-pushed the expression_parser_evaluation branch from e12c035 to 2570bf7 Compare April 14, 2020 22:14

Jackie-Jiang reviewed Apr 14, 2020

View reviewed changes

Neha Pawar added 3 commits April 14, 2020 17:40

Handle case where transformed destination value already exists in the…

aa723c8

… record

Make extract return Set. Exceptions from ExpressionEvaluatorFactory

8cb07c0

Fix brokern test

e3b81b7

Jackie-Jiang reviewed Apr 15, 2020

View reviewed changes

pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java Outdated Show resolved Hide resolved

Make fields a set in the interface, fix tests, fix quickstart

55a1cf9

Jackie-Jiang approved these changes Apr 15, 2020

View reviewed changes

Comment about supported datatypes in GenericRow

de3031d

npawar merged commit b20ace0 into apache:master Apr 15, 2020

npawar deleted the expression_parser_evaluation branch April 15, 2020 21:13

npawar mentioned this pull request Apr 24, 2020

Decouple RecordExtractor from RecordReader/StreamMessageDecoder #5296

Closed

npawar mentioned this pull request May 7, 2020

Transformations using columns which themselves are a result of transformation #5351

Closed

This was referenced Jul 24, 2020

Fix code to correctly extract value of multi-value column from avro file #5746

Closed

Transform and extract map values to array #5756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate schema transform expressions during ingestion #5238

Evaluate schema transform expressions during ingestion #5238

npawar commented Apr 10, 2020 •

edited

Loading

mcvsubbu left a comment

kishoreg left a comment

Jackie-Jiang left a comment

npawar commented Apr 13, 2020

Jackie-Jiang commented Apr 13, 2020 •

edited

Loading

Jackie-Jiang left a comment •

edited

Loading

mcvsubbu left a comment

npawar commented Apr 14, 2020

npawar commented Apr 14, 2020

Jackie-Jiang left a comment

mayankshriv commented Apr 14, 2020

Jackie-Jiang left a comment

npawar commented Apr 15, 2020

npawar commented Apr 15, 2020

npawar commented Apr 15, 2020 •

edited

Loading

Evaluate schema transform expressions during ingestion #5238

Evaluate schema transform expressions during ingestion #5238

Conversation

npawar commented Apr 10, 2020 • edited Loading

mcvsubbu left a comment

Choose a reason for hiding this comment

kishoreg left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

npawar commented Apr 13, 2020

Jackie-Jiang commented Apr 13, 2020 • edited Loading

Jackie-Jiang left a comment • edited Loading

Choose a reason for hiding this comment

mcvsubbu left a comment

Choose a reason for hiding this comment

npawar commented Apr 14, 2020

npawar commented Apr 14, 2020

Jackie-Jiang left a comment

Choose a reason for hiding this comment

mayankshriv commented Apr 14, 2020

Jackie-Jiang left a comment

Choose a reason for hiding this comment

npawar commented Apr 15, 2020

npawar commented Apr 15, 2020

npawar commented Apr 15, 2020 • edited Loading

npawar commented Apr 10, 2020 •

edited

Loading

Jackie-Jiang commented Apr 13, 2020 •

edited

Loading

Jackie-Jiang left a comment •

edited

Loading

npawar commented Apr 15, 2020 •

edited

Loading