-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate schema transform expressions during ingestion #5238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, splitting it into a couple of PRs may help -- since you offered. Perhaps starting with the record extractor, but you decide.
Regarding your open questions:
1 . We face this problem because json and csv dont have any schema. Should we introduce the concept of an input schema ? Should we pick the schema from the first record? (what if it has null for some field, and we find the field later? Should we dictate that the first record MUST have all the fields they ever expect to see in the input, and take that to be the schema? Just adding some ideas for consideration. Some related discussion in PR #4968
2. Can we assume pinot defaults if input to transform function is null? This will work until we support more input types than pinot itself supports -- far-fetched I think.
3. ?
4. Since we are clearing and re-using GenericRecord, it should be OK I think.
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!. Minor comments - mostly around refactoring.
...re/src/test/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformerTest.java
Show resolved
Hide resolved
pinot-core/src/test/resources/data/expression_transformer/groovy_expression_transformer.json
Outdated
Show resolved
Hide resolved
...pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordReader.java
Outdated
Show resolved
Hide resolved
...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Outdated
Show resolved
Hide resolved
...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/SourceFieldNameExtractor.java
Outdated
Show resolved
Hide resolved
...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/SourceFieldNameExtractor.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be nice if we can completely decouple schema from RecordReader and RecordExtractor, which will make future development much easier
pinot-common/src/test/java/org/apache/pinot/common/utils/time/TimeConverterTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/recordtransformer/CompositeTransformer.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Show resolved
Hide resolved
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/ClusterTest.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordReaderUtils.java
Outdated
Show resolved
Hide resolved
...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/SourceFieldNameExtractor.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Outdated
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java
Show resolved
Hide resolved
...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java
Show resolved
Hide resolved
pinot-spi/src/test/java/org/apache/pinot/spi/utils/TimeConverterTest.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordReaderUtils.java
Show resolved
Hide resolved
It's already decoupled from RecordExtractor. Do you mean pull up the schema even more, such that RecordReader also doesn't need Schema? What would we achieve by doing that? |
@npawar RecordExtractor is for stream ingestion, and RecordReader is for batch ingestion. Think of some users trying to add a new record reader, they don't need to understand what schema is, they only need to know here are the fields that should be read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tests, can we hardcode more than one records to catch the problem such as ORC case?
...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
...t-core/src/main/java/org/apache/pinot/core/data/recordtransformer/ExpressionTransformer.java
Outdated
Show resolved
Hide resolved
...pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordReader.java
Outdated
Show resolved
Hide resolved
.../pinot/plugin/inputformat/avro/confluent/KafkaConfluentSchemaRegistryAvroMessageDecoder.java
Outdated
Show resolved
Hide resolved
...rmat/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java
Show resolved
Hide resolved
.../pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java
Outdated
Show resolved
Hide resolved
.../pinot-orc/src/test/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractorTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial
...rmat/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java
Outdated
Show resolved
Hide resolved
As per offline sync up, StreamMessageDecoder is the entry point for realtime, and RecordReader is the entry point for batch. The RecordExtractor is expected to be common to both of them. Picture in the design doc linked in the description. |
The RecordExtractor tests are on more than 1 record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise, great work
…essionTransformer, Map type, TimeSpec conversion and more
...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
Show resolved
Hide resolved
...ot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java
Show resolved
Hide resolved
...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
Outdated
Show resolved
Hide resolved
...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluator.java
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Show resolved
Hide resolved
We missed to discuss the pros/cons of using Groovy on the design doc, apologies. One thing to note is that opening the door for executing an external script/code is a security risk. For example, we need to protect against malicious intent such as |
e12c035
to
2570bf7
Compare
...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/DefaultTimeSpecEvaluator.java
Outdated
Show resolved
Hide resolved
...pi/src/main/java/org/apache/pinot/spi/data/function/evaluators/DefaultTimeSpecEvaluator.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
...i/src/main/java/org/apache/pinot/spi/data/function/evaluators/GroovyExpressionEvaluator.java
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/utils/SchemaFieldExtractorUtils.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/spi/data/function/evaluators/ExpressionEvaluatorFactory.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/RecordExtractor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Let's continue the discussion on the design doc. We can always remove the Groovy function type if we decide against it. The rest of the changes should remain the same. |
InputSchema seems like the cleanest solution. But that's more configuration and more scope for error for the user. Giving it more thought, I think the current implementation will work fine, because most often, the same input format will be used for all pushes to the same table. The problem arises only if someone keeps switching between input formats of the data.
Right now, I think the best way forward is to let the function handle it. The function definition will change only if someone uses data with different input formats to push to the same schema (if they use Avro once and it succeeds, but then they use CSV and the function fails) - which is not a common practical case. |
This PR adds the feature to execute transform functions written in the schema. This transformation will happen during ingestion. Detailed design doc listing all changes and design: Column Transformation during ingestion in Pinot.
Changes mainly are:
transformFunction
can be written in the schema for any fieldSpec using Groovy. The convention to follow is:"transformFunction": "FunctionType({function}, argument1, argument2,...argumentN)"
For example:
"transformFunction" : "Groovy({firstName + ' ' + lastName}, firstName, lastName)"
RecordReader
will provide the Pinot schema to theSourceFieldNamesExtractor
utility to get source field names.RecordExtractor
interface is introduced, one per input format. TheRecordReader
will pass the source field names and the source record to theRecordExtractor
, which will provide the destinationGenericRow
.ExpressionTransformer
will createExpressionEvaluator
for each transform function and execute the functions.ExpressionTransformer
will go before all otherRecordTransformers
, so that every other transformer has the real values.I'll be happy to break this down into smaller PRs, if this is getting too big to review.I'm finding it hard to break this down, because the AvroRecordExtractor is already used in realtime decoders
Pending
Some open questions
CSV
a) Everything is read as String, right until the DataTypeTransformer. The function will have to take care of handling the type conversion.
b) Cannot distinguish between MV columns of single value vs single value column. Function will have to take care
c) All empty values will be null values. Cannot distinguish between genuine “” and null in String
JSON
a) Cannot distinguish between INT/LONG and DOUBLE/FLOAT, until DataTypeTransformer.
KafkaJSONDecoder
needs to createJSONRecordExtractor
by default. But we cannot access JSONRecordExtractor ofinput-format
module in thestream-ingestion
module. Did not face this problem in Avro, because everything is ininput-format