Allow transform expressions on same column as source and sink.#8413
Allow transform expressions on same column as source and sink.#8413amrishlal wants to merge 1 commit intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8413 +/- ##
============================================
- Coverage 70.77% 64.05% -6.73%
- Complexity 4278 4281 +3
============================================
Files 1655 1610 -45
Lines 86607 84739 -1868
Branches 13064 12866 -198
============================================
- Hits 61296 54278 -7018
- Misses 21068 26540 +5472
+ Partials 4243 3921 -322
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Let's hold a little bit on merging this and have some high level discussion first.
We intentionally reject ingestion transform with same input and output column because it is not idempotent, and can cause unexpected behavior if by any chance the same record is transformed twice. Also, in certain scenarios, the input data might already have the final column generated, and we just skip the transform.
I would be super careful on this change because we need to ensure the record is never transformed twice. Another concern is that if the ingestion transform changes, there is no way to re-generate the derived column because the original values are already changed.
IMO, loose this restriction can easily cause unexpected behavior, and might not worth it.
The problem that we are running into is that for GDPR etc., we need to be able to purge records from a segment based on values of a particular field and if we change the name of the field that is being ingested into Pinot, then we loose information that column 'x' in the Pinot table actually came from field 'y' in the Kafka event / avro schema and hence cannot purge records automatically based on orginal avro schema field name 'y' in minion. Definitely open to suggestions and discussion, but my understanding is that ingestion transform functions are applied only during ingestion where the original field is in kafka/avro and the transformed value goes into Pinot column, so this should be safe right? If you have any particular usecase that may not be safe I can try them out? |
If you transform the value within column 'x', even if you can find the column, the value is no longer the original value, how do you apply the purge logic?
Ingestion transforms can be used during ingestion and also during reload to generate the derived column. Also, on the minion side, the segment can be read as source file, and fed into the ingestion engine again, which may transform the records again. We have to take extra care to make it right if the transform is not idempotent. |
|
@Jackie-Jiang Let me look into it a bit more. Will ping you offline. |
|
Closing as #8426 has been merged. |
Description
This PR allows for having column names in Pinot that have same name as the corresponding column name in an incoming online or offline (avro) dataset even after applying an Ingestion Transform Function. To do this, we modify
ExpressionTransformer.topologicalSortfunction so that Ingestion Transform function dependency is not considered cyclic if an Ingestion Transform Function has the same column name as both source and sink:Note that there is no actual cyclic dependency here since the function can still be safely evaluated without getting into an infinite loop.
ExpressionTransformer.javahas changes to allow for specifying Ingestion Transform functions where source and sink column names are the same.ExpressionTransformTest.javahas unit tests for validating the change.JsonIngestionFromAvroQueriesTest.javawas added as a real usecase testcase. This test involves ingesting avro complex type fields into JSON column when an Ingestion Transform function is used to map the avro complex type field to Pinot JSON column of the same name.AvroIngestionSchemaValidator.javachanges allow for validating type compatibility between Avro complex type fields and JSON column.Upgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
backward-incompat, and complete the section below on Release Notes)Does this PR fix a zero-downtime upgrade introduced earlier?
backward-incompat, and complete the section below on Release Notes)Does this PR otherwise need attention when creating release notes? Things to consider:
release-notesand complete the section on Release Notes)Release Notes
Documentation