-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Objective: Pinot does not have the ability to generate columns based on other columns. The workaround has been for the users to pre-compute this as part of the input data. This adds additional processing - hadoop job in case of offline and a stream processing job in case of realtime.
Solution: Pinot currently supports this for time spec. The TimeFieldSpec allows one to specify the incoming field spec and outgoing field spec. We can generalize this other fields types as well (metric and dimensions). e.g. generate an HLL value from userId (derived metric), generate country from IP (derived dimension).
Option 1
Introduce a new FieldSpec in the schema - DerivedFieldSpec. This can look very similar to the existing fieldSpec the fieldType enum that can be DIMENSION, METRIC, TIME.
Option 2
Retain the dimensionFieldSpec, MetricFieldSpec, DateTimeFieldSpec and simply add a boolean isDerived to the base fieldSpec class.
I prefer option 2, since we will just add a simple boolean and rest of the query execution does not need any change.
In addition to above, we need the following to compute the derived column value.
- sourceFieldName: source field from which we can compute the derived field value.
- transformFunction: the function that needs to be applied to the source value to compute the derived value.
- transformFunctionParameters: Parameters to the transform function.
Here again, we have two options to store this configuration - Schema or TableConfig.
Thoughts/comments?