Skip to content

Provide partial unnesting for handling complex types. #12803

@rseetham

Description

@rseetham

Currently, if we are using the ComplexTypeTransformer, a (Avro)record/map field will be completely unnested. There is not control over how much we want to unnest. It would be great to have an option to control the level of unnesting.
If we don't use the ComplexTypeTransformer, an Avro record or map field can be converted to a pinot JSON field. This is done in the DataTypeTransformer.

Most of our use cases want to convert Map types into JSON fields. If we have a kafka topic with record and map field in the input kafka topic, we cannot choose to unnest a record field and convert a map field to JSON using the ComplexTypeTransformer. Also the way ComplexTypeTransformer handles map fields means you have to use it to only extract values for keys you already know ahead of time. For example is a Map field in the input is a simple key value string and you have multiple messages that look like:
{ map1: { key1: value1 } } | { map1: { key2: value2 } } | { map1: { key3: value3 } }
When using the ComplexTypeTransformer, you have to set the pinot table column names to map1.key1 or map1.key2 and it will only populate that column if this input message has that key. If we set map1 as the pinot table column, it will be set to null.

This is how the code works.
The decoder converts both map and record fields to a value type of Map<String, Object> in the GenericRow output. The Object is nested with other similar objects so the field is fully unnested. The output of the decoder is a Map<String, Object> when the input type is Map or Record. There is no difference. The ComplexTypeTransformer says if the value type is Map, it will unnest the entire map using the delimiter to generate the key.

Just like we have options to control how collection fields are unnested (collectionNotUnnestedToJson, fieldsToUnnest), it would be great to have options for Map fields as well.
We could have a option called recordNotUnnestedToJson. By default it will set to false and keep the current behaviour.
If set to true, we can stop the unnesting of a Map/Record if the unnesting at any point matches the pinot column name.
For example, if we have a input Map/Record field of the form:

key1: {
          key11: {
                         key111: val1 
                      }
          key12: {
                         key121: val2
                      }
        }            

Delimiter is '.' and the pinot table schema is

{
key1.key11.key111: string
key1.key12: json
}

Then we will stop the unnesting when we reach the key value key1.key12 and DataTypeTransformer will convert it to json.

To do the same thing, we can use transform functions, but if there are a lot of input columns, that will be a lot of functions to write. This will be a cleaner way to provide the same functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions