Schema resolution at query time

Right now in `BrokerReduceService`, it will do a basic check on whether there are conflicting schemas across servers. If the row names don't match or some column fails type compatibility, it will just drop some segments.

This has caused issues with dropping segments unexpectedly due to known bugs or operational errors, especially with schema evolution:
- When updating the schema (e.g. add a new column) for a REALTIME table, even if we call RELOAD on the table, the segment being consumed at the time of schema update will end up with the old schema, while other old and new segments will have the new schema, causing that single segment to be dropped at query time due to schema mismatch (https://github.com/apache/incubator-pinot/issues/4225#issuecomment-521824275)
- When updating the schema for OFFLINE tables with daily append cadence, if the user didn't backfill old data after changing the schema and devs did not RELOAD the table/old segments, segments would end up in a mix of old and new schemas, causing some segments to be dropped at query time as well.

Possible solutions:
1. Resolve schema at query time. When segments have different but potentially compatible schemas (e.g. found schema v1 and v2 in dataTableMap, and v2 has one extra column than v1), try to merge all segments in the result, e.g. add default values for the new column for segments with schema v1 on the fly. This might add query latency overhead, and need to be done at each query, since schema inconsistency in the underlying segments are still not fixed.
2. Resolve schema in an asyn manner. We still keep the existing behavior of dropping segments with conflicting schemas, but when this happens, we will call controller to trigger RELOAD on select segments or the whole table, in an attempt to fix potential schema issues for future queries. Trigger conditions could be:
    - when not all segments have the same schema.
    - when the schema of some segments is different from the table schema. (I may have missed something, but seems at query time, the schema is inferred from segments, and the table schema in ZK is not referenced at all)

For option 2, we might need some extra check to make sure there aren't memory constraints due to reloading unnecessarily/frequently, e.g. we will only reload tables in scenarios where different schemas are compatible (e.g. backward compatible, type compatible), and are confident reloading would resolve the schema inconsistencies as a one-time effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema resolution at query time #4610

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Schema resolution at query time #4610

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions