Skip to content

GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled #15191

@dave-the-tech-guy

Description

@dave-the-tech-guy

Affected Version

V27.0.0

Impact

This issue appears to be reliably reproduced by executing a single-dimension, single-filter Native Druid query on any string dimension in a kinesis ingestion task that is derived from a Schema Auto-Discovery spec, as long as the data has not been handed off. The issue resolves after hand-off to Historicals.

Expected Result

GroupBy and Timeseries Queries against actively ingested single dimension values are consistently filtered without regard to data residency (realtime vs fully persisted segment).

Actual Result

GroupBy and Timeseries Queries against actively ingested single dimension values temporarily ignore or mis-apply filters until data segments are persisted, at which point filters are correctly applied.

Description

My team operates multiple large-scale Druid clusters with roughly identical base configurations. Pertinent details are as follows:

  • Ingestion Method: kinesis
  • Segment size: 1 hour
  • Lookback period: 3 hours (a small portion of our data is late-arriving)
  • Relevant Middle Manager architecture: ARM processors, statically defined hardware, dedicated to kinesis ingestion tasks
    • Other Middle Manager tasks, such as compaction, are delegated to a separate Middle Manager tier

As part of Schema Auto-discovery migration, we migrated one of our regions to a new schema in which we only define a few legacy lists (to retain them as MVDs) and aggregations - the rest of our fields are ingested via discovery. In total, we produce records with ~100-150 fields, and the dataTypes do appear to align correctly post-migration.

In the process of migrating, we stumbled across a perplexing issue with GroupBy and Timeseries queries. Whenever we perform a single dimension query that overlaps/involves data on the Middle Managers (in our case, queries that touch the most recent 3 hours), the results received are nonsensical - the filter appears to be either inconsistently applied or not applied at all, resulting in other dimension values 'leaking' into the results despite being ruled out by the filter. This behavior is almost reminiscient of some sort of MVD edge case, but the fields experiencing this issue are strictly singular string values (and, as mentioned further down, the behavior changes between different points of the segment's lifecycle - indicating some sort of discrepancy based on query path / segment state).

Consider the following minimally-reproducible query, a GroupBy that groups and filters by an example_field dimension.

{
  "queryType": "groupBy",
  "dataSource": "Example_Records",
  "granularity": "all",
  "filter": {
      "type": "selector",
      "dimension": "example_field",
      "value": "expected_value"
   },
  "dimensions": ["example_field"],
  "intervals": [
    "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
  ]
}

Assuming example_field is guaranteed to be a simple string value, this query should return at a maximum 1 row - the value expected_value. However, that is not what happens.

  • When executed on a data range that still resides on Middle Managers, this query returns between 20-40 different rows with miscellaneous values for example_field.
  • When executed on a data range that has been successfully handed off to Historicals, this query returns the correct / expected value of only expected_value.
  • When the same query is executed twice with a 3-hour delay between runs, it will first return the nonsensical result - and then later return the expected result - indicating a behavior change between the comparable Middle Manager and Historical queries.

Oddly enough, a modification to the original query appears to fix it. If an additional dimension - even one that doesn't exist - is added to the query (ordering does not matter), it returns the expected result 100% of the time:

{
  "queryType": "groupBy",
  "dataSource": "Sample_Sessions",
  "granularity": "all",
  "filter": {
      "type": "selector",
      "dimension": "example_field",
      "value": "expected_value"
   },
  "dimensions": ["example_field", "oof"],
  "intervals": [
    "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
  ]
}

The above query will always return one row with an example_field value of expected_value and an oof value of null, somehow avoiding the nonsensical condition of the first query.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions