GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled

### Affected Version

V27.0.0

### Impact

This issue appears to be reliably reproduced by executing a single-dimension, single-filter Native Druid query on any `string` dimension in a `kinesis` ingestion task that is derived from a `Schema Auto-Discovery` spec, as long as the data has not been handed off. The issue resolves after hand-off to Historicals.

### Expected Result

GroupBy and Timeseries Queries against actively ingested single dimension values are consistently filtered without regard to data residency (realtime vs fully persisted segment).

### Actual Result

GroupBy and Timeseries Queries against actively ingested single dimension values temporarily ignore or mis-apply filters until data segments are persisted, at which point filters are correctly applied.

### Description

My team operates multiple large-scale Druid clusters with roughly identical base configurations. Pertinent details are as follows:

- Ingestion Method: `kinesis`
- Segment size: `1 hour`
- Lookback period: `3 hours` (a small portion of our data is late-arriving)
- Relevant Middle Manager architecture: ARM processors, statically defined hardware, dedicated to kinesis ingestion tasks
  - Other Middle Manager tasks, such as compaction, are delegated to a separate Middle Manager tier

As part of Schema Auto-discovery migration, we migrated one of our regions to a new schema in which we only define a few legacy lists (to retain them as MVDs) and aggregations - the rest of our fields are ingested via discovery. In total, we produce records with ~100-150 fields, and the dataTypes do appear to align correctly post-migration.

In the process of migrating, we stumbled across a perplexing issue with GroupBy and Timeseries queries. Whenever we perform a single dimension query that overlaps/involves data on the Middle Managers (in our case, queries that touch the most recent 3 hours), the results received are nonsensical - the filter appears to be either inconsistently applied or not applied at all, resulting in other dimension values 'leaking' into the results despite being ruled out by the filter. This behavior is almost reminiscient of some sort of MVD edge case, but the fields experiencing this issue are strictly singular string values (and, as mentioned further down, the behavior changes between different points of the segment's lifecycle - indicating some sort of discrepancy based on query path / segment state).

Consider the following minimally-reproducible query, a GroupBy that groups and filters by an `example_field` dimension. 

```
{
  "queryType": "groupBy",
  "dataSource": "Example_Records",
  "granularity": "all",
  "filter": {
      "type": "selector",
      "dimension": "example_field",
      "value": "expected_value"
   },
  "dimensions": ["example_field"],
  "intervals": [
    "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
  ]
}
```
Assuming `example_field` is guaranteed to be a simple string value, this query should return at a maximum 1 row - the value `expected_value`. However, that is not what happens. 

- When executed on a data range that still resides on Middle Managers, this query returns between 20-40 different rows with miscellaneous values for `example_field`.
- When executed on a data range that has been successfully handed off to Historicals, this query returns the correct / expected value of only `expected_value`.
- When the same query is executed twice with a 3-hour delay between runs, it will first return the nonsensical result - and then later return the expected result - indicating a behavior change between the comparable Middle Manager and Historical queries.

Oddly enough, a modification to the original query appears to fix it. If an additional dimension - even one that doesn't exist - is added to the query (ordering does not matter), it returns the expected result 100% of the time: 

```
{
  "queryType": "groupBy",
  "dataSource": "Sample_Sessions",
  "granularity": "all",
  "filter": {
      "type": "selector",
      "dimension": "example_field",
      "value": "expected_value"
   },
  "dimensions": ["example_field", "oof"],
  "intervals": [
    "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
  ]
}
```

The above query will always return one row with an `example_field` value of `expected_value` and an `oof` value of `null`, somehow avoiding the nonsensical condition of the first query.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled #15191

Affected Version

Impact

Expected Result

Actual Result

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GroupBy and Timeseries queries to realtime segments mis-apply filters when Schema Auto-Discovery is enabled #15191

Description

Affected Version

Impact

Expected Result

Actual Result

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions