Skip to content

Preserve row signature column order when column analysis has errors#19162

Merged
abhishekrb19 merged 3 commits intoapache:masterfrom
abhishekrb19:fix_analysis_order
Mar 18, 2026
Merged

Preserve row signature column order when column analysis has errors#19162
abhishekrb19 merged 3 commits intoapache:masterfrom
abhishekrb19:fix_analysis_order

Conversation

@abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Mar 16, 2026

Fixes #18437 and relates to #18966.

When column analysis encounters errors during fold, the current behavior can cause row signatures to flap on the Brokers, which in turn leads to sporadic query failures or incorrect query results, since query plans rely on the Broker’s segment metadata cache. This issue is more pronounced during segment analysis on realtime servers with JSON columns, where the fold may sometimes produce column analysis errors, presumably due to type coercion.

This patch ensures that columns are not skipped when such errors occur preserving the row signature's order.

Release note

Preserve row signature column order when column analysis encounters errors, preventing schema flapping and sporadic query failures or incorrect results (fixes #18437).

Note: #19176 may still occur, where the current behavior is that types will fall back to string when such errors are encountered.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

When column analysis has errors, the current behavior causes signatures
to flap causing invalid query plans and errors / incorrect query results.
This patch ensures that columns aren't skipped on such errors.
Comment on lines 994 to 996
ColumnType valueType = entry.getValue().getTypeSignature();

// this shouldn't happen, but if it does, first try to fall back to legacy type information field in case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s also possible to simplify the surrounding code as the comments note. However, it may be better to do that separately but happy to bundle it into this bug fix as well if folks think it's safe to do so.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that this is what turned out to be causing unstable signatures. I suppose this change is fine. It would be good to fix the underlying problem too. Single-column analyses really shouldn't have errors.

{
final RowSignature.Builder rowSignatureBuilder = RowSignature.builder();
for (Map.Entry<String, ColumnAnalysis> entry : analysis.getColumns().entrySet()) {
if (entry.getValue().isError()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised you're facing analysis errors when merge is off. Most analysis errors I've seen are related to incompatible types in fold, which won't happen here since we aren't doing merge. I wonder if we should fix the underlying probably that causes the errors to happen… do you have more info about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, on your historicals or realtime tasks do you see warnings from SegmentAnalyzer like Error analyzing column or Unknown column type? If we can fix those then the type would be correct here too, rather than getting reset to STRING.

Copy link
Contributor Author

@abhishekrb19 abhishekrb19 Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

Yes, these analysis errors are indeed coming from fold. I think I got tripped up by the error logs I had added: Folding column[xyz] is an error[error:cannot_merge_diff_types: [json] and [STRING]] for mergeStrategy[strict] dataSources[abc].

As for why this happens here, I’m not entirely sure....but these were happening only on the realtime tasks in our case. So I had a suspicion that the data for some of these JSON columns in the ingestion spec were numeric strings/null/sparse data and contributing to these analysis errors [cannot_merge_diff_types](error:cannot_merge_diff_types: [json] and [STRING]]) .

  1. For the unstable signature issue, this can cause spurious query failures or return no data. Unfortunately, there’s no good workaround other than forcing a reorder of the columns in the ingestion spec, so this fix should help address it.
  2. For any type-related correctness issues, I think users can likely work around them with some form of casting in the queries. (I hadn't seen this issue reported by users so far)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for why this happens here, I’m not entirely sure....but these were happening only on the realtime tasks in our case.

Ah, they're probably trying to merge the analyses from different persists. Are you using auto typing? Either type: auto for a dimension spec or useSchemaDiscovery: true? I wonder if there is some issue when different persists select different types for the same column. That kind of thing is generally reconciled at query time and also reconciled when the segment was built and merged, but maybe we missed a place where it needs to be reconciled in the segmentMetadata path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using auto typing? Either type: auto for a dimension spec or useSchemaDiscovery: true?

We don't have auto discovery enabled. The columns are a mix of json type ("formatVersion": v5) and primitives string and long (no auto typing).

I wonder if there is some issue when different persists select different types for the same column. That kind of thing is generally reconciled at query time and also reconciled when the segment was built and merged, but maybe we missed a place where it needs to be reconciled in the segmentMetadata path

Hmm, I see. I wonder if the json type follows a similar code path as the auto-type indexer. Also created an issue to track the type correctness issue - #19176.

}

/**
* Verifies that columns with analysis errors are included in the row signature with {@link ColumnType#UNKNOWN_COMPLEX}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe errors show up as STRING not COMPLEX.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but I'm interested in more information about your setup that leads to these errors. There might be another thing we can fix that would cause the types to be detected more correctly.

@abhishekrb19
Copy link
Contributor Author

Approved but I'm interested in more information about your setup that leads to these errors. There might be another thing we can fix that would cause the types to be detected more correctly.

Thanks @gianm. I’ve responded in the discussion with some details of our setup #19162 (comment) (and #19176 to detect types more accurately in such cases).

@abhishekrb19 abhishekrb19 merged commit 746cae6 into apache:master Mar 18, 2026
37 checks passed
@abhishekrb19 abhishekrb19 deleted the fix_analysis_order branch March 18, 2026 03:42
@github-actions github-actions bot added this to the 37.0.0 milestone Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Frequent row signature changes causing queries to return incorrect results

2 participants