Skip to content

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Sep 22, 2025

Works around a problem with ESQL planning around new types. Without this if you have either aggregate_metric_double or dense_vector fields running on a node running < 9.2.0 ESQL will attempt detect those for queries like FROM * and request those fields. The old nodes don't know about the data type. So they'll claim they can't parse the request and fail. That's bad because in pure 9.1.x cluster FROM * would have returned null for those fields rather than failing. And, come on, FROM * shouldn't fail during an upgrade.

And it's not just during an upgrade. A mixed remote cluster being sent FROM remote:* over cross cluster search will have the same problem if it is 9.1 and the coordinating node is 9.2.

The fix we'd like to do is get the version of all nodes that'll have the request and only enable the type if all versions support it. That's complex, but it easy to think about and explain and handle in the planner.

We thought about a different fix - downgrading these fields to unsupported on write to an old version - but that's difficult to reason about and quite likely to fail in a long tail of weird ways.

This is something a simpler version of the "get all the versions and disable unsupported fields" that uses a cute hack to not have to fetch the version, which is the complex part. Instead, it scrapes the query for the TS command or for functions like KNN or TO_AGGREGATE_METRIC_DOUBLE and enables these fields if those are there. This works because folks who want dense_vector will always be using KNN in 9.2.0. And folks who want aggregate_metric_double will use TS. These things will fail if the entire cluster and all remote nodes aren't on 9.2.0 because they don't exist on those versions. So users will tell us, by writing things that require 9.2.0, if all the nodes are on 9.2.0.

We'll implement the transport version based approach in a follow up. But, for now, we'll use this trick.

Fix #135193

@nik9000 nik9000 added >test Issues or PRs that are addressing/adding tests :Analytics/ES|QL AKA ESQL v9.2.0 labels Sep 22, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000 nik9000 marked this pull request as draft September 23, 2025 01:51
@nik9000
Copy link
Member Author

nik9000 commented Sep 23, 2025

Flipping back to draft because this now contains a potential fix.

It also needs expanding to the multi-cluster tests.

import org.junit.ClassRule;

@ThreadLeakFilters(filters = TestClustersThreadFilter.class)
public class AllSupportedFieldsIT extends AllSupportedFieldsTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for single-node. This is not going to serialize at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful here because it serves as a smoke test for the ground "it works" state. Kind of a self test for the test.

}

nodeToInfo = new TreeMap<>();
Request getIds = new Request("GET", "_cat/nodes");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table output is a bit tricky to parse.
We should use json alternative instead. There is an example getting a node name. Same object should have id and version as well:

private List<String> getNodeNames() throws Exception {
final var request = new Request("GET", "/_nodes");
final var response = client().performRequest(request);
Map<String, Object> responseMap = responseAsMap(response);
Map<String, Map<String, Object>> nodes = extractValue(responseMap, "nodes");
final List<String> nodeNames = new ArrayList<>();
for (Map.Entry<String, Map<String, Object>> nodeInfoEntry : nodes.entrySet()) {
final String nodeName = extractValue(nodeInfoEntry.getValue(), "name");
nodeNames.add(nodeName);
}
return nodeNames;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's better, but I can do it.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the fix could work for some queries, but we'll have to work through all kinds of different scenarios and add tests for them.

This only addresses the actual data type during serialization. This is not the same as planning out the query with the fact in mind that any field attributes may actually have an unsupported data type on a remote node.

Things that come to mind:

  • What if a previously unsupported type is used in union types?
  • What if a previously unsupported type is used in a lookup join? In particular, now we can have the fun situation where the lookup happens on the data node, but coordinator + lookup nodes already know the new data type while the data node doesn't.
  • Same goes for ENRICH.
  • Are all our commands actually okay with the pages they receive? This is the first time that in some of the pages, we have null blocks, and in others, we have blocks of the expected data type.
  • When old data nodes get sent their local plan, we may not touch a field of previously unsupported type there - but we may touch it further downstream, on the coordinator, where it's supposed to be supported. The validation for "no touchy unsupported field" is also fully done on the coordinator, which doesn't know about the fact that some of the data nodes don't know about one of the data types used.

@nik9000
Copy link
Member Author

nik9000 commented Sep 23, 2025

  • Are all our commands actually okay with the pages they receive? This is the first time that in some of the pages, we have null blocks, and in others, we have blocks of the expected data type.

It's possible some things don't handle all-null blocks but they should handle them. All null is valid for all block types and common to make for unmapped fields. Like if you query two indices and only one maps the field - the other will make all-null blocks. It's possible a trick like this would bring the null block to more places, but I don't think so.

@nik9000
Copy link
Member Author

nik9000 commented Sep 23, 2025

  • When old data nodes get sent their local plan, we may not touch a field of previously unsupported type there - but we may touch it further downstream, on the coordinator, where it's supposed to be supported. The validation for "no touchy unsupported field" is also fully done on the coordinator, which doesn't know about the fact that some of the data nodes don't know about one of the data types used.

Does that mean that, with this hack, you can do stuff like:

FROM test
| WHERE unsupported IS NULL

and the data nodes running on an old version will run the IS NULL check?

@alex-spies
Copy link
Contributor

alex-spies commented Sep 23, 2025

and the data nodes running on an old version will run the IS NULL check?

Would have to confirm to say for sure, but they may just attempt to run the check. The verifier only runs after the analyzer, and both run only on the coordinator. So the verifier will not pick up the unsupported attributes and see that we're trying to use them in places that we aren't allowed to.

If you look at ComputeService#runCompute, you see that we take the received physical plan as-is and throw it immediately into the local optimizers. In any case, we'd be trying to run a query that would previously have been considered invalid.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to slap the test-release label on and see what we get. I think the test expectations are currently wrongly built on the SNAPSHOT-behavior - see below.

I think this is a great improvement and can be merged because it saves us from the problem that old nodes will just error out from existing queries due to serialization problems.

That said, I think it's incorrect to allow 9.1 nodes to participate in knn-using queries. Yes, they can deserialize the dense_vector data type. But the type is still disabled in 9.1 release builds. This will implicitly add support for 9.2/9.1 mixed/remote cluster situations. This may be fine! Or it may be not. In general, IMO a query should error out if we try to have a remote node use an unsupported type. And 9.1 doesn't support dense_vector, yet. The same holds, analogously, for aggregate_metric_double. And _tsid.

But we can tackle this in a separate PR, as long as it's before 9.2.0's release.

case GEO_SHAPE -> equalTo("POINT (-71.34 41.12)");
case NULL -> nullValue();
case AGGREGATE_METRIC_DOUBLE -> {
if (minVersion.onOrAfter(Version.V_9_2_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think per default 9.1.5 nodes shouldn't send aggregate_metric_double data; the data type was still under construction then. Same for dense_vector.

If we know that it's safe to get that data sent over, this is fine. But I don't know that and this means we're letting 9.1 nodes silently participate in queries that they may never have been tested for.

This goes back to my other #135215 (comment); in general, I think it's unsafe to treat a type as supported as soon as it can be deserialized; it's only safe to treat it as supported once it leaves the "under construction" stage.

// Currently unsupported without TS command or KNN function
case AGGREGATE_METRIC_DOUBLE, DENSE_VECTOR -> {
if (false == minVersion.onOrAfter(Version.V_9_2_0)) {
yield either(equalTo("unsupported")).or(equalTo(type.esType()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Hold on. With this change, the output should be stable overall when we have e.g. 9.1/9.2 mixed clusters in release builds. It shouldn't depend which node we send a query to - if the query isn't a knn or TS query (and so on), the expected output is dictated by the output of the old nodes.

Both should tell you "we don't support this type" and fill the column with nulls.

Are we looking at the right build? The new types are already enabled in SNAPSHOT builds since 9.0/9.1 (resp. 8.18/8.19). But disabled on release builds.

I went and ran from k8s-downsampled | keep network.eth0.rx | limit 1 against a local SNAPSHOT build, and a local release build. For the former, you'll get values and the right type in the column. For the latter, you'll get

{
  "took" : 184,
  "is_partial" : false,
  "documents_found" : 1,
  "values_loaded" : 0,
  "columns" : [
    {
      "name" : "network.eth0.rx",
      "type" : "unsupported",
      "original_types" : [
        "aggregate_metric_double"
      ],
      "suggested_cast" : "aggregate_metric_double"
    }
  ],
  "values" : [
    [
      null
    ]
  ]
}

This test's behavior is important mostly for release builds. Maybe this should be a release-build only test? Including SNAPSHOT builds will muddle the water and make this hard to reason about.

if (false == minVersion.onOrAfter(Version.V_9_2_0)) {
yield either(equalTo("unsupported")).or(equalTo(type.esType()));
}
yield equalTo("unsupported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test that the originalType is correctly reported in the output? That's something we could mess up when tinkering with the under construction types.

@nik9000 nik9000 added test-release Trigger CI checks against release build test-full-bwc Trigger full BWC version matrix tests labels Sep 25, 2025
task.skipTest("ml/sparse_vector_search/Search on a sparse_vector field with dots in the field names", "Vectors are no longer returned by default")
task.skipTest("ml/sparse_vector_search/Search on a nested sparse_vector field with dots in the field names and conflicting child fields", "Vectors are no longer returned by default")
task.skipTest("esql/190_lookup_join/lookup-no-key-only-key", "Requires the fix")
task.skipTest("esql/40_tsdb/aggregate_metric_double unsortable", "Extra function required to enable the field type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if we don't rename the now skipped tests, does that mean that the fwc test will be skipped for them forever? Would it be safer to update all the names of the changed yaml tests, e.g. by adding a v2 suffix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot. Uh. I don't know.

It won't be needed after I fix it to use the version. We can just un-skip it. I think.

But I'm not sure.

@nik9000 nik9000 merged commit 94c1ba4 into elastic:main Sep 26, 2025
30 checks passed
alex-spies added a commit that referenced this pull request Oct 1, 2025
- Fix the test expectations for our AllSupportedFieldsITs according to #135215 (comment).
- Clean up our mechanism and docs for "under construction" data types:
  - For SNAPSHOT builds, go back to the old status quo where all types are considered supported and we just skip tests based on capabilities.
  - Re-brand the CreatedVersion for data types to SupportedVersion. For release builds, it doesn't matter when serialization code for a type was introduced; it matters only when it was actually released, and this is generally later than when the serialization code for the type was written.
  - Remove the old feature-flag based system for marking types as "under construction" as we don't plan to use feature flags, but minimum transport versions to determine if a remote node/cluster supports a type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests test-full-bwc Trigger full BWC version matrix tests test-release Trigger CI checks against release build v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: data types transitioning from unsupported to supported fail queries
6 participants