-
Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL: Opt in to support for new aggregate_metric_double
and dense_vector
using query constructs
#135215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
…ql_fetch_field_types
…sql_fetch_field_types
Flipping back to draft because this now contains a potential fix. It also needs expanding to the multi-cluster tests. |
…ql_fetch_field_types
...e/src/javaRestTest/java/org/elasticsearch/xpack/esql/qa/multi_node/AllSupportedFieldsIT.java
Outdated
Show resolved
Hide resolved
import org.junit.ClassRule; | ||
|
||
@ThreadLeakFilters(filters = TestClustersThreadFilter.class) | ||
public class AllSupportedFieldsIT extends AllSupportedFieldsTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for single-node
. This is not going to serialize at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful here because it serves as a smoke test for the ground "it works" state. Kind of a self test for the test.
} | ||
|
||
nodeToInfo = new TreeMap<>(); | ||
Request getIds = new Request("GET", "_cat/nodes"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table output is a bit tricky to parse.
We should use json alternative instead. There is an example getting a node name. Same object should have id and version as well:
Lines 121 to 133 in 885a551
private List<String> getNodeNames() throws Exception { | |
final var request = new Request("GET", "/_nodes"); | |
final var response = client().performRequest(request); | |
Map<String, Object> responseMap = responseAsMap(response); | |
Map<String, Map<String, Object>> nodes = extractValue(responseMap, "nodes"); | |
final List<String> nodeNames = new ArrayList<>(); | |
for (Map.Entry<String, Map<String, Object>> nodeInfoEntry : nodes.entrySet()) { | |
final String nodeName = extractValue(nodeInfoEntry.getValue(), "name"); | |
nodeNames.add(nodeName); | |
} | |
return nodeNames; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's better, but I can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think the fix could work for some queries, but we'll have to work through all kinds of different scenarios and add tests for them.
This only addresses the actual data type during serialization. This is not the same as planning out the query with the fact in mind that any field attributes may actually have an unsupported data type on a remote node.
Things that come to mind:
- What if a previously unsupported type is used in union types?
- What if a previously unsupported type is used in a lookup join? In particular, now we can have the fun situation where the lookup happens on the data node, but coordinator + lookup nodes already know the new data type while the data node doesn't.
- Same goes for ENRICH.
- Are all our commands actually okay with the pages they receive? This is the first time that in some of the pages, we have null blocks, and in others, we have blocks of the expected data type.
- When old data nodes get sent their local plan, we may not touch a field of previously unsupported type there - but we may touch it further downstream, on the coordinator, where it's supposed to be supported. The validation for "no touchy unsupported field" is also fully done on the coordinator, which doesn't know about the fact that some of the data nodes don't know about one of the data types used.
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/type/DataType.java
Show resolved
Hide resolved
It's possible some things don't handle all-null blocks but they should handle them. All null is valid for all block types and common to make for unmapped fields. Like if you query two indices and only one maps the field - the other will make all-null blocks. It's possible a trick like this would bring the null block to more places, but I don't think so. |
Does that mean that, with this hack, you can do stuff like:
and the data nodes running on an old version will run the |
Would have to confirm to say for sure, but they may just attempt to run the check. The verifier only runs after the analyzer, and both run only on the coordinator. So the verifier will not pick up the unsupported attributes and see that we're trying to use them in places that we aren't allowed to. If you look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to slap the test-release
label on and see what we get. I think the test expectations are currently wrongly built on the SNAPSHOT-behavior - see below.
I think this is a great improvement and can be merged because it saves us from the problem that old nodes will just error out from existing queries due to serialization problems.
That said, I think it's incorrect to allow 9.1 nodes to participate in knn
-using queries. Yes, they can deserialize the dense_vector
data type. But the type is still disabled in 9.1 release builds. This will implicitly add support for 9.2/9.1 mixed/remote cluster situations. This may be fine! Or it may be not. In general, IMO a query should error out if we try to have a remote node use an unsupported type. And 9.1 doesn't support dense_vector
, yet. The same holds, analogously, for aggregate_metric_double
. And _tsid
.
But we can tackle this in a separate PR, as long as it's before 9.2.0's release.
case GEO_SHAPE -> equalTo("POINT (-71.34 41.12)"); | ||
case NULL -> nullValue(); | ||
case AGGREGATE_METRIC_DOUBLE -> { | ||
if (minVersion.onOrAfter(Version.V_9_2_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think per default 9.1.5 nodes shouldn't send aggregate_metric_double data; the data type was still under construction then. Same for dense_vector.
If we know that it's safe to get that data sent over, this is fine. But I don't know that and this means we're letting 9.1 nodes silently participate in queries that they may never have been tested for.
This goes back to my other #135215 (comment); in general, I think it's unsafe to treat a type as supported as soon as it can be deserialized; it's only safe to treat it as supported once it leaves the "under construction" stage.
// Currently unsupported without TS command or KNN function | ||
case AGGREGATE_METRIC_DOUBLE, DENSE_VECTOR -> { | ||
if (false == minVersion.onOrAfter(Version.V_9_2_0)) { | ||
yield either(equalTo("unsupported")).or(equalTo(type.esType())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Hold on. With this change, the output should be stable overall when we have e.g. 9.1/9.2 mixed clusters in release builds. It shouldn't depend which node we send a query to - if the query isn't a knn
or TS
query (and so on), the expected output is dictated by the output of the old nodes.
Both should tell you "we don't support this type" and fill the column with nulls.
Are we looking at the right build? The new types are already enabled in SNAPSHOT builds since 9.0/9.1 (resp. 8.18/8.19). But disabled on release builds.
I went and ran from k8s-downsampled | keep network.eth0.rx | limit 1
against a local SNAPSHOT build, and a local release build. For the former, you'll get values and the right type in the column. For the latter, you'll get
{
"took" : 184,
"is_partial" : false,
"documents_found" : 1,
"values_loaded" : 0,
"columns" : [
{
"name" : "network.eth0.rx",
"type" : "unsupported",
"original_types" : [
"aggregate_metric_double"
],
"suggested_cast" : "aggregate_metric_double"
}
],
"values" : [
[
null
]
]
}
This test's behavior is important mostly for release builds. Maybe this should be a release-build only test? Including SNAPSHOT builds will muddle the water and make this hard to reason about.
if (false == minVersion.onOrAfter(Version.V_9_2_0)) { | ||
yield either(equalTo("unsupported")).or(equalTo(type.esType())); | ||
} | ||
yield equalTo("unsupported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test that the originalType
is correctly reported in the output? That's something we could mess up when tinkering with the under construction types.
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/type/DataType.java
Show resolved
Hide resolved
...esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/LookupJoinTypesIT.java
Show resolved
Hide resolved
…ql_fetch_field_types
…ql_fetch_field_types
task.skipTest("ml/sparse_vector_search/Search on a sparse_vector field with dots in the field names", "Vectors are no longer returned by default") | ||
task.skipTest("ml/sparse_vector_search/Search on a nested sparse_vector field with dots in the field names and conflicting child fields", "Vectors are no longer returned by default") | ||
task.skipTest("esql/190_lookup_join/lookup-no-key-only-key", "Requires the fix") | ||
task.skipTest("esql/40_tsdb/aggregate_metric_double unsortable", "Extra function required to enable the field type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if we don't rename the now skipped tests, does that mean that the fwc test will be skipped for them forever? Would it be safer to update all the names of the changed yaml tests, e.g. by adding a v2
suffix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh shoot. Uh. I don't know.
It won't be needed after I fix it to use the version. We can just un-skip it. I think.
But I'm not sure.
- Fix the test expectations for our AllSupportedFieldsITs according to #135215 (comment). - Clean up our mechanism and docs for "under construction" data types: - For SNAPSHOT builds, go back to the old status quo where all types are considered supported and we just skip tests based on capabilities. - Re-brand the CreatedVersion for data types to SupportedVersion. For release builds, it doesn't matter when serialization code for a type was introduced; it matters only when it was actually released, and this is generally later than when the serialization code for the type was written. - Remove the old feature-flag based system for marking types as "under construction" as we don't plan to use feature flags, but minimum transport versions to determine if a remote node/cluster supports a type.
Works around a problem with ESQL planning around new types. Without this if you have either
aggregate_metric_double
ordense_vector
fields running on a node running < 9.2.0 ESQL will attempt detect those for queries likeFROM *
and request those fields. The old nodes don't know about the data type. So they'll claim they can't parse the request and fail. That's bad because in pure 9.1.x clusterFROM *
would have returnednull
for those fields rather than failing. And, come on,FROM *
shouldn't fail during an upgrade.And it's not just during an upgrade. A mixed remote cluster being sent
FROM remote:*
over cross cluster search will have the same problem if it is 9.1 and the coordinating node is 9.2.The fix we'd like to do is get the version of all nodes that'll have the request and only enable the type if all versions support it. That's complex, but it easy to think about and explain and handle in the planner.
We thought about a different fix - downgrading these fields to
unsupported
on write to an old version - but that's difficult to reason about and quite likely to fail in a long tail of weird ways.This is something a simpler version of the "get all the versions and disable unsupported fields" that uses a cute hack to not have to fetch the version, which is the complex part. Instead, it scrapes the query for the
TS
command or for functions likeKNN
orTO_AGGREGATE_METRIC_DOUBLE
and enables these fields if those are there. This works because folks who wantdense_vector
will always be usingKNN
in 9.2.0. And folks who wantaggregate_metric_double
will useTS
. These things will fail if the entire cluster and all remote nodes aren't on 9.2.0 because they don't exist on those versions. So users will tell us, by writing things that require 9.2.0, if all the nodes are on 9.2.0.We'll implement the transport version based approach in a follow up. But, for now, we'll use this trick.
Fix #135193