Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ESQL] Binary Comparison Serialization #107921

Merged

Conversation

not-napoleon
Copy link
Member

Prior to this PR, serializing a binary comparison in ES|QL depended on the enum BinaryComparisonProcessor.BinaryComparisonOperator from the QL binary comparison code. That put some distance between the ESQL classes and their serialization logic, while also limiting our ability to make adjustments to that logic (since doing so would have ramifications for SQL and EQL)

This PR introduces a new ESQL specific enum for binary comparisons, which has a Writer and a Reader built in, and which implements the standard Writable interface. This enum is constructed in such a way as to be wire-compatible with the existing enum, thus not requiring a transport version change (although any future changes to this probably will require a transport version change).

A side effect of this change is removing Null Equals from ESQL serialization. We never actually implemented Null Equals, and the existing class is a stub. I infer that it was only created to allow use of the QL BinaryComparisonOperator enum, which specifies a Null Equals. I did not include it in the ESQL specific enum I just added, and as such removed it from places that reference that enum.

There is also a "shim" mapping from the new ESQL specific enum to the general QL enum. This is necessary for passing up to the parent BinaryOperation class. Changing the argument for that to use an interface like ArithmeticOperation does would require some non-trivial changes to how QL does serialization, which would dramatically increase the surface area of this PR. Medium term, I would like to change EsqlBinaryComparison to inherit directly from BinaryOperator, which will remove the need for that shim. Unfortunately, doing so proved non-trivial, and so I'm saving that for follow up work.

Follow up work:
- Remove remaining references to Null Equals, and the ESQL Null Equals class.
- Move PlanNamedTypes.writeBinComparison and PlanNamedTypes.readBinComparison into EsqlBinaryComparison, and make EsqlBinaryComparison Writable. This will finish putting the serialization logic next to the object being serialized, for binary comparisons.
- Remove the "shim" by changing EsqlBinaryComparison to inherit directly from BinaryOperation

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 25, 2024
}
}
throw new IOException("No BinaryComparisonOperation found for id [" + id + "]");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more normal way to serialize enums and let's us implement wire compatibility more easily. It's annoying to type it out for every single enum, but through hard won experience, we found it to be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of having individual almost identical readFromStream/writeTo methods for every enum, as opposed to the current, centralized (in StreamInput/StreamOuput) approach?

I see the serialization tests make use of the transport version (maybe this already exists now in "old" serialization code, not sure) as a novelty, but can't we do the same in ESQL (outside StreamOutput/StreamInput) and, also, have a common Enum serialization/deserialization code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can explain. We used to serialize enums the way that QL does - via the ordinals. It means fairly simple things like changing the order of the enums can break serialization. Or inserting a new enum. And adding wire backwards compatibility because a problem. You see here that we have to skip a number for wire compatibility. That's easy enough with the id member but requires an extra unused member. We broke enough enum serialization without realizing it that we decided we should be pedantic about it and make a method like this on each enum.

So, partly it's to remove the concerns around sorting enums or inserting new entries. And partly its to make wire backwards compatibility easier in the future. Well, in this case, now. And, partly, it's to make QLs enums look like the other well behaved serializable enums in Elasticsearch.

FWIW, I tend to use a switch statement instead of the loop. But that's not a big deal either way. I'd have done it like this:

        @Override
        public void writeTo(StreamOutput out) throws IOException {
            out.writeByte(code);
        }

        public static Role readFrom(StreamInput in) throws IOException {
            return switch (in.readByte()) {
                case 0 -> DEFAULT;
                case 1 -> INDEX_ONLY;
                case 2 -> SEARCH_ONLY;
                default -> throw new IllegalStateException("unknown role");
            };
        }

But it doesn't matter.

FWIW there are about 200 calls to StreamOutput#writeEnum and StreamOutput#writeOptionalEnum which does the ordinal thing. Spot checking enums in core yields a bunch that delegate to that and a bunch that do the switch or loop way. I think its not as settled of a thing as I make it out to be. But it really makes me feel good knowing I don't have to worry about changing the order of the constants breaking things over the wire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nik9000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened another issue to talk about removing the enum serialization this way. There's some discussion around doing some other third thing instead. I dunno. I was more sure about this a few hours ago. I mean, I'm sure it's an improvement over what we have, but I'm not sure there aren't other good ways too. OTOH, better is better.

@@ -334,15 +331,15 @@ public void testBinComparisonSimple() throws IOException {
var orig = new Equals(Source.EMPTY, field("foo", DataTypes.DOUBLE), field("bar", DataTypes.DOUBLE));
BytesStreamOutput bso = new BytesStreamOutput();
PlanStreamOutput out = new PlanStreamOutput(bso, planNameRegistry);
out.writeNamed(BinaryComparison.class, orig);
var deser = (Equals) planStreamInput(bso).readNamed(BinaryComparison.class);
out.writeNamed(EsqlBinaryComparison.class, orig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause us trouble with the name? I know PlanNamedTypes uses the class name sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PlanNameRegistry only uses the plain class name, so as long as Equals extends BinaryComparison is replaced by Equals extends EsqlBinaryComparison - with the same name Equals, there's a chance this works without issue.

Changing the "category class" EsqlBinaryComparison under which we register Equals could in theory break things, but in practice when we read/write expressions, the PlanNamedTypes only cares that we have a subclass of Expression (which is unchanged).

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'd wait for a review by Costin/Andrei though since they're more familiar with the QL project.

BinaryComparisonOperation(
int id,
String symbol,
BinaryComparisonProcessor.BinaryComparisonOperation shim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: maybe a short comment on why this is needed would help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hoped it wouldn't live long enough to need such a comment, but you're probably right. I'll add one.

@@ -334,15 +331,15 @@ public void testBinComparisonSimple() throws IOException {
var orig = new Equals(Source.EMPTY, field("foo", DataTypes.DOUBLE), field("bar", DataTypes.DOUBLE));
BytesStreamOutput bso = new BytesStreamOutput();
PlanStreamOutput out = new PlanStreamOutput(bso, planNameRegistry);
out.writeNamed(BinaryComparison.class, orig);
var deser = (Equals) planStreamInput(bso).readNamed(BinaryComparison.class);
out.writeNamed(EsqlBinaryComparison.class, orig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PlanNameRegistry only uses the plain class name, so as long as Equals extends BinaryComparison is replaced by Equals extends EsqlBinaryComparison - with the same name Equals, there's a chance this works without issue.

Changing the "category class" EsqlBinaryComparison under which we register Equals could in theory break things, but in practice when we read/write expressions, the PlanNamedTypes only cares that we have a subclass of Expression (which is unchanged).

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the disconnect from QL and I agree with it by creating our own EsqlBinaryComparison, but I fail to see the benefits of duplicating serialization logic for enums.

var left = in.readExpression();
var right = in.readExpression();
// TODO: Remove zoneId entirely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this TODO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, isn't there the possibility of having date > '2024-01-01T01:00:00` on a specific timezone (see Kibana) sometime in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO is because it's a field we don't currently use or test; none of the actual ES|QL implementations actually use it, so it only looks like it does something, while actually doing nothing. I find that confusing, since I tend to not expect to carry around unused fields. In this particular case, you can see I am not even passing this value on to the constructor.

Beyond that, I don't see the reason that binary comparisons, and only binary comparisons, need to track the time zone they are operating in. If a date is time zone aware, that information should be encoded in the date (2024-04-25T09:15:00+06:00 or similar). Carrying the timezone as part of the operator makes that very confusing. What does it mean to say date > 2024-04-25T09:15:00+06:00 in CET? And if that does have a meaning, why do we not also need it for (e.g.) subtraction of dates?

Ultimately, ES|QL doesn't have timezone support yet, and we don't know what that will look like. Maybe it will be the case that when building that, we decided we need to add back in a timezone here. But until we have a plan for that, I don't see why we should carry around a confusing, unused value.

}
}
throw new IOException("No BinaryComparisonOperation found for id [" + id + "]");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of having individual almost identical readFromStream/writeTo methods for every enum, as opposed to the current, centralized (in StreamInput/StreamOuput) approach?

I see the serialization tests make use of the transport version (maybe this already exists now in "old" serialization code, not sure) as a novelty, but can't we do the same in ESQL (outside StreamOutput/StreamInput) and, also, have a common Enum serialization/deserialization code?

@astefan astefan self-requested a review April 26, 2024 17:45
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@not-napoleon
Copy link
Member Author

@elasticmachine update branch

@not-napoleon not-napoleon merged commit 4664ced into elastic:main Apr 26, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants