feat: plan serialization jackson annotations #3520

rodesai · 2019-10-09T21:59:41Z

This patch adds jackson annotations needed to serialize execution plans:
- query context
- DDLs
- execution steps
- name types

It also adds annotations for a few types that the above types depend on:
- timestamp extractors
- format pojos needed for serdes
These will be cleaned up in a later change.

Finally, this patch includes a tool for generating a JsonSchema spec from the
annotated types, a first cut of this schema, and a test for checking that the
schema isn't changed.

purplefox · 2019-10-15T20:29:50Z

Hey @rodesai .

I am curious about the general approach here. I notice the fields of the objects that are serialized are specified using Jackson annotations.

Does that not mean the serialized form will change even if there is a single change to any of the fields that are annotated? How do we then maintain compatibility between versions of ksql?

E.g. if node 1 is running a slightly different version of ksql to node 2 then the serialized form shared between the nodes won't be compatible unless there have no changes to any of the annotated objects.

If we are defining a serialization format that we want to be usable between different versions of ksql, I suspect we want to define it elsewhere and lock it down.

rodesai · 2019-10-27T20:36:56Z

Does that not mean the serialized form will change even if there is a single change to any of the fields that are annotated? How do we then maintain compatibility between versions of ksql? E.g. if node 1 is running a slightly different version of ksql to node 2 then the serialized form shared between the nodes won't be compatible unless there have no changes to any of the annotated objects.

So we're on the same page - the current design is for these plans to be backwards-compatible - new versions of KSQL should be able to read and execute old plans, and old versions should be able to detect new incompatible plans (and then halt command topic execution).

To make sure this happens, we'll modify the query translation tests to write out and run against the execution plans KSQL generates. We will also extend these to fail when the plan changes (even in a compatible way). If the plan changes in an incompatible way the tests will fail to deserialize or fail test cases, and the developer should fix their change. If the plan changes in a compatible way, we'll have a tool for generating and writing out a new set of plans (for the cases that need it, alongside the old plans).

If we are defining a serialization format that we want to be usable between different versions of ksql, I suspect we want to define it elsewhere and lock it down.

I (maybe?) agree here. I'll be moving these into their own module (ksql-models) in a follow-up. The testing described above should be enough to keep them compatible. Not sure if this goes as far as what you had in mind.

purplefox · 2019-10-29T03:14:31Z

I (maybe?) agree here. I'll be moving these into their own module (ksql-models) in a follow-up.

I think, quite possibly, that I don't understand well enough the overall plan.

But I kind of hear alarm bells going off, if we're exposing a part of the implementation as something that is set in stone, and we can't evolve over time. In general any approach where we tie the implementation down to a specific form, sounds like a set of ankle shackles for the future, and that scares me.

Reminds me a bit of the horror of Java serialization, where users would serialize parts of their internal object model, which was deserialized by other instances. For that to work across versions it meant that the internal objects that were serialized could never change (or at least they could only change in a compatible way), but either way, was a bit of a nightmare.

Let's talk about this in person. As I said, it's quite possible I haven't fully understood the proposal yet :)

big-andy-coates · 2019-11-06T19:40:23Z

Some quick thoughts on the serialized form:

Stream-Stream Join:
https://gist.github.com/rodesai/19ff1591a5c0d17cccd2936da4d7704d

$.ddlCommand.sqlExpression duplicates $.statementText, which seems unnecessary.
$.ddlCommand.topic.ksqlSink is implicit from the command, so likely not needed
$.queryPlan.physicalPlan.physicalPlan.. properties.schema is likewise generated by the plan node, i.e. given a known input schema and a known version of the plan node, the output schema is deterministic. So I don't think we need to include this.
$.queryPlan.physicalPlan.physicalPlan.source.source.[leftFormats & rightFormats] duplicates data held in $.queryPlan.physicalPlan.physicalPlan.source.source.[left & right].formats.
'sourceSchema' in the left and right nodes looks to be unnecessary too
why do we serialize planSummary?

I think the issue here is that you're serializing the physical model and ddl command as-is, rather than only serializing the minimum data we need to rebuild that physical plan. If you got rid of all the duplication and unnecessary/implicit data, then the serialized form would be much cleaner. Of course, this means you can't just deserialize each type directly. You'll need something that can pull the necessary information together, e.g. to build the ddl command you'd need the statement text from the top level object and the details within the ddlCommand node.

This comes back to my previous comment on a previous PR on the same topic: we need to be careful how we expose this internal state, so that we're as loosely coupled as we can be to it. (As Tim alludes to).

rodesai · 2019-11-06T21:09:01Z

$.queryPlan.physicalPlan.physicalPlan.. properties.schema is likewise generated by the plan node, i.e. given a known input schema and a known version of the plan node, the output schema is deterministic. So I don't think we need to include this.

I agree, and plan to remove this in a follow-up as discussed on #3722

$.ddlCommand.topic.ksqlSink is implicit from the command, so likely not needed

I've got a follow up to clean this up. I'll add a json ignore to it for now.

$.queryPlan.physicalPlan.physicalPlan.source.source.[leftFormats & rightFormats] duplicates data held in $.queryPlan.physicalPlan.physicalPlan.source.source.[left & right].formats.

Yes, but I think this is actually useful - we might want to use a different serialization format when repartitioning or materializing a stream/table. That said, maybe there's some information we can clean up from this (e.g. windowing info)

why do we serialize planSummary?

Hold-over for compatibility reasons from before the major version bump. will clean this up.

big-andy-coates

Thanks @rodesai

General comments:

we should be marking public factory methods as @JsonCreator, not private constructors.
we should avoid marking private fields as @JsonProperty where ever possible. It breaks encapsulation.
we should make @JsonProperty as required where they are not optional.

Having said that, I'm still going to block this PR for two reasons:

There a no tests that any of these types serialize to / from JSON correctly. If we're going this route it's crucial to have such tests.
I'm still have reservations about the approach. Details below.

So.. design wize, I understand that the idea of serializing the JSON representation of our query plan to the command topic is to allow us to:
a) Change our SQL syntax in non-backwards compatible ways.
b) Change the query plan a specific SQL query is built as, .e.g. avoiding an unnecessary repartition step.

However, the current approach still feels like we're strongly coupling our impl with our serialized form. @purplefox had reservations about this too. At the moment, it feels like too much of our internal implementation is being exposed in the JSON, because the JSON is being built from our internal model. This makes it much harder to change that internal model. For example, I plan on enhancing LogicalSchema to know if the key is windowed or not, (after all this is part of the schema!). This will mean we no longer need KeyFormat, but KeyFormat is part of our JSON model.

So, what's the solution? I'm not 100% sure. My gut is that we need to decouple the JSON from many of the types that have JSON annotations added in this PR. Maybe we should limit what classes get serialized to JSON to those in the ksql-execution module and io.confluent.ksql.execution package, and maybe the *Name types, or something like that? Adding new Pojo types to this package to hold the info we need for a specific part of the process, and then later use this pojo to build the actual types we need. This might be more work, but provide better design, less coupling and more flexibility. Another alternative would be to use custom serde classes. Though this tends to lead to less readable code. Another suggestion would be to define the schema as a JSON Schema and code gen classes from this.

@purplefox might how some ideas too. Maybe we can have a three-way discussion? I'm sure we can come up with something!

As I said, blocking the PR for this reason. Feel free to reach out on Slack. I sometimes leave slack messages unread if I'm concentrating on code. If its urgent, just call me on Slack.

big-andy-coates · 2019-11-13T14:25:26Z

ksql-common/src/main/java/io/confluent/ksql/name/ColumnName.java

@@ -51,6 +52,7 @@ public static ColumnName of(final String name) {
    return new ColumnName(name);
  }

+  @JsonCreator


move to the public factory method.

big-andy-coates · 2019-11-13T14:25:42Z

ksql-common/src/main/java/io/confluent/ksql/name/FunctionName.java

@@ -27,6 +28,7 @@ public static FunctionName of(final String name) {
    return new FunctionName(name);
  }

+  @JsonCreator


Move to the public factory method.

big-andy-coates · 2019-11-13T14:26:13Z

ksql-common/src/main/java/io/confluent/ksql/name/Name.java

@@ -28,7 +29,7 @@
 */
 @Immutable
 public abstract class Name<T extends Name<?>> {
-
+  @JsonValue


Move to the public name getter. (Though you may need to rename the method to getName).

big-andy-coates · 2019-11-13T14:26:39Z

ksql-common/src/main/java/io/confluent/ksql/name/SourceName.java

@@ -27,6 +28,7 @@ public static SourceName of(final String name) {
    return new SourceName(name);
  }

+  @JsonCreator


Move to the public factory method.

big-andy-coates · 2019-11-13T14:27:20Z

ksql-common/src/main/java/io/confluent/ksql/serde/FormatInfo.java

@@ -41,10 +42,11 @@ public static FormatInfo of(
    return new FormatInfo(format, avroFullSchemaName, valueDelimiter);
  }

+  @JsonCreator


Move to the public factory method.

ksql-engine/src/main/java/io/confluent/ksql/physical/PhysicalPlan.java

big-andy-coates · 2019-11-13T14:52:09Z

ksql-execution/src/main/java/io/confluent/ksql/execution/context/QueryContext.java

  }

  public List<String> getContext() {
    return context;
  }

+  @JsonValue
+  public String formatContext() {


Suggested change

public String formatContext() {

public String toString() {

???

^^^ Outstanding suggestion from last review

big-andy-coates · 2019-11-13T14:54:04Z

ksql-execution/src/main/java/io/confluent/ksql/execution/context/QueryContext.java

+
+  @JsonCreator
+  private QueryContext(final String context) {
+    this(Arrays.asList(context.split(DELIMITER)));


Suggested change

this(Arrays.asList(context.split(DELIMITER)));

this(ImmutableList.copyOf(context.split(DELIMITER)));

ksql-execution/src/main/java/io/confluent/ksql/execution/context/QueryContext.java

big-andy-coates · 2019-11-13T14:56:28Z

ksql-execution/src/main/java/io/confluent/ksql/execution/ddl/commands/CreateTableCommand.java

+      @JsonProperty("sourceName") final SourceName sourceName,
+      @JsonProperty("schema") final LogicalSchema schema,
+      @JsonProperty("keyField") final Optional<ColumnName> keyField,
+      @JsonProperty("timestampExtractionPolicy") final TimestampExtractionPolicy extractionPolicy,


Suggested change

@JsonProperty("timestampExtractionPolicy") final TimestampExtractionPolicy extractionPolicy,

@JsonProperty("timestampExtractor") final TimestampExtractionPolicy timestampExtractor,

rodesai · 2019-11-18T19:44:35Z

Summarizing our discussion from last week:

The main concern with the change in this PR are:
- the serialized model is too tightly coupled with KSQL's implementation. For example, serializing the format info types makes it hard to change how we represent windows internally.
- using Jackson annotations on KSQL's types is perhaps a brittle way of defining the plan schema. For example, we have Jackson annotations on other types and developers routinely break them in incompatible ways.
Getting some serialization implementation merged will help a lot because it unlocks progress that can be made in parallel with work to trim down the model. Until actually write the plans out anywhere, we can change the format however we want.
So, to move forward, we'll do the following:
- To address the second concern, we'll generate a JsonSchema specification from the current model (multiple tools exist to do this automatically), check it in as part of this PR, and add a test that asserts that it is not changed. This way devs are forced to think through whether their change is backwards-compatible. This tooling can be extended to check for compatibility at the time the new schema is committed (rather than just for equality at test time).
- Any changes that persist the serialized format anywhere will be flagged off by default.
- I'll do some follow-up work to clean up the serialized model.
- The team will do a review of the schema before turning on changes that write out the plans.

rodesai · 2019-11-19T00:56:05Z

ksql-rest-app/pom.xml

@@ -113,6 +113,20 @@
            <scope>test</scope>
        </dependency>

+        <dependency>
+            <groupId>com.kjetland</groupId>
+            <artifactId>mbknor-jackson-jsonschema_2.12</artifactId>


This is the JSON schema generation library. I played with a few, and this was the only one that correctly handles interfaces and Jackson annotations. It uses the MIT license.

rodesai · 2019-11-19T00:57:56Z

pom.xml

+            <dependency>
+                <groupId>javax.validation</groupId>
+                <artifactId>validation-api</artifactId>
+                <version>${javax-validation.version}</version>


the schema generator relies on a newer version of javax-validation than we pull in by default.

rodesai · 2019-11-19T01:03:46Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

@@ -0,0 +1,1062 @@
+{


Currently, the schema includes:

Plan Nodes:
KsqlPlanV1
QueryPlan
PhysicalPlan[Object]
StreamAggregate
DefaultExecutionStepProperties
StreamFilter
StreamFlatMap
StreamGroupBy
StreamGroupByKey
StreamMapValues
StreamSelectKey
StreamSink
StreamSource
WindowedStreamSource
StreamStreamJoin
StreamTableJoin
StreamToTable
StreamWindowedAggregate
TableAggregate
TableFilter
TableGroupBy
TableMapValues
TableSink
TableTableJoin
ExecutionStep

DDL commands:
CreateStreamCommand
KsqlTopic
CreateTableCommand
RegisterTypeCommand
DropSourceCommand
DropTypeCommand

Timestamp Extractors:
MetadataTimestampExtractionPolicy
StringTimestampExtractionPolicy
LongColumnTimestampExtractionPolicy

Formats:
KeyFormat
FormatInfo
WindowInfo

I'm working on follow-ups to clean up the timestamp extractors and format types from this schema.

We should remove KsqlTopic from this list for sure. That's pure implementation detail.

big-andy-coates

Thanks @rodesai

The fact that there are no tests to test that these types can be serialized to/from JSON seems very wrong to me. The PR is all about adding the annotations to serialize to/from JSON, but we don't know if the code is wright. IMHO, each annotated class should have suitable unit tests to ensure correct serialization and deserialization.

Aside from that, the PR LGTM.

big-andy-coates · 2019-11-19T13:53:11Z

ksql-common/src/main/java/io/confluent/ksql/serde/ValueFormat.java

@@ -23,7 +26,6 @@
 */
 @Immutable
 public final class ValueFormat {
-


^^^ Outstanding question/nit from last review.

big-andy-coates · 2019-11-19T13:54:10Z

ksql-common/src/main/java/io/confluent/ksql/serde/KeyFormat.java

@@ -41,14 +43,16 @@ public static KeyFormat windowed(
    return new KeyFormat(format, Optional.of(windowInfo));
  }

+  @JsonCreator


^^^^ Outstanding question / issue from the late review.

big-andy-coates · 2019-11-19T13:55:15Z

ksql-common/src/main/java/io/confluent/ksql/serde/WindowInfo.java

  private final WindowType type;
  private final Optional<Duration> size;

  public static WindowInfo of(final WindowType type, final Optional<Duration> size) {
    return new WindowInfo(type, size);
  }

-  private WindowInfo(final WindowType type, final Optional<Duration> size) {
+  @JsonCreator


Move annotations to the public factory method please

big-andy-coates · 2019-11-19T13:55:51Z

ksql-common/src/main/java/io/confluent/ksql/util/timestamp/StringTimestampExtractionPolicy.java


  private final ColumnRef timestampField;
+  @JsonProperty(FORMAT)


^^^^ Outstanding issue from last review

big-andy-coates · 2019-11-19T13:58:24Z

ksql-engine/src/main/java/io/confluent/ksql/physical/PhysicalPlan.java

+      @JsonProperty(value = "physicalPlan", required = true) final ExecutionStep<T> physicalPlan,
+      @JsonProperty(value = "planSummary", required = true) final String planSummary
+  ) {
+    this(queryId, physicalPlan, planSummary, Optional.empty());


How come we're no serializing the key field?

It's not needed - its already in the ddl. Cleaning this up in an upcoming pr.

big-andy-coates · 2019-11-19T14:25:17Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

+          "$ref" : "#/definitions/FormatInfo"
+        },
+        "ksqlSink" : {
+          "type" : "boolean"


definitely shouldn't be exposing this. (follow up PR to fix)

big-andy-coates · 2019-11-19T14:25:43Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

+          "$ref" : "#/definitions/FormatInfo"
+        },
+        "windowInfo" : {
+          "$ref" : "#/definitions/WindowInfo"


meh... this is going to make it hard to move window info to the shema...

big-andy-coates · 2019-11-19T14:26:27Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

+        "delimiter" : {
+          "type" : "string"
+        },
+        "avroFullSchemaName" : {


we should make this just fullSchemaName so that its not Avro specific. (Follow up PR)

big-andy-coates · 2019-11-19T14:26:49Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

+          "type" : "string"
+        }
+      },
+      "required" : [ "format", "delimiter" ]


delimiter is NOT required. Only required for DELIMITER format.

big-andy-coates · 2019-11-19T14:30:14Z

ksql-rest-app/src/test/resources/ksql-plan-schema/schema.json

+      },
+      "required" : [ "sources", "sink", "physicalPlan" ]
+    },
+    "PhysicalPlan[Object]" : {


We should remove the generics on PhysicalPlan so that we drop the [Object] here. The generics aren't actually of any use.

big-andy-coates · 2019-11-19T14:38:48Z

ksql-rest-app/src/test/java/io/confluent/ksql/rest/server/utils/KsqlPlanSchemaGenerator.java

+        Collections.emptySet(),
+        // the schema generator doesn't play nice with custom serializers, so we add a
+        // config to remap the custom-serialized types to their underlying primitive
+        new ImmutableMap.Builder<Class<?>, Class<?>>()


Is this only a temporary thing? What does this mean in terms of:

a) the schema generated: are we losing schema validation on these types?
b) the serialized form: how do these types look when serialized?

a) the schema generated: are we losing schema validation on these types?

Yes - from a jsonschema POV we can't do much better than defining these schemas as strings. I think we can add a contentMediaType that describes what the string contains (e.g. text/sqlexpression, text/sqlschema, etc). Let me give that a go.

b) the serialized form: how do these types look when serialized?

These are all serialized as text parseable by the ksql parser.

big-andy-coates

I'm approving purely to unblock future work. Personally, I still feel strongly that each annotated type should have associated unit tests to test serialization to/from JSON.

Please also get approval from @agavra or @purplefox

rodesai · 2019-11-19T18:14:17Z

Personally, I still feel strongly that each annotated type should have associated unit tests to test serialization to/from JSON.

I definitely agree we need these tests. I'm just proposing we put them in when the final format is settled, but before actually writing these models out anywhere.

This patch adds jackson annotations needed to serialize execution plans: - query context / query ID - DDLs - execution steps - name types It also adds annotations for a few types that the above types depend on: - timestamp extractors - format pojos needed for serdes These will be cleaned up in a later change. Finally, this patch includes a tool for generating a JsonSchema spec from the annotated types, a first cut of this schema, and a test for checking that the schema isn't changed.

rodesai requested a review from a team as a code owner October 9, 2019 21:59

rodesai force-pushed the jackson-annotations-redux branch from 638bac2 to bd73651 Compare November 1, 2019 18:10

rodesai changed the title ~~DRAFT feat: plan serialization~~ feat: plan serialization jackson annotations Nov 1, 2019

rodesai force-pushed the jackson-annotations-redux branch from 50c93ba to 21bae3f Compare November 3, 2019 18:08

big-andy-coates suggested changes Nov 13, 2019

View reviewed changes

big-andy-coates requested review from agavra and purplefox November 13, 2019 18:52

agavra mentioned this pull request Nov 14, 2019

feat: serialize expressions #3721

Merged

rodesai force-pushed the jackson-annotations-redux branch 2 times, most recently from fa55570 to 0aa3b17 Compare November 19, 2019 00:54

rodesai commented Nov 19, 2019

View reviewed changes

rodesai force-pushed the jackson-annotations-redux branch from 0aa3b17 to e851f35 Compare November 19, 2019 00:56

rodesai commented Nov 19, 2019

View reviewed changes

rodesai force-pushed the jackson-annotations-redux branch from e851f35 to 7683410 Compare November 19, 2019 04:45

rodesai requested a review from big-andy-coates November 19, 2019 13:20

big-andy-coates reviewed Nov 19, 2019

View reviewed changes

big-andy-coates self-requested a review November 19, 2019 14:34

big-andy-coates reviewed Nov 19, 2019

View reviewed changes

big-andy-coates self-requested a review November 19, 2019 14:39

big-andy-coates approved these changes Nov 19, 2019

View reviewed changes

rodesai added 2 commits November 19, 2019 13:55

review feedback

3fc73ba

rodesai force-pushed the jackson-annotations-redux branch from 7683410 to 3fc73ba Compare November 19, 2019 23:02

rodesai added 2 commits November 19, 2019 15:25

fix schema

c9924e7

checkstyle

3d6dc12

rodesai merged commit 2736d77 into confluentinc:master Nov 20, 2019

rodesai mentioned this pull request Nov 25, 2019

proposed execution plan schema #3969

Closed

rodesai mentioned this pull request Dec 16, 2019

test: unit tests for plan serialization #4137

Closed

	this(Arrays.asList(context.split(DELIMITER)));
	this(ImmutableList.copyOf(context.split(DELIMITER)));

	@JsonProperty("timestampExtractionPolicy") final TimestampExtractionPolicy extractionPolicy,
	@JsonProperty("timestampExtractor") final TimestampExtractionPolicy timestampExtractor,


		private final ColumnRef timestampField;
		@JsonProperty(FORMAT)

feat: plan serialization jackson annotations #3520

feat: plan serialization jackson annotations #3520

Conversation

rodesai commented Oct 9, 2019 • edited

purplefox commented Oct 15, 2019

rodesai commented Oct 27, 2019

purplefox commented Oct 29, 2019

big-andy-coates commented Nov 6, 2019

rodesai commented Nov 6, 2019 • edited

big-andy-coates left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai commented Nov 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates Nov 19, 2019 • edited

Choose a reason for hiding this comment

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates Nov 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai Nov 19, 2019 • edited

Choose a reason for hiding this comment

big-andy-coates left a comment

Choose a reason for hiding this comment

rodesai commented Nov 19, 2019

rodesai commented Oct 9, 2019 •

edited

rodesai commented Nov 6, 2019 •

edited

big-andy-coates left a comment •

edited

rodesai commented Nov 18, 2019 •

edited

big-andy-coates Nov 19, 2019 •

edited

big-andy-coates Nov 19, 2019 •

edited

rodesai Nov 19, 2019 •

edited