KAFKA-3209: KIP-66: single message transforms #2299

shikhar · 2017-01-03T19:21:56Z

Besides API and runtime changes, this PR also includes 2 data transformations (InsertField, HoistToStruct) and 1 routing transformation (TimestampRouter).

There is some gnarliness in ConnectorConfig / ConfigDef around creating, parsing and validating a dynamic ConfigDef.

asfbot · 2017-01-03T20:01:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/446/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-01-03T20:02:51Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/445/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-01-03T20:03:27Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/444/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-03T23:38:07Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/456/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-01-04T00:00:24Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/455/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-04T00:30:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/461/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-04T00:52:15Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/457/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-01-04T01:48:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/467/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-01-04T01:50:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/459/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-01-04T01:50:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/466/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-01-04T01:50:56Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/465/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-04T02:50:58Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/460/
Test FAILed (JDK 8 and Scala 2.12).

… a base ConfigDef

asfbot · 2017-01-04T20:09:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/494/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-04T20:15:56Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/493/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-01-04T21:15:10Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/495/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-01-05T01:41:51Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/507/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-05T02:48:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/508/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-05T02:57:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/506/
Test PASSed (JDK 7 and Scala 2.10).

…void surprises

asfbot · 2017-01-10T23:36:35Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/705/
Test FAILed (JDK 7 and Scala 2.10).

ggrossetie · 2017-01-11T12:13:32Z

connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java

+ */
+public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
+
+    /** Apply transformation to the {@code record} and return another record object (which may be {@code record} itself). Must be thread-safe. **/


We will only support simple 1:{0,1} transformations – i.e. map and filter operations

I think we should add in the javadoc that the return record object can be null.

@shikhar Trivial change, but this seems like a good idea so folks know they can do filtering.

ggrossetie · 2017-01-11T12:45:40Z

Nice work!

Currently I'm using Kafka connect to consume data from Kafka and push them to Elasticsearch using kafka-connect-elasticsearch.
We've defined a ConsumerInterceptor to be able to validate and apply minor transformations on records (i.e. map and filter operations). As far as I understand this new feature can be used to do the same thing ?

In our use case there's one limitation with this implementation.
Our users can send either a JSON object (one value) or a JSON array (multiple values). In other words, a record can contain one or more values.

Basically what we are doing is:

Iterate over a list of records ConsumerRecords onConsume(ConsumerRecords records)
For each record, we deserialize the data (we now have a List<ObjectNode> containing zero, one or more ObjectNode)
For each ObjectNode, we check that no required fields are missing and we also apply minor transformations
We return a list of records containing zero, one or more records.

Could we change the apply method in the Transformation interface to return a list of records ?

-R apply(R record);
+List<R> apply(R record);

ewencp · 2017-01-11T16:06:59Z

Yes, you can do this with SMTs instead. The difference is when the transformation occurs. SMTs are generic and occur before the data hits any serialization format specific changes.

ewencp

@shikhar A few more nits, but looks like we're almost there.

ewencp · 2017-01-11T16:32:28Z

connect/api/src/main/java/org/apache/kafka/connect/sink/SinkRecord.java

@@ -55,6 +55,11 @@ public TimestampType timestampType() {
    }

    @Override
+    public SinkRecord newRecord(String topic, Schema keySchema, Object key, Schema valueSchema, Object value, Long timestamp) {


I just noticed that the partition is omitted from this API. Was this intentional? For sinks it would definitely be weird to modify. Sources can technically specify the partition. Is there any chance we'd want to include the Kafka partition in this API as well?

ewencp · 2017-01-11T16:51:04Z

connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ConnectorConfig.java

+
+        for (String alias : new LinkedHashSet<>(transformAliases)) {
+            final String prefix = TRANSFORMS_CONFIG + "." + alias + ".";
+            final String groupPrefix = TRANSFORMS_GROUP + ": " + alias;


nit: this isn't really a prefix, it's just a new group

As an arg to newDef.embed(), it can be a prefix (if a transformation's ConfigDef key spec has group name, otherwise it acts as the group). But it makes sense to just call it group here so will do.

ewencp · 2017-01-11T17:03:31Z

connect/runtime/src/main/java/org/apache/kafka/connect/runtime/PluginDiscovery.java

+        Collections.sort(connectorPlugins, new Comparator<ConnectorPluginInfo>() {
+            @Override
+            public int compare(ConnectorPluginInfo a, ConnectorPluginInfo b) {
+                return a.clazz().compareTo(b.clazz());


The Transformations are sorted by comparing the canonical names of the classes. Should we do that here as well?

ConnectorPluginInfo.clazz() is the canonical class name

ewencp · 2017-01-11T17:10:56Z

connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java

@@ -219,7 +230,7 @@ public void onCompletion(RecordMetadata recordMetadata, Exception e) {
                                    log.trace("Wrote record successfully: topic {} partition {} offset {}",
                                            recordMetadata.topic(), recordMetadata.partition(),
                                            recordMetadata.offset());
-                                    commitTaskRecord(record);
+                                    commitTaskRecord(preTransformRecord);


This is interesting, I don't think we really considered the potentially 2x memory usage increase this can cause.

It shouldn't be a 2x increase since we are not closing over the transformed record in this callback.

The callback uses both the pre- and post-transformed record which is why I was saying 2x. I'm not too worried about it, it's just something we didn't realize during the review.

ewencp · 2017-01-11T17:24:01Z

connect/transforms/src/main/java/org/apache/kafka/connect/transforms/HoistToStruct.java

+            .define(FIELD_CONFIG, ConfigDef.Type.STRING, ConfigDef.NO_DEFAULT_VALUE, ConfigDef.Importance.MEDIUM,
+                    "Field name for the single field that will be created in the resulting Struct.");
+
+    private Cache<Schema, Schema> schemaUpdateCache;


final and initialize in constructor instead? Doesn't seem to depend on the config at all.

The Cache API does not provide a clear() method, and I'm relying on nulling out the field in close(). So creating it in init() seems appropriate.

ewencp · 2017-01-11T17:39:31Z

connect/transforms/src/main/java/org/apache/kafka/connect/transforms/InsertInKey.java

+/**
+ * This transformation allows inserting configured attributes of the record metadata as fields in the record key.
+ * It also allows adding a static data field.
+ * The record key is required to be of type {@link Schema.Type#STRUCT}.


I think this is outdated based on the addition of schemaless support.

ewencp · 2017-01-11T17:39:47Z

connect/transforms/src/main/java/org/apache/kafka/connect/transforms/InsertInValue.java

+/**
+ * This transformation allows inserting configured attributes of the record metadata as fields in the record value.
+ * It also allows adding a static data field.
+ * The record value is required to be of type {@link Schema.Type#STRUCT}.


Also out of date here.

ewencp · 2017-01-11T17:42:46Z

connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java

+ */
+public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
+
+    /** Apply transformation to the {@code record} and return another record object (which may be {@code record} itself). Must be thread-safe. **/


Is the thread-safe comment actually true? Won't these be instantiated per-task and only execute in that task's thread?

Presently, this is true. However it makes room for potentially operating in parallel over records -- easier to have the contract now than add it later. I'm open to not doing this since it allows simplifications e.g. not needing SynchronizedCache, if you don't think finer-grained parallelism in the workers is a likely direction.

ewencp · 2017-01-11T18:25:10Z

@Mogztter Sorry, I missed the last part of your question. Transformations are 1:1 or 1:0 only for a reason. Connect tracks offsets for connectors automatically and allowing a flatMap-like transformation would at best make handling those offsets (which are defined by the connector in the case of sources) a lot harder or in the worst case break the guarantees the framework can provide. The intent of this feature is to only do lightweight transformations; Connect is still focused on moving data between systems and is not intended to be a fully-featured transformation engine. The basic transformations are being added just to support things like removing PII and doing very basic per-message data cleanup to avoid having to make extra copies of the data. If you get into more complicated transformations, you should take a look at Kafka Streams which is designed for that.

shikhar · 2017-01-11T18:28:42Z

@Mogztter what @ewencp said, sorry I should have included this as a 'Rejected alternative' in the KIP.

asfbot · 2017-01-11T19:59:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/754/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-11T20:01:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/756/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-11T22:17:05Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/754/
Test FAILed (JDK 7 and Scala 2.10).

shikhar · 2017-01-12T19:41:10Z

@ewencp this should be ready for a final pass. I've also ran systests and fixed one issue that showed up 2628156.

asfbot · 2017-01-12T20:25:36Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/804/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-12T20:26:02Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/806/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-12T21:19:17Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/804/
Test PASSed (JDK 7 and Scala 2.10).

ewencp · 2017-01-12T23:55:33Z

connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java

@@ -219,7 +230,7 @@ public void onCompletion(RecordMetadata recordMetadata, Exception e) {
                                    log.trace("Wrote record successfully: topic {} partition {} offset {}",
                                            recordMetadata.topic(), recordMetadata.partition(),
                                            recordMetadata.offset());
-                                    commitTaskRecord(record);
+                                    commitTaskRecord(preTransformRecord);


The callback uses both the pre- and post-transformed record which is why I was saying 2x. I'm not too worried about it, it's just something we didn't realize during the review.

ewencp · 2017-01-13T00:13:56Z

LGTM. Going to merge this now, but we'll have to remember to follow up and close the JIRA since JIRA is currently down, although I guess we have the remaining transformations to tackle still anyway.

shikhar · 2017-01-13T00:40:42Z

Thanks @ewencp for the thorough review!

…ourceTask Followup to #2299 for KAFKA-3209 Author: Shikhar Bhushan <shikhar@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #2365 from shikhar/2299-followup

Besides API and runtime changes, this PR also includes 2 data transformations (`InsertField`, `HoistToStruct`) and 1 routing transformation (`TimestampRouter`). There is some gnarliness in `ConnectorConfig` / `ConfigDef` around creating, parsing and validating a dynamic `ConfigDef`. Author: Shikhar Bhushan <shikhar@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes apache#2299 from shikhar/smt-2017

…ourceTask Followup to apache#2299 for KAFKA-3209 Author: Shikhar Bhushan <shikhar@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes apache#2365 from shikhar/2299-followup

shikhar added 4 commits January 3, 2017 14:49

KAFKA-3209: KIP-66: single message transforms

8f15535

Fix required/optional insertion

0232cc3

minor: error message

80bfc4c

comment

afb0f26

Set UTC TZ on SimpleDateFormat

61feb0b

Use a mock TransformationChain rather than NO_OP

e8107e1

shikhar added 4 commits January 4, 2017 11:27

Fix WorkerSourceTaskTest.testSendRecordsRetries

b2f3430

should also copy configsWithNoParent when deriving new ConfigDef from…

5d8049b

… a base ConfigDef

Nest transformation related helpers in an inner class in ConnectorConfig

4920e48

Structure to better sorting logic in ConfigDef.sortedConfigs()

fd2d88b

Javadoc for transformations

e46ca45

Invoke SourceTask.commitRecord() with the pre-transformed record to a…

7774168

…void surprises

ggrossetie reviewed Jan 11, 2017

View reviewed changes

ewencp reviewed Jan 11, 2017

View reviewed changes

shikhar added 5 commits January 11, 2017 10:46

groupPrefix -> group

28a9b21

rm outdated doc

a4ee772

note that filtering possible in xform.apply() javadoc

30604f3

Add kafkaPartition arg to ConnectRecord.newRecord()

170d134

fix connect rest systest

2628156

Encapsulate key/value variants for data transformations in base class

70cff21

ewencp approved these changes Jan 13, 2017

View reviewed changes

asfgit closed this in 2f90488 Jan 13, 2017

shikhar mentioned this pull request Jan 13, 2017

MINOR: avoid closing over both pre & post-transform record in WorkerSourceTask #2365

Closed

shikhar deleted the smt-2017 branch January 13, 2017 00:40

kkonstantine added the connect label Oct 16, 2020

KAFKA-3209: KIP-66: single message transforms #2299

KAFKA-3209: KIP-66: single message transforms #2299

Conversation

shikhar commented Jan 3, 2017 • edited Loading

asfbot commented Jan 3, 2017

asfbot commented Jan 3, 2017

asfbot commented Jan 3, 2017

asfbot commented Jan 3, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 4, 2017

asfbot commented Jan 5, 2017

asfbot commented Jan 5, 2017

asfbot commented Jan 5, 2017

asfbot commented Jan 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggrossetie commented Jan 11, 2017

ewencp commented Jan 11, 2017

ewencp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewencp commented Jan 11, 2017

shikhar commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

shikhar commented Jan 12, 2017

asfbot commented Jan 12, 2017

asfbot commented Jan 12, 2017

asfbot commented Jan 12, 2017

Choose a reason for hiding this comment

ewencp commented Jan 13, 2017

shikhar commented Jan 13, 2017

shikhar commented Jan 3, 2017 •

edited

Loading