Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-3209: KIP-66: single message transforms #2299

Closed
wants to merge 36 commits into from
Closed

KAFKA-3209: KIP-66: single message transforms #2299

wants to merge 36 commits into from

Conversation

shikhar
Copy link
Contributor

@shikhar shikhar commented Jan 3, 2017

Besides API and runtime changes, this PR also includes 2 data transformations (InsertField, HoistToStruct) and 1 routing transformation (TimestampRouter).

There is some gnarliness in ConnectorConfig / ConfigDef around creating, parsing and validating a dynamic ConfigDef.

@asfbot
Copy link

asfbot commented Jan 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/446/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/445/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/444/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 3, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/456/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/455/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/461/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/457/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/467/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/459/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/466/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/465/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/460/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/494/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/493/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 4, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/495/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 5, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/507/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 5, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/508/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 5, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/506/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Jan 10, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/705/
Test FAILed (JDK 7 and Scala 2.10).

*/
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {

/** Apply transformation to the {@code record} and return another record object (which may be {@code record} itself). Must be thread-safe. **/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will only support simple 1:{0,1} transformations – i.e. map and filter operations

I think we should add in the javadoc that the return record object can be null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shikhar Trivial change, but this seems like a good idea so folks know they can do filtering.

@ggrossetie
Copy link

Nice work!

Currently I'm using Kafka connect to consume data from Kafka and push them to Elasticsearch using kafka-connect-elasticsearch.
We've defined a ConsumerInterceptor to be able to validate and apply minor transformations on records (i.e. map and filter operations). As far as I understand this new feature can be used to do the same thing ?

In our use case there's one limitation with this implementation.
Our users can send either a JSON object (one value) or a JSON array (multiple values). In other words, a record can contain one or more values.

Basically what we are doing is:

  1. Iterate over a list of records ConsumerRecords onConsume(ConsumerRecords records)
  2. For each record, we deserialize the data (we now have a List<ObjectNode> containing zero, one or more ObjectNode)
  3. For each ObjectNode, we check that no required fields are missing and we also apply minor transformations
  4. We return a list of records containing zero, one or more records.

Could we change the apply method in the Transformation interface to return a list of records ?

-R apply(R record);
+List<R> apply(R record);

@ewencp
Copy link
Contributor

ewencp commented Jan 11, 2017

Yes, you can do this with SMTs instead. The difference is when the transformation occurs. SMTs are generic and occur before the data hits any serialization format specific changes.

Copy link
Contributor

@ewencp ewencp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shikhar A few more nits, but looks like we're almost there.

@@ -55,6 +55,11 @@ public TimestampType timestampType() {
}

@Override
public SinkRecord newRecord(String topic, Schema keySchema, Object key, Schema valueSchema, Object value, Long timestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the partition is omitted from this API. Was this intentional? For sinks it would definitely be weird to modify. Sources can technically specify the partition. Is there any chance we'd want to include the Kafka partition in this API as well?


for (String alias : new LinkedHashSet<>(transformAliases)) {
final String prefix = TRANSFORMS_CONFIG + "." + alias + ".";
final String groupPrefix = TRANSFORMS_GROUP + ": " + alias;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't really a prefix, it's just a new group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an arg to newDef.embed(), it can be a prefix (if a transformation's ConfigDef key spec has group name, otherwise it acts as the group). But it makes sense to just call it group here so will do.

Collections.sort(connectorPlugins, new Comparator<ConnectorPluginInfo>() {
@Override
public int compare(ConnectorPluginInfo a, ConnectorPluginInfo b) {
return a.clazz().compareTo(b.clazz());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Transformations are sorted by comparing the canonical names of the classes. Should we do that here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectorPluginInfo.clazz() is the canonical class name

@@ -219,7 +230,7 @@ public void onCompletion(RecordMetadata recordMetadata, Exception e) {
log.trace("Wrote record successfully: topic {} partition {} offset {}",
recordMetadata.topic(), recordMetadata.partition(),
recordMetadata.offset());
commitTaskRecord(record);
commitTaskRecord(preTransformRecord);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I don't think we really considered the potentially 2x memory usage increase this can cause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be a 2x increase since we are not closing over the transformed record in this callback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback uses both the pre- and post-transformed record which is why I was saying 2x. I'm not too worried about it, it's just something we didn't realize during the review.

.define(FIELD_CONFIG, ConfigDef.Type.STRING, ConfigDef.NO_DEFAULT_VALUE, ConfigDef.Importance.MEDIUM,
"Field name for the single field that will be created in the resulting Struct.");

private Cache<Schema, Schema> schemaUpdateCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final and initialize in constructor instead? Doesn't seem to depend on the config at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cache API does not provide a clear() method, and I'm relying on nulling out the field in close(). So creating it in init() seems appropriate.

/**
* This transformation allows inserting configured attributes of the record metadata as fields in the record key.
* It also allows adding a static data field.
* The record key is required to be of type {@link Schema.Type#STRUCT}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is outdated based on the addition of schemaless support.

/**
* This transformation allows inserting configured attributes of the record metadata as fields in the record value.
* It also allows adding a static data field.
* The record value is required to be of type {@link Schema.Type#STRUCT}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also out of date here.

*/
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {

/** Apply transformation to the {@code record} and return another record object (which may be {@code record} itself). Must be thread-safe. **/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the thread-safe comment actually true? Won't these be instantiated per-task and only execute in that task's thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presently, this is true. However it makes room for potentially operating in parallel over records -- easier to have the contract now than add it later. I'm open to not doing this since it allows simplifications e.g. not needing SynchronizedCache, if you don't think finer-grained parallelism in the workers is a likely direction.

@ewencp
Copy link
Contributor

ewencp commented Jan 11, 2017

@Mogztter Sorry, I missed the last part of your question. Transformations are 1:1 or 1:0 only for a reason. Connect tracks offsets for connectors automatically and allowing a flatMap-like transformation would at best make handling those offsets (which are defined by the connector in the case of sources) a lot harder or in the worst case break the guarantees the framework can provide. The intent of this feature is to only do lightweight transformations; Connect is still focused on moving data between systems and is not intended to be a fully-featured transformation engine. The basic transformations are being added just to support things like removing PII and doing very basic per-message data cleanup to avoid having to make extra copies of the data. If you get into more complicated transformations, you should take a look at Kafka Streams which is designed for that.

@shikhar
Copy link
Contributor Author

shikhar commented Jan 11, 2017

@Mogztter what @ewencp said, sorry I should have included this as a 'Rejected alternative' in the KIP.

@asfbot
Copy link

asfbot commented Jan 11, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/754/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 11, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/756/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 11, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/754/
Test FAILed (JDK 7 and Scala 2.10).

@shikhar
Copy link
Contributor Author

shikhar commented Jan 12, 2017

@ewencp this should be ready for a final pass. I've also ran systests and fixed one issue that showed up 2628156.

@asfbot
Copy link

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/804/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/806/
Test PASSed (JDK 8 and Scala 2.11).

@asfbot
Copy link

asfbot commented Jan 12, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/804/
Test PASSed (JDK 7 and Scala 2.10).

@@ -219,7 +230,7 @@ public void onCompletion(RecordMetadata recordMetadata, Exception e) {
log.trace("Wrote record successfully: topic {} partition {} offset {}",
recordMetadata.topic(), recordMetadata.partition(),
recordMetadata.offset());
commitTaskRecord(record);
commitTaskRecord(preTransformRecord);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback uses both the pre- and post-transformed record which is why I was saying 2x. I'm not too worried about it, it's just something we didn't realize during the review.

@ewencp
Copy link
Contributor

ewencp commented Jan 13, 2017

LGTM. Going to merge this now, but we'll have to remember to follow up and close the JIRA since JIRA is currently down, although I guess we have the remaining transformations to tackle still anyway.

@shikhar
Copy link
Contributor Author

shikhar commented Jan 13, 2017

Thanks @ewencp for the thorough review!

@shikhar shikhar deleted the smt-2017 branch January 13, 2017 00:40
asfgit pushed a commit that referenced this pull request Jan 13, 2017
…ourceTask

Followup to #2299 for KAFKA-3209

Author: Shikhar Bhushan <shikhar@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes #2365 from shikhar/2299-followup
soenkeliebau pushed a commit to soenkeliebau/kafka that referenced this pull request Feb 7, 2017
Besides API and runtime changes, this PR also includes 2 data transformations (`InsertField`, `HoistToStruct`) and 1 routing transformation (`TimestampRouter`).

There is some gnarliness in `ConnectorConfig` / `ConfigDef` around creating, parsing and validating a dynamic `ConfigDef`.

Author: Shikhar Bhushan <shikhar@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes apache#2299 from shikhar/smt-2017
soenkeliebau pushed a commit to soenkeliebau/kafka that referenced this pull request Feb 7, 2017
…ourceTask

Followup to apache#2299 for KAFKA-3209

Author: Shikhar Bhushan <shikhar@confluent.io>

Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>

Closes apache#2365 from shikhar/2299-followup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants