KAFKA-5671: Add StreamsBuilder and Deprecate KStreamBuilder #3602

mjsax · 2017-07-31T02:46:12Z

No description provided.

mjsax · 2017-07-31T02:46:32Z

Call for review @bbejeck @guozhangwang @enothereska @dguy

asfgit · 2017-07-31T03:46:04Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6439/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-31T04:08:51Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6424/
Test PASSed (JDK 8 and Scala 2.12).

dguy

Thanks @mjsax, my main concern is with the use of reflection. I'd like to think we can somehow avoid it, but i think the package structure etc is working against us quite a bit.

dguy · 2017-07-31T09:59:04Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     * Long valueForKey = localStore.get(key);
+     * }</pre>
+     * Note that {@link GlobalKTable} always applies {@code "auto.offset.reset"} strategy {@code "earliest"}
+     * regardless of the speci


Do we need addStateStore on the DSL?

This is what we decided on (cf the KIP). Strictly speaking it's not required, as people could do builder.build().addStateStore() but that is quite clumsy. And you need to add stores manually for transform/transformValues/process.

dguy · 2017-07-31T09:59:33Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     * Long valueForKey = localStore.get(key);
+     * }</pre>
+     * Note that {@link GlobalKTable} always applies {@code "auto.offset.reset"} strategy {@code "earliest"}
+     * regardless of the speci


Not sure we need this on the DSL? It isn't used anywhere and seems more like it should be on Topology?

Both are available in Topology, too. And we add to DSL for convenience (cf. my other comment). Also, one might want to add a global store to transform/transformValues/process, too.

dguy · 2017-07-31T09:59:41Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     * Long valueForKey = localStore.get(key);
+     * }</pre>
+     * Note that {@link GlobalKTable} always applies {@code "auto.offset.reset"} strategy {@code "earliest"}
+     * regardless of the speci


dguy · 2017-07-31T10:36:29Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStreamBuilder.java

+    }
+
+    public KStreamBuilder() {
+        // TODO: we should refactor this to avoid usage of reflection


We need to think about how we can avoid this. The package structure appears to be working against us.

I don't like this either and to me, it should be a temporary workaround only. The PR is already huge and I think it's ok to accept this for now to finish the KIP. Refactoring will be internal cleanup only and can be done as a follow up. WDYT?

@guozhangwang ?

Actually I'm not clear why we cannot keep the current impl of KStreamBuilder as is, i.e. to leverage on the old deprecated TopologyBuilder's APIs in processor package than introducing an InternalStreamsBuilder and reflection?

As discussed, we can't keep as-is, because e.g. KStreamImpl was updated to accept an InternalStreamBuilder but not a KStreamBuilder anymore.

dguy · 2017-07-31T10:44:57Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilder.java

+    private final AtomicInteger index = new AtomicInteger(0);
+
+    public InternalStreamsBuilder() {
+        // TODO: we should refactor this to avoid usage of reflection


Again, i think we need to sort this out. Seems to me that somehow we need to pass in both InternalTopologyBuilder and Topology. Using reflection for this is a bit grim

I think we should be able to just call Topology's public APIs to construct the DAG, e.g. topology.addSource instead of internalTopologyBuilder.addSource if we believe the DSL could be "independently" layer on top of the PAPI layer instead of integrated with it.

I understand that for some other classes like KStreamImpl there is some places that the public APIs of Topology are not sufficient, and we need to access its internal topology builder for getting internal topics / marker co-partitions / etc, I will left a separate comment on those places. But for this class it seems we do not need to access any of its internal functions.

I agree in general. The goal is to layer DSL on top of Topology -- but there are still some other "leaking" abstractions to get the separation as it "should be". For now, it seems to be ok to me, to work with InternalTopologyBuilder to get access to those "internal methods". I would prefer to do some more refactoring as a follow up.

dguy · 2017-07-31T11:37:24Z

...ams/src/test/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilderTest.java

+    }
+
+    @Test
+    public void shouldAddTimestampExtractorToStreamWithKeyValSerdePerSource() throws Exception {


Not sure about the name? In particular the WithKeyValueSerde - we aren't adding any serdes?

dguy · 2017-07-31T11:38:10Z

...ams/src/test/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilderTest.java

+    }
+
+    @Test
+    public void shouldAddTimestampExtractorToStreamWithOffsetResetPerSource() throws Exception {


same here. Also, is this really any different from the test above?

dguy · 2017-07-31T11:49:16Z

streams/src/test/java/org/apache/kafka/streams/kstream/internals/KTableKTableJoinTest.java

@@ -67,11 +68,24 @@ public void setUp() throws IOException {
        stateDir = TestUtils.tempDirectory("kafka-test");
    }

-    private void doTestJoin(final KStreamBuilder builder,
+    public static Collection<Set<String>> getCopartitionedGroups(StreamsBuilder builder) {


As above. Seems something is not quite right if we are resorting to reflection

Will try to remove it in a follow-up PR.

dguy · 2017-07-31T11:56:04Z

.../src/test/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignorTest.java

+        final StreamsBuilder builder = new StreamsBuilder();
+
+        // TODO: we should refactor this to avoid usage of reflection
+        final Field internalStreamsBuilderField = builder.getClass().getDeclaredField("internalStreamsBuilder");


Is this the same as in StandbyTaskTest

dguy · 2017-07-31T11:56:51Z

.../src/test/java/org/apache/kafka/streams/processor/internals/StreamPartitionAssignorTest.java

+        final StreamsBuilder builder = new StreamsBuilder();
+
+        // TODO: we should refactor this to avoid usage of reflection
+        final Field internalStreamsBuilderField = builder.getClass().getDeclaredField("internalStreamsBuilder");


Same as above block of code?

bbejeck · 2017-07-31T14:17:30Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     *
+     * @param offsetReset the {@code "auto.offset.reset"} policy to use for the specified topic if no valid committed
+     *                    offsets are available
+     * @param topic       the topic name; cannot be {@code null}


nit: missing javadoc for timestampExtractor parameter

bbejeck · 2017-07-31T14:21:23Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     *
+     * @param offsetReset the {@code "auto.offset.reset"} policy to use for the specified topic if no valid committed
+     *                    offsets are available
+     * @param keySerde    key serde used to send key-value pairs,


nit: missing javadoc param tag for timestampExtractor

bbejeck · 2017-07-31T14:22:41Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     * @param keySerde           key serde used to send key-value pairs,
+     *                           if not specified the default key serde defined in the configuration will be used
+     * @param valueSerde         value serde used to send key-value pairs,
+     *                           if not specified the default value serde defined in the configuration will be used


nit: missing javadoc param tag for timestampExtractor

bbejeck · 2017-07-31T14:41:39Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStreamBuilder.java


-        addSource(offsetReset, name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topics);
+            internalTopologyBuilder.addSource(translateAutoOffsetReset(offsetReset), name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topics);


nit: with this many parameters including ternary statements maybe place the params on a separate line for readability?

bbejeck · 2017-07-31T14:42:16Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStreamBuilder.java


-        addSource(offsetReset, name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topicPattern);
+            internalTopologyBuilder.addSource(translateAutoOffsetReset(offsetReset), name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topicPattern);


nit: same as above

bbejeck · 2017-07-31T16:08:38Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableImpl.java

-            throw new TopologyBuilderException(message);
+            builder.internalTopologyBuilder.addProcessor(name, new KStreamPrint<>(new PrintForeachAction(printWriter, defaultKeyValueMapper, label), keySerde, valSerde), this.name);
+        } catch (final FileNotFoundException | UnsupportedEncodingException e) {
+            throw new TopologyException("Unable to write stream to file at [" + filePath + "] " + e.getMessage());


nit: use String.format for error message

mjsax · 2017-07-31T19:04:55Z

Updated to address @dguy and @bbejeck comments.

guozhangwang

Leave two general comments on trying to remove reflections.

guozhangwang · 2017-07-31T18:22:58Z

docs/streams/developer-guide.html

@@ -1094,8 +1094,8 @@
    //
    // OR
    //
-    KStreamBuilder builder = ...;  // when using the Kafka Streams DSL
-    Topology topology = builder.topology();
+    StreamsBuilder builder = ...;  // when using the Kafka Streams DSL


As mentioned in the previous PR, we should add the paragraph for using describe() as well as building incrementally with the describe -> build -> describe -> build pattern.

I did add something in the last PR: https://github.com/apache/kafka/blob/trunk/docs/streams/developer-guide.html#L207-L225

Not sufficient?

guozhangwang · 2017-07-31T18:28:01Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+     * {@link KafkaStreams#store(String, QueryableStoreType) KafkaStreams#store(...)}:
+     * <pre>{@code
+     * KafkaStreams streams = ...
+     * ReadOnlyKeyValueStore<String,Long> localStore = streams.store(queryableStoreName, QueryableStoreTypes.<String, Long>keyValueStore());


nit: space after , ditto similar javadoc elsewhere.

guozhangwang · 2017-07-31T18:29:29Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java

@@ -36,8 +37,8 @@
 * For example a user X might buy two items I1 and I2, and thus there might be two records {@code <K:I1>, <K:I2>}
 * in the stream.
 * <p>
- * A {@code KStream} is either {@link KStreamBuilder#stream(String...) defined from one or multiple Kafka topics} that
- * are consumed message by message or the result of a {@code KStream} transformation.
+ * A {@code KStream} is either {@link org.apache.kafka.streams.StreamsBuilder#stream(String...) defined from one or


Do we need to reference the whole path here, since we already import the class above? Ditto elsewhere.

guozhangwang · 2017-07-31T18:35:58Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStreamBuilder.java

+    }
+
+    public KStreamBuilder() {
+        // TODO: we should refactor this to avoid usage of reflection


Actually I'm not clear why we cannot keep the current impl of KStreamBuilder as is, i.e. to leverage on the old deprecated TopologyBuilder's APIs in processor package than introducing an InternalStreamsBuilder and reflection?

guozhangwang · 2017-07-31T18:53:41Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilder.java

+    private final AtomicInteger index = new AtomicInteger(0);
+
+    public InternalStreamsBuilder() {
+        // TODO: we should refactor this to avoid usage of reflection


I think we should be able to just call Topology's public APIs to construct the DAG, e.g. topology.addSource instead of internalTopologyBuilder.addSource if we believe the DSL could be "independently" layer on top of the PAPI layer instead of integrated with it.

I understand that for some other classes like KStreamImpl there is some places that the public APIs of Topology are not sufficient, and we need to access its internal topology builder for getting internal topics / marker co-partitions / etc, I will left a separate comment on those places. But for this class it seems we do not need to access any of its internal functions.

guozhangwang · 2017-07-31T18:54:41Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KGroupedStreamImpl.java

-                        : Collections.singleton(sourceName),
-                storeSupplier.name(),
-                isQueryable);
+        builder.internalTopologyBuilder.addProcessor(aggFunctionName, aggregateSupplier, sourceName);


Same here. Could we just use InternalStreamsBuilder#topology#addProcessor instead?

I don't think, the code would be simpler with the suggested change. ATM, the code has nested "hierarchy" with (top level to inner) Topology -> InternalTopologyBuilder -- thus, InternalTopologyBuilder does not even know it's "wrapped" by Topology.

Also, IMHO, in internal classes, it's ok to use InternalTopologyBuilder.

Makes sense. Thanks for the explanation.

guozhangwang · 2017-07-31T19:01:08Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KGroupedTableImpl.java

@@ -132,18 +131,18 @@ private void determineIsQueryable(final String queryableStoreName) {
        ChangedDeserializer<? extends V> changedValueDeserializer = new ChangedDeserializer<>(valueDeserializer);

        // send the aggregate key-value pairs to the intermediate topic for partitioning
-        topology.addInternalTopic(topic);
-        topology.addSink(sinkName, topic, keySerializer, changedValueSerializer, this.name);
+        builder.internalTopologyBuilder.addInternalTopic(topic);


Okay this will be the place that I mentioned about the place that Topology's public API is not sufficient. One way we can walk around it is to introduce one constructor in Topology which takes a InternalTopologyBuilder and the current constructor will need to be explicit and create the

final InternalTopologyBuilder internalTopologyBuilder = new InternalTopologyBuilder();

The added constructor will also be package-private so it is not exposed to users, but o.a.k.streams. StreamsBuilder can use it, so it can create the internalTopologyBuilder and pass it to the Topology in its own constructor, so that it can then hold on its reference to use it in places like here.

BTW as I mentioned above, for classes we do not really need to access the internalTopologyBuilder we should restrict ourselves to only call Topology's public APIs.

I did some of the refactoring you suggested already. But it's easier and cleaner, to go from Topology to InternalTopologyBuilder at one place. For internals, we don't loose anything if we use the internal topology builder IMHO.

asfgit · 2017-07-31T20:04:20Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6448/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-31T20:27:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6433/
Test PASSed (JDK 8 and Scala 2.12).

asfgit · 2017-07-31T21:42:38Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6449/
Test PASSed (JDK 7 and Scala 2.11).

asfgit · 2017-07-31T22:09:25Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6434/
Test PASSed (JDK 8 and Scala 2.12).

guozhangwang

Follow-up comments that will be wrapped in a separate PR.

guozhangwang · 2017-07-31T21:38:58Z

streams/src/main/java/org/apache/kafka/streams/kstream/KStream.java

@@ -51,7 +52,7 @@
 * @param <V> Type of values
 * @see KTable
 * @see KGroupedStream
- * @see KStreamBuilder#stream(String...)
+ * @see org.apache.kafka.streams.StreamsBuilder#stream(String...)


We can just use the class name itself.

guozhangwang · 2017-07-31T21:47:27Z

streams/src/main/java/org/apache/kafka/streams/StreamsBuilder.java

+
+    /**
+     * Create a {@link GlobalKTable} for the specified topic.
+     * The default {@link TimestampExtractor} as specified


We can call the static function of KStreamImpl directly and get rid of the additional function in InternalStreamsBuilder.

guozhangwang · 2017-07-31T21:49:57Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KGroupedStreamImpl.java

-                        : Collections.singleton(sourceName),
-                storeSupplier.name(),
-                isQueryable);
+        builder.internalTopologyBuilder.addProcessor(aggFunctionName, aggregateSupplier, sourceName);


Makes sense. Thanks for the explanation.

guozhangwang · 2017-07-31T21:53:29Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamImpl.java


    private static final String PROCESSOR_NAME = "KSTREAM-PROCESSOR-";

    private static final String PRINTING_NAME = "KSTREAM-PRINTER-";

    private static final String KEY_SELECT_NAME = "KSTREAM-KEY-SELECT-";

-    public static final String SINK_NAME = "KSTREAM-SINK-";
+    static final String SINK_NAME = "KSTREAM-SINK-";


Ideally we should make SOURCE_NAME to be package-private as well. I'll see if that's doable.

guozhangwang · 2017-07-31T21:55:29Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamImpl.java

-    public void to(final Serde<K> keySerde, final Serde<V> valSerde, StreamPartitioner<? super K, ? super V> partitioner, final String topic) {
+    public void to(final Serde<K> keySerde,
+                   final Serde<V> valSerde,
+                   StreamPartitioner<? super K, ? super V> partitioner,


This is not introduced by this PR: we should not change the passed in reference but just pass a different object if needed in addSink.

guozhangwang · 2017-07-31T22:09:33Z

streams/src/test/java/org/apache/kafka/streams/kstream/internals/KGroupedStreamImplTest.java

@@ -218,7 +218,7 @@ public Integer apply(final String aggKey, final Integer aggOne, final Integer ag
                return aggOne + aggTwo;
            }
        }, SessionWindows.with(30), Serdes.Integer(), "session-store");
-        table.foreach(new ForeachAction<Windowed<String>, Integer>() {
+        table.toStream().foreach(new ForeachAction<Windowed<String>, Integer>() {


Good catch!

guozhangwang · 2017-07-31T22:12:34Z

streams/src/test/java/org/apache/kafka/streams/kstream/internals/KTableKTableJoinTest.java

@@ -67,11 +68,24 @@ public void setUp() throws IOException {
        stateDir = TestUtils.tempDirectory("kafka-test");
    }

-    private void doTestJoin(final KStreamBuilder builder,
+    public static Collection<Set<String>> getCopartitionedGroups(StreamsBuilder builder) {


Will try to remove it in a follow-up PR.

guozhangwang · 2017-07-31T22:20:59Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/StandbyTaskTest.java

-        final KStreamBuilder builder = new KStreamBuilder();
-        builder.stream("topic").groupByKey().count("my-store");
-        final ProcessorTopology topology = builder.setApplicationId(applicationId).build(0);
+        final InternalStreamsBuilder builder = new InternalStreamsBuilder(new InternalTopologyBuilder());


Ditto, we can create the internal topology builder directly.

guozhangwang · 2017-07-31T22:21:29Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/StreamThreadTest.java

-        builder.setApplicationId(applicationId);
+
+        // TODO: we should refactor this to avoid usage of reflection
+        final Field internalTopologyBuilderField = internalStreamsBuilder.getClass().getDeclaredField("internalTopologyBuilder");


guozhangwang · 2017-07-31T22:23:38Z

streams/src/test/java/org/apache/kafka/test/KStreamTestDriver.java

@@ -83,6 +87,58 @@ public KStreamTestDriver(final KStreamBuilder builder,
        initTopology(topology, topology.stateStores());
    }

+    public KStreamTestDriver(final StreamsBuilder builder) {


I think we should mark the above constructors as deprecated.

guozhangwang · 2017-07-31T22:29:33Z

Merged to trunk.

KAFKA-5671: Add StreamsBuilder and Deprecate KStreamBuilder

c61ef5f

dguy reviewed Jul 31, 2017

View reviewed changes

bbejeck reviewed Jul 31, 2017

View reviewed changes

Github comments

38c6af2

guozhangwang reviewed Jul 31, 2017

View reviewed changes

Github comments

7c0df13

guozhangwang approved these changes Jul 31, 2017

View reviewed changes

asfgit closed this in da22055 Jul 31, 2017

mjsax deleted the kafka-5671-add-streamsbuilder branch June 5, 2018 23:50


		addSource(offsetReset, name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topics);
		internalTopologyBuilder.addSource(translateAutoOffsetReset(offsetReset), name, timestampExtractor, keySerde == null ? null : keySerde.deserializer(), valSerde == null ? null : valSerde.deserializer(), topics);

KAFKA-5671: Add StreamsBuilder and Deprecate KStreamBuilder #3602

KAFKA-5671: Add StreamsBuilder and Deprecate KStreamBuilder #3602

Conversation

mjsax commented Jul 31, 2017

mjsax commented Jul 31, 2017

asfgit commented Jul 31, 2017

asfgit commented Jul 31, 2017

dguy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbejeck Jul 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax commented Jul 31, 2017

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfgit commented Jul 31, 2017

asfgit commented Jul 31, 2017

asfgit commented Jul 31, 2017

asfgit commented Jul 31, 2017

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jul 31, 2017

bbejeck Jul 31, 2017 •

edited

Loading