KAFKA-6936: Implicit Materialized for aggregates #5066

joan38 · 2018-05-22T20:48:04Z

In #4919 we propagate the SerDes for each of these aggregation operators.

As @guozhangwang mentioned in that PR:

reduce: inherit the key and value serdes from the parent XXImpl class.
count: inherit the key serdes, enforce setting the Serdes.Long() for value serdes.
aggregate: inherit the key serdes, do not set for value serdes internally.

Although it's all good for reduce and count, it is quiet unsafe to have aggregate without Materialized given. In fact I don't see why we would not give a Materialized for the aggregate since the result type will always be different (otherwise use reduce) and also the value Serde is simply not propagated.

This has been discussed previously in a broader PR before but I believe for aggregate we could pass implicitly a Materialized the same way we pass a Joined, just to avoid the stupid case. Then if the user wants to specialize, he can give his own Materialized.

@guozhangwang @debasishg @seglo Let me know your thoughts.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

debasishg · 2018-05-22T21:12:06Z

Regarding the following ..

aggregate: inherit the key serdes, do not set for value serdes internally.

For the aggregate method that does not take a Materialized can't we propagate the value serde as well ? If not, then, this PR makes sense to me.

joan38 · 2018-05-22T21:39:33Z

@debasishg How can you propagate the Serde if the resulting type of the aggregate is not the same as the previous one? This would propagate the wrong Serde.
reduce is safe since it enforces the same return type as the inputs and count is a Long.

guozhangwang · 2018-05-24T03:54:22Z

@debasishg We are removing the deprecated APIs in Java that only takes a serde as parameters, so now aggregate operators only have two overloaded variant: with and without a Materialized.

@joan38 Currently there is no semantics difference between the two overloaded functions with and without the Materialized, since even for the latter case, our internal implementation will still created a materialized store which would not be exposed to users for queries. But moving forward we have plans to optimize the topology such that under some cases, for example, in aggregations when Materialized is not passed in, we will not create a Materialized store. I think by that time we could think about other ways to go around this issue, so I'm good with this change.

guozhangwang · 2018-05-24T03:42:36Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/ImplicitConversions.scala

-  TimeWindowedKStream => TimeWindowedKStreamJ,
-  KGroupedTable => KGroupedTableJ, _}
-
+import org.apache.kafka.streams.kstream.{KGroupedStream => KGroupedStreamJ, KGroupedTable => KGroupedTableJ, KStream => KStreamJ, KTable => KTableJ, SessionWindowedKStream => SessionWindowedKStreamJ, TimeWindowedKStream => TimeWindowedKStreamJ, _}


Is it intentional to put them in a single line? Multi-lines are more human readable.

No, this change has been introduced by my IDE and doesn't make sense. I will revert it.

guozhangwang

This is a meta comment: to make aggregate reduce and count API consistency, I'd still prefer either maintain two overloads for each, or one for each, instead of two for reduce and count and one for aggregate. Personally I'd prefer two for each as it will benefit in the future to optimize away physical materialization, but I'm not sure if we can disambiguate aggregate(..)(implicit Materialized) with aggregate(..), is it possible?

On the other we could consider making reduce and count also only having one overload function. E.g. in KGroupedTable:

def count(implicit materialized: Materialized[K, Long, ByteArrayKeyValueStore]): KTable[K, Long]

def reduce(adder: (V, V) => V)(subtractor: (V, V) => V)(implicit materialized: Materialized[K, V, ByteArrayKeyValueStore]): KTable[K, V]

Note for reduce I used the same pattern in the other PR trying to leverage on type inference, but again I'm not 100% sure if it would work.

@debasishg @joan38 wdyt?

debasishg · 2018-05-25T06:26:52Z

@guozhangwang I agree in principle. I am on a vacation right now and don't have the infrastructure to code. It would be good if @joan38 could verify your concerns.

joan38 · 2018-05-25T08:28:48Z

@guozhangwang the first way of doing is what I tried, unfortunately it's not able to desambiguate the 2.

I think the other way around unifying reduce and count with aggregate is a good idea.
I will do the change today.

joan38 · 2018-05-27T22:50:39Z

Should we do count instead of count()?

joan38 · 2018-05-30T07:18:50Z

@guozhangwang @debasishg Any news on this?

guozhangwang · 2018-05-30T18:02:55Z

Sorry for the late reply.

If we cannot desambiguate the two overloaded functions for count let's make the other two: aggregate and reduce also have one function only then.

I'd prefer count() over count since, in Scala's principle, if it has any side-effects internally we should leave with with the bracket. As for our case, internally we would add some stateful processor node into the topology with this API, so count() would be preferred.

joan38 · 2018-05-30T18:06:39Z

Ok so the first point is done.

I will revert back to count() since there is a side effect.

Thanks

guozhangwang · 2018-05-30T20:57:10Z

Note the jenkins failures are relevant:

07:20:40 /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk8-scala2.11/streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/TopologyTest.scala:94: not enough arguments for method count: (implicit materialized: org.apache.kafka.streams.kstream.Materialized[String,Long,org.apache.kafka.streams.scala.ByteArrayKeyValueStore])org.apache.kafka.streams.scala.kstream.KTable[String,Long].
07:20:40 Unspecified value parameter materialized.
07:20:40           .count()
07:20:40                 ^
07:20:40 /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk8-scala2.11/streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/WordCountTest.scala:122: not enough arguments for method count: (implicit materialized: org.apache.kafka.streams.kstream.Materialized[String,Long,org.apache.kafka.streams.scala.ByteArrayKeyValueStore])org.apache.kafka.streams.scala.kstream.KTable[String,Long].
07:20:40 Unspecified value parameter materialized.
07:20:40         .count()(Materialized.as("word-count"))
07:20:40               ^
07:20:40 two errors found
07:20:40

joan38 · 2018-05-31T16:13:56Z

@guozhangwang The errors are fixed now. This should be ready to go I guess?

guozhangwang

One nit comment, otherwise LGTM.

guozhangwang · 2018-05-31T22:06:43Z

streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/WordCountTest.scala

-        .groupBy((k, v) => v)
-        .count()
+        .groupBy((_, v) => v)
+        .count


We should call count() here?

Arf, missed this one. Thanks for picking it

joan38 · 2018-06-01T08:10:02Z

Thanks @guozhangwang

@guozhangwang

…ache#5066) In apache#4919 we propagate the SerDes for each of these aggregation operators. As @guozhangwang mentioned in that PR: ``` reduce: inherit the key and value serdes from the parent XXImpl class. count: inherit the key serdes, enforce setting the Serdes.Long() for value serdes. aggregate: inherit the key serdes, do not set for value serdes internally. ``` Although it's all good for reduce and count, it is quiet unsafe to have aggregate without Materialized given. In fact I don't see why we would not give a Materialized for the aggregate since the result type will always be different (otherwise use reduce) and also the value Serde is simply not propagated. This has been discussed previously in a broader PR before but I believe for aggregate we could pass implicitly a Materialized the same way we pass a Joined, just to avoid the stupid case. Then if the user wants to specialize, he can give his own Materialized. Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>

joan38 changed the title ~~Implicit materialized for aggregates~~ Implicit Materialized for aggregates May 22, 2018

joan38 force-pushed the materialized branch from 6dd567a to 6e151b3 Compare May 22, 2018 21:53

guozhangwang reviewed May 24, 2018

View reviewed changes

Danny02 mentioned this pull request May 24, 2018

Kafka 6936 - Use implicit serializer for KGroupedStream aggregate #5073

Closed

3 tasks

joan38 force-pushed the materialized branch 2 times, most recently from c90332e to 1cd7718 Compare May 24, 2018 13:37

guozhangwang changed the title ~~Implicit Materialized for aggregates~~ KAFKA-6936: Implicit Materialized for aggregates May 24, 2018

guozhangwang reviewed May 24, 2018

View reviewed changes

joan38 force-pushed the materialized branch 2 times, most recently from c95686a to fa72082 Compare May 25, 2018 21:11

joan38 force-pushed the materialized branch 2 times, most recently from df38a90 to 7f7be17 Compare May 30, 2018 07:17

joan38 force-pushed the materialized branch from 7f7be17 to e03dc56 Compare May 30, 2018 22:09

guozhangwang reviewed May 31, 2018

View reviewed changes

Implicit materialized for aggregate, count and reduce

1bbeba4

joan38 force-pushed the materialized branch from e03dc56 to 1bbeba4 Compare May 31, 2018 23:41

guozhangwang merged commit ad56f04 into apache:trunk Jun 1, 2018

joan38 deleted the materialized branch August 13, 2018 23:38

joan38 mentioned this pull request Aug 17, 2018

KAFKA-7301: Fix streams Scala join ambiguous overload #5502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6936: Implicit Materialized for aggregates #5066

KAFKA-6936: Implicit Materialized for aggregates #5066

joan38 commented May 22, 2018 •

edited

Loading

debasishg commented May 22, 2018

joan38 commented May 22, 2018 •

edited

Loading

guozhangwang commented May 24, 2018

guozhangwang May 24, 2018

joan38 May 24, 2018

guozhangwang left a comment

debasishg commented May 25, 2018

joan38 commented May 25, 2018

joan38 commented May 27, 2018

joan38 commented May 30, 2018

guozhangwang commented May 30, 2018

joan38 commented May 30, 2018

guozhangwang commented May 30, 2018

joan38 commented May 31, 2018

guozhangwang left a comment

guozhangwang May 31, 2018

joan38 May 31, 2018

joan38 commented Jun 1, 2018

KAFKA-6936: Implicit Materialized for aggregates #5066

KAFKA-6936: Implicit Materialized for aggregates #5066

Conversation

joan38 commented May 22, 2018 • edited Loading

Committer Checklist (excluded from commit message)

debasishg commented May 22, 2018

joan38 commented May 22, 2018 • edited Loading

guozhangwang commented May 24, 2018

guozhangwang May 24, 2018

Choose a reason for hiding this comment

joan38 May 24, 2018

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

debasishg commented May 25, 2018

joan38 commented May 25, 2018

joan38 commented May 27, 2018

joan38 commented May 30, 2018

guozhangwang commented May 30, 2018

joan38 commented May 30, 2018

guozhangwang commented May 30, 2018

joan38 commented May 31, 2018

guozhangwang left a comment

Choose a reason for hiding this comment

guozhangwang May 31, 2018

Choose a reason for hiding this comment

joan38 May 31, 2018

Choose a reason for hiding this comment

joan38 commented Jun 1, 2018

joan38 commented May 22, 2018 •

edited

Loading

joan38 commented May 22, 2018 •

edited

Loading