Fix type inference on joins and aggregates on Scala API #5019

joan38 · 2018-05-15T15:33:05Z

The type inference doesn't currently work for the join functions in Scala as it doesn't know yet the types of the given KStream[K, V] or KTable[K, V].

The fix here is to curry the joiner function. I personally prefer this notation but this also means it differs more from the Java API.
I believe the diff with the Java API is worth in this case as it's not only solving the type inference but also fits better the Scala way of coding (ex: fold).

Moreover any Scala dev will bug and spend little time on these functions trying to understand why the type inference is not working and then get frustrated to be obliged to be explicit here where it's not harmful to be inferred.

The change is fairly straight forward but is also breaking, the good news is that we didn't release the Scala API yet, so this is perfect time to do this change.

This would also need some documentation update that I'm happy to do if there is positive feedback on this.

Thanks

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

joan38 · 2018-05-15T15:38:43Z

@ijuma @debasishg @guozhangwang Let me know what you think about this.

debasishg · 2018-05-15T16:15:05Z

One of the points we tried to adhere to was to keep the diff with the Java API to a minimum. There may be more scope of such optimizations (or rather conciseness or making code more idiomatic Scala) in the Scala API, which we intentionally didn't do.

And personally I am not sure which version is more readable. The one with type inference is concise no doubt, but very often I find myself struggling to see what are the types of the parameters to the lambda.

// without type inference
.leftJoin(userRegionsTable, (clicks: Long, region: String) => (if (region == null) "UNKNOWN" else region, clicks))

// type inferred
.leftJoin(userRegionsTable)((clicks, region) => (if (region == null) "UNKNOWN" else region, clicks))

On the whole I am +0 on this change, wouldn't mind, if done. But I would leave it to @guozhangwang and @ijuma for the final call.

joan38 · 2018-05-15T16:25:15Z

Indeed there is cases where explicit types are better and some other cases where it's too much info.
The question here is, should the API restrict this choice and not give the liberty to the developer as he is used to in the Scala collections or even in the other Kafka Streams APIs?

Outside of this question, it took me some time to understand why it wasn't compiling as all the params were matching the documentation. I had to add all the types explicitly to finally understand what's going on and then workout which ones I can remove to leave only the required ones.
This user experience (maybe isolated? maybe not?) is not great IMHO and feels more like a "bug" rather than a feature to force the explicit typing.

joan38 · 2018-05-16T22:01:55Z

I added the change for aggregate too. Let's see what you guys think @guozhangwang @ijuma

guozhangwang · 2018-05-17T15:48:09Z

...ams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KGroupedStream.scala

-    aggregator: (K, V, VR) => VR,
-    materialized: Materialized[K, VR, ByteArrayKeyValueStore]): KTable[K, VR] =
-    inner.aggregate(initializer.asInitializer, aggregator.asAggregator, materialized)
+  def aggregate[VR](initializer: => VR)(aggregator: (K, V, VR) => VR,


This is a meta comment: why we want to separate initializer with other parameters in all the places?

@guozhangwang - As I mentioned in my comment #5019 (comment), this has been done to aid the Scala compiler do better type inferencing. Scala compiler does type inferencing from left to right in groups. So if you place the initializer in a separate group, then u get better type inferencing when specifying the initializer in the usage. There are a few areas where we can do this for better type inference. I have some reservations on this as I mentioned in the comment I linked earlier. May be you or @ijuma can take a call on this.

Look at the signature of foldLeft in Scala:

def foldLeft[B](z: B)(op: (B, A) => B): B

The zero (initializer) is curried away from the aggregator function.

Many think this is a style choice but it's not. If we wanted to implement that in one group of parameter as:

def foldLeft[B](z: B, op: (B, A) => B): B

The type parameter B would not be "fixed" by the type inference before we get into the aggregator function.
With currying you actually apply the first group of parameter so that we know what B are we talking about and then we apply the function.

Here exactly the same happens with aggregate, the type inference is not able to tell you from the initializer what the types of the function will be and therefore ask you to write all the types explicitly. Forcing such a thing in Scala APIs is not very common.

guozhangwang · 2018-05-17T15:49:27Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KStream.scala

-    windows: JoinWindows)(implicit joined: Joined[K, V, VO]): KStream[K, VR] =
-      inner.join[VO, VR](otherStream.inner, joiner.asValueJoiner, windows, joined)
+   */
+  def join[VO, VR](otherStream: KStream[K, VO], windows: JoinWindows)(


Note in Java API we have joiner before the windows. Any specific reasons to switch the ordering here?

Same as my last comment ..

This could also be:

def join[VO, VR](otherStream: KStream[K, VO])(joiner: (V, VO) => VR, windows: JoinWindows)

And in fact now that I wrote it I think it would be better to match the Java API.

This seems different with what you proposed:

def join[VO, VR](otherStream: KStream[K, VO])(joiner: (V, VO) => VR, windows: JoinWindows)

guozhangwang · 2018-05-17T15:49:57Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KStream.scala

-    joiner: (V, VT) => VR)(implicit joined: Joined[K, V, VT]): KStream[K, VR] =
-      inner.join[VT, VR](table.inner, joiner.asValueJoiner, joined)
+   */
+  def join[VT, VR](table: KTable[K, VT])(joiner: (V, VT) => VR)(implicit joined: Joined[K, V, VT]): KStream[K, VR] =


Similarly, not sure what's the rationale of the refactoring here.

Same as my last comment ..

guozhangwang · 2018-05-17T15:50:54Z

streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/TopologyTest.scala

@@ -142,7 +142,7 @@ class TopologyTest extends JUnitSuite {

      val clicksPerRegion: KTable[String, Long] =
        userClicksStream
-          .leftJoin(userRegionsTable, (clicks: Long, region: String) => (if (region == null) "UNKNOWN" else region, clicks))
+          .leftJoin(userRegionsTable)((clicks, region) => (if (region == null) "UNKNOWN" else region, clicks))


As @debasishg mentioned, we want to keep the scala API to be as consistent as possible with the java API. Are specific reasons for the changes here?

See my comment above ..

See the ability to remove (or keep) the types.
Also this doesn't go too fare from the Java API since it's just about currying parameters and has the benefit of bringing the API closer to the Scala "way of doing".

guozhangwang · 2018-05-17T17:37:32Z

@joan38 @debasishg Thanks for your detailed explanations in follow-up, now I got it finally.

I think I'm +0.5 on the proposed changes except for stream-stream windowed joins I'd prefer join[VO, VR](otherStream: KStream[K, VO])(joiner: .. since I think such type inference is worthwhile to have with the trade of a bit more diff with the Java API.

joan38 · 2018-05-17T17:59:31Z

Thanks for your thoughts @guozhangwang.
I just pushed again bring back the original parameter order as:

def join[VO, VR](otherStream: KStream[K, VO])(joiner: (V, VO) => VR, windows: JoinWindows)

Essentially bringing the change to a simple currying only and not reordering the parameters in regards to the Java API.

guozhangwang

@joan38 thanks. Left some more comments.

guozhangwang · 2018-05-18T00:03:59Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KStream.scala

-    windows: JoinWindows)(implicit joined: Joined[K, V, VO]): KStream[K, VR] =
-      inner.join[VO, VR](otherStream.inner, joiner.asValueJoiner, windows, joined)
+   */
+  def join[VO, VR](otherStream: KStream[K, VO], windows: JoinWindows)(


This seems different with what you proposed:

def join[VO, VR](otherStream: KStream[K, VO])(joiner: (V, VO) => VR, windows: JoinWindows)

guozhangwang · 2018-05-18T00:04:21Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KStream.scala

-    windows: JoinWindows)(implicit joined: Joined[K, V, VO]): KStream[K, VR] =
-      inner.leftJoin[VO, VR](otherStream.inner, joiner.asValueJoiner, windows, joined)
+  def leftJoin[VO, VR](otherStream: KStream[K, VO], windows: JoinWindows)(
+    joiner: (V, VO) => VR


Ditto here.

Indeed I forgot this one

Hmm... this seems still not correct to me? Should it be

(otherStream: KStream[K, VO])(joiner: (V, VO) => VR, windows: JoinWindows)

Now it's good

guozhangwang · 2018-05-18T00:05:52Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KTable.scala

-  def outerJoin[VO, VR](other: KTable[K, VO],
-    joiner: (V, VO) => VR,
-    materialized: Materialized[K, VR, ByteArrayKeyValueStore]): KTable[K, VR] =
+  def outerJoin[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(


Ditto here? Also the parameters are re-ordered.

guozhangwang · 2018-05-18T00:05:57Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KTable.scala

-  def leftJoin[VO, VR](other: KTable[K, VO],
-    joiner: (V, VO) => VR,
-    materialized: Materialized[K, VR, ByteArrayKeyValueStore]): KTable[K, VR] =
+  def leftJoin[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(


Ditto here? Also the parameters are re-ordered.

Indeed I forgot this one

guozhangwang · 2018-05-18T00:06:03Z

streams/streams-scala/src/main/scala/org/apache/kafka/streams/scala/kstream/KTable.scala

-  def join[VO, VR](other: KTable[K, VO],
-    joiner: (V, VO) => VR,
-    materialized: Materialized[K, VR, ByteArrayKeyValueStore]): KTable[K, VR] =
+  def join[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(


Ditto here? Also the parameters are re-ordered.

Indeed I forgot this one

joan38 · 2018-05-18T08:54:34Z

Addressed the comments.
Please note also that the function () => initializer becomes a call by name => initializer.

ijuma · 2018-05-18T20:34:56Z

My general opinion is that it makes sense to help the inferencer if it can be done in a reasonably consistent manner in the Scala API and if the resulting code doesn't become harder to read (i.e. if you provided the types before/after, the after case should not be harder to read, ideally).

ijuma · 2018-05-18T20:35:55Z

I'm not too concerned about the APIs looking a bit different than the Java APIs, it's more important to make the Scala API as good as they can be while maintaining the general style (IMO).

ijuma · 2018-05-18T20:44:27Z

...ala/org/apache/kafka/streams/scala/StreamToTableJoinScalaIntegrationTestImplicitSerdes.scala

@@ -88,7 +88,7 @@ class StreamToTableJoinScalaIntegrationTestImplicitSerdes extends JUnitSuite
      userClicksStream

        // Join the stream against the table.
-        .leftJoin(userRegionsTable, (clicks: Long, region: String) => (if (region == null) "UNKNOWN" else region, clicks))
+        .leftJoin(userRegionsTable)((clicks, region) => (if (region == null) "UNKNOWN" else region, clicks))


Btw, why don't we just use Option(region).getOrElse("UNKNOWN")?

No idea but I can change this.

@ijuma It was taken from a Java example and hence was not changed. Thought by not doing Option we can save an allocation :-)

joan38 · 2018-05-19T09:34:48Z

@debasishg @ijuma @guozhangwang what's the general opinion on this? Should we go forward?

guozhangwang · 2018-05-20T01:38:34Z

Please note also that the function () => initializer becomes a call by name => initializer.

That looks fine to me.

guozhangwang · 2018-05-20T01:40:02Z

@joan38 Seems people are not against proceeding with this change. I am retriggering another Jenkins test and will merge after it is fixed.

guozhangwang · 2018-05-20T01:40:08Z

retest this please

guozhangwang · 2018-05-20T17:24:41Z

retest this please

joan38 · 2018-05-20T20:14:26Z

All good @guozhangwang 👍

guozhangwang · 2018-05-20T23:25:41Z

Merged to trunk. @joan38 as you mentioned are there are further documentation updates that you'd want to make?

The type inference doesn't currently work for the join functions in Scala as it doesn't know yet the types of the given KStream[K, V] or KTable[K, V]. The fix here is to curry the joiner function. I personally prefer this notation but this also means it differs more from the Java API. I believe the diff with the Java API is worth in this case as it's not only solving the type inference but also fits better the Scala way of coding (ex: fold). Moreover any Scala dev will bug and spend little time on these functions trying to understand why the type inference is not working and then get frustrated to be obliged to be explicit here where it's not harmful to be inferred. Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>, Ismael Juma <ismael@juma.me.uk>

mowczare · 2018-08-13T19:16:42Z

Excuse me guys, how could this pull request get merged and released in version 2.0.0 without any tests?
And it's been 3 months already, I hope I am just blind, can't use search and there are already some issues pointing it out, but still, if not and I am the only user of Kafka Streams Scala API in the world...

Let's take a look at your changes:
file KTable.scala:

def join[VO, VR](other: KTable[K, VO])(joiner: (V, VO) => VR): KTable[K, VR]
def join[VO, VR](other: KTable[K, VO])(
   joiner: (V, VO) => VR,	
   materialized: Materialized[K, VR, ByteArrayKeyValueStore]): KTable[K, VR]
): KTable[K, VR]

Currying is a nice feature of Scala indeed. Let's use it, shall we?

join(kTable)(joiner)
error: ambiguous reference to overloaded definition

Wait. Oh, maybe somehow I need to pass materializer:

join(kTable)(joiner, materializer)
error: ambiguous reference to overloaded definition

Wait... Did you just create an interface that cannot be used in another way than reflection? As a user I kinda don't really want to use scala.reflect.runtime.universe.runtimeMirror(getClass.getClassLoader) every time i need to join my KTable.

To be more generic, if you have:

abstract class B {
  def a(a: String)(b: Double, c: String): Unit
  def a(a: String)(b: Double): Unit
}
val b: B

Then you cannot call:

b.a("niceFunction")
b.a("niceFunction")(69.0, "wowSuchCurrying")
b.a("lolThisDoesNotWorkEither")(42.0)

cause everytime compiler will fail with:
error: ambiguous reference to overloaded definition

Like even if I'm wrong and there is some other magical way of calling those methods, this is disappointing that I spent 2 hours figuring out how to use your new interfaces and the only solution I came up with was the reflection.

Tomorrow I will do a pull request with reverting old Lightbend interfaces, till then I'm waiting for you to prove me I'm wrong.

@joan38
@debasishg
@ijuma
@guozhangwang

joan38 · 2018-08-13T19:31:45Z

Hi @mowczare,

Thanks for taking the time to find the original PR and reporting this issue here.
I didn't even had time to have a look at the newly released Kafka 2.0.0 myself since it got out 😄. Let me have a look at it.

mowczare · 2018-08-13T19:47:31Z

I see, time is a precious thing, indeed. One shall not simply waste it for tests I guess.

joan38 · 2018-08-13T20:13:09Z

@mowczare I know right? I'm living in London, it's 21h I had a long day at work and I pass my evenings helping the Scala community on various open source projects, all of that for free just because I'm passionate about it.
I just find scandalous that I made a mistake by not spending more time on tests you are right.

EDIT: I truly agree with you, but maybe you should try to be a bit nicer in your messages.

mowczare · 2018-08-13T20:15:53Z

It's not your fault obviously, no need to take it personally. I still hope that it works and it's just me and my lack of experience. However, I'm really concerned, if I'm right, how this could go on 2.0.0 without proper testing.

joan38 · 2018-08-13T20:21:27Z

Unfortunately, I think you are right for both the fact that it's broken and that it got out without testing.
I'm surprised the compiler is ok with this code.

I will raise a PR now to fix this (with tests 😉).
Meanwhile I can only see a monkey patch or a revert to Lightbend's interfaces, unless you have other ideas?
I keep you updated.

mowczare · 2018-08-13T20:28:23Z

Wonderful! Your time will be much appreciated and hopefully, a 2.0.1 version will be up soon.
As for the compiler, it is a common case for 9 years thanks to Mr. Oderski:
https://issues.scala-lang.org/browse/SI-2628

ijuma · 2018-08-18T17:59:02Z

Thanks for submitting a fix so quickly @joan38. It's true that there was a gap in testing here and it's unfortunate that it was not noticed before the PR was merged and the release was published. We should strive to do better in the future. Irrespective of that, we are thankful that @joan38 has spent his personal time to improve the usability of the Scala API for everyone else.

@mowczare

Join in the Scala streams API is currently unusable in 2.0.0 as reported by @mowczare: #5019 (comment) This due to an overload of it with the same signature in the first curried parameter. See compiler issue that didn't catch it: https://issues.scala-lang.org/browse/SI-2628 Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>

@mowczare

Join in the Scala streams API is currently unusable in 2.0.0 as reported by @mowczare: apache#5019 (comment) This due to an overload of it with the same signature in the first curried parameter. See compiler issue that didn't catch it: https://issues.scala-lang.org/browse/SI-2628 Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>

joan38 force-pushed the joins branch from 37d4a17 to 09110de Compare May 15, 2018 15:43

joan38 changed the title ~~Fix type inference on joins for Scala~~ Fix type inference on joins on Scala API May 15, 2018

joan38 force-pushed the joins branch from 09110de to dd6c450 Compare May 16, 2018 17:30

guozhangwang reviewed May 17, 2018

View reviewed changes

joan38 force-pushed the joins branch from dd6c450 to fb69775 Compare May 17, 2018 17:54

guozhangwang reviewed May 18, 2018

View reviewed changes

guozhangwang mentioned this pull request May 18, 2018

KAFKA-6849: add transformValues methods to KTable. #4959

Merged

joan38 force-pushed the joins branch 2 times, most recently from 0c02520 to f964ed0 Compare May 18, 2018 08:53

Fix type inference on joins and aggregates

0d55f3b

joan38 force-pushed the joins branch from f964ed0 to 0d55f3b Compare May 18, 2018 17:20

ijuma reviewed May 18, 2018

View reviewed changes

guozhangwang merged commit 96cda0e into apache:trunk May 20, 2018

joan38 changed the title ~~Fix type inference on joins on Scala API~~ Fix type inference on joins and aggregates on Scala API May 21, 2018

joan38 mentioned this pull request Aug 14, 2018

KAFKA-7301: Fix streams Scala join ambiguous overload #5502

Merged

mjsax added the streams label Aug 21, 2018

Fix type inference on joins and aggregates on Scala API #5019

Fix type inference on joins and aggregates on Scala API #5019

Conversation

joan38 commented May 15, 2018 • edited Loading

Committer Checklist (excluded from commit message)

joan38 commented May 15, 2018

debasishg commented May 15, 2018

joan38 commented May 15, 2018 • edited Loading

joan38 commented May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joan38 May 17, 2018 • edited Loading

Choose a reason for hiding this comment

guozhangwang commented May 17, 2018

joan38 commented May 17, 2018 • edited Loading

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joan38 commented May 18, 2018 • edited Loading

ijuma commented May 18, 2018

ijuma commented May 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joan38 commented May 19, 2018

guozhangwang commented May 20, 2018

guozhangwang commented May 20, 2018

guozhangwang commented May 20, 2018

guozhangwang commented May 20, 2018

joan38 commented May 20, 2018

guozhangwang commented May 20, 2018

mowczare commented Aug 13, 2018 • edited Loading

joan38 commented Aug 13, 2018 • edited Loading

mowczare commented Aug 13, 2018

joan38 commented Aug 13, 2018 • edited Loading

mowczare commented Aug 13, 2018

joan38 commented Aug 13, 2018 • edited Loading

mowczare commented Aug 13, 2018

ijuma commented Aug 18, 2018

joan38 commented May 15, 2018 •

edited

Loading

joan38 commented May 15, 2018 •

edited

Loading

joan38 commented May 16, 2018 •

edited

Loading

joan38 May 17, 2018 •

edited

Loading

joan38 commented May 17, 2018 •

edited

Loading

joan38 commented May 18, 2018 •

edited

Loading

ijuma commented May 18, 2018 •

edited

Loading

mowczare commented Aug 13, 2018 •

edited

Loading

joan38 commented Aug 13, 2018 •

edited

Loading

joan38 commented Aug 13, 2018 •

edited

Loading

joan38 commented Aug 13, 2018 •

edited

Loading