Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Scala API #275

Closed
wants to merge 23 commits into from
Closed

Streaming Scala API #275

wants to merge 23 commits into from

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented Dec 21, 2014

This PR contains the commits for the Scala api for Flink Streaming. Most functionality is already implemented and I think it is ready to merge after maybe some slight refactoring.

I would appreciate if someone with more scala experience could look it through, that code works but it might not be very pretty everywhere.

Also If anyone can please run a scala formatter on the code since it seems impossible to get line wrapping settings in eclipse so I had to break the lines manually to make travis pass.

What's missing:

  • Connected data stream functionality, I will add this soon. This is trivial.
  • Proper scala code-style formatting, see above
  • Some tests, since I hand tested everything so far
  • I am not sure about the package structure, any comments on how I did it?

Feel free to play around and test the features, I am sure that there will be bugs to fix :)

@gyfora gyfora force-pushed the scala-api branch 2 times, most recently from d54d226 to ca2af20 Compare December 21, 2014 12:10
@mbalassi
Copy link
Contributor

Looks great, @gyfora. The windowed join on named fields of case classes is simply beautiful.

I like the project structure, adding the streaming scala code under flink-scala. Another somewhat viable option could be to open a flink-streaming-scala project (and rename flink-streaming-core to f-s-java, or even have separate f-s-core and f-s-java projects).

Would like to merge it soon, because it not only includes the streaming scala API but some much needed refactor of our java API.

Also would like to try out the scala API by implementing the examples with a couple of volunteers. The streaming examples are fairly short and straightforward compared to the batch ones, so a handful of people is more than enough in that department.

@senorcarbone
Copy link
Contributor

+1
Nice job!
@marton I can also help with the scala examples, will have some free time
during holidays.

On Sun, Dec 21, 2014 at 5:45 PM, Márton Balassi notifications@github.com
wrote:

Looks great, @gyfora https://github.com/gyfora. The windowed join on
named fields of case classes is simply beautiful.

I like the project structure, adding the streaming scala code under
flink-scala. Another somewhat viable option could be to open a
flink-streaming-scala project (and rename flink-streaming-core to f-s-java,
or even have separate f-s-core and f-s-java projects).

Would like to merge it soon, because it not only includes the streaming
scala API but some much needed refactor of our java API.

Also would like to try out the scala API by implementing the examples with
a couple of volunteers. The streaming examples are fairly short and
straightforward compared to the batch ones, so a handful of people is more
than enough in that department.


Reply to this email directly or view it on GitHub
#275 (comment)
.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

I suggest to merge this asap after the package structure and code formatting is sorted out. And we can add features as we go. As Marton said, I also refactored some parts of the java api and we would like to build on top of that.

@StephanEwen or @aljoscha can you guys please help me with this?

@aljoscha
Copy link
Contributor

Why does Streaming use it's own classes for field selectors? I.e. FieldsKeySelector, CaseClassKeySelector. Streaming also has it's own selector classes in the Java API.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

The Java api FieldsKeySelector works on array, java tuple and pojo types as well, but it doesnt work on scala tuples.

The Scala FieldsKeySelector works on Java and Scala tuples, and the CaseClassKeySelector works on case classes.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

It should be possible to put the two scala selectors together in a ScalaFieldsKeySelector but I think we need to have them in some form

@aljoscha
Copy link
Contributor

But in the batch API we don't have special KeySelectors, everything is handled uniformly in Keys.java. The field keys and expression keys support Java Tuples, Scala Tuples, Pojos and Case Classes.

@aljoscha
Copy link
Contributor

Also, as @mbalassi mentioned, we should really think about the directory structure now. Having Java Streaming in addons, Scala Streaming in flink-scala and the examples also scattered all over the place does not make things very manageable.

@aljoscha
Copy link
Contributor

I like the work, overall. It's just some nitpicking here and there. 😄

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

What I don't get is how to use the Keys to get the actual key (like I do in the selectors), maybe I am missing something trivial here. If I knew I would have used them trust me :)

@aljoscha
Copy link
Contributor

Yes, @rmetzger added some magic when he unified Pojo and Tuple keys. TypeComparator has a method extractKeys() that does exactly that. It extracts the key fields of a tuple or object and stores them in an array. I think all the infrastructure is there. You can just use Keys.java and create TypeComparators, those you can use for everything else. Maybe @rmetzger can have a look and make some suggestions for how this could best be implemented in the Streaming API.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

You are right I also use the comparators for pojo types. I didnt use comparators for everythibg else because it was trivial to extract keys from tuples and arrays. But i guess its better to have the Keys approach I will implement it.

@aljoscha
Copy link
Contributor

Plus, you get all the support for nesting tuples, pojos and case classes that is already there. 😸

@fhueske
Copy link
Contributor

fhueske commented Dec 22, 2014

+1 for having a clear code structure. IMO, we should either move the whole flink-streaming module to the root folder or keep everything under flink-addons.
Having it as a module of the root folder would mark it as reasonably stable (API & runtime-wise).
I haven't touched the streaming part yet, so I cannot tell but would of course be fine with moving it if the streaming contributors feel it is ready ;-)

@rmetzger
Copy link
Contributor

I would suggest to use the Keys class also for Tuples.
If you have a Tuple3<Tuple2<String, Int>, Int, Int> and you select key 0 (which is a Tuple2), we internally handle this as keys[] = {0,1}.
If you really want to select only the String field of the Tuple2, you have to use string -field section "f0.f0" and you'll get 0.

If you just use the "trivial" extraction (= directly using the user input), you might cause unexpected behavior.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 22, 2014

Okay I will fix this when I have some time. Today or tomorrow :)

@mbalassi
Copy link
Contributor

@aljoscha and @fhueske on the package structure:

We hope that the point in time when streaming can "graduate" from the flink-addons project is near, but to be honest some work is still needed there. If prioritized that that can be achived for the 0.9 release, if not it is safe to say that it can be done for the 1.0.

@gyfora
Copy link
Contributor Author

gyfora commented Dec 23, 2014

Okay as @aljoscha and @rmetzger suggested I reworked our grouping and removed the extra classes and substituted them with the use of Keys from the batch API, works well and looks good :)

The only thing I left in is the ArrayKeySelector which we use in the Streaming api to allow grouping and aggregating on array types as well.

Let's merge this soon because I am also pushing out a rework of the streaming sources for better parallelism handling and also on the connectors to other systems which builds on this PR.

@StephanEwen
Copy link
Contributor

Concerning the project structure: For the batch part, there are discussions
to combine the "flink-core", "flink-java" projects, possibly also the
"flink-scala" project. We are starting to see too many interdependencies.

May be good to keep in mind when you decide on a project structure.

We could, in the next version, go for something like

  • flink-core (core and batch, java & scala)
  • flink-streaming (java & scala)
  • flink-runtime
  • ...
    Am 23.12.2014 21:52 schrieb "Gyula Fora" notifications@github.com:

Okay as @aljoscha https://github.com/aljoscha and @rmetzger
https://github.com/rmetzger suggested I reworked our grouping and
removed the extra classes and substituted them with the use of Keys from
the batch API, works well and looks good :)

The only thing I left in is the ArrayKeySelector which we use in the
Streaming api to allow grouping and aggregating on array types as well.

Let's merge this soon because I am also pushing out a rework of the
streaming sources for better parallelism handling and also on the
connectors to other systems which builds on this PR.


Reply to this email directly or view it on GitHub
#275 (comment)
.

@gyfora gyfora force-pushed the scala-api branch 2 times, most recently from c8eb777 to d629cc3 Compare January 2, 2015 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants