New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2576] Add Outer Join operator to Optimizer and APIs #1138
Conversation
@jkovacs and @r-pogalz, thank you very much for this PR and the detailed description! Nonetheless, we can start a discussion about the handling of projection for outer joins. By changing the type information to |
…ion and optimization
b66b1b0
to
da30f62
Compare
Thanks @fhueske, that's a good point I haven't considered. Another idea that occurred to me was to convert the result tuple types to I pushed this change as jkovacs@f682baa to a different branch to test it Also rebased branch onto current master and resolved conflicts (Failing test is some YARN integration test). |
To partly answer my own question: One big drawback of downgrading the tuple field types to One obvious way to work around this is to only downgrade the fields that are actually nullable, and keep the original types of the definitely non-null fields (i.e. the types from the outer side of a left or right outer join). This way the user can still group/join/sort efficiently on the non-null fields, while preserving null safety for the other fields. I pushed another commit for this to my temporary branch for review, if this makes sense: jkovacs/flink@feature/FLINK-2576...jkovacs:feature/FLINK-2576-projection-types As you can see I was really hoping to make the projection joins work properly :-) but if you feel that the effort isn't worth it or I'm missing something else entirely, we can for sure simply scrap that and throw an |
da30f62
to
45b515a
Compare
Hi @jkovacs, thanks for all your efforts to make the projection work. Going for a |
Agreed with Fabian. For now, let's require join functions. Future work would be to use Tuples with Options in Scala. In Java, we should probably add an option type as well (and teach the TypeExtractor to use them). Java core adds an Option type only in Java 8, unfortunately. We could add one for Java 7 and deprecate it later. |
//no more elements | ||
return false; | ||
} | ||
} else if (currLeftIterator.hasNext() && !currRightIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might yield NPE, if currLeftIterator == null && currRightIterator != null && currRightIterator.hasNext()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically true, but I believe from the control flow that scenario is impossible, since either both iterators get reassigned something non-null at the same time, or both remain null and the method returns false
(no more elements). @r-pogalz can you confirm that or can we make this more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkovacs is right, it is not possible that currLeftIterator
or currRightIterator
are null at this point, as they are just wrappers around the subsets and always get assigned. In case that a subset is null and hasNext()
is called, the wrapper will return false.
There are also tests in OuterJoinOperatorBaseTest
which cover the cases where one side of the outer join is empty.
We have a couple of unit tests to check the correctness of the API, i.e., check that valid use is working and invalid use throws an early exceptions. See for example |
} | ||
|
||
@Test(expected = InvalidProgramException.class) | ||
public void testDefaultJoin() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should be done as a unit test (as mentioned in my other comment).
As I said in my previous comments, I would prefer to skip support for projection joins initially. If we don't allow DefaultJoin and ProjectJoin for outer joins, we can also revert the corresponding changes. |
Thanks @fhueske and @StephanEwen for the comprehensive review and additional details on Flink internals!, I agree that we should rather wait to implement the projection join correctly at a later point. |
45b515a
to
da30f62
Compare
…actJoinOperatorBase to JoinOperatorBase
…t API (Java, Scala) and optimizer. This closes apache#1138
Looks good. I added one commit to restore binary compatibility. The code is not super nice, but it allows to runt programs which have been previously compiled without the need to recompile. We can still clean up the code later if we decide to do so. Final tests are running, will merge after they passed. |
Oh, just realized we did not update the documentation. |
…t API (Java, Scala) and optimizer. This closes apache#1138
…t API (Java, Scala) and optimizer. This closes apache#1138
This PR implements FLINK-2576 (Adding the outer join operator to the optimizer and Java/Scala APIs, previously part of FLINK-2106).
For reference, the revious pull requests for the outer join implementation were #907 and #1052.
First of all thanks for the help we received in person and on the mailing list.
I designed the API as per the consensus on the mailing list and tried reusing as much code from the join operator api as possible.
This PR contributes the following:
Usage & Implementation:
In both APIs we prohibit using the default join functionality for outer joins. The user is required
to specify a custom join function that combines the (potentially
null
) left and right side tuples.In the Java API we support the projection join functionality for outer joins. (Projection joins are not yet implemented in the Scala API for inner joins, therefore no changes there.)
Important to note is that when the user performs a projection join, the type information is lost.
This is also the case for the inner projection join. Additionally, we explicitly "downgrade" the result type information of an outer projection join to a Tuple of
GenericTypeInfo<>(Object.class)
, in order to be able to serializenull
values.A nicer way to do this would be to use an
Optional<T>
type to represent nullable tuple values, but because we can't rely on Java 8 types, nor did I want to hardcode a dependency to a 3rd partyOptional
type (e.g. from guava) into the API, we went this route, for now.