Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5104] Bipartite graph validation #2985

Closed
wants to merge 4 commits into from

Conversation

mushketyk
Copy link
Contributor

Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.

  • General

    • The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
    • The pull request addresses only one issue
    • Each commit in the PR has a meaningful commit message (including the JIRA id)
  • Documentation

    • Documentation has been added for new functionality
    • Old documentation affected by the pull request has been updated
    • JavaDoc for public methods has been added
  • Tests & Build

    • Functionality added by the pull request is covered by tests
    • mvn clean verify has been executed successfully locally or a Travis build has passed

This PR duplicates some the user documentation for the graph added in #2984 but I think this won't be an issue when documentation PR is pushed to the master.

@mushketyk
Copy link
Contributor Author

I noticed that Graph has the following implementation of the validate method:

public Boolean validate(GraphValidator<K, VV, EV> validator) throws Exception {
	return validator.validate(this);
}

What is the reason for returning Boolean instead of boolean here (GraphValidator returns boolean)?
Shouldn't we change the return value to boolean?

Copy link
Contributor

@greghogan greghogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mushketyk, with a few comments for your review. I would change Graph.validate to return boolean unboxed as we aren't promissing ABI compatibility and the use of Boolean is vestigial from an old implementation which returned DataSet<Boolean>.

* limitations under the License.
*/

package org.apache.flink.graph.validation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to org.apache.flink.graph.bipartite.validation?

DataSet<KT> invalidTopIds = invalidIds(bipartiteGraph.getTopVertices(), edgesTopIds);
DataSet<KB> invalidBottomIds = invalidIds(bipartiteGraph.getBottomVertices(), edgesBottomIds);

return invalidTopIds.count() == 0 && invalidBottomIds.count() == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count() executes the program so shouldn't be called twice. Could use an accumulator in RichOutputFormat instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these two datasets have different types I couldn't figure out how to use accumulators and RichOutputFormat to run collect() only once, but I've replaced it with two outer joins and one count() call.
Is it a feasible approach?

}

@ForwardedFields("f0")
private class GetTopIdsMap<KT, KB, EV> implements MapFunction<BipartiteEdge<KT,KB,EV>, Tuple1<KT>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner Flink functions should be static to prevent serialization of the outer class.


@Override
public boolean validate(BipartiteGraph<KT, KB, VVT, VVB, EV> bipartiteGraph) throws Exception {
DataSet<Tuple1<KT>> edgesTopIds = bipartiteGraph.getEdges().map(new GetTopIdsMap<KT, KB, EV>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can GetTopIdsMap be replaced with .<Tuple1<KT>>project(0)?

@mushketyk
Copy link
Contributor Author

Hi @greghogan ,

Thank you for your feedback. I've updated the PR accordingly. The only thing that I did differently is that I've replaced projections and two count() executions with joins and one count() call. What do you think about this?

@greghogan
Copy link
Contributor

@mushketyk the build is failing due to unused imports (IntelliJ can be configured to automatically remove these).

There are so many ways to implement the validator. Projections would reduce the data to be transmitted and sorted. I like to document when we are performing a different kind of join (here, an anti-join) for the day when these are available in Flink. I don't know how counting in the FlatJoinFunction followed by two DiscardingOutputFormat compares with the current implementation of a union and count (which is a dummy OutputFormat).

We'll need to rebase this PR once FLINK-5311 has been committed.

@mushketyk
Copy link
Contributor Author

Sorry for the failing build. I've removed the unused imports so it should be fine now.

@mushketyk
Copy link
Contributor Author

@greghogan Thank you for the review.
Can you give me an example of how you document joins?

@greghogan
Copy link
Contributor

@mushketyk I had meant just with a simple comment noting that with an anti-join we could also remove the FlatJoinFunction.

@mushketyk
Copy link
Contributor Author

Rebased this PR on top of master.

@aljoscha
Copy link
Contributor

I'm closing this as "Abandoned", since there is no more activity and the code base has moved on quite a bit. Please re-open this if you feel otherwise and work should continue.

@aljoscha aljoscha closed this Oct 15, 2019
@aljoscha aljoscha self-assigned this Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants