Skip to content

Conversation

@lindong28
Copy link
Member

What is the purpose of the change

This PR adds a few public classes proposed in FLIP-175. These classes enable users to compose Estimator/Model/AlgoOperator from DAG of Estimator/Model/AlgoOperator.

Brief change log

This PR adds public classes Graph, GraphModel, GraphBuilder and TableId. Users can use these classes to compose compose Estimator/Model/AlgoOperator from DAG of Estimator/Model/AlgoOperator.

This PR also adds the package private classes GraphNode and GraphReadyNodes to simplify the implementation of the above public classes.

Verifying this change

The changes are tested by unit tests in GraphTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (Java doc)

@lindong28 lindong28 force-pushed the FLINK-23959 branch 2 times, most recently from e09157b to 13dc485 Compare November 2, 2021 02:19
@lindong28 lindong28 force-pushed the FLINK-23959 branch 2 times, most recently from 40d3d9f to 0701737 Compare November 20, 2021 03:07
@lindong28
Copy link
Member Author

@gaoyunhaii Could you help review this PR when you get time? Thanks!

@gaoyunhaii
Copy link
Contributor

@lindong28 Hello~ could you rebase the PR to the latest master~?

@lindong28
Copy link
Member Author

@gaoyunhaii Sure. The PR has been rebased to the latest master head. Thanks!

@PublicEvolving
public final class GraphBuilder {

private int maxOutputLength = 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate me a bit why we need maxOutputLength = 20 here ? Why don't we let users to specify the number of outputs when adding nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxOutputLength is needed by methods such as addEstimator(...) and getModelDataFromEstimator(..) to determine the length of TableId array returned by these methods. These tableIds corresponds to the Tables returned by the fit(), transform() and getModelData().

It is feasible to let users specify the number of outputs when adding nodes and getting model data. But this alternative approach has inferior usability. Here are the benefits of the current approach:

  1. The current approach simplifies experience in the common case.

In the common case, the number of tables returned by getModelData(), fit() and transform() is much less than 20. It will be nice not asking users to explicitly specify the number of tables when they call addEstimator(...) and getModelDataFromEstimator(..).

  1. The current approach makes the experience of using GraphBuilder similar to that of calling fit/transform/getModelData directly.

Users don't need to explicitly specify number of outputs when calling fit/transform/getModelData.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lindong28 for the detailed explanation! For the long run perhaps we could make the number of outputs a property of the stage, then we could not need to assume the maximum possible outputs. But since it would not affect the API of this part, I think we could first keep the interfaces as now.


public void setTables(TableId[] tableIds, Table[] tables) {
Preconditions.checkArgument(
tableIds.length >= tables.length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would tableIds.length > tables.length ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of tableIds could be higher than the length of tables because we over-allocate TableIds as placeholders of stage's outputs when building the Graph.

For example, say we have constructed a Graph where algoOperatorA's outputs are used as algoOperatorB's input.

When Graph::fit(...) is invoked, what happens is that we will run algoOperatorA::fit(...) to get an array of Tables, and updates GraphExecutionHelper to map algoOperatorA's output TableIds to these Tables, by calling executionHelper.setTables(node.outputIds, nodeOutputs). In this case, the length of node.outputIds would be 20 by default, by the length of nodeOutputs is usually less than 20.

I have added the following comment above this code to clarify it. Does this address the concern?

// The length of tableIds could be larger than the length of tables because we over-allocate
// the number of tableIds (which is 20 by default) as placeholder of the stage's output
// tables when adding a stage in GraphBuilder.

@lindong28 lindong28 force-pushed the FLINK-23959 branch 3 times, most recently from 774bb28 to 9225e2c Compare December 20, 2021 08:54
Copy link
Contributor

@gaoyunhaii gaoyunhaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lindong28 for opening the PR! LGTM~

@lindong28 lindong28 deleted the FLINK-23959 branch December 24, 2021 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants