-
Notifications
You must be signed in to change notification settings - Fork 90
[FLINK-23959][FLIP-175] Compose Estimator/Model/AlgoOperator from DAG of Estimator/Model/AlgoOperator #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e09157b to
13dc485
Compare
40d3d9f to
0701737
Compare
|
@gaoyunhaii Could you help review this PR when you get time? Thanks! |
|
@lindong28 Hello~ could you rebase the PR to the latest master~? |
0701737 to
e0203b4
Compare
|
@gaoyunhaii Sure. The PR has been rebased to the latest master head. Thanks! |
flink-ml-core/src/main/java/org/apache/flink/ml/builder/Graph.java
Outdated
Show resolved
Hide resolved
flink-ml-core/src/main/java/org/apache/flink/ml/builder/GraphExecutionHelper.java
Outdated
Show resolved
Hide resolved
flink-ml-core/src/main/java/org/apache/flink/ml/builder/GraphExecutionHelper.java
Outdated
Show resolved
Hide resolved
flink-ml-core/src/main/java/org/apache/flink/ml/builder/GraphExecutionHelper.java
Show resolved
Hide resolved
flink-ml-core/src/main/java/org/apache/flink/ml/builder/GraphExecutionHelper.java
Outdated
Show resolved
Hide resolved
flink-ml-core/src/main/java/org/apache/flink/ml/builder/GraphBuilder.java
Show resolved
Hide resolved
| @PublicEvolving | ||
| public final class GraphBuilder { | ||
|
|
||
| private int maxOutputLength = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate me a bit why we need maxOutputLength = 20 here ? Why don't we let users to specify the number of outputs when adding nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxOutputLength is needed by methods such as addEstimator(...) and getModelDataFromEstimator(..) to determine the length of TableId array returned by these methods. These tableIds corresponds to the Tables returned by the fit(), transform() and getModelData().
It is feasible to let users specify the number of outputs when adding nodes and getting model data. But this alternative approach has inferior usability. Here are the benefits of the current approach:
- The current approach simplifies experience in the common case.
In the common case, the number of tables returned by getModelData(), fit() and transform() is much less than 20. It will be nice not asking users to explicitly specify the number of tables when they call addEstimator(...) and getModelDataFromEstimator(..).
- The current approach makes the experience of using
GraphBuildersimilar to that of calling fit/transform/getModelData directly.
Users don't need to explicitly specify number of outputs when calling fit/transform/getModelData.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lindong28 for the detailed explanation! For the long run perhaps we could make the number of outputs a property of the stage, then we could not need to assume the maximum possible outputs. But since it would not affect the API of this part, I think we could first keep the interfaces as now.
|
|
||
| public void setTables(TableId[] tableIds, Table[] tables) { | ||
| Preconditions.checkArgument( | ||
| tableIds.length >= tables.length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would tableIds.length > tables.length ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length of tableIds could be higher than the length of tables because we over-allocate TableIds as placeholders of stage's outputs when building the Graph.
For example, say we have constructed a Graph where algoOperatorA's outputs are used as algoOperatorB's input.
When Graph::fit(...) is invoked, what happens is that we will run algoOperatorA::fit(...) to get an array of Tables, and updates GraphExecutionHelper to map algoOperatorA's output TableIds to these Tables, by calling executionHelper.setTables(node.outputIds, nodeOutputs). In this case, the length of node.outputIds would be 20 by default, by the length of nodeOutputs is usually less than 20.
I have added the following comment above this code to clarify it. Does this address the concern?
// The length of tableIds could be larger than the length of tables because we over-allocate
// the number of tableIds (which is 20 by default) as placeholder of the stage's output
// tables when adding a stage in GraphBuilder.
774bb28 to
9225e2c
Compare
… of Estimator/Model/AlgoOperator
9225e2c to
38c7700
Compare
gaoyunhaii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lindong28 for opening the PR! LGTM~
What is the purpose of the change
This PR adds a few public classes proposed in FLIP-175. These classes enable users to compose Estimator/Model/AlgoOperator from DAG of Estimator/Model/AlgoOperator.
Brief change log
This PR adds public classes
Graph,GraphModel,GraphBuilderandTableId. Users can use these classes to compose compose Estimator/Model/AlgoOperator from DAG of Estimator/Model/AlgoOperator.This PR also adds the package private classes
GraphNodeandGraphReadyNodesto simplify the implementation of the above public classes.Verifying this change
The changes are tested by unit tests in
GraphTest.Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes)Documentation