[FLINK-13339][ml] Add an implementation of pipeline's api #9184

xuyang1706 · 2019-07-20T01:41:23Z

What is the purpose of the change

Add an implementation of pipeline's api

Brief change log

Add an implement PipelineStage, Estimator, Transformer, Model.
Add MLSession to hold the execution environment and others session shared variable.
Add AlgoOperator for the implementation of algorithms.
Add BatchOperator and StreamOperator based on AlgoOperator
*Add TableSourceBatchOp and TableSourceStreamOp *

Verifying this change

This change added tests and can be verified as follows:

run test case pass

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs)

flinkbot · 2019-07-20T01:43:47Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 7505153 (Wed Oct 16 08:42:08 UTC 2019)

Warnings:

1 pom.xml files were touched: Check for build and licensing issues.
No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-07-20T01:49:31Z

CI report:

bd7ade5 : FAILURE Build
6a18792 : FAILURE Build
4f2afd3 : FAILURE Build
c4fc690 : FAILURE Build
b1855d5 : FAILURE Build
54210a0 : SUCCESS Build
b089ac0 : SUCCESS Build
67f96bd : FAILURE Build
54c4c47 : SUCCESS Build
2983c5d : SUCCESS Build
576ff18 : SUCCESS Build
16f1ec8 : SUCCESS Build
acb4479 : SUCCESS Build
a13f249 : SUCCESS Build
7505153 : SUCCESS Build

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build

becketqin

@xuyang1706 Thanks for the patch. I left some comments. In general it would avoid class name collision. We may also want to consider if the packaging can change a little bit. For example:

org.apache.flink.ml.
  .algooperator
      .AlgoOperator.java
      .stream
          .StreamAlgoOperator.java
          .source
               .TableSourceStreamAlgoOperator.java
      .batch
          .BatchAlgoOperator.java
          .source
               TableSourceBatchAlgoOperator.java

becketqin · 2019-09-06T02:56:21Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Estimator.java

+ * @param <E> The class type of the {@link Estimator} implementation itself
+ * @param <M> class type of the {@link Model} this Estimator produces.
+ */
+public abstract class Estimator<E extends Estimator <E, M>, M extends Model <M>>


It is in general an anti-pattern to have the duplicate class names. Can we change this to something like EstimatorBase?

Thanks, changed to EstimatorBase.

becketqin · 2019-09-06T03:00:56Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Estimator.java

+import org.apache.flink.table.api.java.StreamTableEnvironment;
+
+/**
+ * Abstract class for a estimator that fit a {@link Model}.


This Java doc does not seem to provide much information to the code readers. How about change it to the following:

The base class for estimator implementations. It sets a global static context of `MLSession` and prepare the input of the estimator for either batch execution or stream execution.

Thanks, changed.

becketqin · 2019-09-06T03:11:54Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Estimator.java

+	}
+
+	/**
+	 * Train and produce a {@link Model} which fits the records in the given {@link Table}.


No need to copy the java doc from the parent class if there is no additional information.

becketqin · 2019-09-06T03:12:54Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Estimator.java

+	 * @param input the table with records to train the Model.
+	 * @return a model trained to fit on the given Table.
+	 */
+	public M fit(Table input) {


Does this method have to be public? If it has to be public, this method has the assumption that the MLSession has been setup. Should this be mentioned in the JavaDoc?

It is better to be public. We have refined the JavaDoc.

becketqin · 2019-09-06T03:37:35Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Estimator.java

+	 */
+	@Override
+	public M fit(TableEnvironment tEnv, Table input) {
+		MLSession.setTableEnvironment(tEnv, input);


The global static MLSession requires that the environment is only set once. I am not sure if this restriction is too strong. From ML pipeline API's perspective, there is no such restriction. At very least, we should document this.

Thanks for your advice and offline discussion, we defined MLEnvironmentFactory to replace the global static MLSession.

becketqin · 2019-09-06T06:27:32Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/PipelineStage.java

+ * @param <S> The class type of the {@link PipelineStage} implementation itself, used by {@link
+ *            org.apache.flink.ml.api.misc.param.WithParams} and Cloneable.
+ */
+public abstract class PipelineStage<S extends PipelineStage <S>>


Same here, PipelineStageBase? This class does not implement the org.apache.flink.ml.api.core.PipelineStage, is this intended?

I am not sure this class is a must. I think each type of pipeline stage individually can set its own base class. see: https://stackoverflow.com/a/3599379/11332462

Same here, PipelineStageBase? This class does not implement the org.apache.flink.ml.api.core.PipelineStage, is this intended?

I am not sure this class is a must. I think each type of pipeline stage individually can set its own base class. see: https://stackoverflow.com/a/3599379/11332462

It has been renamed and PipelineStage is the standard concept in pipeline and if it implements the WithParams , the subclasses could do not care about WithParams

becketqin · 2019-09-06T06:28:15Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/Transformer.java

+ * @param <T> The class type of the {@link Transformer} implementation itself, used by {@link
+ *            org.apache.flink.ml.api.misc.param.WithParams}
+ */
+public abstract class Transformer<T extends Transformer <T>>


Same here, TransformerBase?

thx, changed

becketqin · 2019-09-06T06:34:32Z

...-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/streamoperator/StreamOperator.java

+/**
+ * Base class of streaming algorithm operators.
+ */
+public abstract class StreamOperator<T extends StreamOperator <T>> extends AlgoOperator<T> {


The name StreamOperator is colliding with org.apache.flink.streaming.api.operators.StreamOperator which is a widely established class in Flink. Can we change it to something like StreamAlgoOperator? We may need to change the BatchOperator to BatchAlgoOperator as well.

Thanks for your advice and offline discussion. In our ML algorithm implementations and applications, we need not use org.apache.flink.streaming.api.operators.StreamOperator, and we'd like to use BatchOperator in ML lib.

becketqin · 2019-09-06T06:39:30Z

...-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/streamoperator/StreamOperator.java

+		return getOutput().toString();
+	}
+
+	public <S extends StreamOperator> S link(S next) {


I am curious why link() and linkTo() methods require a template type while linkFrom() does not need that?

link() and linkTo() return the next AlgoOp, linkFrom() returns self AlgoOp.

becketqin · 2019-09-06T06:44:22Z

...-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/streamoperator/StreamOperator.java

+import java.util.List;
+
+/**
+ * Base class of streaming algorithm operators.


This class seems needs some java doc for the public methods.

ex00

Hi @xuyang1706,
thanks for your work, I've just left a few comments, please look it then you will have time.

ex00 · 2019-09-06T09:54:51Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+		if (null != ins && ins.size() == 1) {
+			return linkFrom(ins.get(0));
+		} else {
+			throw new RuntimeException("Not support more than 1 inputs!");


Implementations of linkFrom(BatchOperator, BatchOperator), linkFrom(BatchOperator, BatchOperator,BatchOperator), linkFrom(List) looks inconsistent.
In methods for BatchOperators is created list of elements more that 1. but in result is throw exception that count elements should be 1.

We define the new method linkFrom(BatchOperator…) to replace them.

ex00 · 2019-09-06T09:55:24Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	}
+
+	public T linkFrom(List <BatchOperator> ins) {
+		if (null != ins && ins.size() == 1) {


what is difference on this case from linkFrom(BatchOperator)?

We define a new method linkFrom(BatchOperator…) to replace both of them.

xuyang1706 · 2019-09-12T12:46:52Z

Hi @xuyang1706,
thanks for your work, I've just left a few comments, please look it then you will have time.

Hi @ex00 , thanks for your comments. We define the new method linkFrom(BatchOperator…), it can support one, two or many inputs.

walterddr

Thanks for the contribution @xuyang1706 and sorry for joining the discussion late. Please see some of my comments. In general I think there are lots of added APIs that we should carefully document. Please let me know what you guys think ;-) thanks -Rong

walterddr · 2019-09-13T16:56:04Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/AlgoOperator.java

+ * Base class for algorithm operators.
+ * @param <T> The class type of the {@link AlgoOperator} implementation itself
+ */
+public abstract class AlgoOperator<T extends AlgoOperator <T>> implements WithParams<T>, Serializable {


another question. based on what I understand of this PR. shouldn't it be

public abstract class AlgoOperator<T> extends PipelineStage<T> { // ... }

walterddr · 2019-09-13T17:05:29Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+		return next;
+	}
+
+	public abstract T linkFrom(BatchOperator<?>... inputs);


This is a new abstract API. Please provide JavaDoc

Thanks, added intro and example.
For there is no answer area of the above question, I put the answer here:
PipelineStage only supports single input and single output, it is the basic unit for pipeline. AlgoOperator supports multi-input and multi-output. We’d like to implement the algorithm with AlgoOperator, and PipelineStage’s fit and transform function can call the AlgoOperator.

walterddr · 2019-09-13T17:07:03Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+/**
+ * Base class of batch algorithm operators.
+ */
+public abstract class BatchOperator<T extends BatchOperator<T>> extends AlgoOperator<T> {


please provide java doc for link linkTo and linkFrom.

Thanks, provided.

walterddr · 2019-09-13T17:08:35Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	}
+
+	public <B extends BatchOperator<?>> B linkTo(B next) {
+		next.linkFrom(this);


what's the rational of making link in this direction? what if downstream operator calls linkFrom twice with the same upstream; or twice with different upstreams?

what's the rational of making link in this direction? what if downstream operator calls linkFrom twice with the same upstream; or twice with different upstreams?

thx

with this method, the operator can link from source to sink if every operator in the dag is single input and single output and it is straightforward that execute in order from front to back

It is not recommended to linkFrom itself or link the same group inputs twice（added to javadoc） and the implement of the operator will define the behavior

walterddr · 2019-09-13T17:09:18Z

.../flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/source/TableSourceBatchOp.java

+
+	@Override
+	public TableSourceBatchOp linkFrom(BatchOperator<?>... inputs) {
+		throw new UnsupportedOperationException("Not supported.");


more concise error message: "Table source operator should not have any upstream to link from"

Thanks, this message is more concise.

walterddr · 2019-09-13T17:16:13Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/PipelineStage.java

+ * @param <S> The class type of the {@link PipelineStage} implementation itself, used by {@link
+ *            org.apache.flink.ml.api.misc.param.WithParams} and Cloneable.
+ */
+public abstract class PipelineStage<S extends PipelineStage <S>>


I am not sure this class is a must. I think each type of pipeline stage individually can set its own base class. see: https://stackoverflow.com/a/3599379/11332462

…MLEnvironmentFactory that create the new MLEnvironment

xuyang1706 · 2019-09-19T12:34:13Z

@xuyang1706 Thanks for the patch. I left some comments. In general it would avoid class name collision. We may also want to consider if the packaging can change a little bit. For example:
org.apache.flink.ml.
  .algooperator
      .AlgoOperator.java
      .stream
          .StreamAlgoOperator.java
          .source
               .TableSourceStreamAlgoOperator.java
      .batch
          .BatchAlgoOperator.java
          .source
               TableSourceBatchAlgoOperator.java

@becketqin , thanks for your advice. we defined MLEnvironmentFactory to replace the global static MLSession and support to set multiple Environment, refactored the class names and refined the javaDoc.

xuyang1706 · 2019-09-19T12:44:09Z

Thanks for the contribution @xuyang1706 and sorry for joining the discussion late. Please see some of my comments. In general I think there are lots of added APIs that we should carefully document. Please let me know what you guys think ;-) thanks -Rong

@walterddr , thanks for your comments. We added more description on the concepts and APIs, and refactored the core functions of link, linkTo and linkFrom.

ex00

Hi @xuyang1706 thanks for update,
I've left additional comments, please look they.

ex00 · 2019-09-20T14:52:44Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/AlgoOperator.java

+	 */
+	public static AlgoOperator<?> sourceFrom(Table table) {
+		if (((TableImpl) table).getTableEnvironment() instanceof StreamTableEnvironment) {
+			return new TableSourceStreamOp(table);


the cyclic dependency

org.apache.flink.ml.streamoperator.source.TableSourceStreamOp extends org.apache.flink.ml.streamoperator.StreamOperator org.apache.flink.ml.streamoperator.StreamOperator<T extends StreamOperator<T>> extends org.apache.flink.ml.common.AlgoOperator in org.apache.flink.ml.common.AlgoOperator import org.apache.flink.ml.streamoperator.source.TableSourceStreamOp

Thanks, removed.

ex00 · 2019-09-20T14:58:02Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/AlgoOperator.java

+	/**
+	 * Construct the operator with empty Params.
+	 */
+	protected AlgoOperator() {


probably need only one default constructor with not nullable Table argument and nullable Params

big part of methods in this class depends on output field

public AlgoOperator(Params params, Table table) { if (null == table) { throw new IllegalArgumentException('Table shoud be non null'); } this.output = table; if (null == params) { this.params = new Params(); } else { this.params = params.clone(); } }

The Table is the output of the operator. It should be set at linkFrom and call getOutput to get the output for the next linked operator in most of the scenes. So the AlgoOperator() and the AlgoOperator(Params params) should be exist.
The TableSourceBatchOp and the TableSourceStreamOp is the only two cases that construct the Operator using the table. These operators are the special cases and need not add a new constructor in the base class for these cases.

I think the main point here is that since the public AlgoOperator() API is not used/tested. we can add it in from later PRs, correct me if I were wrong @ex00 ?
If that's the case then yes I do agree that any new APIs added which are not tested should be part of latter PRs.

yes @walterddr, you are right. I don't see how using this constructor here.
Also, personally for me, logic in parent class, when using output field and it is not marked 'as required', is confusing a little bit.

In later PRs, these APIs will be used and tested.

For how using this constructor, I'd like to give an example.
SplitBatchOp is widely used in ML data pre-preprossing, which splits one dataset input 2 dataset: training set and validation set. It is very convenient for us to write code like this:
new SplitBatchOp().setSplitRatio(0.9)

For output field. The AlgoOperator may have one or more result tables, in most cases, it has only one result. The output is the main operation result table, and the other results are kept in the sideOutputs.
For example, 2 AlgoOperators: AlgoA and AlgoB, AlgoB takes the AlgoA’s results as its inputs, we can write the code like this:
AlgoB.linkFrom(AlgoA)
AlgoA.getOutput() provides the main result of AlgoA, and AlgoA.getSideOutputs() provides the other results of AlgoA. AlgoB will take the AlgoA’s results as its inputs by calling AlgoA.getOutput() and AlgoA.getSideOutputs().

new SplitBatchOp().setSplitRatio(0.9)

How will it be apply to data source and how will define output result? Could you explain more please?

Thanks, we have the setOutput and can set the output in any method of the operator

is it mean what it will be look like new SplitBatchOp().setSplitRatio(0.9).setOutput(table) ?
if it is right assumption, that means that who will write code need to know that must call setOutput. And if someone will doing something from you example, it is easier to lost this call, and exception will be throw in run time only

AlgoA = new SplitBatchOp().setSplitRatio(0.9) AlgoB.linkFrom(AlgoA) AlgoA.getSchema() // NPE: output is null

But if we define that SplitBatchOp(table) and developer don't thing about additional method calls and will get error on compile time in wrong case.
And as result in each operator we need to implement constructor with table param, why is not doing on parent class?

Yes, there must be one or more data sources. We have defined some common *SourceBatchOp to get the source data conveniently, and from the sourceBatchOp, we can link the algorithm operations.
Here is a whole example:

CsvSourceBatchOp algoA = new CsvSourceBatchOp() .setFilePath("http://alink-dataset.cn-hangzhou.oss.aliyun-inc.com/csv/iris.csv") .setSchemaStr("s_length double, s_width double, p_length double, p_width double, category string"); System.out.println("schema of algoA:"); System.out.println(algoA.getSchema()); SplitBatchOp algoB = new SplitBatchOp().setFraction(0.8); algoB.linkFrom(algoA); System.out.println("schema of algoB's main result:"); System.out.println(algoB.getSchema()); System.out.println("schema of algoB's side output result:"); System.out.println(algoB.getSideOutput().getSchema());

And the print results:

schema of algoA: root |-- s_length: Double |-- s_width: Double |-- p_length: Double |-- p_width: Double |-- category: String schema of algoB's main result: root |-- s_length: Double |-- s_width: Double |-- p_length: Double |-- p_width: Double |-- category: String schema of algoB's side output result: root |-- s_length: Double |-- s_width: Double |-- p_length: Double |-- p_width: Double |-- category: String

now I see, thanks for the explanation

ex00 · 2019-09-20T15:07:53Z

.../flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/source/TableSourceBatchOp.java

+ */
+public final class TableSourceBatchOp extends BatchOperator<TableSourceBatchOp> {
+
+	public TableSourceBatchOp(Table table) {


this logic could be moved to parent, it same as in TableSourceStreamOp

refer to #9184 (comment)

ex00 · 2019-09-20T15:30:38Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/AlgoOperator.java

+	@Override
+	public Params getParams() {
+		if (null == this.params) {
+			this.params = new Params();


duplicate logic from constructor.
If we cant set params as null we don't need check all time it.

Thanks for your advice, removed the check process.

…n getParams

xuyang1706 · 2019-09-23T03:47:14Z

Hi @xuyang1706 thanks for update,
I've left additional comments, please look they.

@ex00 , thanks for your advice, I have refactored the sourceFrom() method and removed the unnecessary check process.

add more comments in MLEnvironmentFactory move the initialize of MLEnvironment to getNewMLEnvironmentId in MLEnvironmentFactory

walterddr

Thanks for the follow up @xuyang1706 . Overall the changes looks good to me. I left some minor comments. Also I notices that some large trunk of code was missing tests? were we going to support individual tests in the future, or they are part of the MLSessionTest?

walterddr · 2019-09-25T15:33:55Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+ *
+ * <p>This class is extended to support the data transmission between the BatchOperators.
+ */
+public abstract class BatchOperator<T extends BatchOperator<T>> extends AlgoOperator<T> {


Thanks @becketqin , @walterddr , @ex00 , I have refactored the package paths according your suggestions.

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironment.java

walterddr · 2019-09-25T15:38:21Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironment.java

+					this.streamEnv = ((StreamTableEnvironmentImpl) streamTableEnv).execEnv();
+				}
+			}
+		} else {


I would still put a check here

else if (tEnv instanceof BatchTableEnvironment) { //... } else { throw new IllegalArgumentEception(...); }

Thanks for your advice, added.

walterddr · 2019-09-25T15:41:10Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/EstimatorBase.java

+ * @param <E> The class type of the {@link EstimatorBase} implementation itself
+ * @param <M> class type of the {@link ModelBase} this Estimator produces.
+ */
+public abstract class EstimatorBase<E extends EstimatorBase<E, M>, M extends ModelBase<M>>


I might've been wrong, but looks like EstimatorBase is not tested?

same as other Pipeline stages.

Thanks, we have add some cases, and we will add more cases with the extended class.

walterddr · 2019-09-25T15:44:10Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/AlgoOperator.java

+	/**
+	 * Construct the operator with empty Params.
+	 */
+	protected AlgoOperator() {


I think the main point here is that since the public AlgoOperator() API is not used/tested. we can add it in from later PRs, correct me if I were wrong @ex00 ?
If that's the case then yes I do agree that any new APIs added which are not tested should be part of latter PRs.

add more case add more comments in AlgoOperator

xuyang1706 · 2019-09-26T08:39:25Z

Thanks for the follow up @xuyang1706 . Overall the changes looks good to me. I left some minor comments. Also I notices that some large trunk of code was missing tests? were we going to support individual tests in the future, or they are part of the MLSessionTest?

Thanks @walterddr , I refactored the package paths, added some test cases and refined the JavaDoc with examples.

becketqin

@xuyang1706 Thanks for updating the patch. I left a few more comments. Can you take a look?

becketqin · 2019-09-20T09:32:54Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+/**
+ * Base class of batch algorithm operators.
+ *
+ * <p>This class is extended to support the data transmission between the BatchOperators.


This class extends {@link AlgoOperator} to support data transmission between BatchOperators.

Thanks, changed

becketqin · 2019-09-23T06:31:28Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	/**
+	 * Abbreviation of {@link #linkTo(BatchOperator)}.
+	 */
+	public <B extends BatchOperator<?>> B link(B next) {


Do we need to have this alias method?

Thanks, removed the linkTo

becketqin · 2019-09-27T07:59:47Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	 * }
+	 * </pre>
+	 *
+	 * <p>the <code>c</code> in upper code is the linked


The BatchOperator <code>c</code> in the above code is the same instance as <code>b</code> which takes <code>a</code> as its input. Note that BatchOperator <code>b</code> will be changed to link from BatchOperator <code>a</code>.

Thanks, changed

becketqin · 2019-09-27T08:01:14Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	 * <p>the <code>c</code> in upper code is the linked
+	 * <code>b</code> which use <code>a</code> as input.
+	 *
+	 * @param next the linked BatchOperator


The operator that will be modified to add this operator to its input.

Thanks, changed

becketqin · 2019-09-27T08:02:21Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/batchoperator/BatchOperator.java

+	/**
+	 * Link to another {@link BatchOperator}.
+	 *
+	 * <p>Link the <code>next</code> to BatchOperator using this as its input.


Link the <code>next</code> BatchOperator using this BatchOperator as its input.

Thanks, changed

becketqin · 2019-09-27T14:49:08Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/EstimatorBase.java

+	 * @param input the table with records to train the Model.
+	 * @return a model trained to fit on the given Table.
+	 */
+	public M fit(Table input) {


Does this method have to be public?

Thanks, this method is more convenient with the MLEnvironmentFactory, so I think it should be public

I am actually thinking that with the MLEnvironmentFactory whether we should replace void fit(TableEnvironment, Table) method with this one. That would make the API much consistent and easy to understand. Given that there is no users of the MLPipeline interface yet, doing it now looks the best timing. But we do need a FLIP for such API change.

becketqin · 2019-09-27T14:51:22Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/PipelineStageBase.java

+import org.apache.flink.ml.params.shared.HasMLEnvironmentId;
+
+/**
+ * The base class for a stage in a pipeline, either an [[Estimator]] or a [[Transformer]].


It seems the java doc format for the classes are not quite consistent. Sometimes it is Class, sometimes Class, and here it is [[Class]]. Can we make them consistent?

Thanks, changed

becketqin · 2019-09-27T14:54:40Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/PipelineStageBase.java

+/**
+ * The base class for a stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
+ *
+ * <p>Each pipeline stage is with parameters, and requires a public empty constructor for


The PipelineStageBase maintains the parameters for the stage. A default constructor is needed in order to restore a pipeline stage.

Thanks, changed

becketqin · 2019-09-27T14:56:25Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/TransformerBase.java

+/**
+ * The base class for transformer implementations.
+ *
+ * @param <T> The class type of the {@link TransformerBase} implementation itself, used by {@link


The class type of the {@link TransformerBase} implementation itself =>

A subclass of {@link TransformerBase}, used by ...

Thanks, changed

becketqin · 2019-09-27T14:57:53Z

flink-ml-parent/flink-ml-lib/src/test/java/org/apache/flink/ml/common/MLEnvironmentTest.java

+/**
+ * Test cases for MLEnvironment.
+ */
+public class MLEnvironmentTest {


It looks that there is no assertion in this test class.

Thanks, changed

…eckMinOpSize

…nt in MLEnvironmentFactory

xuyang1706 · 2019-09-29T09:38:31Z

@xuyang1706 Thanks for updating the patch. I left a few more comments. Can you take a look?

Thanks @becketqin, I have refined the code according your comments.

becketqin

@xuyang1706 Thanks for updating the patch. I left a few more comments. Most of them are minor. One major thing is I am not sure whether we should completely remove fit(TableEnvironment, Table) and transform(TableEnvironment, Table) API, because they are completely bypassed in the current implementation given the introduction of MLEnvironmentFactory.

becketqin · 2019-10-10T14:23:06Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironmentFactory.java

+	public static final Long DEFAULT_ML_ENVIRONMENT_ID = 0L;
+
+	/**
+	 * A 'id' is a unique identifier of a MLEnvironment.


A monotonically increasing id for the MLEnvironments. Each id uniquely identifies an MLEnvironment.
Nit: Maybe can change the variable name to nextId?

Thanks, changed

becketqin · 2019-10-10T14:25:41Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironment.java

+	 */
+	public StreamExecutionEnvironment getStreamExecutionEnvironment() {
+		if (null == streamEnv) {
+			streamEnv = StreamExecutionEnvironment.getExecutionEnvironment();


Can we set the private instance variables in the constructor? This way all the instance variables can be final.

Thanks, as we discussed offline, for depending an unfixed bug, we will use current implementation as a temp workaround.

becketqin · 2019-10-10T14:32:26Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironmentFactory.java

+			if (mlEnvId.equals(DEFAULT_ML_ENVIRONMENT_ID)) {
+				setDefault(new MLEnvironment());
+			} else {
+				throw new IllegalArgumentException("There is no Environment in factory. " +


Can we create the default ML Session in the static block? This will help simplify the logic here.

Should we also include the EnvId in the error message? Something like:
String.format("Cannot find MLEnvironment for MLEnvironmentId %s. Did you get the MLEnvironmentId by calling getNewMLEnvironmentId?", mlEnvId)`

Thanks for your advice, we refactored code to create the default ML Session in the static block and changed the exception message.

becketqin · 2019-10-10T14:39:08Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironmentFactory.java

+	 *
+	 * @param env the MLEnvironment
+	 */
+	public static synchronized void setDefault(MLEnvironment env) {


Is custom default MLEnvironment necessary? What if users used one default MLEnv and later on changed to another? This could introduce some unexpected problems.

Should users just create their own MLEnvironmentID and MLEnvironment in that case?

Thanks, I have checked the existence of default MLEnv, user could only set one default MLEnv.

becketqin · 2019-10-10T14:46:13Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/operator/AlgoOperator.java

+	/**
+	 * Returns the column names of the output table.
+	 */
+	public String[] getColNames() {


Just curious how would users get the schema of the side output tables?

Thanks, added

becketqin · 2019-10-10T14:59:07Z

...ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/operator/stream/StreamOperator.java

+			+ size + ", current: " + inputs.length);
+	}
+
+	protected void checkMinOpSize(int size, StreamOperator<?>... inputs) {


Ditto above.

Thanks, changed

becketqin · 2019-10-10T15:05:01Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/EstimatorBase.java

+	 * @param input the table with records to train the Model.
+	 * @return a model trained to fit on the given Table.
+	 */
+	public M fit(Table input) {


I am actually thinking that with the MLEnvironmentFactory whether we should replace void fit(TableEnvironment, Table) method with this one. That would make the API much consistent and easy to understand. Given that there is no users of the MLPipeline interface yet, doing it now looks the best timing. But we do need a FLIP for such API change.

becketqin · 2019-10-10T19:12:38Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/TransformerBase.java

+	}
+
+	@Override
+	public Table transform(TableEnvironment tEnv, Table input) {


Should we do a sanity check to make sure the tEnv passed in is the same table environment used by the input table?

I agree with replace void fit(TableEnvironment, Table) method with the MLEnvironmentFactory, and I suggest we could discuss and change API in another PR.

becketqin · 2019-10-10T19:13:20Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/pipeline/EstimatorBase.java

+	}
+
+	@Override
+	public M fit(TableEnvironment tEnv, Table input) {


Should we do a sanity check to make sure the tEnv passed in is the same one used by the input table?

Ditto above.

becketqin · 2019-10-10T19:17:06Z

flink-ml-parent/flink-ml-lib/src/test/java/org/apache/flink/ml/common/MLEnvironmentTest.java

+	public void testConstructWithBatchEnv() {
+		ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();
+		BatchTableEnvironment batchTableEnvironment = BatchTableEnvironment.create(executionEnvironment);
+
+		MLEnvironment mlEnvironment = new MLEnvironment(executionEnvironment, batchTableEnvironment, null, null);
+
+		Assert.assertSame(mlEnvironment.getExecutionEnvironment(), executionEnvironment);
+		Assert.assertSame(mlEnvironment.getBatchTableEnvironment(), batchTableEnvironment);
+	}
+
+	@Test
+	public void testConstructWithStreamEnv() {
+		StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
+		StreamTableEnvironment streamTableEnvironment = StreamTableEnvironment.create(streamExecutionEnvironment);
+
+		MLEnvironment mlEnvironment = new MLEnvironment(null, null, streamExecutionEnvironment, streamTableEnvironment);
+
+		Assert.assertSame(mlEnvironment.getStreamExecutionEnvironment(), streamExecutionEnvironment);
+		Assert.assertSame(mlEnvironment.getStreamTableEnvironment(), streamTableEnvironment);
+	}
+}


If we only expect the MLEnvironment to be constructed in this two ways, maybe we can just have those two specific constructors?

Thanks, changed

…ns in operators to static, add the getter of side-outputs's schema

xuyang1706 · 2019-10-11T11:42:45Z

@xuyang1706 Thanks for updating the patch. I left a few more comments. Most of them are minor. One major thing is I am not sure whether we should completely remove fit(TableEnvironment, Table) and transform(TableEnvironment, Table) API, because they are completely bypassed in the current implementation given the introduction of MLEnvironmentFactory.

Thanks, @becketqin . I agree with replace void fit(TableEnvironment, Table) method with the MLEnvironmentFactory, and I suggest we could discuss and change the API in another PR.

becketqin · 2019-10-12T21:26:18Z

@xuyang1706 Thanks for updating the patch. the patch LGTM overall. I created a PR with some minor improvements against your branch. Can you check if that makes sense? If so, feel free to merge it to your branch.

@walterddr Do you also want to take another look? I am thinking of merging the patch either on Sunday or Monday. Thanks.

walterddr

Thanks for the contribution @xuyang1706 and the reviews @becketqin . Overall it looks good to me. I only left a minor comment. please kindly take a look.
+1 to merge to unblock future developments.

walterddr · 2019-10-12T22:01:09Z

flink-ml-parent/flink-ml-lib/src/main/java/org/apache/flink/ml/common/MLEnvironmentFactory.java

+	 * @param mlEnvId the id.
+	 * @return the removed MLEnvironment
+	 */
+	public static synchronized MLEnvironment remove(Long mlEnvId) {


when would this API be called? seems like it is not necessary?

MLEnviromentFactory can create multi MLEnviroment, if some of them only used in a time slot, user can use this API to release the resource. In most cases, user just use default MLEnviroment. This API is rarely used.

There is one minor bug here. The default MLEnvironment should never be removed. I'll fix this when checking in the code. Thanks.

xuyang1706 · 2019-10-13T03:03:58Z

Thanks for the contribution @xuyang1706 and the reviews @becketqin . Overall it looks good to me. I only left a minor comment. please kindly take a look.
+1 to merge to unblock future developments.

Thanks @walterddr. Yes, this API is rarely used, but in some cases, if user want to release no-used MLEnviroment in the process, this is only API could be called. Thus, I prefer to keep it.

Minor fix to PR 9184

xuyang1706 · 2019-10-13T03:44:43Z

@xuyang1706 Thanks for updating the patch. the patch LGTM overall. I created a PR with some minor improvements against your branch. Can you check if that makes sense? If so, feel free to merge it to your branch.

@walterddr Do you also want to take another look? I am thinking of merging the patch either on Sunday or Monday. Thanks.

Thanks for your help @becketqin. I have merged your "minor improvement PR" to my branch.

becketqin · 2019-10-14T07:58:18Z

Merged to master.

rmetzger added the review=description? label Jul 20, 2019

rmetzger added the component=Library/MachineLearning label Jul 20, 2019

xuyang1706 force-pushed the pipelinestages branch 3 times, most recently from 4f2afd3 to c4fc690 Compare August 2, 2019 08:57

becketqin reviewed Sep 6, 2019

View reviewed changes

ex00 reviewed Sep 6, 2019

View reviewed changes

[FLINK-13339][ml] Add an implementation of pipeline's api

2365f80

[hotfix] use varargs in linkFrom of operators

b1855d5

xuyang1706 force-pushed the pipelinestages branch from c4fc690 to b1855d5 Compare September 12, 2019 12:59

walterddr reviewed Sep 13, 2019

View reviewed changes

[hotfix] Use Base as the suffix of Pipeline Api's implements and add …

54210a0

…MLEnvironmentFactory that create the new MLEnvironment

[hotfix] fix some typo

b089ac0

ex00 reviewed Sep 20, 2019

View reviewed changes

[hotfix] remove the sourceFrom in AlgoOperator and remove the check i…

67f96bd

…n getParams

xuyang1706 added 2 commits September 25, 2019 16:13

[hotfix] remove linkTo in BatchOperator&&StreamOperator

3e759a4

add more comments in MLEnvironmentFactory move the initialize of MLEnvironment to getNewMLEnvironmentId in MLEnvironmentFactory

[hotfix] remove the needless change

54c4c47

walterddr reviewed Sep 25, 2019

View reviewed changes

[hotfix] move the operators to package named operator

2983c5d

add more case add more comments in AlgoOperator

becketqin reviewed Sep 27, 2019

View reviewed changes

xuyang1706 added 3 commits September 28, 2019 17:40

[hotfix] refine the comments and rename the checkRequiredOpSize to ch…

9991bfb

…eckMinOpSize

[hotfix] make the MLEnvironment immutable and add registerMLEnvironme…

94ba2d0

…nt in MLEnvironmentFactory

[hotfix] rename the sourceFrom to fromTable and fix the ut

576ff18

[hotfix] fix some typos

16f1ec8

walterddr mentioned this pull request Sep 29, 2019

[FLINK-13596][ml] Add two utils for Table transformations #9373

Closed

becketqin reviewed Oct 10, 2019

View reviewed changes

[hotfix] add the constructor in MLEnvironment, change the preconditio…

acb4479

…ns in operators to static, add the getter of side-outputs's schema

xuyang1706 and others added 2 commits October 12, 2019 21:07

[hotfix] remove the setDefault in MLEnvironmentFactory

a13f249

Some minor changes to the PR-9184

4544989

walterddr reviewed Oct 12, 2019

View reviewed changes

Merge pull request #3 from becketqin/minor_fix_pr_9184

7505153

Minor fix to PR 9184

becketqin closed this Oct 14, 2019

[FLINK-13339][ml] Add an implementation of pipeline's api #9184

[FLINK-13339][ml] Add an implementation of pipeline's api #9184

Uh oh!

Conversation

xuyang1706 commented Jul 20, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jul 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jul 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

becketqin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ex00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuyang1706 commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

walterddr left a comment

flinkbot commented Jul 20, 2019 •

edited

Loading

flinkbot commented Jul 20, 2019 •

edited

Loading

xuyang1706 commented Sep 12, 2019 •

edited

Loading