[FLINK-13676][ml] Add class of Vector to Columns mapper #9413

xuyang1706 · 2019-08-09T15:33:00Z

What is the purpose of the change

VectorToColumnsMapper maps one vector to many column objects. The number of columns equals to vector size.

Brief change log

Add VectorToColumnsMapper for mapping Op.
Add VectorToColumnsParams for parameters.
Add VectorToColumnsMapperTest for unit test.

Verifying this change

This change added tests and can be verified as follows:

run test case pass

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs)

flinkbot · 2019-08-09T15:35:11Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e399e94 (Fri Sep 06 09:10:47 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!
This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-08-09T15:45:37Z

CI report:

66c62a8 Azure: SUCCESS Unknown: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

This closes apache#9413.

walterddr

@xuyang1706 overall it looks good. I changed the RowCollector test bug in walterddr@2cd2906 to get Travis pass. I just left minor comments. please kindly take a look.

walterddr · 2020-01-15T01:32:51Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapper.java

+ * under the License.
+ */
+
+package org.apache.flink.ml.common.dataproc.vector;


I was wondering if this is the right package to put this utility function in, maybe:
org.apache.flink.ml.common.utils.dataproc makes more sense?

This is especially confusing since we already have org.apache.flink.ml.common.mapper base class.

This is not a utility class, it's a data preprocess mapper. So I think it shouldn't be in a util package.

I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?

yes, but this is a pretty simple one.

in this case I think the location is spot on :-) Thanks

walterddr · 2020-01-15T01:34:38Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapper.java

+				if (indices[i] < colSize) {
+					result.setField(indices[i], values[i]);
+				} else {
+					break;


int i = 0; while (indices[i] < colSize) { result.setField(indices[i], values[i]); I++; }

also, are we always assume users' sparse vector indices larger than colSize can be ignored?

Yes, this mapper flattens the idxth vector column in the input row to the output row. Maybe some warning log if truncated? But I just worry there maybe too much such logs.

hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.

walterddr · 2020-01-15T01:36:31Z

...-lib/src/test/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapperTest.java

+ */
+public class VectorToColumnsMapperTest {
+	@Test
+	public void test1() throws Exception {


needs more informative test naming.

Thanks, renamed the test case.

walterddr · 2020-01-15T01:36:37Z

...-lib/src/test/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapperTest.java

+	}
+
+	@Test
+	public void test2() throws Exception {


Thanks, renamed the test case.

walterddr · 2020-01-15T01:36:41Z

...-lib/src/test/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapperTest.java

+	}
+
+	@Test
+	public void test3() throws Exception {


Thanks, renamed the test case.

walterddr

Thanks for the feedback @qiuxiafei . I have some follow ups, please let me know if they make sense. otherwise the patch looks good to go.

walterddr · 2020-01-15T16:22:51Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapper.java

+ * under the License.
+ */
+
+package org.apache.flink.ml.common.dataproc.vector;


I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?

walterddr · 2020-01-15T16:23:19Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapper.java

+import org.apache.flink.types.Row;
+
+/**
+ * This mapper maps vector to table columns.


I think if this is a particular pre-processing algorithm/method, it needs better Javadoc.

Thanks, refined the JavaDoc

walterddr · 2020-01-15T16:24:37Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/dataproc/vector/VectorToColumnsMapper.java

+				if (indices[i] < colSize) {
+					result.setField(indices[i], values[i]);
+				} else {
+					break;


hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.

… add more java doc

xuyang1706 · 2020-01-16T09:55:09Z

@xuyang1706 overall it looks good. I changed the RowCollector test bug in walterddr@2cd2906 to get Travis pass. I just left minor comments. please kindly take a look.

Thanks @walterddr for your comments and discussion. I have refined the code, renamed the test cases and added more JavaDoc.

ex00 · 2020-01-16T11:50:04Z

...src/main/java/org/apache/flink/ml/operator/common/dataproc/vector/VectorToColumnsMapper.java

+	private int idx;
+	private OutputColsHelper outputColsHelper;
+
+	public VectorToColumnsMapper(TableSchema dataSchema, Params params) {


Could you let a java doc with description which parameters should set successful creating new object, please?

walterddr

thanks for the quick turnaround @xuyang1706. I left some additional minor comments.

walterddr · 2020-01-16T15:48:42Z

...src/main/java/org/apache/flink/ml/operator/common/dataproc/vector/VectorToColumnsMapper.java

+import java.util.Arrays;
+
+/**
+ * This mapper maps vector to table columns, and the table is created with the first


Suggested change

* This mapper maps vector to table columns, and the table is created with the first

* This is a data preprocessing function that transforms {@link Vector}s into {@link Table} columns.

*

* <p>Table is created with the first colSize value of the vector.

*

* <p>For sparse vector without given size, it will be treated as vector with infinite size.

* ...

walterddr · 2020-01-16T15:54:53Z

...src/main/java/org/apache/flink/ml/operator/common/dataproc/vector/VectorToColumnsMapper.java

+		idx = TableUtil.findColIndex(dataSchema.getFieldNames(), selectedColName);
+		Preconditions.checkArgument(idx >= 0, "Can not find column: " + selectedColName);
+		String[] outputColNames = this.params.get(VectorToColumnsParams.OUTPUT_COLS);
+		Preconditions.checkArgument(null != outputColNames,
+			"VectorToTable: outputColNames must set.");
+		this.colSize = outputColNames.length;
+		TypeInformation[] types = new TypeInformation[colSize];
+		Arrays.fill(types, Types.DOUBLE);
+		this.outputColsHelper = new OutputColsHelper(dataSchema, outputColNames, types,
+			this.params.get(VectorToColumnsParams.RESERVED_COLS));


Let's refactor this part out as a private static function: constructOutputColsHelper ? I can see @ex00 's concerns that the constructor is way too complex. In principle the constructor should be simpler and easy to understand. One example is to have this as:

public VectorToColumnsMapper(TableSchema dataSchema, Params params) { this(dataSchema, params, constructOutputColsHelper(dataSchema, params)); } public VectorToColumnsMapper(TableSchema dataSchema, Params params, OutputColsHelper outputColsHelper) { super(dataSchema, params); this.outputColsHelper = outputColsHelper; }

walterddr · 2020-01-16T15:55:41Z

...test/java/org/apache/flink/ml/operator/common/dataproc/vector/VectorToColumnsMapperTest.java

+/**
+ * Unit test for VectorToColumnsMapper.
+ */
+public class VectorToColumnsMapperTest {


Test case looks good! thanks for refining @xuyang1706

rmetzger added the review=description? label Aug 9, 2019

rmetzger added the component=Library/MachineLearning label Aug 9, 2019

xuyang1706 mentioned this pull request Oct 9, 2019

[FLINK-13577][ml] Add an util class to build result row and generate … #9355

Closed

walterddr pushed a commit to walterddr/flink that referenced this pull request Nov 2, 2019

[FLINK-13676][ml] Add class of Vector to Columns mapper

d1424a2

This closes apache#9413.

walterddr reviewed Jan 15, 2020

View reviewed changes

xuyang1706 added 3 commits January 16, 2020 16:57

[FLINK-13676][ml] Add class of Vector to Columns mapper

3e9a4ea

[hotfix] use VectorTypes in VectorToColumnsMapper

de45129

[hotfix] check the scale of indices when the vector size is given and…

66c62a8

… add more java doc

xuyang1706 force-pushed the vectortocolumns branch from e399e94 to 66c62a8 Compare January 16, 2020 09:47

ex00 reviewed Jan 16, 2020

View reviewed changes

walterddr reviewed Jan 16, 2020

View reviewed changes

zentol closed this May 17, 2022

- * This mapper maps vector to table columns, and the table is created with the first
+ * This is a data preprocessing function that transforms {@link Vector}s into {@link Table} columns.
+ *
+ * <p>Table is created with the first colSize value of the vector.
+ *
+ * <p>For sparse vector without given size, it will be treated as vector with infinite size.
+ * ...

[FLINK-13676][ml] Add class of Vector to Columns mapper #9413

[FLINK-13676][ml] Add class of Vector to Columns mapper #9413

Uh oh!

Conversation

xuyang1706 commented Aug 9, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

walterddr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

walterddr Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

walterddr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuyang1706 commented Jan 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

walterddr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flinkbot commented Aug 9, 2019 •

edited

Loading

flinkbot commented Aug 9, 2019 •

edited

Loading

walterddr Jan 15, 2020 •

edited

Loading