-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-13676][ml] Add class of Vector to Columns mapper #9413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit e399e94 (Fri Sep 06 09:10:47 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuyang1706 overall it looks good. I changed the RowCollector
test bug in walterddr@2cd2906 to get Travis pass. I just left minor comments. please kindly take a look.
* under the License. | ||
*/ | ||
|
||
package org.apache.flink.ml.common.dataproc.vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this is the right package to put this utility function in, maybe:
org.apache.flink.ml.common.utils.dataproc
makes more sense?
This is especially confusing since we already have org.apache.flink.ml.common.mapper
base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a utility class, it's a data preprocess mapper. So I think it shouldn't be in a util package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this is a pretty simple one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case I think the location is spot on :-) Thanks
if (indices[i] < colSize) { | ||
result.setField(indices[i], values[i]); | ||
} else { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int i = 0;
while (indices[i] < colSize) {
result.setField(indices[i], values[i]);
I++;
}
also, are we always assume users' sparse vector indices larger than colSize can be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this mapper flattens the idx
th vector column in the input row to the output row. Maybe some warning log if truncated? But I just worry there maybe too much such logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.
*/ | ||
public class VectorToColumnsMapperTest { | ||
@Test | ||
public void test1() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more informative test naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, renamed the test case.
} | ||
|
||
@Test | ||
public void test2() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, renamed the test case.
} | ||
|
||
@Test | ||
public void test3() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, renamed the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @qiuxiafei . I have some follow ups, please let me know if they make sense. otherwise the patch looks good to go.
* under the License. | ||
*/ | ||
|
||
package org.apache.flink.ml.common.dataproc.vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?
import org.apache.flink.types.Row; | ||
|
||
/** | ||
* This mapper maps vector to table columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if this is a particular pre-processing algorithm/method, it needs better Javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, refined the JavaDoc
if (indices[i] < colSize) { | ||
result.setField(indices[i], values[i]); | ||
} else { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.
e399e94
to
66c62a8
Compare
Thanks @walterddr for your comments and discussion. I have refined the code, renamed the test cases and added more JavaDoc. |
private int idx; | ||
private OutputColsHelper outputColsHelper; | ||
|
||
public VectorToColumnsMapper(TableSchema dataSchema, Params params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you let a java doc with description which parameters should set successful creating new object, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the quick turnaround @xuyang1706. I left some additional minor comments.
import java.util.Arrays; | ||
|
||
/** | ||
* This mapper maps vector to table columns, and the table is created with the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This mapper maps vector to table columns, and the table is created with the first | |
* This is a data preprocessing function that transforms {@link Vector}s into {@link Table} columns. | |
* | |
* <p>Table is created with the first colSize value of the vector. | |
* | |
* <p>For sparse vector without given size, it will be treated as vector with infinite size. | |
* ... |
idx = TableUtil.findColIndex(dataSchema.getFieldNames(), selectedColName); | ||
Preconditions.checkArgument(idx >= 0, "Can not find column: " + selectedColName); | ||
String[] outputColNames = this.params.get(VectorToColumnsParams.OUTPUT_COLS); | ||
Preconditions.checkArgument(null != outputColNames, | ||
"VectorToTable: outputColNames must set."); | ||
this.colSize = outputColNames.length; | ||
TypeInformation[] types = new TypeInformation[colSize]; | ||
Arrays.fill(types, Types.DOUBLE); | ||
this.outputColsHelper = new OutputColsHelper(dataSchema, outputColNames, types, | ||
this.params.get(VectorToColumnsParams.RESERVED_COLS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this part out as a private static function: constructOutputColsHelper
? I can see @ex00 's concerns that the constructor is way too complex. In principle the constructor should be simpler and easy to understand. One example is to have this as:
public VectorToColumnsMapper(TableSchema dataSchema, Params params) {
this(dataSchema, params, constructOutputColsHelper(dataSchema, params));
}
public VectorToColumnsMapper(TableSchema dataSchema, Params params, OutputColsHelper outputColsHelper) {
super(dataSchema, params);
this.outputColsHelper = outputColsHelper;
}
/** | ||
* Unit test for VectorToColumnsMapper. | ||
*/ | ||
public class VectorToColumnsMapperTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case looks good! thanks for refining @xuyang1706
What is the purpose of the change
VectorToColumnsMapper maps one vector to many column objects. The number of columns equals to vector size.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation