Skip to content

Conversation

xuyang1706
Copy link
Contributor

What is the purpose of the change

VectorToColumnsMapper maps one vector to many column objects. The number of columns equals to vector size.

Brief change log

  • Add VectorToColumnsMapper for mapping Op.
  • Add VectorToColumnsParams for parameters.
  • Add VectorToColumnsMapperTest for unit test.

Verifying this change

This change added tests and can be verified as follows:

  • run test case pass

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 9, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e399e94 (Fri Sep 06 09:10:47 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 9, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuyang1706 overall it looks good. I changed the RowCollector test bug in walterddr@2cd2906 to get Travis pass. I just left minor comments. please kindly take a look.

* under the License.
*/

package org.apache.flink.ml.common.dataproc.vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this is the right package to put this utility function in, maybe:
org.apache.flink.ml.common.utils.dataproc makes more sense?

This is especially confusing since we already have org.apache.flink.ml.common.mapper base class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a utility class, it's a data preprocess mapper. So I think it shouldn't be in a util package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is a pretty simple one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case I think the location is spot on :-) Thanks

if (indices[i] < colSize) {
result.setField(indices[i], values[i]);
} else {
break;
Copy link
Contributor

@walterddr walterddr Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int i = 0;
while (indices[i] < colSize) {
  result.setField(indices[i], values[i]);
  I++;
}

also, are we always assume users' sparse vector indices larger than colSize can be ignored?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this mapper flattens the idxth vector column in the input row to the output row. Maybe some warning log if truncated? But I just worry there maybe too much such logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.

*/
public class VectorToColumnsMapperTest {
@Test
public void test1() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs more informative test naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, renamed the test case.

}

@Test
public void test2() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, renamed the test case.

}

@Test
public void test3() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, renamed the test case.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @qiuxiafei . I have some follow ups, please let me know if they make sense. otherwise the patch looks good to go.

* under the License.
*/

package org.apache.flink.ml.common.dataproc.vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I am assuming this pre-processing is similar to, say feature normalization, or some specific pre-processing algorithm such as: Word Embedding algorithm?

import org.apache.flink.types.Row;

/**
* This mapper maps vector to table columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this is a particular pre-processing algorithm/method, it needs better Javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, refined the JavaDoc

if (indices[i] < colSize) {
result.setField(indices[i], values[i]);
} else {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. that's probably right. I think explaining this in the JavaDoc might be suffice.

@xuyang1706
Copy link
Contributor Author

@xuyang1706 overall it looks good. I changed the RowCollector test bug in walterddr@2cd2906 to get Travis pass. I just left minor comments. please kindly take a look.

Thanks @walterddr for your comments and discussion. I have refined the code, renamed the test cases and added more JavaDoc.

private int idx;
private OutputColsHelper outputColsHelper;

public VectorToColumnsMapper(TableSchema dataSchema, Params params) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you let a java doc with description which parameters should set successful creating new object, please?

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the quick turnaround @xuyang1706. I left some additional minor comments.

import java.util.Arrays;

/**
* This mapper maps vector to table columns, and the table is created with the first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This mapper maps vector to table columns, and the table is created with the first
* This is a data preprocessing function that transforms {@link Vector}s into {@link Table} columns.
*
* <p>Table is created with the first colSize value of the vector.
*
* <p>For sparse vector without given size, it will be treated as vector with infinite size.
* ...

Comment on lines +54 to +63
idx = TableUtil.findColIndex(dataSchema.getFieldNames(), selectedColName);
Preconditions.checkArgument(idx >= 0, "Can not find column: " + selectedColName);
String[] outputColNames = this.params.get(VectorToColumnsParams.OUTPUT_COLS);
Preconditions.checkArgument(null != outputColNames,
"VectorToTable: outputColNames must set.");
this.colSize = outputColNames.length;
TypeInformation[] types = new TypeInformation[colSize];
Arrays.fill(types, Types.DOUBLE);
this.outputColsHelper = new OutputColsHelper(dataSchema, outputColNames, types,
this.params.get(VectorToColumnsParams.RESERVED_COLS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this part out as a private static function: constructOutputColsHelper ? I can see @ex00 's concerns that the constructor is way too complex. In principle the constructor should be simpler and easy to understand. One example is to have this as:

public VectorToColumnsMapper(TableSchema dataSchema, Params params) {
  this(dataSchema, params, constructOutputColsHelper(dataSchema, params));
}

public VectorToColumnsMapper(TableSchema dataSchema, Params params, OutputColsHelper outputColsHelper) {
  super(dataSchema, params);
  this.outputColsHelper = outputColsHelper;
}

/**
* Unit test for VectorToColumnsMapper.
*/
public class VectorToColumnsMapperTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case looks good! thanks for refining @xuyang1706

@zentol zentol closed this May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants