Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5991] Fix RDDCustomColumnsSortPartitioner's RepartitionRecords #8300

Merged
merged 7 commits into from
Apr 16, 2023

Conversation

wuwenchi
Copy link
Contributor

@wuwenchi wuwenchi commented Mar 27, 2023

Change Logs

When obtaining multiple specified fields, the return value is actually an array, but here it is directly obtained as an object:
Object recordValue = record.getColumnValues(...)

So it is converted into a string later: StringUtils.objToString(recordValue),
in fact, the address of the previous array is obtained, resulting in a sorting error.

Impact

Modify the sort function.

Risk level (write none, low medium or high below)

low.

Documentation Update

none

Contributor's checklist

  • [ x ] Read through contributor's guide
  • [ x ] Change Logs and Impact were stated clearly
  • [ x ] Adequate tests were added if applicable
  • [ x ] CI passed

@wuwenchi
Copy link
Contributor Author

wuwenchi commented Apr 6, 2023

@danny0405 Can you help review it? Thanks!

@danny0405
Copy link
Contributor

Thanks for the contribution, I have reviewed and attached a patch which is based on the latest master:
5991.patch.zip

@danny0405 danny0405 self-assigned this Apr 10, 2023
@danny0405 danny0405 added spark Issues related to spark writer-core Issues relating to core transactions/write actions labels Apr 10, 2023
@wuwenchi
Copy link
Contributor Author

Thanks for the contribution, I have reviewed and attached a patch which is based on the latest master: 5991.patch.zip

@danny0405
In the patch, put multiple fields together and compare them, so this may happen:
image
image

So I think it may be more reasonable to use a list, put each field value in the list, and then use the comparison function of each item alone.

return StringUtils.objToString(recordValue);
}
Object[] columnValues = record.getColumnValues(schema.get(), sortColumns, consistentLogicalTimestampEnabled);
return FlatLists.ofComparableArray(columnValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NULL still obeying the null_first order? Could the nulls throw exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not asking you to throw exception for nulls here, the original code has the NULL_FIRST semantics, that means a null is always greater than any other non_nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exception will be thrown here, it's null_first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior is null_last, the original comment is wrong, it returned empty string for nulls, empty string should be always smaller than non empty strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix JavaCustomColumnsSortPartitioner too.

@danny0405 danny0405 added the priority:critical production down; pipelines stalled; Need help asap. label Apr 12, 2023
@wuwenchi wuwenchi requested a review from danny0405 April 12, 2023 07:19
return values1.toString().compareTo(values2.toString());
FlatLists.ComparableList<Comparable> cmp1 = FlatLists.ofComparableArray(
HoodieAvroUtils.getRecordColumnValues((HoodieAvroRecord) o1, sortColumnNames, schema, consistentLogicalTimestampEnabled)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp1 -> values1, cmp2 -> values2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}, (o1, o2) -> {
FlatLists.ComparableList obj1 = FlatLists.ofComparableArray(o1.toArray());
FlatLists.ComparableList obj2 = FlatLists.ofComparableArray(o2.toArray());
return obj1.compareTo(obj2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj1 -> values1, obj2 -> values2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 2759878 into apache:master Apr 16, 2023
15 checks passed
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
…pache#8300)

The concatenation of multiple field strings could introudce ambiguity, for e.g, the comparison of '21,3' and '2,13' does not return the correct order. This patch fixes the comparison in form of ComparableList.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…pache#8300)

The concatenation of multiple field strings could introudce ambiguity, for e.g, the comparison of '21,3' and '2,13' does not return the correct order. This patch fixes the comparison in form of ComparableList.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap. spark Issues related to spark writer-core Issues relating to core transactions/write actions
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants