-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MINOR] Make ordering deterministic in small file selection #11008
base: master
Are you sure you want to change the base?
[MINOR] Make ordering deterministic in small file selection #11008
Conversation
@hudi-bot run azure |
if (table.getIndex().canIndexLogFiles()) { | ||
return table.getSliceView() | ||
.getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false) | ||
.filter(this::isSmallFile) | ||
.sorted(comparator) | ||
.collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just fix the tests? Do we have gains for the sort in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to prefer the smallest files first as candidates to minimize IO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, just fix all the test failures and let's see what use cases are affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 the test failing is flakey. It is try to test some spark exception but it is non-deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the test failure takes some time to fix, is it easy to make the tests deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying but do not understand why these tests are coupled to small file handling. I think that the approach to testing is a bit strange to rely on that feature to test something in delta streamer for example.
Change Logs
Makes the ordering deterministic to get consistent results and avoid any issues in tests
Impact
Small file selection is consistent (mostly helps tests be consistent)
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist