Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR] Make ordering deterministic in small file selection #11008

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

Makes the ordering deterministic to get consistent results and avoid any issues in tests

Impact

Small file selection is consistent (mostly helps tests be consistent)

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@the-other-tim-brown
Copy link
Contributor Author

@hudi-bot run azure

if (table.getIndex().canIndexLogFiles()) {
return table.getSliceView()
.getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false)
.filter(this::isSmallFile)
.sorted(comparator)
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just fix the tests? Do we have gains for the sort in production?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to prefer the smallest files first as candidates to minimize IO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, just fix all the test failures and let's see what use cases are affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 the test failing is flakey. It is try to test some spark exception but it is non-deterministic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test failure takes some time to fix, is it easy to make the tests deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying but do not understand why these tests are coupled to small file handling. I think that the approach to testing is a bit strange to rely on that feature to test something in delta streamer for example.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants