Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2950] Addressing performance traps in Bulk Insert/Layout Optimization #4234

Merged
merged 14 commits into from
Jan 11, 2022

Conversation

alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

NOTE: This stacked on top of #4106

This PR targets addressing performance traps found during Perf Benchmarking of the Layout Optimization implementation

Brief change log

  • Reducing small objects churn due to Scala/Java conversions by re-using RowFactory, passing Object[]
  • Replaced OverwriteAvroPayloadRecord w/ RewriteRecordPayload to avoid unnecessary Avro ser/de loop
  • Added PathCachingFileName to avoid fetching substrings every time file-name is fetched;
  • Drastically reducing size of the ArrayDeque allocated by ObjectSizeCalculator

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.
This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@vinothchandar
Copy link
Member

@alexeykudinkin Is this ready to be reviewed?

@alexeykudinkin
Copy link
Contributor Author

@vinothchandar yes, just need to rebase on master since this one was stacked

@alexeykudinkin alexeykudinkin changed the title [HUDI-2950][Stacked 4106] Addressing performance traps in Bulk Insert/Layout Optimization [HUDI-2950] Addressing performance traps in Bulk Insert/Layout Optimization Dec 20, 2021
@vinothchandar vinothchandar self-assigned this Dec 25, 2021
@vinothchandar vinothchandar moved this from Ready for Review to Nearing Landing in PR Tracker Board Dec 25, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiarixiaoyao do you want to review this once?

*/
public class PathCachingFileName extends Path {

// NOTE: volatile keyword is redundant here and put mostly for reader notice, since all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment? Don't want to overload the code with these. I'd expect people to understand volatile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But its goal is the opposite: volatile isn't required here and i added a comment for those who might decide that it isn't needed, removing it (while the purpose of volatile here is to hint to the reader that the read/write of the ref doesn't need to be synchronized here)

/**
* NOTE: This class is thread-safe
*/
public class PathCachingFileName extends Path {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename: FileNameCachingPath or NameCachedPath, ending with Path

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

hudi-bot commented Jan 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar vinothchandar merged commit f1e3762 into apache:master Jan 11, 2022
PR Tracker Board automation moved this from Nearing Landing to Done Jan 11, 2022
@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
…zation (apache#4234)

* Cleaned up Z-curve/Hilbert ordering seqs:
  - Streamlined flow
  - Removed unnecessary operations (double-mapping, boxing, etc)
Updated `CollectionUtils::combine` to avoid AL resizing

* Tidying up

* Reducing small objects churn due to Scala/Java conversions by re-using `RowFactory`, passing `Object[]`

* Fixing name resolution (disambiguation overloads)

* `lint`

* Replaced `OverwriteAvroPayloadRecord` w/ `RewriteRecordPayload` to avoid unnecessary Avro ser/de loop

* Added `PathCachingFileName` to avoid fetching substrings every time file-name is fetched;
Inject `PathCachingFileName` into `HoodieWrapperFileSystem.convertPathWithScheme`

* Drastically reducing size of the `ArrayDeque` allocated by `ObjectSizeCalculator`

* XXX

* Missing license

* Fixed refs (after rebase)

* Fixing compilation failure in Scala 2.11

* `PathCachingFileName` > `FileNameCachingPath`

* Tidying up
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…zation (apache#4234)

* Cleaned up Z-curve/Hilbert ordering seqs:
  - Streamlined flow
  - Removed unnecessary operations (double-mapping, boxing, etc)
Updated `CollectionUtils::combine` to avoid AL resizing

* Tidying up

* Reducing small objects churn due to Scala/Java conversions by re-using `RowFactory`, passing `Object[]`

* Fixing name resolution (disambiguation overloads)

* `lint`

* Replaced `OverwriteAvroPayloadRecord` w/ `RewriteRecordPayload` to avoid unnecessary Avro ser/de loop

* Added `PathCachingFileName` to avoid fetching substrings every time file-name is fetched;
Inject `PathCachingFileName` into `HoodieWrapperFileSystem.convertPathWithScheme`

* Drastically reducing size of the `ArrayDeque` allocated by `ObjectSizeCalculator`

* XXX

* Missing license

* Fixed refs (after rebase)

* Fixing compilation failure in Scala 2.11

* `PathCachingFileName` > `FileNameCachingPath`

* Tidying up
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…zation (apache#4234)

* Cleaned up Z-curve/Hilbert ordering seqs:
  - Streamlined flow
  - Removed unnecessary operations (double-mapping, boxing, etc)
Updated `CollectionUtils::combine` to avoid AL resizing

* Tidying up

* Reducing small objects churn due to Scala/Java conversions by re-using `RowFactory`, passing `Object[]`

* Fixing name resolution (disambiguation overloads)

* `lint`

* Replaced `OverwriteAvroPayloadRecord` w/ `RewriteRecordPayload` to avoid unnecessary Avro ser/de loop

* Added `PathCachingFileName` to avoid fetching substrings every time file-name is fetched;
Inject `PathCachingFileName` into `HoodieWrapperFileSystem.convertPathWithScheme`

* Drastically reducing size of the `ArrayDeque` allocated by `ObjectSizeCalculator`

* XXX

* Missing license

* Fixed refs (after rebase)

* Fixing compilation failure in Scala 2.11

* `PathCachingFileName` > `FileNameCachingPath`

* Tidying up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants