[HUDI-1554] Introduced buffering for streams in HUDI.#2496
[HUDI-1554] Introduced buffering for streams in HUDI.#2496prashantwason wants to merge 1 commit intoapache:masterfrom
Conversation
|
@n3nash Please review as this may provide benefits for HDFS workloads. |
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/TimedFSInputStream.java
Outdated
Show resolved
Hide resolved
n3nash
left a comment
There was a problem hiding this comment.
@prashantwason left some comments
Codecov Report
@@ Coverage Diff @@
## master #2496 +/- ##
============================================
+ Coverage 50.28% 50.77% +0.48%
- Complexity 3120 3182 +62
============================================
Files 430 436 +6
Lines 19565 19907 +342
Branches 2004 2041 +37
============================================
+ Hits 9838 10107 +269
- Misses 8924 8979 +55
- Partials 803 821 +18
Flags with carried forward coverage won't be shown. Click here to find out more. |
vinothchandar
left a comment
There was a problem hiding this comment.
High level question , should we always buffer or make this configurable for HDFS only?
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
5b58495 to
5a928ec
Compare
I don't have idea about other file systems and their inherent buffering. You can decide. I did not see an easy way to restrict this as HoodieWrapperFileSystem currently does not take any properties. |
f6f692d to
b43ddd3
Compare
|
cc @umehrot2 would this additional buffering pose inefficiencies for S3 FileSystem? TL;DR HDFS's |
vinothchandar
left a comment
There was a problem hiding this comment.
LGTM. Minor comments. I think the 16MB buffer is okay to do regardless of DFS.
On passing configs, the way I can think of is to transfer the values from writeConfig to the hadoop configuration object. We should not make this layer aware of HoodieWriteConfig etc. For now, these constants may be ok.
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
|
Once we fix CI and the minor stuff, we can land |
b43ddd3 to
f070eb9
Compare
Implemented this. @vinothchandar PTAL |
f070eb9 to
82387c6
Compare
|
@n3nash I have implemented the on/off config. PTAL and approve. |
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
n3nash
left a comment
There was a problem hiding this comment.
@prashantwason Thanks for adding the configurable option. This will help to rollout this feature with confidence that we can turn it off if we see issues. Left 1 comment, rest LGTM can merge once addressed.
nsivabalan
left a comment
There was a problem hiding this comment.
Hey @prashantwason : I don't have a lot of knowledge on buffering. Can you please link some references to understandbuffering in fileSystem if you have some. Also, if you have come across any links that talks about when to enable buffering and when to not enable, would be good as well. Will benefit folks like me who are trying to understand how buffering works and will also benefit users who are looking to decide whether to enable or not.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
can we please try to be uniform. for getter methods, we have suffixed "ing" for buffer, but here we haven't. was it a conscious decision?
There was a problem hiding this comment.
Buffering is a verb. So when asking "is buffering enabled" I have used the ing form. The size of the buffer is in bytes. So for getting the size of the buffer, I have used "buffer".
Buffering is enabled/disabled. Buffering does not have a size. Buffer has a size.
nsivabalan
left a comment
There was a problem hiding this comment.
A high level question.
Do we need to enable metrics FS(time aware, size aware) as well by default whenever buffering is enabled. Why not we make this on configurable too. Wondering if there will be any overhead, as we might be measuring metrics for very read/write calls to the FileSystem. Please correct me if my understanding is wrong. Users may not be interested in these metrics unless they want to debug something IMO. Let me know your thoughts.
There was a problem hiding this comment.
Do we have unit tests for this FS?
There was a problem hiding this comment.
This is being tested by all the unit tests which read data.
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/TimedSizeAwareOutputStream.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieEngineContext.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why still name the class TimedFSInputStream based on the fact that the written bytes size is also recorded ?
There was a problem hiding this comment.
This class does not record the written bytes.
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
Outdated
Show resolved
Hide resolved
There are two parts to metrics in HUDI:
Metrics within HUDI are implemented using Registry which simply maintains the key-value metric pairs in memory. Each metric itself is a AtomicLong held in a in-memory Hash-map. Therefore the overhead of incrementing a metrics is:
So this should be negligible overhead on modern processors unless we are maintain millions of metrics. I feel the checks of metrics enable everywhere (if-metrics-enabled-then-do-something) tend to make the code ugly and they dont provide any performance benefits. |
1d4120a to
113d4b1
Compare
2e1a1be to
b03b269
Compare
|
@prashantwason can we get the PR to pass tests? I can take a final pass for landing. it'd be good to get this in |
There was a problem hiding this comment.
why this change for this PR?
Input and Output streams created in HUDI through calls to HoodieWrapperFileSystem do not include any buffering unless the underlying file system implements buffering. This patch introduces buffering at the HoodieWrapperFileSystem level so that all types of reads and writes benefit from buffering. The buffering can be controlled by the following properties: hoodie.fs.io.buffer.enabled (default true) enable/disable buffering hoodie.fs.io.buffer.data.min.size (default 16MB) Minimum buffer size of data files and log files which are generally large in size hoodie.fs.io.buffer.min.size (default 1MB) Minimum buffer size of non-data files which are generally smaller in size
b03b269 to
ba72d3e
Compare
|
@prashantwason I rebased this against master. still have some test failures. could you please take a look, so we can land this |
|
@prashantwason ping! |
|
@vinothchandar With a re-test on HDFS, I have been able to verify that this patch reduces the total number of API calls but did not find any significant difference in performance. Were you able to test this on S3? My main aim was to improve performance so cannot prove it right now. Should we mark this WIP for now till I can get more tests done? |
|
I have not been able to test this on S3. let me pick it up later next week. |
|
cc @nsivabalan @codope do any of you have cycles to test this PR out on top of s3 and see if any perf improvements happen (My guess is no). |
|
@vinothchandar @nsivabalan @xushiyan if I understand correctly based on the discussion, this PR is ready to land after fixing the tests. The performance test on S3 is a plus. Or could we close this PR for now given other improvements like metadata table, record-level index, and log compaction? |
|
closing this due to inactivity.. reopen if interested |
What is the purpose of the pull request
Input and Output streams created in HUDI through calls to HoodieWrapperFileSystem do not include any buffering unless the underlying file system implements buffering.
This patch introduces buffering at the HoodieWrapperFileSystem level so that all types of reads and writes benefit from buffering.
Brief change log
HoodieWrapperFileSystem changed to introduce BufferedStreams.
Verify this pull request
This pull request is already covered by existing tests.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.