[DOCS] Edits to the Hudi Tech specs#6408
Merged
vinothchandar merged 2 commits intoapache:asf-sitefrom Aug 17, 2022
Merged
Conversation
- Consistent terminlogy; tables vs datasets, management vs maintenance - Fixed few ommissions around meta fields, added more rationale - Clarified partitioning bit more - Formatting, typos.
prasannarajaperumal
approved these changes
Aug 16, 2022
vinothchandar
commented
Aug 16, 2022
| | Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower. | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. | | ||
| | | Merge Efficiency | Query Efficiency | | ||
| | ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements. | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage. | |
Member
Author
There was a problem hiding this comment.
@prasannarajaperumal I made this tunable vs optimal . CoW is optimal for reads for e.g , while you can tune merge, by over-provisioning writers. this is probably a better way to talk about this?
Contributor
There was a problem hiding this comment.
Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe
prasannarajaperumal
approved these changes
Aug 16, 2022
| | Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower. | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. | | ||
| | | Merge Efficiency | Query Efficiency | | ||
| | ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements. | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage. | |
Contributor
There was a problem hiding this comment.
Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe
Member
Author
|
@prasannarajaperumal Not sure if |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
Impact
none
Contributor's checklist