Skip to content

[DOCS] Edits to the Hudi Tech specs#6408

Merged
vinothchandar merged 2 commits intoapache:asf-sitefrom
vinothchandar:asf-site
Aug 17, 2022
Merged

[DOCS] Edits to the Hudi Tech specs#6408
vinothchandar merged 2 commits intoapache:asf-sitefrom
vinothchandar:asf-site

Conversation

@vinothchandar
Copy link
Member

@vinothchandar vinothchandar commented Aug 16, 2022

Change Logs

  • Consistent terminlogy; tables vs datasets, management vs maintenance
  • Fixed few ommissions around meta fields, added more rationale
  • Clarified partitioning bit more
  • Formatting, typos.

Impact

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

 - Consistent terminlogy; tables vs datasets, management vs maintenance
 - Fixed few ommissions around meta fields, added more rationale
 - Clarified partitioning bit more
 - Formatting, typos.
| Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower. | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. |
| | Merge Efficiency | Query Efficiency |
| ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements. | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage. |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prasannarajaperumal I made this tunable vs optimal . CoW is optimal for reads for e.g , while you can tune merge, by over-provisioning writers. this is probably a better way to talk about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe

| Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower. | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. |
| | Merge Efficiency | Query Efficiency |
| ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements. | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe

@vinothchandar vinothchandar merged commit 9735364 into apache:asf-site Aug 17, 2022
@vinothchandar
Copy link
Member Author

vinothchandar commented Aug 17, 2022

@prasannarajaperumal Not sure if tunable is the right word either. Not married to it. Landed for now, lets keep looking and update if we find sth better. I was approaching it from - we fix X and then we can vary Y based on perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants