[DOCS] Edits to the Hudi Tech specs by vinothchandar · Pull Request #6408 · apache/hudi

vinothchandar · 2022-08-16T11:36:16Z

Change Logs

Consistent terminlogy; tables vs datasets, management vs maintenance
Fixed few ommissions around meta fields, added more rationale
Clarified partitioning bit more
Formatting, typos.

Impact

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

- Consistent terminlogy; tables vs datasets, management vs maintenance - Fixed few ommissions around meta fields, added more rationale - Clarified partitioning bit more - Formatting, typos.

…le tradeoffs

vinothchandar · 2022-08-16T13:11:53Z

website/src/pages/tech-specs.md

-| Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower.  | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. |
+|                     | Merge Efficiency                                                                                                                                                                                                                                                                                                  | Query Efficiency                                                                                                                                                                                                                                                                                                                                               |
+| ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements.    | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage.                                                                                       |


@prasannarajaperumal I made this tunable vs optimal . CoW is optimal for reads for e.g , while you can tune merge, by over-provisioning writers. this is probably a better way to talk about this?

Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe

prasannarajaperumal · 2022-08-16T15:38:38Z

website/src/pages/tech-specs.md

-| Merge on Read (MOR) | **Efficient** <br />MOR table type batches the updates to the file slice in a separate optimized Log file, write amplification is amortized over time when sufficient updates are batched. The merge cost involved will be lower than COW since the churn on the records re-written for every update is much lower.  | **Inefficient**<br />MOR Table type required record level merging during query. Although there are techniques to make this merge as efficient as possible, there is still a record level overhead to apply the updates batched up for the file slice. The merge cost applies on every query until the compaction applies the updates and creates a new file slice. |
+|                     | Merge Efficiency                                                                                                                                                                                                                                                                                                  | Query Efficiency                                                                                                                                                                                                                                                                                                                                               |
+| ------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Copy on Write (COW) | **Tunable** <br />COW table type creates a new File slice in the file-group for every batch of updates. Write amplification can be quite high when the update is spread across multiple file groups. The cost involved can be high over a time period especially on tables with low data latency requirements.    | **Optimal** <br />COW table types create whole readable data files in open source columnar file formats on each merge batch, there is minimal overhead per record in the query engine. Query engines are fairly optimized for accessing files directly in cloud storage.                                                                                       |


Hmm everything is tunable then? Not sure if tunable is the right word. We could talk about cost maybe

vinothchandar · 2022-08-17T03:15:36Z

@prasannarajaperumal Not sure if tunable is the right word either. Not married to it. Landed for now, lets keep looking and update if we find sth better. I was approaching it from - we fix X and then we can vary Y based on perf.

[DOCS] Edits to the Hudi Tech specs

a21a8bb

- Consistent terminlogy; tables vs datasets, management vs maintenance - Fixed few ommissions around meta fields, added more rationale - Clarified partitioning bit more - Formatting, typos.

vinothchandar assigned prasannarajaperumal Aug 16, 2022

prasannarajaperumal approved these changes Aug 16, 2022

View reviewed changes

Fixing more typos, grammar + few rewording in concurrency control/tab…

99ada36

…le tradeoffs

vinothchandar commented Aug 16, 2022

View reviewed changes

prasannarajaperumal approved these changes Aug 16, 2022

View reviewed changes

vinothchandar merged commit 9735364 into apache:asf-site Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Edits to the Hudi Tech specs#6408

[DOCS] Edits to the Hudi Tech specs#6408
vinothchandar merged 2 commits intoapache:asf-sitefrom
vinothchandar:asf-site

vinothchandar commented Aug 16, 2022 •

edited

Loading

Uh oh!

vinothchandar Aug 16, 2022

Uh oh!

prasannarajaperumal Aug 16, 2022

Uh oh!

prasannarajaperumal Aug 16, 2022

Uh oh!

vinothchandar commented Aug 17, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vinothchandar commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Contributor's checklist

Uh oh!

vinothchandar Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

prasannarajaperumal Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

prasannarajaperumal Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vinothchandar commented Aug 16, 2022 •

edited

Loading

vinothchandar commented Aug 17, 2022 •

edited

Loading