[HUDI-8126] Use union to parallelize data and error table writes#12813
[HUDI-8126] Use union to parallelize data and error table writes#12813nsivabalan merged 5 commits intoapache:masterfrom
Conversation
|
@the-other-tim-brown @rmahindra : Can you folks review this. once everything looks good and CI is green, lmk. I can help land the patch. |
| .defaultValue(ErrorWriteFailureStrategy.ROLLBACK_COMMIT.name()) | ||
| .withDocumentation("The config specifies the failure strategy if error table write fails. " | ||
| + "Use one of - " + Arrays.toString(ErrorWriteFailureStrategy.values())); | ||
| public static final ConfigProperty<Boolean> ENABLE_ERROR_TABLE_WRITE_UNIFICATION = ConfigProperty |
There was a problem hiding this comment.
I am wondering if we really need a flag for this. This seems like it will be more performant for users with the error table writer enabled
There was a problem hiding this comment.
The error table write implementation can't do a union if they implement the upsertAndCommit method so still need to support it behind a feature flag to avoid breaking things for existing users.
d72d93b to
1189123
Compare
|
@nsivabalan How do I see logs for azure run ? Can we re-trigger the CI run again if possible ? |
|
This seems to be because of |
|
have retriggered |
1189123 to
cf313c0
Compare
|
Rebased with latest master. |
|
Jacoco failures. |
|
Test failures. |
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Outdated
Show resolved
Hide resolved
|
@hudi-bot run azure |
…che#12813) (cherry picked from commit cb32e5e)
…che#12813) (cherry picked from commit cb32e5e)
…che#12813) (cherry picked from commit cb32e5e)
…che#12813) (cherry picked from commit cb32e5e)
Change Logs
Enable writing of error and data table in parallel. This behavior is disabled by default and can enabled by setting error table config property:
hoodie.errortable.write.union.enableto true.Impact
The DAG's for data table + error table are sequential today, this change executes them in a union to better utilize the executor resources in the spark driver.
Risk level (write none, low medium or high below)
Low.
Documentation Update