[HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer #6003

zhangyue19921010 · 2022-06-29T08:30:34Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

zhangyue19921010 · 2022-07-04T10:03:09Z

@hudi-bot run azure

hudi-bot · 2022-07-04T10:48:42Z

CI report:

6069fdc Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

At high-level, the approach LGTM. Let’s finalize the details in a few spots.

yihua · 2022-07-06T01:05:19Z

rfc/rfc-56/rfc-56.md

+
+

nit: remove redundant empty lines

Sure, removed.

yihua · 2022-07-06T01:07:58Z

rfc/rfc-56/rfc-56.md

+
+## Abstract
+
+At present, Hudi implements an optimized occ mechanism based on timeline to ensure data consistency, integrity and 


Let's spell out OCC: optimized occ mechanism -> OCC (Optimistic Concurrency Control)

yihua · 2022-07-06T01:26:39Z

rfc/rfc-56/rfc-56.md

+after the data writing is completed. If this detection was failed, it would lead to a waste of cluster resources 
+because computing and writing were finished already. To solve this problem, this RFC design an early conflict detection 
+mechanism based on the existing Hudi marker mechanism. This new mechanism will do conflict checking before the writers 
+creating markers and before starting to write corresponding data file. So that the writing conflicts can be detected as 


This new mechanism will do conflict checking before the writers creating markers and before starting to write corresponding data file.

This only applies to the direct markers in the new design. To be accurate, for the timeline-server based markers, based on the design, the conflicts are periodically checked and both writers may still write the data files of the same file slice, until the conflict is detected in the next round of checking.

nice catch! changed.

yihua · 2022-07-07T05:33:50Z

rfc/rfc-56/rfc-56.md

+
+
+Based on marker and heartbeat, this RFC design a new conflict detection: Early Conflict Detection. 
+Before the writer creates the marker and before it starts to write the file, Hudi will perform this new conflict 


Similar here for inaccurate description.

yihua · 2022-07-07T05:42:29Z

rfc/rfc-56/rfc-56.md

+
+
+In addition, during checking conflicts, we do not need to list all marker files in all directories, we can prune based 
+on the current partition Path and fileID to avoid unnecessary list operations.


You can give an example of how the direct marker path is constructed for checking whether it exists, instead of listing all markers to save FS calls.

yihua · 2022-07-07T05:49:57Z

rfc/rfc-56/rfc-56.md

+
+![](figure3.png)
+
+As shown on figure3. For the client side, it will call this check-marker-conflict api to get the latest value 


Does the timeline server batch process the check-marker-conflict as well? Why not using the existing marker creation requests so that the timeline server returns false instead of true as the response? In that case, no new endpoint is introduced.

Nice catch Ethan. I'd like more discuss here.
A separate api is added here for better abstraction. Because the follow-up may not only support async checking, but also support loop polling based on each writer's time line server. And coupling checking and create together may lose some scalability.

yihua · 2022-07-07T05:52:54Z

rfc/rfc-56/rfc-56.md

+Fortunately, we can adjust the frequency of the schedule according to the actual situation to reduce the occurrence of 
+underreported.
+
+Secondly, the marker checker runnable scheduled and selecting the candidate active instants, only the instant smaller than 


+1. This logic can be generalized for early conflict detection and pre-commit conflict detection.

yihua · 2022-07-07T05:55:32Z

rfc/rfc-56/rfc-56.md

+of hasConflict.
+
+We did a performance test for async conflict checker based on 1500+ inflight markers located in MARKER0~MARKER19 
+and it will take 179ms to finish checking conflict.


Great to have the benchmarking numbers here! Let’s also add the setup, e.g., how many concurrent writers, the file system type, etc. and discuss the impact on the latency of ongoing requests.

yihua · 2022-07-07T06:03:15Z

rfc/rfc-56/rfc-56.md

+
+### [3] Try to resolve conflict
+
+For now, the default behavior is throwing an `HoodieWriteConflictException`


More details should be added here around how the transaction is ended. Does the driver or executor throw the exception? How to gracefully end the current write operation and possibly retry it in the same Spark job with DeltaStreamer?

pratyakshsharma · 2022-07-12T06:47:12Z

rfc/rfc-56/rfc-56.md

+
+At present, Hudi implements an optimized OCC (Optimistic Concurrency Control) based on timeline to ensure data
+consistency, integrity and correctness between multi-writers. However, the related conflict detection is performed
+before commit metadata and after the data writing is completed. If this detection was failed, it would lead to a waste


nit: "If this detection was failed" -> "if a conflict is detected"

I guess this makes it more clear.

Changed. Thanks a lot for your review and attention :)

pratyakshsharma

minor comments. Thank you for writing this.

rfc/rfc-56/rfc-56.md

pratyakshsharma

Let us wait to see if @yihua has any more concerns.

nbalajee · 2022-07-19T01:51:05Z

rfc/rfc-56/rfc-56.md

+early as possible. Both writers may still write the data files of the same file slice, until the conflict is detected
+in the next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster,


What are the assumptions about the workload (batch of records being ingested), when the early conflict detection strategy is employed on two are more jobs? In other words, if two jobs A and B concurrently working on the hudi dataset, are they expected to perform "upsert/update" operations only?

Early conflict detection may not be able to catch conflicts in some scenarios:
a) Both jobs A and B are performing an "insert" operation (seeing the record for the first time) and same record key is present on both batches. A might chose fileId F1 and B might choose fileid F2 and no conflict will be flagged. (If all are inserts, with no dupes, there is no need for conflict resolution).
b) Job A received an insert for a record R1 and Job B received an update for record R1. Both A and B might map R1 to different fileIds and conflict won't be detected.
c) If both jobs A and B are not "ingestion" jobs (say A is ingesting to partition P1, B is clustering on partition P1), then early conflict detection strategy will work.

Thanks @nbalajee for your attention.

What are the assumptions about the workload (batch of records being ingested), when the early conflict detection strategy is employed on two are more jobs? In other words, if two jobs A and B concurrently working on the hudi dataset, are they expected to perform "upsert/update" operations only?

I had to say, the key point for early conflict detection is to detect if multi-writers modify the same file group or not.

Current occ-based detection can only detect this conflict(which is file group level not record level as you mentioned a), b) and c)) at the end of ingestion.

This new workflow is trying to detected this kind of conflict and fail jobs as early as possible.

@nbalajee Thanks for bringing this up. We can add the scope of conflict detection, i.e., at file group level, to the RFC for clarification.

…sues

rfc/rfc-56/rfc-56.md

yihua · 2022-09-29T19:05:07Z

rfc/rfc-56/rfc-56.md

+early as possible. Both writers may still write the data files of the same file slice, until the conflict is detected
+in the next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict detection and release the resources to cluster,


@nbalajee Thanks for bringing this up. We can add the scope of conflict detection, i.e., at file group level, to the RFC for clarification.

rfc/rfc-56/rfc-56.md

yihua

LGTM. @zhangyue19921010 Thanks for putting this up and taking care of all design details! The early conflict detection is going to be a good optimization and resource saver for multi-writer scenarios.

I helped you address a few minor writing issues in this RFC.

yihua · 2022-09-29T20:34:57Z

rfc/rfc-56/rfc-56.md

+which are running on executors. These tasks then fail and the Hudi write transaction retries.
+
+As for HoodieDeltaStreamer, when we detect marker conflicts, corresponding writing task fails and retries. If the retry
+reaches a certain number of times we set, the current stage will fail. At this time, this behavior is consistent with


We should think about how to abort the Hudi write transaction as a whole if one executor fails to create a marker due to early conflict detection, instead of relying on Spark's failure retries.

Sure, Ethan! For now when hudi detected a conflict, it will throw a HoodieEarlyConflictDetectionException on executor ==> current task failed directly

If we want to control this failed ourselves, there is the way i am thinking:

we implement a SparkListener for SparkListenerTaskEnd and register to spark context. If there is any task failed due to HoodieEarlyConflictDetectionException then failed current write transaction

We can do sth like this and take it as a follow-up. Have you measured how much additional time this can save, i.e., how much time do retries take?

If there is a conflict and users use AsyncTimelineMarkerEarlyConflictDetectionStrategy, it will take about 150 ms to detect this conflict and fail current task.

spark.task.maxFailures with default value 4. It may takes 4* 150 = 600ms to fail current write transaction

Sg. Let's follow up with proper benchmarking on this. We should show how this contributes to the overall write latency of a transaction. Spark retries the whole job/stage which contains multiple tasks. Considering Spark's overhead, it can take some time for all tasks to fail and retry.

yihua · 2022-09-29T20:36:56Z

@prasannarajaperumal could you take another look at this RFC before we merge it? I'm going to move on to reviewing the implementation in #6133 .

prasannarajaperumal

I feel this is a critical improvement for multiple writers. Thanks for writing this up.
Looks good overall. Few comments.

prasannarajaperumal · 2022-09-30T10:12:29Z

rfc/rfc-56/rfc-56.md

+detection and conflict resolution. And we provide `AsyncTimelineMarkerEarlyConflictDetectionStrategy` for
+TimelineServerBasedWriteMarkers to perform corresponding conflict detection and conflict resolution
+
+#### DirectWriteMarkers related strategy


Might make sense to put a probabilistic data structure (bloom or cuckoo) to frontend the check and speed up checking marker conflicts with all active writers especially for File based markers.

Nice Idea! If I understand correctly, we can implement another strategy:

When write handler create marker it will create or refresh a bloom index file for current writer.

When early conflict detect works it will check this bloom index file first to speed up this check.

Yeah - I think lets bring this as a default into the design. This should speed up early OCC check when there are no conflicts which is really the case we want to optimize for.

Add new strategies named BloomDirectMarkerBasedEarlyConflictDetectionStrategy and BloomAsyncTimelineMarkerEarlyConflictDetectionStrategy which design to read pre-created marker bloom files firstly to pick out potentially conflicting marker files as quick as possible.

Given that the timeline-server-based markers are the default now, should we rely more on the timeline-server-based early conflict detection for better performance and make the Bloom filter based marker conflict detection as a nice-to-have in the first cut?

For the Bloom filter based marker conflict detection, given we're writing additional metadata beyond just the markers, we need to design it properly. That's why I'm thinking we can have it in the second cut.

Given that the timeline-server-based markers are the default now, should we rely more on the timeline-server-based early conflict detection for better performance and make the Bloom filter based marker conflict detection as a nice-to-have in the first cut?

+1

Because of hoodie using timeline-server-based markers as default. And there only 20(MARKER0~MARKER19) marker files under .temp, so maybe this bloom filters are not a very urgent blocker which need more careful design on both marker creator and marker reader :)

prasannarajaperumal · 2022-09-30T10:46:20Z

rfc/rfc-56/rfc-56.md

+extreme cases, the delay of early conflict detection will also occur. Fortunately, we can adjust the frequency of the
+schedule according to the actual situation.
+
+Secondly, the marker checker runnable only looks at the instants smaller than the current writer for conflict detection.


It is not clear what this means to me. For every single write - Do we not have to check with all the existing markers by all the active writers to detect a conflict?.

It makes sense to have some ordering to break ties between concurrent writes trying to create markers - but that is not what we are doing here though if I understand this right.

The design here is to prevent two writers from detecting each other and all failing both.
For example writer1 start at 9:00 and writer2 start at 10:00 which conflict each other.

Without this limit, writer1 detect conflict with writer2 and failed. On the other hand writer2 also detect this conflict and failed either.

With this limit, only writer2 failed.

Also we can design a priority strategy as follow up :)

@prasannarajaperumal I think what @zhangyue19921010 means here is that if two writers have a conflict, one of the writers can succeed and the other fails. This is consistent with the current pre-commit conflict detection, i.e., only the commits in the Hudi timeline snapshot captured at the beginning of the write transaction are going to be checked for conflict. If writer2 starts in the middle of the writer1's transaction, at pre-commit time, writer1 does not see and check the writer2's inflight commit even though it is in the Hudi timeline (because at the beginning of writer1's transaction, the snapshot of the timeline does not have the writer2's commit).

prasannarajaperumal · 2022-09-30T10:48:46Z

rfc/rfc-56/rfc-56.md

+![](figure4.png)
+
+Writer1 starts writing data at time t1 and finishes at time t3. Writer2 starts writing at time t2, and updates a file
+already updated by writer1 at time t4. Since all markers of writer1 have been deleted at time t4, such conflict cannot


Is this what you mean at time t4?
"at time t4 - writer2 tries to create marker for a file already updated by writer1"

Yeap => at time t4 - writer2 tries to create marker for a file fileA which already updated by writer1 to fileAa(updated)

Can you update the RFC to word it this way for more clarity? Thanks

prasannarajaperumal · 2022-09-30T10:52:01Z

rfc/rfc-56/rfc-56.md

+conflict detection, it is necessary to add the steps of checking commit conflict during the detection.
+
+```
+1. Get and cache all instants before init write handler as set1.


Naming is a bit confusing as set1 used here and set1 used in Figure 2 is not the same - but readers may think it is :)
We could name it functionally like committed_instants_before and committed_instants_after and the diff is committed_instants_during.

Nice catch! Changed!

prasannarajaperumal · 2022-09-30T10:54:26Z

rfc/rfc-56/rfc-56.md

+2. Create and reload a new active timeline after finishing marker conflict checking as set2.
+3. Get the diff instants between set1 and set2 as set3.
+4. Read all the changed fileIds contained in set3 as fileId set1
+5. Check if current writing fileID set2 is overlapped with fileId set1


also since OCC Early detection is pure optimization play - Can we skip on these complex logic and keep the design simple. Worst case there are some false positives that could go through till pre-commit check.

Personally, I would rather have OCC early detection logic very simple and catch 95% of the cases than being very complicated and catch 99% of the cases.

Maybe we can set a functional flag default as false.
For common users they don't need to care about this flag.
For users who are more sensitive to this early conflict detect, they can enable this hard check.
Hi @yihua what do you think here :)

Thats sounds okay to me.

Without checking the metadata of completed commits from other concurrent writers, we miss the type of conflicts happening between t3 and t10 in the case below, which I think happens much more than 5% of the cases if there are conflicts, and lead to resource waste. For timeline-server-based markers, since the timeline server is going to do such detection asynchronously, it should not add latency to the hot path of writer data files.

writer 1: t1-----t3 writer 2: t2---------------<conflict>------------------t10

IMO, given the commit metadata check is sth pre-commit conflict detection also does, we can reuse the same logic and generalize it for the early conflict detection as well. I agree that we can add a feature flag for such logic, but we can keep the default as true for timeline-server-based markers and false for direct markers. If the user really sees performance issues, they can turn it off. What do you guys think?

zhangyue19921010 · 2022-10-03T03:31:41Z

Hi @yihua and @pratyakshsharma . Really appreciate for your attention here! Address the comments. PTAL :)

yihua

@zhangyue19921010 Thanks for addressing the comments. Why are some code changes involved? Could you properly rebase this PR on the latest master?

zhangyue19921010 · 2022-10-05T11:45:38Z

@zhangyue19921010 Thanks for addressing the comments. Why are some code changes involved? Could you properly rebase this PR on the latest master?

ooooh. Sorry for that. Just rebased the master branch. PTAL

prasannarajaperumal

Looks good to land the RFC. Thanks for taking care of my comments.

zhangyue19921010 · 2022-10-14T10:05:26Z

Thanks all for your efforts here!

…#6003) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

* [DOCS] Fix Slack invite link in README.md (apache#6648) * [HUDI-3558] Consistent bucket index: bucket resizing (split&merge) & concurrent write during resizing (apache#4958) RFC-42 implementation - Implement bucket resizing for consistent hashing index. - Support concurrent write during bucket resizing. This change added tests and can be verified as follows: - The test of the consistent bucket index is enhanced to include the case of bucket resizing. - Tests of different bucket resizing cases. - Tests of concurrent resizing, and concurrent writes during resizing. * [MINOR] Add dev setup and spark 3.3 profile to readme (apache#6656) * [HUDI-4831] Fix AWSDmsAvroPayload#getInsertValue,combineAndGetUpdateValue to invoke correct api (apache#6637) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * [HUDI-4806] Use Avro version from the root pom for Flink bundle (apache#6628) Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4833] Add Postgres Schema Name to Postgres Debezium Source (apache#6616) * [HUDI-4825] Remove redundant fields in serialized commit metadata in JSON (apache#6646) * [MINOR] Insert should call validateInsertSchema in HoodieFlinkWriteClient (apache#5919) Co-authored-by: 徐帅 <xushuai@MacBook-Pro-6.local> * [HUDI-3879] Suppress exceptions that are not fatal in HoodieMetadataTableValidator (apache#5344) Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning (apache#5478) - The last completed commit timestamp is used to calculate how many commit have been completed since the last clean. we might need to save this w/ clean plan so that next time when we trigger clean, we can start calculating from that. * [HUDI-3994] - Added support for initializing DeltaStreamer without a defined Spark Master (apache#5630) That will enable the usage of DeltaStreamer on environments such as AWS Glue or other serverless environments where the spark master is inherited and we do not have access to it. Co-authored-by: Angel Conde Manjon <acmanjon@amazon.com> * [HUDI-4628] Hudi-flink support GLOBAL_BLOOM，GLOBAL_SIMPLE，BUCKET index type (apache#6406) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4814] Schedules new clustering plan based on latest clustering instant (apache#6574) * Keep a clustering running at the same time * Simplify filtering logic Co-authored-by: dongsj <dongsj@asiainfo.com> * [HUDI-4817] Delete markers after full-record bootstrap operation (apache#6667) * [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module (apache#6550) As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic from Spark 3.2 module has been simply copied over. This PR is rectifying that by: 1. Creating new module "hudi-spark3.2plus-common" (that is shared across Spark 3.2 and Spark 3.3) 2. Moving shared components under "hudi-spark3.2plus-common" * [HUDI-4752] Add dedup support for MOR table in cli (apache#6608) * [HUDI-4837] Stop sleeping where it is not necessary after the success (apache#6270) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4843] Delete the useless timer in BaseRollbackActionExecutor (apache#6671) Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4780] hoodie.logfile.max.size It does not take effect, causing the log file to be too large (apache#6602) * hoodie.logfile.max.size It does not take effect, causing the log file to be too large Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-4844] Skip partition value resolving when the field does not exists for MergeOnReadInputFormat#getReader (apache#6678) * [MINOR] Fix the Spark job status description for metadata-only bootstrap operation (apache#6666) * [HUDI-3403] Ensure keygen props are set for bootstrap (apache#6645) * [HUDI-4193] Upgrade Protobuf to 3.21.5 (apache#5784) * [HUDI-4785] Fix partition discovery in bootstrap operation (apache#6673) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type (apache#6486) InternalSchemaChangeApplier#applyAddChange forget to remove parent name when calling ColumnAddChange#addColumns * [HUDI-4851] Fixing CSI not handling `InSet` operator properly (apache#6685) * [HUDI-4796] MetricsReporter stop bug (apache#6619) * [HUDI-3861] update tblp 'path' when rename table (apache#5320) * [HUDI-4853] Get field by name for OverwriteNonDefaultsWithLatestAvroPayload to avoid schema mismatch (apache#6689) * [HUDI-4813] Fix infer keygen not work in sparksql side issue (apache#6634) * [HUDI-4813] Fix infer keygen not work in sparksql side issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4856] Missing option for HoodieCatalogFactory (apache#6693) * [HUDI-4864] Fix AWSDmsAvroPayload#combineAndGetUpdateValue when using MOR snapshot query after delete operations with test (apache#6688) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * [HUDI-4841] Fix sort idempotency issue (apache#6669) * [HUDI-4865] Optimize HoodieAvroUtils#isMetadataField to use O(1) complexity (apache#6702) * [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed (apache#6536) * [HUDI-4842] Support compaction strategy based on delta log file num (apache#6670) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4757] Create pyspark examples (apache#6672) * [HUDI-3959] Rename class name for spark rdd reader (apache#5409) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650) Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4873] Report number of messages to be processed via metrics (apache#6271) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4870] Improve compaction config description (apache#6706) * [HUDI-3304] Support partial update payload (apache#4676) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630) * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489) Bumped spring shell to 2.1.1 and updated the default value for show fsview all `pathRegex` parameter. * [minor] following 3304, some code refactoring (apache#6713) * [HUDI-4832] Fix drop partition meta sync (apache#6662) * [HUDI-4810] Fix log4j imports to use bridge API (apache#6710) Co-authored-by: dongsj <dongsj@asiainfo.com> * [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920) - This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861* - The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. * Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [MINOR] fix indent to make build pass (apache#6721) * [HUDI-3478] Implement CDC Write in Spark (apache#6697) * [HUDI-4326] Fix hive sync serde properties (apache#6722) * [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709) * [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708) * [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516) File group in pending compaction can not be queried when query _ro table with spark. This commit fixes that. Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715) * [HUDI-4758] Add validations to java spark examples (apache#6615) * [HUDI-4792] Batch clean files to delete (apache#6580) This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4363] Support Clustering row writer to improve performance (apache#6046) * [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734) * [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-3901] Correct the description of hoodie.index.type (apache#6749) * [MINOR] Add .mvn directory to gitignore (apache#6746) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * add support for unraveling proto schemas * fix some compile issues * [HUDI-4901] Add avro.version to Flink profiles (apache#6757) * Add avro.version to Flink profiles Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322) * [HUDI-4883] Supporting delete savepoint for MOR (apache#6744) Users could delete unnecessary savepoints and unblock archival for MOR table. * [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740) * [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031) * Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768) This reverts commit 092375f. * [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769) * [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762) * [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754) * Update HoodieIndex.java Fix a typo * [HUDI-4906] Fix the local tests for hudi-flink (apache#6763) * [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755) * [HUDI-4892] Fix hudi-spark3-bundle (apache#6735) * [MINOR] Fix a few typos in HoodieIndex (apache#6784) Co-authored-by: xingjunwang <xingjunwang@tencent.com> * [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130) There are two minor issues fixed here: 1. When the insert_overwrite operation is performed, the clusteringPlan in the requestedReplaceMetadata will be null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE. 2. When insert_overwrite operation, inflightCommitMetadata!=null, getOperationType should be obtained from getHoodieInflightReplaceMetadata, the original code will have a null pointer. * [MINOR] retain avro's namespace (apache#6783) * [MINOR] Simple logging fix in LockManager (apache#6765) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349) When using the repair deduplicate command with hudi-cli, there is no way to run it on the unpartitioned dataset, so modify the cli parameter. Co-authored-by: Xingjun Wang <wongxingjun@126.com> * [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256) * [HUDI-4915] improve avro serializer/deserializer (apache#6788) * [HUDI-3478] Implement CDC Read in Spark (apache#6727) * naming and style updates * [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652) * make test data random, reuse code * [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561) - Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch. * [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792) * [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778) * [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794) * [HUDI-4718] Add Kerberos kinit command support. (apache#6719) * add test for 2 different recursion depths, fix schema cache key * add unsigned long support * better handle other types * rebase on 4904 * get all tests working * fix oneof expected schema, update tests after rebase * [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759) * [MINOR] Update PR template with documentation update (apache#6748) * revert scala binary change * try a different method to avoid avro version * [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761) If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data. Main changes to existing code: ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array Converting from a proto to an avro now accounts for this truncation of the input * delete unused file * [HUDI-4907] Prevent single commit multi instant issue (apache#6766) Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: yuzhao.cyz <yuzhao.cyz@gmail.com> * [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4848] Fixing repair deprecated partition tool (apache#6731) * [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785) * address PR feedback, update decimal precision * fix isNullable issue, check if class is Int64value * checkstyle fix * change wrapper descriptor set initialization * add in testing for unsigned long to BigInteger conversion * [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676) Turn off the type inference of the partition column to be consistent with existing behavior. Add notes around partition column type inference. * [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015) Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4924] Auto-tune dedup parallelism (apache#6802) * [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657) Use JOL GraphLayout for estimating deep size. * [MINOR] fixing validate async operations to poll completed clean instances (apache#6814) * [HUDI-4734] Deltastreamer table config change validation (apache#6753) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4934] Revert batch clean files (apache#6813) * Revert "[HUDI-4792] Batch clean files to delete (apache#6580)" This reverts commit cbf9b83. * [HUDI-4722] Added locking metrics for Hudi (apache#6502) * [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616) Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820) * [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729) * [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828) * [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826) * [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821) Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665) Adds an incremental source from GCS based on a similar design as https://hudi.apache.org/blog/2021/08/23/s3-events-source * [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839) * [HUDI-4718] Add Kerberos kdestroy command support (apache#6810) * [HUDI-4916] Implement change log feed for Flink (apache#6840) * [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848) * [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805) * [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851) * [MINOR] Fix testUpdateRejectForClustering (apache#6852) * [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846) * [HOTFIX] Fix source release validate script (apache#6865) * [HUDI-4980] Calculate avg record size using commit only (apache#6864) Calculate average record size for Spark upsert partitioner based on commit instants only. Previously it's based on commit and replacecommit, of which the latter may be created by clustering which has inaccurately smaller average record sizes, which could result in OOM due to size underestimation. * shade protobuf dependency * Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809) This reverts commit 79b3e2b. * [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857) * Enhancing README for multi-writer tests (apache#6870) * [MINOR] Fix deploy script for flink 1.15 (apache#6872) * [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883) * Revert "shade protobuf dependency" This reverts commit f03f961. * [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751) * [HUDI-2786] Docker demo on mac aarch64 (apache#6859) * [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873) * [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958) - Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). - New timeline action is introduced for the purpose. Co-authored-by: sivabalan <n.siva.b@gmail.com> * Relocate apache http package (apache#6874) * [HUDI-4975] Fix datahub bundle dependency (apache#6896) * [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901) * [MINOR] Update GitHub setting for merge button (apache#6922) Only allow squash and merge. Disable merge and rebase * [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885) * [MINOR] Fix name spelling for RunBootstrapProcedure * [HUDI-4754] Add compliance check in github actions (apache#6575) * [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> * [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886) * Implement Create/Drop/Show/Refresh Secondary Index (apache#5933) * [MINOR] Moved readme from .github to the workflows folder (apache#6932) * [HUDI-4952] Fixing reading from metadata table when there are no inflight commits (apache#6836) * Fixing reading from metadata table when there are no inflight commits * Fixing reading from metadata if not fully built out * addressing minor comments * fixing sql conf and options interplay * addressing minor refactoring * [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer (apache#6003) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5006] Use the same wrapper for timestamp type metadata for parquet and log files (apache#6918) Before this patch, for timestamp type, we use LongWrapper for parquet and TimestampMicrosWrapper for avro log, they may keep different precision val here, for example, with timestamp(3), LongWrapper keeps the val as a millisecond long from EPOCH instant, while TimestampMicrosWrapper keeps the val as micro-seconds. For spark, it uses micro-seconds internally for timestamp type value, while flink uses the TimestampData internally, we better keeps the same precision for better compatibility here. * [HUDI-5016] Flink clustering does not reserve commit metadata (apache#6929) * [HUDI-3900] Fixing hdfs setup and tear down in tests to avoid flakiness (apache#6912) * [HUDI-5002] Remove deprecated API usage in SparkHoodieHBaseIndex#generateStatement (apache#6909) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5010] Fix flink hive catalog external config not work (apache#6923) * fix flink catalog external config not work * [HUDI-4948] Improve CDC Write (apache#6818) * improve cdc write to support multiple log files * update: use map to store the cdc stats * [HUDI-5030] Fix TestPartialUpdateAvroPayload.testUseLatestRecordMetaValue(apache#6948) * [HUDI-5033] Fix Broken Link In MultipleSparkJobExecutionStrategy (apache#6951) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5037] Upgrade org.apache.thrift:libthrift to 0.14.0 (apache#6941) * [MINOR] Fixing verbosity of docker set up (apache#6944) * [HUDI-5022] Make better error messages for pr compliance (apache#6934) * [HUDI-5003] Fix the type of InLineFileSystem`startOffset to long (apache#6916) * [HUDI-4855] Add missing table configs for bootstrap in Deltastreamer (apache#6694) * [MINOR] Handling null event time (apache#6876) * [MINOR] Update DOAP with 0.12.1 Release (apache#6988) * [MINOR] Increase maxParameters size in scalastyle (apache#6987) * [HUDI-3900] Closing resources in TestHoodieLogRecord (apache#6995) * [MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts (apache#6949) * [HUDI-4982] Add validation job for spark bundles in GitHub Actions (apache#6954) * [HUDI-5041] Fix lock metric register confict error (apache#6968) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4998] Infer partition extractor class first from meta sync partition fields (apache#6899) * [HUDI-4781] Allow omit metadata fields for hive sync (apache#6471) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4997] Use jackson-v2 import instead of jackson-v1 (apache#6893) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-3900] Fixing tempDir usage in TestHoodieLogFormat (apache#6981) * [HUDI-4995] Relocate httpcomponents (apache#6906) * [MINOR] Update GitHub setting for branch protection (apache#7008) - require at least 1 approving review * [HUDI-4960] Upgrade jetty version for timeline server (apache#6844) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5046] Support all the hive sync options for flink sql (apache#6985) * [MINOR] fix cdc flake ut (apache#7016) * [MINOR] Remove redundant space in PR compliance check (apache#7022) * [HUDI-5063] Enabling run time stats to be serialized with commit metadata (apache#7006) * [HUDI-5070] Adding lock provider to testCleaner tests since async cleaning is invoked (apache#7023) * [HUDI-5070] Move flaky cleaner tests to separate class (apache#7034) * [HUDI-4971] Remove direct use of kryo from `SerDeUtils` (apache#7014) Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-5081] Tests clean up in hudi-utilities (apache#7033) * [HUDI-5027] Replace hardcoded hbase config keys with constant variables (apache#6946) * [MINOR] add commit_action output in show_commits (apache#7012) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception (apache#7001) Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> * [MINOR] Skip loading last completed txn for single writer (apache#6660) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4281] Using hudi to build a large number of tables in spark on hive causes OOM (apache#5903) * [HUDI-5042] Fix clustering schedule problem in flink when enable schedule clustering and disable async clustering (apache#6976) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4753] more accurate record size estimation for log writing and spillable map (apache#6632) * [HUDI-4201] Cli tool to get warned about empty non-completed instants from timeline (apache#6867) * [HUDI-5038] Increase default num_instants to fetch for incremental source (apache#6955) * [HUDI-5049] Supports dropPartition for Flink catalog (apache#6991) * for both dfs and hms catalogs * [HUDI-4809] glue support drop partitions (apache#7007) Co-authored-by: xxhua <xxhua@freewheel.tv> * [HUDI-5057] Fix msck repair hudi table (apache#6999) * [HUDI-4959] Fixing Avro's `Utf8` serialization in Kryo (apache#7024) * temp_view_support (apache#6990) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4982] Add Utilities and Utilities Slim + Spark Bundle testing to GH Actions (apache#7005) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5085]When a flink job has multiple sink tables, the index loading status is abnormal (apache#7051) * [HUDI-5089] Refactor HoodieCommitMetadata deserialization (apache#7055) * [HUDI-5058] Fix flink catalog read spark table error : primary key col can not be nullable (apache#7009) * [HUDI-5087] Fix incorrect merging sequence for Column Stats Record in `HoodieMetadataPayload` (apache#7053) * [HUDI-5087]Fix incorrect maxValue getting from metatable [HUDI-5087]Fix incorrect maxValue getting from metatable * Fixed `HoodieMetadataPayload` merging seq; Added test * Fixing handling of deletes; Added tests for handling deletes; * Added tests for combining partition files-list record Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-4946] fix merge into with no preCombineField having dup row by only insert (apache#6824) * [HUDI-5072] Extract `ExecutionStrategy#transform` duplicate code (apache#7030) * [HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle (apache#6079) * [HUDI-5000] Support schema evolution for Hive/presto (apache#6989) Co-authored-by: z00484332 <zhaolong36@huawei.com> * [HUDI-4716] Avoid parquet-hadoop-bundle in hudi-hadoop-mr (apache#6930) * [HUDI-5035] Remove usage of deprecated HoodieTimer constructor (apache#6952) Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5083]Fixed a bug when schema evolution (apache#7045) * [HUDI-5102] source operator(monitor and reader) support user uid (apache#7085) * Update HoodieTableSource.java Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> * [HUDI-5057] Fix msck repair external hudi table (apache#7084) * [MINOR] Fix typos in Spark client related classes (apache#7083) * [HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796) Co-authored-by: jian.feng <jian.feng@shopee.com> * [MINOR] use default maven version since it already fix the warnings recently (apache#6863) Co-authored-by: jian.feng <jian.feng@shopee.com> * Revert "[HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796)" (apache#7090) This reverts commit e222693. * [MINOR] Fix doc of org.apache.hudi.sink.meta.CkpMetadata#bootstrap (apache#7048) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4799] improve analyzer exception tip when cannot resolve expression (apache#6625) * [HUDI-5096] Upgrade jcommander to 1.78 (apache#7068) - resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (apache#7091) * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql * remove pr compliance from open source * fix test issues * fix bad merge files * ignoring Spark3DDL tests, as they are failing in OSS master too against spark3.2 , scala2.12 * remove flakey test case * Update HoodieMultiTableCommitStatsManager when creating job info (apache#122) * Update HoodieMultiTableCommitStatsManager when creating job info * Tidying up Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Rahil Chertara <rchertar@amazon.com> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: Abhishek Modi <modi@makenotion.com> Co-authored-by: shuai.xu <chiggics@gmail.com> Co-authored-by: 徐帅 <xushuai@MacBook-Pro-6.local> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: 董可伦 <dongkelun01@inspur.com> Co-authored-by: Angel Conde <neuw84@gmail.com> Co-authored-by: Angel Conde Manjon <acmanjon@amazon.com> Co-authored-by: FocusComputing <xiaoxingstack@gmail.com> Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> Co-authored-by: eric9204 <90449228+eric9204@users.noreply.github.com> Co-authored-by: dongsj <dongsj@asiainfo.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: Manu <36392121+xicm@users.noreply.github.com> Co-authored-by: Volodymyr Burenin <vburenin@gmail.com> Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com> Co-authored-by: 吴文池 <wuwenchi@deepexi.com> Co-authored-by: luokey <loukey.j@gmail.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: Sylwester Lachiewicz <slachiewicz@apache.org> Co-authored-by: komao <masterwangzx@gmail.com> Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: 苏承祥 <sucx@tuya.com> Co-authored-by: 5herhom <543872547@qq.com> Co-authored-by: Jon Vexler <jon@onehouse.ai> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: y0908105023 <283999377@qq.com> Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Paul Zhang <xzhangyao@126.com> Co-authored-by: Kyle Zhike Chen <zk.chan007@gmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: dohongdayi <dohongdayi@126.com> Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: RexAn <bonean131@gmail.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: wulei <wulei.1023@bytedance.com> Co-authored-by: Xingjun Wang <wongxingjun@126.com> Co-authored-by: Prasanna Rajaperumal <prasanna.raj@live.com> Co-authored-by: xingjunwang <xingjunwang@tencent.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: ChanKyeong Won <brightwon.dev@gmail.com> Co-authored-by: Zouxxyy <zouxxyy@qq.com> Co-authored-by: Nicholas Jiang <programgeek@163.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: hj2016 <hj3245459@163.com> Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: jsbali <jsbali@uber.com> Co-authored-by: Leon Tsao <31072303+gnailJC@users.noreply.github.com> Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: 申胜利 <48829688+shenshengli@users.noreply.github.com> Co-authored-by: aiden.dong <782112163@qq.com> Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Pramod Biligiri <pramodbiligiri@gmail.com> Co-authored-by: Zouxxyy <zouxinyu.zxy@alibaba-inc.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Surya Prasanna <syalla@uber.com> Co-authored-by: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: huberylee <shibei.lh@foxmail.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: slfan1989 <55643692+slfan1989@users.noreply.github.com> Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: wangzeyu <hameizi369@gmail.com> Co-authored-by: vvsd <40269480+vvsd@users.noreply.github.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: Bingeng Huang <304979636@qq.com> Co-authored-by: hbg <bingeng.huang@shopee.com> Co-authored-by: that's cool <1059023054@qq.com> Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> Co-authored-by: gavin <zhangrenhuaman@163.com> Co-authored-by: Jon Vexler <jbvexler@gmail.com> Co-authored-by: Xixi Hua <smilecrazy1h@gmail.com> Co-authored-by: xxhua <xxhua@freewheel.tv> Co-authored-by: YangXiao <919869387@qq.com> Co-authored-by: chao chen <59957056+waywtdcc@users.noreply.github.com> Co-authored-by: Zhangshunyu <zhangshunyu1990@126.com> Co-authored-by: Long Zhao <294514940@qq.com> Co-authored-by: z00484332 <zhaolong36@huawei.com> Co-authored-by: 矛始 <1032851561@qq.com> Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> Co-authored-by: lvhu-goodluck <81349721+lvhu-goodluck@users.noreply.github.com> Co-authored-by: harshal patil <harshal.j.patil@gmail.com> Co-authored-by: Vinish Reddy <vinishreddypannala@infinilake.com>

* [HUDI-4282] Repair IOException in CHDFS when check block corrupted in HoodieLogFileReader (apache#6031) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4757] Create pyspark examples (apache#6672) * [HUDI-3959] Rename class name for spark rdd reader (apache#5409) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4828] Fix the extraction of record keys which may be cut out (apache#6650) Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4873] Report number of messages to be processed via metrics (apache#6271) Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4870] Improve compaction config description (apache#6706) * [HUDI-3304] Support partial update payload (apache#4676) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo… (apache#6630) * [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in log file issue Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4485] Bump spring shell to 2.1.1 in CLI (apache#6489) Bumped spring shell to 2.1.1 and updated the default value for show fsview all `pathRegex` parameter. * [minor] following 3304, some code refactoring (apache#6713) * [HUDI-4832] Fix drop partition meta sync (apache#6662) * [HUDI-4810] Fix log4j imports to use bridge API (apache#6710) Co-authored-by: dongsj <dongsj@asiainfo.com> * [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (apache#6717) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool (apache#5920) - This pull request fix [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 apache#5861* - The issue is caused by after changing the table to spark data source table, the table SerDeInfo is missing. * Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [MINOR] fix indent to make build pass (apache#6721) * [HUDI-3478] Implement CDC Write in Spark (apache#6697) * [HUDI-4326] Fix hive sync serde properties (apache#6722) * [HUDI-4875] Fix NoSuchTableException when dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2 (apache#6709) * [DOCS] Improve the quick start guide for Kafka Connect Sink (apache#6708) * [HUDI-4729] Fix file group pending compaction cannot be queried when query _ro table (apache#6516) File group in pending compaction can not be queried when query _ro table with spark. This commit fixes that. Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-3983] Fix ClassNotFoundException when using hudi-spark-bundle to write table with hbase index (apache#6715) * [HUDI-4758] Add validations to java spark examples (apache#6615) * [HUDI-4792] Batch clean files to delete (apache#6580) This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4363] Support Clustering row writer to improve performance (apache#6046) * [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data (apache#6734) * [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator (apache#6739) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-3901] Correct the description of hoodie.index.type (apache#6749) * [MINOR] Add .mvn directory to gitignore (apache#6746) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * add support for unraveling proto schemas * fix some compile issues * [HUDI-4901] Add avro.version to Flink profiles (apache#6757) * Add avro.version to Flink profiles Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4559] Support hiveSync command based on Call Produce Command (apache#6322) * [HUDI-4883] Supporting delete savepoint for MOR (apache#6744) Users could delete unnecessary savepoints and unblock archival for MOR table. * [HUDI-4897] Refactor the merge handle in CDC mode (apache#6740) * [HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031) * Revert "[HUDI-3523] Introduce AddColumnSchemaPostProcessor to support add columns to the end of a schema (apache#5031)" (apache#6768) This reverts commit 092375f. * [HUDI-3523] Introduce AddPrimitiveColumnSchemaPostProcessor to support add new primitive column to the end of a schema (apache#6769) * [HUDI-4903] Fix TestHoodieLogFormat`s minor typo (apache#6762) * [MINOR] Drastically reducing concurrency level (to avoid CI flakiness) (apache#6754) * Update HoodieIndex.java Fix a typo * [HUDI-4906] Fix the local tests for hudi-flink (apache#6763) * [HUDI-4899] Fixing compatibility w/ Spark 3.2.2 (apache#6755) * [HUDI-4892] Fix hudi-spark3-bundle (apache#6735) * [MINOR] Fix a few typos in HoodieIndex (apache#6784) Co-authored-by: xingjunwang <xingjunwang@tencent.com> * [HUDI-4412] Fix multi writer INSERT_OVERWRITE NPE bug (apache#6130) There are two minor issues fixed here: 1. When the insert_overwrite operation is performed, the clusteringPlan in the requestedReplaceMetadata will be null. Calling getFileIdsFromRequestedReplaceMetadata will cause NPE. 2. When insert_overwrite operation, inflightCommitMetadata!=null, getOperationType should be obtained from getHoodieInflightReplaceMetadata, the original code will have a null pointer. * [MINOR] retain avro's namespace (apache#6783) * [MINOR] Simple logging fix in LockManager (apache#6765) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4433] hudi-cli repair deduplicate not working with non-partitioned dataset (apache#6349) When using the repair deduplicate command with hudi-cli, there is no way to run it on the unpartitioned dataset, so modify the cli parameter. Co-authored-by: Xingjun Wang <wongxingjun@126.com> * [RFC-51][HUDI-3478] Update RFC: CDC support (apache#6256) * [HUDI-4915] improve avro serializer/deserializer (apache#6788) * [HUDI-3478] Implement CDC Read in Spark (apache#6727) * naming and style updates * [HUDI-4830] Fix testNoGlobalConfFileConfigured when add hudi-defaults.conf in default dir (apache#6652) * make test data random, reuse code * [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering (apache#6561) - Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD which ended up retriggering the action. Fixing the double de-referencing in this patch. * [HUDI-4914] Managed memory weight should be set when sort clustering is enabled (apache#6792) * [HUDI-4910] Fix unknown variable or type "Cast" (apache#6778) * [HUDI-4918] Fix bugs about when trying to show the non -existing key from env, NullPointException occurs. (apache#6794) * [HUDI-4718] Add Kerberos kinit command support. (apache#6719) * add test for 2 different recursion depths, fix schema cache key * add unsigned long support * better handle other types * rebase on 4904 * get all tests working * fix oneof expected schema, update tests after rebase * [HUDI-4902] Set default partitioner for SIMPLE BUCKET index (apache#6759) * [MINOR] Update PR template with documentation update (apache#6748) * revert scala binary change * try a different method to avoid avro version * [HUDI-4904] Add support for unraveling proto schemas in ProtoClassBasedSchemaProvider (apache#6761) If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data. Main changes to existing code: ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array Converting from a proto to an avro now accounts for this truncation of the input * delete unused file * [HUDI-4907] Prevent single commit multi instant issue (apache#6766) Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: yuzhao.cyz <yuzhao.cyz@gmail.com> * [HUDI-4923] Fix flaky TestHoodieReadClient.testReadFilterExistAfterBulkInsertPrepped (apache#6801) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4848] Fixing repair deprecated partition tool (apache#6731) * [HUDI-4913] Fix HoodieSnapshotExporter for writing to a different S3 bucket or FS (apache#6785) * address PR feedback, update decimal precision * fix isNullable issue, check if class is Int64value * checkstyle fix * change wrapper descriptor set initialization * add in testing for unsigned long to BigInteger conversion * [HUDI-4453] Fix schema to include partition columns in bootstrap operation (apache#6676) Turn off the type inference of the partition column to be consistent with existing behavior. Add notes around partition column type inference. * [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data (apache#4015) Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4924] Auto-tune dedup parallelism (apache#6802) * [HUDI-4687] Avoid setAccessible which breaks strong encapsulation (apache#6657) Use JOL GraphLayout for estimating deep size. * [MINOR] fixing validate async operations to poll completed clean instances (apache#6814) * [HUDI-4734] Deltastreamer table config change validation (apache#6753) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4934] Revert batch clean files (apache#6813) * Revert "[HUDI-4792] Batch clean files to delete (apache#6580)" This reverts commit cbf9b83. * [HUDI-4722] Added locking metrics for Hudi (apache#6502) * [HUDI-4936] Fix `as.of.instant` not recognized as hoodie config (apache#5616) Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4861] Relaxing `MERGE INTO` constraints to permit limited casting operations w/in matched-on conditions (apache#6820) * [HUDI-4885] Adding org.apache.avro to hudi-hive-sync bundle (apache#6729) * [HUDI-4951] Fix incorrect use of Long.getLong() (apache#6828) * [MINOR] Use base path URI in ITTestDataStreamWrite (apache#6826) * [HUDI-4308] READ_OPTIMIZED read mode will temporary loss of data when compaction (apache#6664) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4237] Fixing empty partition-values being sync'd to HMS (apache#6821) Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (apache#6355) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4850] Add incremental source from GCS to Hudi (apache#6665) Adds an incremental source from GCS based on a similar design as https://hudi.apache.org/blog/2021/08/23/s3-events-source * [HUDI-4957] Shade JOL in bundles to fix NoClassDefFoundError:GraphLayout (apache#6839) * [HUDI-4718] Add Kerberos kdestroy command support (apache#6810) * [HUDI-4916] Implement change log feed for Flink (apache#6840) * [HUDI-4769] Option read.streaming.skip_compaction skips delta commit (apache#6848) * [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row (apache#6805) * [HUDI-4966] Add a partition extractor to handle partition values with slashes (apache#6851) * [MINOR] Fix testUpdateRejectForClustering (apache#6852) * [HUDI-4962] Move cloud dependencies to cloud modules (apache#6846) * [HOTFIX] Fix source release validate script (apache#6865) * [HUDI-4980] Calculate avg record size using commit only (apache#6864) Calculate average record size for Spark upsert partitioner based on commit instants only. Previously it's based on commit and replacecommit, of which the latter may be created by clustering which has inaccurately smaller average record sizes, which could result in OOM due to size underestimation. * shade protobuf dependency * Revert "[HUDI-4915] improve avro serializer/deserializer (apache#6788)" (apache#6809) This reverts commit 79b3e2b. * [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (apache#6857) * Enhancing README for multi-writer tests (apache#6870) * [MINOR] Fix deploy script for flink 1.15 (apache#6872) * [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (apache#6883) * Revert "shade protobuf dependency" This reverts commit f03f961. * [HUDI-4972] Fixes to make unit tests work on m1 mac (apache#6751) * [HUDI-2786] Docker demo on mac aarch64 (apache#6859) * [HUDI-4971] Fix shading kryo-shaded with reusing configs (apache#6873) * [HUDI-3900] [UBER] Support log compaction action for MOR tables (apache#5958) - Adding log compaction support to MOR table. subsequent log blocks can now be compacted into larger log blocks without needing to go for full compaction (by merging w/ base file). - New timeline action is introduced for the purpose. Co-authored-by: sivabalan <n.siva.b@gmail.com> * Relocate apache http package (apache#6874) * [HUDI-4975] Fix datahub bundle dependency (apache#6896) * [HUDI-4999] Refactor FlinkOptions#allOptions and CatalogOptions#allOptions (apache#6901) * [MINOR] Update GitHub setting for merge button (apache#6922) Only allow squash and merge. Disable merge and rebase * [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool (apache#6885) * [MINOR] Fix name spelling for RunBootstrapProcedure * [HUDI-4754] Add compliance check in github actions (apache#6575) * [HUDI-4963] Extend InProcessLockProvider to support multiple table ingestion (apache#6847) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> * [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities (apache#6886) * Implement Create/Drop/Show/Refresh Secondary Index (apache#5933) * [MINOR] Moved readme from .github to the workflows folder (apache#6932) * [HUDI-4952] Fixing reading from metadata table when there are no inflight commits (apache#6836) * Fixing reading from metadata table when there are no inflight commits * Fixing reading from metadata if not fully built out * addressing minor comments * fixing sql conf and options interplay * addressing minor refactoring * [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer (apache#6003) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5006] Use the same wrapper for timestamp type metadata for parquet and log files (apache#6918) Before this patch, for timestamp type, we use LongWrapper for parquet and TimestampMicrosWrapper for avro log, they may keep different precision val here, for example, with timestamp(3), LongWrapper keeps the val as a millisecond long from EPOCH instant, while TimestampMicrosWrapper keeps the val as micro-seconds. For spark, it uses micro-seconds internally for timestamp type value, while flink uses the TimestampData internally, we better keeps the same precision for better compatibility here. * [HUDI-5016] Flink clustering does not reserve commit metadata (apache#6929) * [HUDI-3900] Fixing hdfs setup and tear down in tests to avoid flakiness (apache#6912) * [HUDI-5002] Remove deprecated API usage in SparkHoodieHBaseIndex#generateStatement (apache#6909) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5010] Fix flink hive catalog external config not work (apache#6923) * fix flink catalog external config not work * [HUDI-4948] Improve CDC Write (apache#6818) * improve cdc write to support multiple log files * update: use map to store the cdc stats * [HUDI-5030] Fix TestPartialUpdateAvroPayload.testUseLatestRecordMetaValue(apache#6948) * [HUDI-5033] Fix Broken Link In MultipleSparkJobExecutionStrategy (apache#6951) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5037] Upgrade org.apache.thrift:libthrift to 0.14.0 (apache#6941) * [MINOR] Fixing verbosity of docker set up (apache#6944) * [HUDI-5022] Make better error messages for pr compliance (apache#6934) * [HUDI-5003] Fix the type of InLineFileSystem`startOffset to long (apache#6916) * [HUDI-4855] Add missing table configs for bootstrap in Deltastreamer (apache#6694) * [MINOR] Handling null event time (apache#6876) * [MINOR] Update DOAP with 0.12.1 Release (apache#6988) * [MINOR] Increase maxParameters size in scalastyle (apache#6987) * [HUDI-3900] Closing resources in TestHoodieLogRecord (apache#6995) * [MINOR] Test case for hoodie.merge.allow.duplicate.on.inserts (apache#6949) * [HUDI-4982] Add validation job for spark bundles in GitHub Actions (apache#6954) * [HUDI-5041] Fix lock metric register confict error (apache#6968) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4998] Infer partition extractor class first from meta sync partition fields (apache#6899) * [HUDI-4781] Allow omit metadata fields for hive sync (apache#6471) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4997] Use jackson-v2 import instead of jackson-v1 (apache#6893) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-3900] Fixing tempDir usage in TestHoodieLogFormat (apache#6981) * [HUDI-4995] Relocate httpcomponents (apache#6906) * [MINOR] Update GitHub setting for branch protection (apache#7008) - require at least 1 approving review * [HUDI-4960] Upgrade jetty version for timeline server (apache#6844) Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5046] Support all the hive sync options for flink sql (apache#6985) * [MINOR] fix cdc flake ut (apache#7016) * [MINOR] Remove redundant space in PR compliance check (apache#7022) * [HUDI-5063] Enabling run time stats to be serialized with commit metadata (apache#7006) * [HUDI-5070] Adding lock provider to testCleaner tests since async cleaning is invoked (apache#7023) * [HUDI-5070] Move flaky cleaner tests to separate class (apache#7034) * [HUDI-4971] Remove direct use of kryo from `SerDeUtils` (apache#7014) Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-5081] Tests clean up in hudi-utilities (apache#7033) * [HUDI-5027] Replace hardcoded hbase config keys with constant variables (apache#6946) * [MINOR] add commit_action output in show_commits (apache#7012) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-5061] bulk insert operation don't throw other exception except IOE Exception (apache#7001) Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> * [MINOR] Skip loading last completed txn for single writer (apache#6660) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4281] Using hudi to build a large number of tables in spark on hive causes OOM (apache#5903) * [HUDI-5042] Fix clustering schedule problem in flink when enable schedule clustering and disable async clustering (apache#6976) Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4753] more accurate record size estimation for log writing and spillable map (apache#6632) * [HUDI-4201] Cli tool to get warned about empty non-completed instants from timeline (apache#6867) * [HUDI-5038] Increase default num_instants to fetch for incremental source (apache#6955) * [HUDI-5049] Supports dropPartition for Flink catalog (apache#6991) * for both dfs and hms catalogs * [HUDI-4809] glue support drop partitions (apache#7007) Co-authored-by: xxhua <xxhua@freewheel.tv> * [HUDI-5057] Fix msck repair hudi table (apache#6999) * [HUDI-4959] Fixing Avro's `Utf8` serialization in Kryo (apache#7024) * temp_view_support (apache#6990) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4982] Add Utilities and Utilities Slim + Spark Bundle testing to GH Actions (apache#7005) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5085]When a flink job has multiple sink tables, the index loading status is abnormal (apache#7051) * [HUDI-5089] Refactor HoodieCommitMetadata deserialization (apache#7055) * [HUDI-5058] Fix flink catalog read spark table error : primary key col can not be nullable (apache#7009) * [HUDI-5087] Fix incorrect merging sequence for Column Stats Record in `HoodieMetadataPayload` (apache#7053) * [HUDI-5087]Fix incorrect maxValue getting from metatable [HUDI-5087]Fix incorrect maxValue getting from metatable * Fixed `HoodieMetadataPayload` merging seq; Added test * Fixing handling of deletes; Added tests for handling deletes; * Added tests for combining partition files-list record Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> * [HUDI-4946] fix merge into with no preCombineField having dup row by only insert (apache#6824) * [HUDI-5072] Extract `ExecutionStrategy#transform` duplicate code (apache#7030) * [HUDI-3287] Remove hudi-spark dependencies from hudi-kafka-connect-bundle (apache#6079) * [HUDI-5000] Support schema evolution for Hive/presto (apache#6989) Co-authored-by: z00484332 <zhaolong36@huawei.com> * [HUDI-4716] Avoid parquet-hadoop-bundle in hudi-hadoop-mr (apache#6930) * [HUDI-5035] Remove usage of deprecated HoodieTimer constructor (apache#6952) Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-5083]Fixed a bug when schema evolution (apache#7045) * [HUDI-5102] source operator(monitor and reader) support user uid (apache#7085) * Update HoodieTableSource.java Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> * [HUDI-5057] Fix msck repair external hudi table (apache#7084) * [MINOR] Fix typos in Spark client related classes (apache#7083) * [HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796) Co-authored-by: jian.feng <jian.feng@shopee.com> * [MINOR] use default maven version since it already fix the warnings recently (apache#6863) Co-authored-by: jian.feng <jian.feng@shopee.com> * Revert "[HUDI-4741] hotfix to avoid partial failover cause restored subtask timeout (apache#6796)" (apache#7090) This reverts commit e222693. * [MINOR] Fix doc of org.apache.hudi.sink.meta.CkpMetadata#bootstrap (apache#7048) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [HUDI-4799] improve analyzer exception tip when cannot resolve expression (apache#6625) * [HUDI-5096] Upgrade jcommander to 1.78 (apache#7068) - resolves security vulnerability - resolves NPE issues with HiveSyncTool args parsing Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql (apache#7091) * [HUDI-5105] Add Call show_commit_extra_metadata for spark sql * [HUDI-5107] Fix hadoop config in DirectWriteMarkers, HoodieFlinkEngineContext and StreamerUtil are not consistent issue (apache#7094) Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> * [MINOR] Fix OverwriteWithLatestAvroPayload full class name (apache#7096) * [HUDI-5074] Warn if table for metastore sync has capitals in it (apache#7077) Co-authored-by: Jonathan Vexler <=> * [HUDI-5124] Fix HoodieInternalRowFileWriter#canWrite error return tag. (apache#7107) Co-authored-by: slfan1989 <louj1988@@> * [MINOR] update commons-codec:commons-codec 1.4 to 1.13 (apache#6959) * [HUDI-5148] Claim RFC-63 for Index on Function and Logical Partitioning (apache#7114) * [HUDI-5065] Call close on SparkRDDWriteClient in HoodieCleaner (apache#7101) Co-authored-by: Jonathan Vexler <=> * [HUDI-4624] Implement Closable for S3EventsSource (apache#7086) Co-authored-by: Jonathan Vexler <=> * [HUDI-5045] Adding support to configure index type with integ tests (apache#6982) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5416) https://issues.apache.org/jira/browse/HUDI-3963 RFC design : apache#5567 Add Lock-Free executor to improve hoodie writing throughput and optimize execution efficiency. Disruptor linked: https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction. Existing BoundedInMemory is the default. Users can enable on a need basis. Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-5076] Fixing non serializable path used in engineContext with metadata table intialization (apache#7036) * [HUDI-5032] Add archive to cli (apache#7076) Adding archiving capability to cli. Co-authored-by: Jonathan Vexler <=> * [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task (apache#6733) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy()… (apache#7113) * [HUDI-5147] Flink data skipping doesn't work when HepPlanner calls copy() on HoodieTableSource * [MINOR] Fixing broken test (apache#7123) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table (apache#6741) * [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table * Update HiveAvroSerializer.java otherwise payload string type combine field will cause cast exception * [HUDI-5126] Delete duplicate configuration items PAYLOAD_CLASS_NAME (apache#7103) * [HUDI-4989] Fixing deltastreamer init failures (apache#6862) Fixing handling missing hoodie.properties * [MINOR] Fix flaky test in ITTestHoodieDataSource (apache#7134) * [HUDI-4071] Remove default value for mandatory record key field (apache#6681) * [HUDI-5088]Fix bug:Failed to synchronize the hive metadata of the Flink table (apache#7056) * sync `_hoodie_operation` meta field if changelog mode is enabled. * [MINOR] Removing spark2 scala12 combinations from readme (apache#7112) * [HUDI-5153] Fix the write token name resolution of cdc log file (apache#7128) * [HUDI-5066] Support flink hoodie source metaclient cache (apache#7017) * [HUDI-5132] Add hadoop-mr bundle validation (apache#7157) * [HUDI-2673] Add kafka connect bundle to validation test (apache#7131) * [HUDI-5082] Improve the cdc log file name format (apache#7042) * [HUDI-5154] Improve hudi-spark-client Lambada writing (apache#7127) Co-authored-by: slfan1989 <louj1988@@> * [HUDI-5178] Add Call show_table_properties for spark sql (apache#7161) * [HUDI-5067] Merge the columns stats of multiple log blocks from the same log file (apache#7018) * [HUDI-5025] Rollback failed with log file not found when rollOver in rollback process (apache#6939) * fix rollback file not found * [HUDI-4526] Improve spillableMapBasePath when disk directory is full (apache#6284) * [minor] Refactor the code for CkpMetadata (apache#7166) * [HUDI-5111] Improve integration test coverage (apache#7092) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-5187] Remove the preCondition check of BucketAssigner assign state (apache#7170) * [HUDI-5145] Avoid starting HDFS in hudi-utilities tests (apache#7171) * [MINOR] Performance improvement of flink ITs with reused miniCluster (apache#7151) * implement MiniCluster extension compatible with junit5 * Make local build work * Delete files removed in OSS * Fix bug in testing * Upgrade to version release-v0.10.0 Co-authored-by: 5herhom <543872547@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: Jon Vexler <jon@onehouse.ai> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: y0908105023 <283999377@qq.com> Co-authored-by: yangshuo3 <yangshuo3@kingsoft.com> Co-authored-by: Volodymyr Burenin <vburenin@gmail.com> Co-authored-by: Volodymyr Burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: FocusComputing <xiaoxingstack@gmail.com> Co-authored-by: xiaoxingstack <xiaoxingstack@didiglobal.com> Co-authored-by: Paul Zhang <xzhangyao@126.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: eric9204 <90449228+eric9204@users.noreply.github.com> Co-authored-by: dongsj <dongsj@asiainfo.com> Co-authored-by: Kyle Zhike Chen <zk.chan007@gmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: dohongdayi <dohongdayi@126.com> Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: Manu <36392121+xicm@users.noreply.github.com> Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net> Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: RexAn <bonean131@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Rahil Chertara <rchertar@amazon.com> Co-authored-by: Timothy Brown <tim@onehouse.ai> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: wulei <wulei.1023@bytedance.com> Co-authored-by: Xingjun Wang <wongxingjun@126.com> Co-authored-by: Prasanna Rajaperumal <prasanna.raj@live.com> Co-authored-by: xingjunwang <xingjunwang@tencent.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: 苏承祥 <sucx@tuya.com> Co-authored-by: ChanKyeong Won <brightwon.dev@gmail.com> Co-authored-by: Zouxxyy <zouxxyy@qq.com> Co-authored-by: Nicholas Jiang <programgeek@163.com> Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: TengHuo <teng_huo@outlook.com> Co-authored-by: hj2016 <hj3245459@163.com> Co-authored-by: huangjing02 <huangjing02@bilibili.com> Co-authored-by: jsbali <jsbali@uber.com> Co-authored-by: Leon Tsao <31072303+gnailJC@users.noreply.github.com> Co-authored-by: leon <leon@leondeMacBook-Pro.local> Co-authored-by: 申胜利 <48829688+shenshengli@users.noreply.github.com> Co-authored-by: aiden.dong <782112163@qq.com> Co-authored-by: dujunling <dujunling@bytedance.com> Co-authored-by: Pramod Biligiri <pramodbiligiri@gmail.com> Co-authored-by: Zouxxyy <zouxinyu.zxy@alibaba-inc.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Co-authored-by: Surya Prasanna <syalla@uber.com> Co-authored-by: Rajesh Mahindra <76502047+rmahindra123@users.noreply.github.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local> Co-authored-by: huberylee <shibei.lh@foxmail.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: slfan1989 <55643692+slfan1989@users.noreply.github.com> Co-authored-by: slfan1989 <louj1988@@> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: wangzeyu <hameizi369@gmail.com> Co-authored-by: vvsd <40269480+vvsd@users.noreply.github.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: Bingeng Huang <304979636@qq.com> Co-authored-by: hbg <bingeng.huang@shopee.com> Co-authored-by: that's cool <1059023054@qq.com> Co-authored-by: liufangqi.chenfeng <liufangqi.chenfeng@BYTEDANCE.COM> Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com> Co-authored-by: gavin <zhangrenhuaman@163.com> Co-authored-by: Jon Vexler <jbvexler@gmail.com> Co-authored-by: Xixi Hua <smilecrazy1h@gmail.com> Co-authored-by: xxhua <xxhua@freewheel.tv> Co-authored-by: YangXiao <919869387@qq.com> Co-authored-by: chao chen <59957056+waywtdcc@users.noreply.github.com> Co-authored-by: Zhangshunyu <zhangshunyu1990@126.com> Co-authored-by: Long Zhao <294514940@qq.com> Co-authored-by: z00484332 <zhaolong36@huawei.com> Co-authored-by: 矛始 <1032851561@qq.com> Co-authored-by: chenzhiming <chenzhm@chinatelecom.cn> Co-authored-by: lvhu-goodluck <81349721+lvhu-goodluck@users.noreply.github.com> Co-authored-by: alberic <cnuliuweiren@gmail.com> Co-authored-by: lxxyyds <114218541+lxxawfl@users.noreply.github.com> Co-authored-by: Alexander Trushev <42293632+trushev@users.noreply.github.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: windWheel <1817802738@qq.com> Co-authored-by: Alexander Trushev <trushev.alex@gmail.com> Co-authored-by: Shizhi Chen <107476116+chenshzh@users.noreply.github.com>

yihua self-assigned this Jun 29, 2022

yihua changed the title ~~[HUDI-1575] RFC-56: Early Conflict Detection For Multi-writer~~ [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer Jun 29, 2022

yihua added writer-core Issues relating to core transactions/write actions rfc multi-writer priority:blocker big-needle-movers labels Jun 29, 2022

yihua added this to Under Discussion PRs in PR Tracker Board via automation Jun 29, 2022

yihua reviewed Jul 7, 2022

View reviewed changes

yihua assigned vinothchandar Jul 7, 2022

zhangyue19921010 mentioned this pull request Jul 7, 2022

[HUDI-1575] Early Conflict Detection For Multi-writer #6059

Closed

5 tasks

pratyakshsharma reviewed Jul 12, 2022

View reviewed changes

pratyakshsharma requested changes Jul 12, 2022

View reviewed changes

PR Tracker Board automation moved this from Under Discussion PRs to Nearing Landing Jul 12, 2022

pratyakshsharma approved these changes Jul 13, 2022

View reviewed changes

zhangyue19921010 mentioned this pull request Jul 18, 2022

[HUDI-1575] Early Conflict Detection For Multi-writer #6133

Merged

4 tasks

nbalajee reviewed Jul 19, 2022

View reviewed changes

codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jul 20, 2022

yihua added priority:blocker and removed priority:critical production down; pipelines stalled; Need help asap. labels Sep 13, 2022

yuezhang and others added 7 commits September 29, 2022 11:55

finish rfc design && ready to review

0112d0b

ready to review

927587b

address comments

4158976

address comments

ca84c09

update flow1.png

d623989

tbd

86ac36b

tbd

25115cb

yuezhang and others added 3 commits September 29, 2022 11:55

need more change

86324a8

review

310b4f4

Improve writing based on the reivew comments and fix writing style is…

1a83c20

…sues

yihua reviewed Sep 29, 2022

View reviewed changes

yihua force-pushed the rfc-56 branch from 3b3f797 to 1a83c20 Compare September 29, 2022 20:25

yihua approved these changes Sep 29, 2022

View reviewed changes

yihua reviewed Sep 29, 2022

View reviewed changes

yihua requested a review from prasannarajaperumal September 29, 2022 20:37

prasannarajaperumal reviewed Sep 30, 2022

View reviewed changes

yuezhang added 2 commits October 3, 2022 10:22

merge from master

4ab6547

review

3a281f3

yihua reviewed Oct 4, 2022

View reviewed changes

yuezhang added 4 commits October 5, 2022 18:22

rebase master

96800b8

address comments

b46498e

address comments

2d35be0

address comments

bd73241

address comments

d2e4ecd

prasannarajaperumal approved these changes Oct 9, 2022

View reviewed changes

yihua merged commit 0a9a6b8 into apache:master Oct 12, 2022

PR Tracker Board automation moved this from Nearing Landing to Done Oct 12, 2022


		## Abstract

		At present, Hudi implements an optimized occ mechanism based on timeline to ensure data consistency, integrity and



		Based on marker and heartbeat, this RFC design a new conflict detection: Early Conflict Detection.
		Before the writer creates the marker and before it starts to write the file, Hudi will perform this new conflict



		In addition, during checking conflicts, we do not need to list all marker files in all directories, we can prune based
		on the current partition Path and fileID to avoid unnecessary list operations.


		![](figure3.png)

		As shown on figure3. For the client side, it will call this check-marker-conflict api to get the latest value


		### [3] Try to resolve conflict

		For now, the default behavior is throwing an `HoodieWriteConflictException`

[HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer #6003

[HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer #6003

Conversation

zhangyue19921010 commented Jun 29, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

zhangyue19921010 commented Jul 4, 2022

hudi-bot commented Jul 4, 2022

CI report:

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratyakshsharma left a comment

Choose a reason for hiding this comment

pratyakshsharma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihua commented Sep 29, 2022

prasannarajaperumal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 Oct 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 commented Oct 3, 2022

yihua left a comment

Choose a reason for hiding this comment

zhangyue19921010 commented Oct 5, 2022

zhangyue19921010 Oct 3, 2022 •

edited