[HUDI-3730] Improve meta sync class design and hierarchies #5854

xushiyan · 2022-06-14T00:03:04Z

Implementation of #5695 (classes design section).

xushiyan · 2022-06-26T13:08:00Z

@fengjian428 refactoring change passed CI https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9541

codope

@fengjian428 @xushiyan Thanks for driving this work. The class design looks much better. Just a few minor comments/clarifications.
On a high-level, i want to check two things with you:

Should we extract any config default value changes to a separate PR stacked on top of this?
I have tested locally with Hive. Have we tested the changes for AWS Glue and GCP BigQuery?

hudi-common/src/main/java/org/apache/hudi/common/config/TypedProperties.java

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java

codope · 2022-06-30T14:32:54Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncConfig.java

+      props.setPropertyIfNonNull(META_SYNC_DATABASE_NAME.key(), databaseName);
+      props.setPropertyIfNonNull(META_SYNC_TABLE_NAME.key(), tableName);
+      props.setPropertyIfNonNull(META_SYNC_BASE_FILE_FORMAT.key(), baseFileFormat);
+      props.setPropertyIfNonNull(META_SYNC_PARTITION_FIELDS.key(), StringUtils.join(",", partitionFields));


Previously, this was being set as empty list in case of config not being set by the user. Now, it will be null. This could have side-effect downstream. Why change the default here?

I can keep out all default value changes from this PR.. but i would like to set proper defaults. In this case, it won't be null, it's just not set. the default value is still "" for the config. The proper default should be noDefaultValue().

"" just looks like a weird placeholder

Agree, proper default should be noDefaultValue. I am ok with a separate PR with default changes. Actually, that's more preferable.

keep this change as is right? default still "" and here it's not set when partitionFields is null.

codope · 2022-06-30T14:37:39Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncClient.java

+        if (!storagePartitionValues.isEmpty()) {
+          String storageValue = String.join(", ", storagePartitionValues);
+          if (!paths.containsKey(storageValue)) {
+            events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
+          } else if (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
+            events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
+          }
+        }


Duplicate code in HoodieAdbJdbcClient. Let's extract this block to a method and reuse that in subclass as well.

the bigger problem is that org.apache.hudi.sync.adb.HoodieAdbJdbcClient#getPartitionEvents deviates from org.apache.hudi.sync.common.HoodieSyncClient#getPartitionEvents

By right HoodieAdbJdbcClient does not need to implement this.

I noted a todo in HoodieAdbJdbcClient to merge these 2 methods. will tackle this soon.

@codope this refactoring for org.apache.hudi.sync.adb.HoodieAdbJdbcClient#getPartitionEvents is rabbit hole. in Adb sync, "Map<List, String> tablePartitions" is used throughout the code path. We should tackle it in a separate refactoring PR. This part is isolated for adb sync so we should be good.

hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncTool.java

codope · 2022-07-01T09:41:51Z

...i-hive-sync/src/test/java/org/apache/hudi/hive/replication/TestHiveSyncGlobalCommitTool.java

+
+  @BeforeEach
+  public void setUp() throws Exception {
+    localCluster.forceCreateDb(DB_NAME);


Since we're creating the db after each test here, so why not drop db (with cascade) after each test in the clear method?

this forceCreateDb() does drop the db first

...ync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieMetaSyncOperations.java

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncClient.java

codope · 2022-07-01T09:56:31Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/model/FieldSchema.java

+
+import java.util.Objects;
+
+public class FieldSchema {


I see this is used in meta sync operations like getStorageFieldSchemas which is passed at many places. Should this implement Serializable?

don't see a need to serialize for these models. this and other pojos are used in single thread when sync with metastore. A previously existing example is PartitionEvent

...tilities/src/test/java/org/apache/hudi/utilities/functional/HoodieDeltaStreamerTestBase.java

Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>

* [HUDI-3730] fix global tools UT * [HUDI-3730] fix global tools UT Co-authored-by: jian.feng <jian.feng@shopee.com>

Co-authored-by: jian.feng <jian.feng@shopee.com>

xushiyan · 2022-07-03T05:08:18Z

@codope manual testing discovered 1 bug, fixed in 55b7bc1
manually tested gcp, datahub, glue sync, passed. last 4 commits are the new changes.

hudi-bot · 2022-07-03T08:53:47Z

CI report:

c95a9d1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

LGTM.

…che#37) * [MINOR] Update alter rename command class type for pattern matching (apache#5381) * [HUDI-3977] Flink hudi table with date type partition path throws HoodieNotSupportedException (apache#5432) * Claim RFC 52 for Introduce Secondary Index to Improve HUDI Query Performance (apache#5441) * [HUDI-3945] After the async compaction operation is complete, the task should exit. (apache#5391) Co-authored-by: y00617041 <yangxuan42@huawei.com> * [HUDI-3815] Fix docs description of metadata.compaction.delta_commits default value error (apache#5368) Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com> * [HUDI-3943] Some description fixes for 0.10.1 docs (apache#5447) * [MINOR] support different cleaning policy for flink (apache#5459) * [HUDI-3758] Fix duplicate fileId error in MOR table type with flink bucket hash Index (apache#5185) * fix duplicate fileId with bucket Index * replace to load FileGroup from FileSystemView * [MINOR] Fix CI by ignoring SparkContext error (apache#5468) Sets spark.driver.allowMultipleContexts = true when constructing Spark conf in UtilHelpers * [HUDI-3862] Fix default configurations of HoodieHBaseIndexConfig (apache#5308) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3978] Fix use of partition path field as hive partition field in flink (apache#5434) * Fix partition path fields as hive sync partition fields error * [MINOR] Update DOAP for release 0.11.0 (apache#5467) * [HUDI-3211][RFC-44] Add RFC for Hudi Connector for Presto (apache#4563) * Add RFC doc Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * Add note regarding catalog naming Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [MINOR] Update RFC status (apache#5486) * [HUDI-4005] Update release scripts to help validation (apache#5479) * [HUDI-4031] Avoid clustering update handling when no pending replacecommit (apache#5487) * [HUDI-3667] Run unit tests of hudi-integ-tests in CI (apache#5078) * [MINOR] Optimize code logic (apache#5499) * [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (apache#4264) * [HUDI-4042] Support truncate-partition for Spark-3.2 (apache#5506) * [HUDI-4017] Improve spark sql coverage in CI (apache#5512) Add GitHub actions tasks to run spark sql UTs under spark 3.1 and 3.2. * [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode (apache#5073) - Added a postWriteTerminationStrategy to deltastreamer continuous mode. One can enable by setting the appropriate termination strategy using DeltastreamerConfig.postWriteTerminationStrategyClass. If not, continuous mode is expected to run forever. - Added one concrete impl for termination strategy as NoNewDataTerminationStrategy which shuts down deltastreamer if there is no new data to consume from source for N consecutive rounds. * [HUDI-3849] AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration (apache#5287) * [MINOR] Fixing class not found when using flink and enable metadata table (apache#5527) * [MINOR] fixing flaky tests in deltastreamer tests (apache#5521) * [HUDI-4055]refactor ratelimiter to avoid stack overflow (apache#5530) * [MINOR] Fixing close for HoodieCatalog's test (apache#5531) * [MINOR] Fixing close for HoodieCatalog's test * [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOpti… (apache#5526) * [HUDI-4053] Flaky ITTestHoodieDataSource.testStreamWriteBatchReadOptimized Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3995] Making perf optimizations for bulk insert row writer path (apache#5462) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap. * [HUDI-4044] When reading data from flink-hudi to external storage, the … (apache#5516) Co-authored-by: aliceyyan <aliceyyan@tencent.com> * [HUDI-4003] Try to read all the log file to parse schema (apache#5473) * [HUDI-4038] Avoid calling `getDataSize` after every record written (apache#5497) - getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost. Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4079] Supports showing table comment for hudi with spark3 (apache#5546) * [HUDI-4085] Fixing flakiness with parquet empty batch tests in TestHoodieDeltaStreamer (apache#5559) * [HUDI-3963][Claim RFC number 53] Use Lock-Free Message Queue Improving Hoodie Writing Efficiency. (apache#5562) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (apache#5501) - Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it. - Added delete_partition support to integ test framework using spark-datasource. - Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions) - Added tests for 4 concurrent spark datasource writers (multi-writer tests). - Fixed readme w/ sample commands for multi-writer. * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5528) * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink * [MINOR] Fix a NPE for Option (apache#5461) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compact… (apache#5545) * [HUDI-4078][HUDI-FLINK]BootstrapOperator contains the pending compaction files * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink (apache#5574) * [HUDI-3336][HUDI-FLINK]Support custom hadoop config for flink * [HUDI-4072] Fix NULL schema for empty batches in deltastreamer (apache#5543) * [HUDI-4097] add table info to jobStatus (apache#5529) Co-authored-by: wqwl611 <wqwl611@gmail.com> * [HUDI-3980] Suport kerberos hbase index (apache#5464) - Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection. Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (apache#5495) * fix hive sync no partition table error (apache#5585) * [HUDI-3123] consistent hashing index: basic write path (upsert/insert) (apache#4480) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index * [HUDI-4098] Metadata table heartbeat for instant has expired, last heartbeat 0 (apache#5583) * [HUDI-4103] [HUDI-4001] Filter the properties should not be used when create table for Spark SQL * [HUDI-3654] Preparations for hudi metastore. (apache#5572) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com> * [HUDI-4104] DeltaWriteProfile includes the pending compaction file slice when deciding small buckets (apache#5594) * [HUDI-4101] BucketIndexPartitioner should take partition path for better dispersion (apache#5590) * [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (apache#5564) * [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand * Set hoodie.query.as.ro.table in serde properties * [HUDI-4110] Clean the marker files for flink compaction (apache#5604) * [MINOR] Fixing spark long running yaml for non-partitioned (apache#5607) * [minor] Some code refactoring for LogFileComparator and Instant instantiation (apache#5600) * [HUDI-4109] Copy the old record directly when it is chosen for merging (apache#5603) * Clean the marker files for flink compaction (apache#5611) Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-3942] [RFC-50] Improve Timeline Server (apache#5392) * [HUDI-4111] Bump ANTLR runtime version in Spark 3.x (apache#5606) * Revert "[HUDI-3870] Add timeout rollback for flink online compaction (apache#5314)" (apache#5622) This reverts commit 6f9b02d. * [HUDI-4116] Unify clustering/compaction related procedures' output type (apache#5620) * Unify clustering/compaction related procedures' output type * Address review comments * [HUDI-4114] Remove the unnecessary fs view sync for BaseWriteClient#initTable (apache#5617) No need to #sync actively because the table instance is instantiated freshly, its view manager has empty fiew instantces, the fs view would be synced lazily when is it requested. * [HUDI-4119] the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi (apache#5626) * HUDI-4119 the first read result is incorrect when Flink upsert- Kafka connector is used in HUDi Co-authored-by: aliceyyan <aliceyyan@tencent.com> * [HUDI-4130] Remove the upgrade/downgrade for flink #initTable (apache#5642) * [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (apache#5532) * [MINOR] Minor fixes to exception log and removing unwanted metrics flush in integ test (apache#5646) * [HUDI-4122] Fix NPE caused by adding kafka nodes (apache#5632) * [MINOR] remove unused gson test dependency (apache#5652) * [HUDI-3858] Shade javax.servlet for Spark bundle jar (apache#5295) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4100] CTAS failed to clean up when given an illegal MANAGED table definition (apache#5588) * [HUDI-3890] fix rat plugin issue with sql files (apache#5644) * [HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (apache#5517) * [HUDI-4051] Allow nested field as preCombineField in spark sql * relax validation for primary key * [HUDI-4129] Initializes a new fs view for WriteProfile#reload (apache#5640) Co-authored-by: zhangyuang <zhangyuang@corp.netease.com> * [HUDI-4142] Claim RFC-54 for new table APIs (apache#5665) * [HUDI-3933] Add UT cases to cover different key gen (apache#5638) * [MINOR] Removing redundant semicolons and line breaks (apache#5662) * [HUDI-4134] Fix Method naming consistency issues in FSUtils (apache#5655) * [HUDI-4084] Add support to test async table services with integ test suite framework (apache#5557) * Add support to test async table services with integ test suite framework * Make await time for validation configurable * [HUDI-4138] Fix the concurrency modification of hoodie table config for flink (apache#5660) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig * [HUDI-2473] Fixing compaction write operation in commit metadata (apache#5203) * [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (apache#5669) * [HUDI-4135] remove netty and netty-all (apache#5663) * [HUDI-2207] Support independent flink hudi clustering function * [HUDI-4132] Fixing determining target table schema for delta sync with empty batch (apache#5648) * [MINOR] Fix a potential NPE and some finer points of hudi cli (apache#5656) * [HUDI-4146] Claim RFC-55 for Improve Hive/Meta sync class design and hierachies (apache#5682) * [HUDI-3193] Decouple hudi-aws from hudi-client-common (apache#5666) Move HoodieMetricsCloudWatchConfig to hudi-client-common * [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (apache#5676) * [HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (apache#5502) * Along the lines of RDDCustomColumnsSortPartitioner but for Row * [HUDI-4023] Decouple hudi-spark from hudi-utilities-slim-bundle (apache#5641) * [HUDI-4124] Add valid check in Spark Datasource configs (apache#5637) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-3963][RFC-53] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency (apache#5567) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4162] Fixed some constant mapping issues. (apache#5700) Co-authored-by: y00617041 <yangxuan42@huawei.com> * [HUDI-4161] Make sure partition values are taken from partition path (apache#5699) * [MINOR] Fix the issue when handling conf hoodie.datasource.write.operation=bulk_insert in sql mode (apache#5679) Co-authored-by: Rex An <bonean131@gmail.com> * [HUDI-4151] flink split_reader supports rocksdb (apache#5675) * [HUDI-4151] flink split_reader supports rocksdb * [HUDI-4160] Make database regex of MaxwellJsonKafkaSourcePostProcessor optional (apache#5697) * [MINOR] Fix Hive and meta sync config for sql statement (apache#5316) * [HUDI-4166] Added SimpleClient plugin for integ test (apache#5710) * [HUDI-3551] Add the Oracle Cloud Infrastructure (oci) Object Storage URI scheme (apache#4952) * [HUDI-3551] Fix testStorageSchemes for oci storage (apache#5711) * [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (apache#5563) Co-authored-by: 苏承祥 <sucx@tuya.com> * [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (apache#5703) If the avro file is corrupted, an InvalidAvroMagicException throws. * [HUDI-4149] Drop-Table fails when underlying table directory is broken (apache#5672) * [HUDI-4107] Added --sync-tool-classes config option in HoodieMultiTableDeltaStreamer (apache#5597) * added --sync-tool-classes config option in multitable delta streamer * added a testcase to assert if syncClientToolClassNames is getting picked to the deltastreamer execution context * [HUDI-4174] Add hive conf dir option for flink sink (apache#5725) * [HUDI-4011] Add hudi-aws-bundle (apache#5674) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-3670] free temp views in sql transformers (apache#5080) * [HUDI-4167] Remove the timeline refresh with initializing hoodie table (apache#5716) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error. * [HUDI-4179] Cluster with sort cloumns invalid (apache#5739) * [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (apache#5743) * [HUDI-4187] Fix partition order in aws glue sync (apache#5731) * [HUDI-4168] Add Call Procedure for marker deletion (apache#5738) * Add Call Procedure for marker deletion * [HUDI-4190] Include hbase-protocol for shading in the bundles (apache#5750) * [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (apache#5755) SeekTo top cells avoid NullPointerException * [HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (apache#5749) * [HUDI-4195] Bulk insert should use right keygen for non-partitioned table (apache#5759) * [HUDI-4101] When BucketIndexPartitioner take partition path for dispersion may cause the fileID of the task to not be loaded correctly (apache#5763) Co-authored-by: john.wick <john.wick@vipshop.com> * [HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (apache#5733) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy; * [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (apache#5664) Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen. This patch fixes the issue. * [HUDI-4197] Fix Async indexer to support building FILES partition (apache#5766) - When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met. * [HUDI-4171] Fixing Non partitioned with virtual keys in read path (apache#5747) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist. * [MINOR] Mark AWSGlueCatalogSyncClient experimental (apache#5775) * [MINOR][RFC-53] Fix typos (apache#5764) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4200] Fixing sorting of keys fetched from metadata table (apache#5773) - Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock. - Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix apache#5208 * [HUDI-4198] Fix hive config for AWSGlueClientFactory (apache#5768) * HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory * Resolve metastore uri config before loading fs conf * Skip hiveql due to CI issue Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (apache#5737) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic. * [MINOR][DOCS] Update the README.md file in hudi-examples (apache#5803) * [MINOR] FlinkStateBackendConverter add more exception message (apache#5809) * [MINOR] FlinkStateBackendConverter add more exception message * [HUDI-4213] Infer keygen clazz for Spark SQL (apache#5815) * [HUDI-4139]improvement for flink write operator name to identify tables easily (apache#5744) Co-authored-by: yanenze <yanenze@keytop.com.cn> * [HUDI-3889] Do not validate table config if save mode is set to Overwrite (apache#5619) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata (apache#5829) * [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (apache#5840) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior. * [HUDI-4205] Fix NullPointerException in HFile reader creation (apache#5841) Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers * [HUDI-4224] Fix CI issues (apache#5842) - Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [MINOR] fix AvroSchemaConverter duplicate branch in 'switch' (apache#5813) * Strip extra spaces when creating new configuration (apache#5849) Co-authored-by: superche <superche@tencent.com> * [HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (apache#5790) TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields. This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing. Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-3863] Add UT for drop partition column in deltastreamer testsuite (apache#5727) * [HUDI-4006] failOnDataLoss on delta-streamer kafka sources (apache#5718) add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently when failOnDataLoss is set, fail explicitly * [HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (apache#5788) Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception * [MINOR] Fix typo of DisruptorExecutor in RFC 53 (apache#5860) * [minor] Following HUDI-4207, remote the new wrapper #init method (apache#5865) * [HUDI-4255] Make the flink merge and replace handle intermediate file visible (apache#5866) * [HUDI-3499] Add Call Procedure for show rollbacks (apache#5848) * Add Call Procedure for show rollbacks * fix * add ut for show_rollback_detail and exception handle Co-authored-by: superche <superche@tencent.com> * [HUDI-4218] [HUDI-4218] Expose the real exception information when an exception occurs in the tableExists method (apache#5827) * [HUDI-4217] improve repeat init object in ExpressionPayload (apache#5825) * [HUDI-4214] improve repeat init write schema in ExpressionPayload (apache#5820) * [HUDI-4214] improve repeat init write schema in ExpressionPayload * [HUDI-4265] Deprecate useless targetTableName parameter in HoodieMultiTableDeltaStreamer (apache#5883) * [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (apache#5761) * Support Create/Drop/Show/Refresh Index Syntax for Spark SQL * [HUDI-3507] Support export command based on Call Produce Command (apache#5901) * [HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (apache#5894) * [MINOR] Add "spillable_map_path" in FlinkCompactionConfig. To avoid the disk space of "/tmp" full when compacting offline. (apache#5905) * [HUDI-4277] supoort flink table source with computed column (apache#5897) Co-authored-by: chenshizhi <chenshizhi@bilibili.com> * fix remove redundant Variable (apache#5806) * [HUDI-4259] Flink create avro schema not conformance to standards (apache#5878) * flink create avro schema not conformance to standards Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job (apache#5876) * [HUDI-4258] Fix when HoodieTable removes data file before the end of Flink job * [MINOR] Update DOAP with 0.11.1 Release (apache#5908) * [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (apache#5723) * [HUDI-4251] Fix the problem that the command 'commits sync' description does not match. (apache#5881) * [HUDI-4177] Fix hudi-cli rollback with rollbackUsingMarkers method call (apache#5734) * Fix hudi-cli rollback with rollbackUsingMarkers method call * Add test for hudi-cli rollbackUsingMarkers Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4270] Bootstrap op data loading missing (apache#5888) * [HUDI-3475] Initialize hudi table management module. * udate * Revert master (apache#5925) * Revert "udate" This reverts commit 092e35c. * Revert "[HUDI-3475] Initialize hudi table management module." This reverts commit 4640a3b. * [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (apache#5917) Signed-off-by: LinMingQiang <1356469429@qq.com> * [minor] following 4270, add unit tests for the keys lost case (apache#5918) * [HUDI-3508] Add call procedure for FileSystemViewCommand (apache#5929) * [HUDI-3508] Add call procedure for FileSystemView * minor Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> * [HUDI-4299] Fix problem about hudi-example-java run failed on idea. (apache#5936) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (apache#5941) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups * Separate out incremental sync fsview test with clustering * [HUDI-3509] Add call procedure for HoodieLogFileCommand (apache#5949) Co-authored-by: zhanshaoxiong <jiimmyzhan@tencent.com> * [HUDI-4273] Support inline schedule clustering for Flink stream (apache#5890) * [HUDI-4273] Support inline schedule clustering for Flink stream * delete deprecated clustering plan strategy and add clustering ITTest * [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (apache#5874) * [HUDI-4260] Change KEYGEN_CLASS_NAME without default value (apache#5877) * Change KEYGEN_CLASS_NAME without default value Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-3512] Add call procedure for StatsCommand (apache#5955) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * [TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948) * Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)" (apache#5971) This reverts commit e8fbd4d. * [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (apache#5966) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint) * [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (apache#5973) * [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (apache#5956) * [MINOR] Remove -T option from CI build (apache#5972) * [HUDI-5246] Bumping mysql connector version due to security vulnerability (apache#5851) * [HUDI-4309] Spark3.2 custom parser should not throw exception (apache#5947) * [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (apache#5959) * [HUDI-4315] Do not throw exception in BaseSpark3Adapter#toTableIdentifier (apache#5957) * [HUDI-3504] Support bootstrap command based on Call Produce Command (apache#5977) * [HUDI-4311] Fix Flink lose data on some rollback scene (apache#5950) * [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (apache#5930) * [HUDI-3506] Add call procedure for CommitsCommand (apache#5974) * [HUDI-3506] Add call procedure for CommitsCommand Co-authored-by: superche <superche@tencent.com> * [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (apache#5982) * [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon * [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (apache#5990) * [HUDI-4332] The current instant may be wrong under some extreme conditions in AppendWriteFunction. (apache#5988) * [HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (apache#5970) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller * [HUDI-1176] Upgrade hudi to log4j2 (apache#5366) * Move to log4j2 cr: https://code.amazon.com/reviews/CR-71010705 * Upgrade unit tests to log4j2 * update exclusion Co-authored-by: Brandon Scheller <bschelle@amazon.com> * [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (apache#5994) * [HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (apache#6002) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> * [MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (apache#5174) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4331] Allow loading external config file from class loader (apache#5987) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4336] Fix records overwritten bug with binary primary key (apache#5996) * [MINOR] Following apache#2070, Fix BindException when running tests on shared machines. (apache#5951) * [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (apache#5999) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (apache#5907) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer * add ut Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458) * [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file * [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [HUDI-4353] Column stats data skipping for flink (apache#6026) * [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012) Co-authored-by: superche <superche@tencent.com> * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854) * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-3511] Add call procedure for MetadataCommand (apache#6018) * [HUDI-3730] Add ConfigTool#toMap UT (apache#6035) Co-authored-by: voonhou.su <voonhou.su@shopee.com> * [MINOR] Improve variable names (apache#6039) * [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043) * [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042) * [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029) * [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828) * [HUDI-4357] Support flink 1.15.x (apache#6050) * [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677) * [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once * [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted * [HUDI-4152] Provider UT & IT for compact multi compaction plan * [HUDI-4152] Put multi compaction plans into one compaction plan source * [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma * [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy * [HUDI-4309] fix spark32 repartition error (apache#6033) * [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051) * [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060) * [HUDI-4367] Support copyToTable on call (apache#6054) * [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995) * fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check * Fix - serde parameters getting overrided on table property update * removing stale syncConfig * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com> * [HUDI-3500] Add call procedure for RepairsCommand (apache#6053) * [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig * [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062) Bumps xalan from 2.7.1 to 2.7.2. --- updated-dependencies: - dependency-name: xalan:xalan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead * [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695) * [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4323] Make database table names optional in sync tool (apache#6073) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config * [MINOR] Update RFCs status (apache#6078) * [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937) * [HUDI-4298] Add test case for reading mor table Signed-off-by: LinMingQiang <1356469429@qq.com> * [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080) * [HUDI-4391] Incremental read from archived commits for flink (apache#6096) * [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103) * [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112) * [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119) * [HUDI-4403] Fix the end input metadata for bounded source (apache#6116) * [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120) * [HUDI-3503] Add call procedure for CleanCommand (apache#6065) * [HUDI-3503] Add call procedure for CleanCommand Co-authored-by: simonssu <simonssu@tencent.com> * [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (apache#5855) * [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug * Remove a few files that were removed in upstream master * Fix build issues Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: huberylee <shibei.lh@foxmail.com> Co-authored-by: watermelon12138 <49849410+watermelon12138@users.noreply.github.com> Co-authored-by: y00617041 <yangxuan42@huawei.com> Co-authored-by: Ibson <pushengli@163.com> Co-authored-by: pusheng.li01 <pusheng.li01@liulishuo.com> Co-authored-by: LiChuang <64473732+CodeCooker17@users.noreply.github.com> Co-authored-by: Gary Li <yanjia.gary.li@gmail.com> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: xicm <36392121+xicm@users.noreply.github.com> Co-authored-by: xicm <xicm@asiainfo.com> Co-authored-by: Wangyh <763941163@qq.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: Todd Gao <todd.gao.2013@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: qianchutao <72595723+qianchutao@users.noreply.github.com> Co-authored-by: guanziyue <30882822+guanziyue@users.noreply.github.com> Co-authored-by: Jin Xing <jinxing.corey@gmail.com> Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com> Co-authored-by: BruceLin <brucekellan@gmail.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: aliceyyan <104287562+aliceyyan@users.noreply.github.com> Co-authored-by: aliceyyan <aliceyyan@tencent.com> Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Bo Cui <cuibo0108@163.com> Co-authored-by: Xingcan Cui <xcui@wealthsimple.com> Co-authored-by: wqwl611 <67826098+wqwl611@users.noreply.github.com> Co-authored-by: wqwl611 <wqwl611@gmail.com> Co-authored-by: 董可伦 <dongkelun01@inspur.com> Co-authored-by: 陈浩 <bettermouse94@gmail.com> Co-authored-by: Yuwei XIAO <ywxiaozero@gmail.com> Co-authored-by: Shawy Geng <gengxiaoyu1996@gmail.com> Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com> Co-authored-by: luokey <854194341@qq.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: uday08bce <uday08bce@gmail.com> Co-authored-by: YuangZhang <z_yuang@foxmail.com> Co-authored-by: zhangyuang <zhangyuang@corp.netease.com> Co-authored-by: felixYyu <felix2003@live.cn> Co-authored-by: Heap <35054152+h1ap@users.noreply.github.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: luoyajun <luoyajun1010@gmail.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: RexAn <anh131@126.com> Co-authored-by: komao <masterwangzx@gmail.com> Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Co-authored-by: Rex An <bonean131@gmail.com> Co-authored-by: Carter Shanklin <cartershanklin@users.noreply.github.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: 苏承祥 <sucx@tuya.com> Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com> Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com> Co-authored-by: leesf <490081539@qq.com> Co-authored-by: Nicolas Paris <nicolas.paris@riseup.net> Co-authored-by: Saisai Shao <sai.sai.shao@gmail.com> Co-authored-by: marchpure <marchpure@126.com> Co-authored-by: HunterXHunter <1356469429@qq.com> Co-authored-by: john.wick <john.wick@vipshop.com> Co-authored-by: liuzhuang2017 <95120044+liuzhuang2017@users.noreply.github.com> Co-authored-by: sandyfog <154525105@qq.com> Co-authored-by: yanenze <34880077+yanenze@users.noreply.github.com> Co-authored-by: yanenze <yanenze@keytop.com.cn> Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com> Co-authored-by: superche <superche@tencent.com> Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com> Co-authored-by: Shizhi Chen <107476116+chenshzh@users.noreply.github.com> Co-authored-by: chenshizhi <chenshizhi@bilibili.com> Co-authored-by: Alexander Trushev <42293632+trushev@users.noreply.github.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: jiz <31836510+microbearz@users.noreply.github.com> Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: bschell <bdscheller@gmail.com> Co-authored-by: Brandon Scheller <bschelle@amazon.com> Co-authored-by: Teng <teng_huo@outlook.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: wenningd <wenningding95@gmail.com> Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: miomiocat <284487410@qq.com> Co-authored-by: JerryYue-M <272614347@qq.com> Co-authored-by: jerryyue <jerryyue@didiglobal.com> Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: voonhou.su <voonhou.su@shopee.com> Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Tim Brown <tim.brown126@gmail.com> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>

* Revert master (apache#5925) * Revert "udate" This reverts commit 092e35c. * Revert "[HUDI-3475] Initialize hudi table management module." This reverts commit 4640a3b. * [HUDI-4279] Strength the remote fs view lagging check when latest commit refresh is enabled (apache#5917) Signed-off-by: LinMingQiang <1356469429@qq.com> * [minor] following 4270, add unit tests for the keys lost case (apache#5918) * [HUDI-3508] Add call procedure for FileSystemViewCommand (apache#5929) * [HUDI-3508] Add call procedure for FileSystemView * minor Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> * [HUDI-4299] Fix problem about hudi-example-java run failed on idea. (apache#5936) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups (apache#5941) * [HUDI-4290] Fix fetchLatestBaseFiles to filter replaced filegroups * Separate out incremental sync fsview test with clustering * [HUDI-3509] Add call procedure for HoodieLogFileCommand (apache#5949) Co-authored-by: zhanshaoxiong <jiimmyzhan@tencent.com> * [HUDI-4273] Support inline schedule clustering for Flink stream (apache#5890) * [HUDI-4273] Support inline schedule clustering for Flink stream * delete deprecated clustering plan strategy and add clustering ITTest * [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (apache#5874) * [HUDI-4260] Change KEYGEN_CLASS_NAME without default value (apache#5877) * Change KEYGEN_CLASS_NAME without default value Co-authored-by: 854194341@qq.com <loukey_7821> * [HUDI-3512] Add call procedure for StatsCommand (apache#5955) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * [TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948) * Revert "[TEST][DO_NOT_MERGE]fix random failed for ci (apache#5948)" (apache#5971) This reverts commit e8fbd4d. * [HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (apache#5966) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint) * [HUDI-4296] Fix the bug that TestHoodieSparkSqlWriter.testSchemaEvolutionForTableType is flaky (apache#5973) * [HUDI-3502] Support hdfs parquet import command based on Call Produce Command (apache#5956) * [MINOR] Remove -T option from CI build (apache#5972) * [HUDI-5246] Bumping mysql connector version due to security vulnerability (apache#5851) * [HUDI-4309] Spark3.2 custom parser should not throw exception (apache#5947) * [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (apache#5959) * [HUDI-4315] Do not throw exception in BaseSpark3Adapter#toTableIdentifier (apache#5957) * [HUDI-3504] Support bootstrap command based on Call Produce Command (apache#5977) * [HUDI-4311] Fix Flink lose data on some rollback scene (apache#5950) * [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (apache#5930) * [HUDI-3506] Add call procedure for CommitsCommand (apache#5974) * [HUDI-3506] Add call procedure for CommitsCommand Co-authored-by: superche <superche@tencent.com> * [HUDI-4325] fix spark sql procedure cause ParseException with semicolon (apache#5982) * [HUDI-4325] fix saprk sql procedure cause ParseException with semicolon * [HUDI-4333] fix HoodieFileIndex's listFiles method log print skipping percent NaN (apache#5990) * [HUDI-4332] The current instant may be wrong under some extreme conditions in AppendWriteFunction. (apache#5988) * [HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (apache#5970) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller * [HUDI-1176] Upgrade hudi to log4j2 (apache#5366) * Move to log4j2 cr: https://code.amazon.com/reviews/CR-71010705 * Upgrade unit tests to log4j2 * update exclusion Co-authored-by: Brandon Scheller <bschelle@amazon.com> * [HUDI-4334] close SparkRDDWriteClient after usage in Create/Delete/RollbackSavepointsProcedure (apache#5994) * [HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (apache#6002) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> * [MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false as default (apache#5174) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4331] Allow loading external config file from class loader (apache#5987) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4336] Fix records overwritten bug with binary primary key (apache#5996) * [MINOR] Following apache#2070, Fix BindException when running tests on shared machines. (apache#5951) * [HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED (apache#5999) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (apache#5907) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer * add ut Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458) * [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file * [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [HUDI-4353] Column stats data skipping for flink (apache#6026) * [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012) Co-authored-by: superche <superche@tencent.com> * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854) * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-3511] Add call procedure for MetadataCommand (apache#6018) * [HUDI-3730] Add ConfigTool#toMap UT (apache#6035) Co-authored-by: voonhou.su <voonhou.su@shopee.com> * [MINOR] Improve variable names (apache#6039) * [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043) * [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042) * [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029) * [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828) * [HUDI-4357] Support flink 1.15.x (apache#6050) * [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677) * [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once * [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted * [HUDI-4152] Provider UT & IT for compact multi compaction plan * [HUDI-4152] Put multi compaction plans into one compaction plan source * [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma * [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy * [HUDI-4309] fix spark32 repartition error (apache#6033) * [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051) * [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060) * [HUDI-4367] Support copyToTable on call (apache#6054) * [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995) * fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check * Fix - serde parameters getting overrided on table property update * removing stale syncConfig * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com> * [HUDI-3500] Add call procedure for RepairsCommand (apache#6053) * [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig * [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062) Bumps xalan from 2.7.1 to 2.7.2. --- updated-dependencies: - dependency-name: xalan:xalan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead * [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695) * [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4323] Make database table names optional in sync tool (apache#6073) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config * [MINOR] Update RFCs status (apache#6078) * [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937) * [HUDI-4298] Add test case for reading mor table Signed-off-by: LinMingQiang <1356469429@qq.com> * [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080) * [HUDI-4391] Incremental read from archived commits for flink (apache#6096) * [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103) * [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112) * [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119) * [HUDI-4403] Fix the end input metadata for bounded source (apache#6116) * [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120) * [HUDI-3503] Add call procedure for CleanCommand (apache#6065) * [HUDI-3503] Add call procedure for CleanCommand Co-authored-by: simonssu <simonssu@tencent.com> * [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (apache#5855) * [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug * Fix file group count issue with metadata partitions (apache#5892) * [HUDI-4098] Support HMS for flink HudiCatalog (apache#6082) * [HUDI-4098]Support HMS for flink HudiCatalog * [HUDI-4409] Improve LockManager wait logic when catch exception (apache#6122) * [HUDI-4065] Add FileBasedLockProvider (apache#6071) * [HUDI-4416] Default database path for hoodie hive catalog (apache#6136) * [HUDI-4372] Enable matadata table by default for flink (apache#6066) * [HUDI-4401] Skip HBase version check (apache#6114) * Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature * [HUDI-4427] Add a computed column IT test (apache#6150) * [HUDI-4146][RFC-55] Update config changes proposal (apache#6162) * [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (apache#5428) Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation and the reason to that is SchemaPruning optimization rule (pruning nested schemas) that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's not applied in cases when any other relation is used. This change is porting this rule to Hudi relations (MOR, Incremental, etc) by the virtue of leveraging HoodieSparkSessionExtensions mechanism injecting modified version of the original SchemaPruning rule that is adopted to work w/ Hudi's custom relations. - Added customOptimizerRules to HoodieAnalysis - Added NestedSchemaPrunning Spark's Optimizer rule - Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas) - Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions - Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource - Disabled fallback to HadoopFsRelation * [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (apache#4915) Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This PR fixes this issue and allows loading global configs during the Hudi reading phase. Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (apache#5470) * [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (apache#6161) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (apache#6113) Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter. * [HUDI-4204] Fixing NPE with row writer path and with OCC (apache#5850) * [HUDI-4247] Upgrading protocol buffers version for presto bundle (apache#5852) * [MINOR] Fix result missing information issue in commits_compare Procedure (apache#6165) Co-authored-by: superche <superche@tencent.com> * [HUDI-4404] Fix insert into dynamic partition write misalignment (apache#6124) * [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (apache#6175) - Fixes broken ITTestHoodieDemo#testParquetDemo * [MINOR] Fix CI issue with TestHiveSyncTool (apache#6110) * [HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (apache#5523) This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead) * [MINOR] Disable Flink compactor IT test (apache#6189) * Revert "[MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)" (apache#6192) This reverts commit d5c904e. * [HUDI-3979] Optimize out mandatory columns when no merging is performed (apache#5430) For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out. * [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (apache#5954) * Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)" (apache#6160) This reverts commit 046044c. * [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (apache#6155) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4437] Fix test conflicts by clearing file system cache (apache#6123) Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4436] Invalidate cached table in Spark after write (apache#6159) Co-authored-by: Ryan Pifer <rmpifer@umich.edu> * [MINOR] Fix Call Procedure code style (apache#6186) * Fix Call Procedure code style. Co-authored-by: superche <superche@tencent.com> * [MINOR] Bump CI timeout to 150m (apache#6198) * [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (apache#6163) Co-authored-by: Ryan Pifer <rmpifer@umich.edu> * [HUDI-4071] Make NONE sort mode as default for bulk insert (apache#6195) * [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (apache#5708) * [HUDI-4448] Remove the latest commit refresh for timeline server (apache#6179) * [HUDI-4450] Revert the checkpoint abort notification (apache#6181) * [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (apache#6164) Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4348] fix merge into sql data quality in concurrent scene (apache#6020) * [HUDI-3510] Add sync validate procedure (apache#6200) * [HUDI-3510] Add sync validate procedure Co-authored-by: simonssu <simonssu@tencent.com> * [MINOR] Fix typos in Spark client related classes (apache#6204) * [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness (apache#6201) * [MINOR] Only log stdout output for non-zero exit from commands in IT (apache#6199) * [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (apache#6205) * [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (apache#6206) Co-authored-by: superche <superche@tencent.com> * [HUDI-4455] Improve test classes for TestHiveSyncTool (apache#6202) Improve HiveTestService, HiveTestUtil, and related classes. * [HUDI-4456] Clean up test resources (apache#6203) * [HUDI-3884] Support archival beyond savepoint commits (apache#5837) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (apache#5746) We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine. * [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common * [HUDI-4474] Infer metasync configs (apache#6217) - infer repeated sync configs from original configs - `META_SYNC_BASE_FILE_FORMAT` - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` - `META_SYNC_ASSUME_DATE_PARTITION` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING` - `META_SYNC_DECODE_PARTITION` - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING` - `META_SYNC_USE_FILE_LISTING_FROM_METADATA` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE` As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes * [HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (apache#5797) * [HUDI-3730] Keep metasync configs backward compatible (apache#6221) * [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (apache#6214) * [HUDI-4186] Support Hudi with Spark 3.3.0 (apache#5943) Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4126] Disable file splits for Bootstrap real time queries (via InputFormat) (apache#6219) Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (apache#6229) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * [HUDI-4484] Add default lock config options for flink metadata table (apache#6222) * [HUDI-4494] keep the fields' order when data is written out of order (apache#6233) * [MINOR] Minor changes around Spark 3.3 support (apache#6231) Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (apache#6213) * [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (apache#6237) * [HUDI-4499] Tweak default retry times for flink metadata table lock (apache#6238) * [HUDI-4221] Optimzing getAllPartitionPaths (apache#6234) - Levering spark par for dir processing * Moving to 0.13.0-SNAPSHOT on master branch. * [HUDI-4504] Disable metadata table by default for flink (apache#6241) * [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (apache#6242) To avoid unnecessary exception throws * [HUDI-4507] Improve file name extraction logic in metadata utils (apache#6250) * [MINOR] Fix convertPathWithScheme tests (apache#6251) * [MINOR] Add license header (apache#6247) Add license header to TestConfigUtils * [HUDI-4025] Add Presto and Trino query node to validate queries (apache#5578) * Add Presto and Trino query nodes to hudi-integ-test * Add yamls for query validation * Add presto-jdbc and trino-jdbc to integ-test-bundle * [HUDI-4518] Free lock if allocated but not acquired (apache#6272) If the lock is not null but its state has not yet transitioned to ACQUIRED, retry fails because the lock is not de-allocated. See issue apache#5702 * [HUDI-4510] Repair config "hive_sync.metastore.uris" in flink sql hive schema sync is not effective (apache#6257) * [HUDI-3848] Fixing minor bug in listing based rollback request generation (apache#6244) * [HUDI-4512][HUDI-4513] Fix bundle name for spark3 profile (apache#6261) * [HUDI-4501] Throwing exception when restore is attempted with hoodie.arhive.beyond.savepoint is enabled (apache#6239) * [HUDI-4516] fix Task not serializable error when run HoodieCleaner after one failure (apache#6265) Co-authored-by: jian.feng <jian.feng@shopee.com> * remove test resources (apache#6147) Co-authored-by: root <root@TCN1004532-1.tcent.cn> * [HUDI-4477] Adjust partition number of flink sink task (apache#6218) Co-authored-by: lewinma <lewinma@tencent.com> * [HUDI-4298] Mor table reading for base and log files lost sequence of events (apache#6286) * [HUDI-4298] Mor table reading for base and log files lost sequence of events Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4525] Fixing Spark 3.3 `AvroSerializer` implementation (apache#6279) * [HUDI-4447] fix no partitioned path extractor error when sync meta (apache#6263) * [HUDI-4520] Support qualified table 'db.table' in call procedures (apache#6274) * [HUDI-4531] Wrong partition path for flink hive catalog when the partition fields are not in the last (apache#6292) * [HUDI-4487] support to create ro/rt table by spark sql (apache#6262) * [HUDI-4533] Fix RunCleanProcedure's ArrayIndexOutOfBoundsException (apache#6293) * [HUDI-4536] ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering (apache#6298) * [HUDI-4385] Support online compaction in the flink batch mode write (apache#6093) * [HUDI-4385] Support online compaction in the flink batch mode write Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4530] fix default payloadclass in mor is different with cow (apache#6288) * [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (apache#6306) * [HUDI-4544] support retain hour cleaning policy for flink (apache#6300) * [HUDI-4547] Fix SortOperatorGen sort indices (apache#6309) Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4470] Remove spark dataPrefetch disabled prop in DefaultSource * [HUDI-4540] Cover different table types in functional tests of Spark structured streaming (apache#6317) * [HUDI-4514] optimize CTAS to adapt to saveAsTable api in different modes (apache#6295) * [HUDI-4474] Fix inferring props for meta sync (apache#6310) - HoodieConfig#setDefaults looks up declared fields, so should pass static class for reflection, otherwise, subclasses of HoodieSyncConfig won't set defaults properly. - Pass all write client configs of deltastreamer to meta sync - Make org.apache.hudi.hive.MultiPartKeysValueExtractor default for deltastreamer, to align with SQL and flink * [HUDI-4550] Fallback to listing based rollback for completed instant (apache#6313) Ideally, rollback is not triggered for completed instants. However, if it gets triggered due to some extraneous condition or forced while rollback strategy still configured to be marker-based, then fallback to listing-based rollback instead of failing. - CTOR changes in rollback plan and action executors. - Change in condition to determine whether to use marker-based rollback. - Added UT to cover the scenario. * [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value (apache#6248) - Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception. - Added a new write config ("hoodie.skip.default.partition.validation") when enabled, will bypass the above validation. If users have a hudi table where "default" partition was created intentionally and not as sentinel, they can enable this config to get past the validation. * [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis (apache#6307) * [HUDI-4534] Fixing upgrade to reload Metaclient for deltastreamer writes (apache#6296) * [HUDI-4517] If no marker type file, fallback to timeline based marker (apache#6266) - If MARKERS.type file is not present, the logic assumes that the direct markers are stored, which causes the read failure in certain cases even where timeline server based marker is enabled. This PR handles the failure by falling back to timeline based marker in such cases. * [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884) - Adding request retry to RemoteHoodieTableFileSystemView. Users can enable using the new configs added. * [HUDI-4464] Clear warnings in Azure CI (apache#6210) Co-authored-by: jian.feng <jian.feng@shopee.com> * [MINOR] Update PR description template (apache#6323) * [HUDI-4508] Repair the exception when reading optimized query for mor in hive and presto/trino (apache#6254) In MOR table, file slice may just have log file but no base file, before the file slice is compacted. In this case, read-optimized query will match the condition !baseFileOpt.isPresent() in HoodieCopyOnWriteTableInputFormat.createFileStatusUnchecked() and throw IllegalStateException. Instead of throwing exception, it is more suitable to query nothing in the file slice. Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4548] Unpack the column max/min to string instead of Utf8 for Mor table (apache#6311) * [HUDI-4447] fix SQL metasync when perform delete table operation (apache#6180) * [HUDI-4424] Add new compactoin trigger stratgy: NUM_COMMITS_AFTER_REQ… (apache#6144) * [MINOR] improve flink dummySink's parallelism (apache#6325) * [HUDI-4568] Shade dropwizard metrics-core in hudi-aws-bundle (apache#6327) * [HUDI-4572] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if precombine field is not ts (apache#6331) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4570] Fix hive sync path error due to reuse of storage descriptors. (apache#6329) * [HUDI-4571] Fix partition extractor infer function when partition field mismatch (apache#6333) Infer META_SYNC_PARTITION_FIELDS and META_SYNC_PARTITION_EXTRACTOR_CLASS from hoodie.table.partition.fields first. If not set, then from hoodie.datasource.write.partitionpath.field. Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4570] Add test for updating multiple partitions in hive sync (apache#6340) * [MINOR] Fix wrong key to determine sync sql cascade (apache#6339) * [HUDI-4581] Claim RFC-58 for data skipping integration with query engines (apache#6346) * [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide (apache#6318) * [HUDI-4556] Improve functional test coverage of column stats index (apache#6319) * [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties (apache#6320) Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4543] Support natural order when table schema contains a field named 'ts' (apache#6246) * be able to disable precombine field when table schema contains a field named ts Co-authored-by: jian yonghua <jianyonghua@163.com> * [HUDI-4569][RFC-58] Claim RFC-58 for adding a new feature named 'Multiple event_time Fields Latest Verification in a Single Table' for Hudi (apache#6328) Co-authored-by: XinyaoTian <leontian1024@gmail.com> * [HUDI-3503] Support more feature to call procedure CleanCommand (apache#6353) * [HUDI-4590] Add hudi-aws dependency to hudi-flink-bundle. (apache#6356) * [MINOR] fix potential npe in spark writer (apache#6363) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * fix bug in cli show fsview all (apache#6314) * [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency (apache#6228) * [HUDI-4611] Fix the duplicate creation of config in HoodieFlinkStreamer (apache#6369) Co-authored-by: linfey <linfey2021@gmail.com> * [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table (apache#6141) * Spark support MOR read archived commits for incremental query * [MINOR] fix progress field calculate logic in HoodieLogRecordReader (apache#6291) * [HUDI-4608] Fix upgrade command in Hudi CLI (apache#6374) * [HUDI-4609] Improve usability of upgrade/downgrade commands in Hudi CLI (apache#6377) * [HUDI-4574] Fixed timeline based marker thread safety issue (apache#6383) * fixed timeline based markers thread safety issue * add document for TimelineBasedMarkers thread safety issues * [HUDI-4621] Add validation that bucket index fields should be subset of primary keys (apache#6396) * check bucket index fields Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027) * [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365) * read error from mor after compaction Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [MINOR] Update DOAP with 0.12.0 Release (apache#6413) * [HUDI-4529] Tweak some default config options for flink (apache#6287) * [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415) * [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env parallelism (apache#6312) * [MINOR] Improve code style of CLI Command classes (apache#6427) * [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440) * [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386) - Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar. - Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch. * [HUDI-3579] Add timeline commands in hudi-cli (apache#5139) * [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434) * Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449) This reverts commit 9055b2f. * [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443) * [HUDI-4644] Change default flink profile to 1.15.x (apache#6445) * [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459) * [HUDI-4676] infer cleaner policy when write concurrency mode is OCC Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4683] Use enum class value for default value in flink options (apache#6453) * [HUDI-4584] Cleaning up Spark utilities (apache#6351) Cleans up Spark utilities and removes duplication * [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467) Also fix the flaky test * [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267) * [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433) * [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481) * HUDI-4687 add show_invalid_parquet procedure (apache#6480) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352) Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`) * [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170) * Merge OSS master * resolve build issues * fix checkstyle issue * [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450) * [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490) * add back in internal customization for s3EventsHoodieIncrSource * [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494) * Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501) This reverts commit 660177b. Signed-off-by: LinMingQiang <1356469429@qq.com> Signed-off-by: HunterXHunter <1356469429@qq.com> Co-authored-by: Zhaojing Yu <yuzhaojing@bytedance.com> Co-authored-by: LinMingQiang <1356469429@qq.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: jiz <31836510+microbearz@users.noreply.github.com> Co-authored-by: jiimmyzhan <jiimmyzhan@tencent.com> Co-authored-by: Forus <70357858+Forus0322@users.noreply.github.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: xi chaomin <36392121+xicm@users.noreply.github.com> Co-authored-by: luokey <854194341@qq.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com> Co-authored-by: leesf <490081539@qq.com> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com> Co-authored-by: superche <superche@tencent.com> Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: BruceLin <brucekellan@gmail.com> Co-authored-by: bschell <bdscheller@gmail.com> Co-authored-by: Brandon Scheller <bschelle@amazon.com> Co-authored-by: Teng <teng_huo@outlook.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: wenningd <wenningding95@gmail.com> Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: luoyajun <luoyajun1010@gmail.com> Co-authored-by: RexAn <bonean131@gmail.com> Co-authored-by: komao <masterwangzx@gmail.com> Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Co-authored-by: miomiocat <284487410@qq.com> Co-authored-by: JerryYue-M <272614347@qq.com> Co-authored-by: jerryyue <jerryyue@didiglobal.com> Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: voonhou.su <voonhou.su@shopee.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: xicm <xicm@asiainfo.com> Co-authored-by: 董可伦 <dongkelun01@inspur.com> Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com> Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: Bo Cui <cuibo0108@163.com> Co-authored-by: Rahil Chertara <rchertar@amazon.com> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Ryan Pifer <rmpifer@umich.edu> Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: simonssu <simonssu@tencent.com> Co-authored-by: Vander <30547463+vanderzh@users.noreply.github.com> Co-authored-by: Dongwook Kwon <dongwook@amazon.com> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com> Co-authored-by: root <root@TCN1004532-1.tcent.cn> Co-authored-by: F7753 <mabiaocas@gmail.com> Co-authored-by: lewinma <lewinma@tencent.com> Co-authored-by: Nicholas Jiang <programgeek@163.com> Co-authored-by: Yonghua Jian_deepnova <47289660@qq.com> Co-authored-by: 5herhom <543872547@qq.com> Co-authored-by: RexXiong <lvshuang.tb@gmail.com> Co-authored-by: Pratyaksh Sharma <pratyaksh13@gmail.com> Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com> Co-authored-by: 吴文池 <wuwenchi@deepexi.com> Co-authored-by: jian yonghua <jianyonghua@163.com> Co-authored-by: Xinyao Tian (Richard) <31195026+XinyaoTian@users.noreply.github.com> Co-authored-by: XinyaoTian <leontian1024@gmail.com> Co-authored-by: vamshigv <107005799+vamshigv@users.noreply.github.com> Co-authored-by: feiyang_deepnova <736320652@qq.com> Co-authored-by: linfey <linfey2021@gmail.com> Co-authored-by: novisfff <62633257+novisfff@users.noreply.github.com> Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com> Co-authored-by: hehuiyuan <471627698@qq.com> Co-authored-by: Zouxxyy <zouxxyy@qq.com>

* [HUDI-3984] Remove mandatory check of partiton path for cli command (apache#5458) * [HUDI-3634] Could read empty or partial HoodieCommitMetaData in downstream if using HDFS (apache#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file * [HUDI-3953]Flink Hudi module should support low-level source and sink api (apache#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [HUDI-4353] Column stats data skipping for flink (apache#6026) * [HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (apache#6012) Co-authored-by: superche <superche@tencent.com> * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5854) * [HUDI-3730] Improve meta sync class design and hierarchies (apache#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-3511] Add call procedure for MetadataCommand (apache#6018) * [HUDI-3730] Add ConfigTool#toMap UT (apache#6035) Co-authored-by: voonhou.su <voonhou.su@shopee.com> * [MINOR] Improve variable names (apache#6039) * [HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table partitions through a standalone job. (apache#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (apache#6043) * [HUDI-3836] Improve the way of fetching metadata partitions from table (apache#5286) Co-authored-by: xicm <xicm@asiainfo.com> * [HUDI-4359] Support show_fs_path_detail command on Call Produce Command (apache#6042) * [HUDI-4356] Fix the error when sync hive in CTAS (apache#6029) * [HUDI-4219] Merge Into when update expression "col=s.col+2" on precombine cause exception (apache#5828) * [HUDI-4357] Support flink 1.15.x (apache#6050) * [HUDI-4152] Flink offline compaction support compacting multi compaction plan at once (apache#5677) * [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once * [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted * [HUDI-4152] Provider UT & IT for compact multi compaction plan * [HUDI-4152] Put multi compaction plans into one compaction plan source * [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma * [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy * [HUDI-4309] fix spark32 repartition error (apache#6033) * [HUDI-4366] Synchronous cleaning for flink bounded source (apache#6051) * [minor] following 4152, refactor the clazz about plan selection strategy (apache#6060) * [HUDI-4367] Support copyToTable on call (apache#6054) * [HUDI-4335] Bug fixes in AWSGlueCatalogSyncClient post schema evolution. (apache#5995) * fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check * Fix - serde parameters getting overrided on table property update * removing stale syncConfig * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (apache#6017) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com> * [HUDI-3500] Add call procedure for RepairsCommand (apache#6053) * [HUDI-2150] Rename/Restructure configs for better modularity (apache#6061) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig * [MINOR] Bump xalan from 2.7.1 to 2.7.2 (apache#6062) Bumps xalan from 2.7.1 to 2.7.2. --- updated-dependencies: - dependency-name: xalan:xalan dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072) * [HUDI-4324] Remove use_jdbc config from hudi sync * Users should use HIVE_SYNC_MODE instead * [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (apache#5695) * [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4323] Make database table names optional in sync tool (apache#6073) * [HUDI-4323] Make database table names optional in sync tool * Infer from these properties from the table config * [MINOR] Update RFCs status (apache#6078) * [HUDI-4298] When reading the mor table with QUERY_TYPE_SNAPSHOT,Unabl… (apache#5937) * [HUDI-4298] Add test case for reading mor table Signed-off-by: LinMingQiang <1356469429@qq.com> * [HUDI-4379] Bump Flink versions to 1.14.5 and 1.15.1 (apache#6080) * [HUDI-4391] Incremental read from archived commits for flink (apache#6096) * [RFC-51] [HUDI-3478] Hudi to support Change-Data-Capture (apache#5436) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4393] Add marker file for target file when flink merge handle rolls over (apache#6103) * [HUDI-4399][RFC-57] Claim RFC 57 for DeltaStreamer proto support (apache#6112) * [HUDI-4397] Flink Inline Cluster and Compact plan distribute strategy changed from rebalance to hash to avoid potential multiple threads accessing the same file (apache#6106) Co-authored-by: jerryyue <jerryyue@didiglobal.com> * [MINOR] Disable TestHiveSyncGlobalCommitTool (apache#6119) * [HUDI-4403] Fix the end input metadata for bounded source (apache#6116) * [HUDI-4408] Reuse old rollover file as base file for flink merge handle (apache#6120) * [HUDI-3503] Add call procedure for CleanCommand (apache#6065) * [HUDI-3503] Add call procedure for CleanCommand Co-authored-by: simonssu <simonssu@tencent.com> * [HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (apache#5855) * [HUDI-4170] Make user can use hoodie.datasource.read.paths to read necessary files (apache#5722) * Rebase codes * Move listFileSlices to HoodieBaseRelation * Fix review * Fix style * Fix bug * Fix file group count issue with metadata partitions (apache#5892) * [HUDI-4098] Support HMS for flink HudiCatalog (apache#6082) * [HUDI-4098]Support HMS for flink HudiCatalog * [HUDI-4409] Improve LockManager wait logic when catch exception (apache#6122) * [HUDI-4065] Add FileBasedLockProvider (apache#6071) * [HUDI-4416] Default database path for hoodie hive catalog (apache#6136) * [HUDI-4372] Enable matadata table by default for flink (apache#6066) * [HUDI-4401] Skip HBase version check (apache#6114) * Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature * [HUDI-4427] Add a computed column IT test (apache#6150) * [HUDI-4146][RFC-55] Update config changes proposal (apache#6162) * [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (apache#5428) Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation and the reason to that is SchemaPruning optimization rule (pruning nested schemas) that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's not applied in cases when any other relation is used. This change is porting this rule to Hudi relations (MOR, Incremental, etc) by the virtue of leveraging HoodieSparkSessionExtensions mechanism injecting modified version of the original SchemaPruning rule that is adopted to work w/ Hudi's custom relations. - Added customOptimizerRules to HoodieAnalysis - Added NestedSchemaPrunning Spark's Optimizer rule - Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas) - Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions - Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource - Disabled fallback to HadoopFsRelation * [HUDI-3764] Allow loading external configs while querying Hudi tables with Spark (apache#4915) Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This PR fixes this issue and allows loading global configs during the Hudi reading phase. Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (apache#5470) * [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (apache#6161) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (apache#6113) Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter. * [HUDI-4204] Fixing NPE with row writer path and with OCC (apache#5850) * [HUDI-4247] Upgrading protocol buffers version for presto bundle (apache#5852) * [MINOR] Fix result missing information issue in commits_compare Procedure (apache#6165) Co-authored-by: superche <superche@tencent.com> * [HUDI-4404] Fix insert into dynamic partition write misalignment (apache#6124) * [MINOR] Fallback to default for hive-style partitioning, url-encoding configs (apache#6175) - Fixes broken ITTestHoodieDemo#testParquetDemo * [MINOR] Fix CI issue with TestHiveSyncTool (apache#6110) * [HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (apache#5523) This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead) * [MINOR] Disable Flink compactor IT test (apache#6189) * Revert "[MINOR] Fix CI issue with TestHiveSyncTool (apache#6110)" (apache#6192) This reverts commit d5c904e. * [HUDI-3979] Optimize out mandatory columns when no merging is performed (apache#5430) For MOR, when no merging is performed there is no point in reading either primary-key or pre-combine-key values (unless query is referencing these). Avoiding reading these allows to potentially save substantial resources wasted for reading it out. * [HUDI-4303] Use Hive sentinel value as partition default to avoid type caste issues (apache#5954) * Revert "[HUDI-4324] Remove use_jdbc config from hudi sync (apache#6072)" (apache#6160) This reverts commit 046044c. * [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (apache#6155) Co-authored-by: Wenning Ding <wenningd@amazon.com> * [HUDI-4437] Fix test conflicts by clearing file system cache (apache#6123) Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4436] Invalidate cached table in Spark after write (apache#6159) Co-authored-by: Ryan Pifer <rmpifer@umich.edu> * [MINOR] Fix Call Procedure code style (apache#6186) * Fix Call Procedure code style. Co-authored-by: superche <superche@tencent.com> * [MINOR] Bump CI timeout to 150m (apache#6198) * [HUDI-4440] Treat boostrapped table as non-partitioned in HudiFileIndex if partition column is missing from schema (apache#6163) Co-authored-by: Ryan Pifer <rmpifer@umich.edu> * [HUDI-4071] Make NONE sort mode as default for bulk insert (apache#6195) * [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (apache#5708) * [HUDI-4448] Remove the latest commit refresh for timeline server (apache#6179) * [HUDI-4450] Revert the checkpoint abort notification (apache#6181) * [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (apache#6164) Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> * [HUDI-4348] fix merge into sql data quality in concurrent scene (apache#6020) * [HUDI-3510] Add sync validate procedure (apache#6200) * [HUDI-3510] Add sync validate procedure Co-authored-by: simonssu <simonssu@tencent.com> * [MINOR] Fix typos in Spark client related classes (apache#6204) * [HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness (apache#6201) * [MINOR] Only log stdout output for non-zero exit from commands in IT (apache#6199) * [HUDI-4458] Add a converter cache for flink ColumnStatsIndices (apache#6205) * [HUDI-4071] Match ROLLBACK_USING_MARKERS_ENABLE in sql as datasource (apache#6206) Co-authored-by: superche <superche@tencent.com> * [HUDI-4455] Improve test classes for TestHiveSyncTool (apache#6202) Improve HiveTestService, HiveTestUtil, and related classes. * [HUDI-4456] Clean up test resources (apache#6203) * [HUDI-3884] Support archival beyond savepoint commits (apache#5837) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (apache#5746) We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine. * [HUDI-4471] Relocate AWSDmsAvroPayload class to hudi-common * [HUDI-4474] Infer metasync configs (apache#6217) - infer repeated sync configs from original configs - `META_SYNC_BASE_FILE_FORMAT` - infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` - `META_SYNC_ASSUME_DATE_PARTITION` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING` - `META_SYNC_DECODE_PARTITION` - infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING` - `META_SYNC_USE_FILE_LISTING_FROM_METADATA` - infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE` As proposed in https://github.com/apache/hudi/blob/master/rfc/rfc-55/rfc-55.md#compatible-changes * [HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (apache#5797) * [HUDI-3730] Keep metasync configs backward compatible (apache#6221) * [HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (apache#6214) * [HUDI-4186] Support Hudi with Spark 3.3.0 (apache#5943) Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4126] Disable file splits for Bootstrap real time queries (via InputFormat) (apache#6219) Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4490] Make AWSDmsAvroPayload class backwards compatible (apache#6229) Co-authored-by: Rahil Chertara <rchertar@amazon.com> * [HUDI-4484] Add default lock config options for flink metadata table (apache#6222) * [HUDI-4494] keep the fields' order when data is written out of order (apache#6233) * [MINOR] Minor changes around Spark 3.3 support (apache#6231) Co-authored-by: Shawn Chang <yxchang@amazon.com> * [HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (apache#6213) * [HUDI-4495] Fix handling of S3 paths incompatible with java URI standards (apache#6237) * [HUDI-4499] Tweak default retry times for flink metadata table lock (apache#6238) * [HUDI-4221] Optimzing getAllPartitionPaths (apache#6234) - Levering spark par for dir processing * Moving to 0.13.0-SNAPSHOT on master branch. * [HUDI-4504] Disable metadata table by default for flink (apache#6241) * [HUDI-4505] Returns instead of throws if lock file exists for FileSystemBasedLockProvider (apache#6242) To avoid unnecessary exception throws * [HUDI-4507] Improve file name extraction logic in metadata utils (apache#6250) * [MINOR] Fix convertPathWithScheme tests (apache#6251) * [MINOR] Add license header (apache#6247) Add license header to TestConfigUtils * [HUDI-4025] Add Presto and Trino query node to validate queries (apache#5578) * Add Presto and Trino query nodes to hudi-integ-test * Add yamls for query validation * Add presto-jdbc and trino-jdbc to integ-test-bundle * [HUDI-4518] Free lock if allocated but not acquired (apache#6272) If the lock is not null but its state has not yet transitioned to ACQUIRED, retry fails because the lock is not de-allocated. See issue apache#5702 * [HUDI-4510] Repair config "hive_sync.metastore.uris" in flink sql hive schema sync is not effective (apache#6257) * [HUDI-3848] Fixing minor bug in listing based rollback request generation (apache#6244) * [HUDI-4512][HUDI-4513] Fix bundle name for spark3 profile (apache#6261) * [HUDI-4501] Throwing exception when restore is attempted with hoodie.arhive.beyond.savepoint is enabled (apache#6239) * [HUDI-4516] fix Task not serializable error when run HoodieCleaner after one failure (apache#6265) Co-authored-by: jian.feng <jian.feng@shopee.com> * remove test resources (apache#6147) Co-authored-by: root <root@TCN1004532-1.tcent.cn> * [HUDI-4477] Adjust partition number of flink sink task (apache#6218) Co-authored-by: lewinma <lewinma@tencent.com> * [HUDI-4298] Mor table reading for base and log files lost sequence of events (apache#6286) * [HUDI-4298] Mor table reading for base and log files lost sequence of events Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4525] Fixing Spark 3.3 `AvroSerializer` implementation (apache#6279) * [HUDI-4447] fix no partitioned path extractor error when sync meta (apache#6263) * [HUDI-4520] Support qualified table 'db.table' in call procedures (apache#6274) * [HUDI-4531] Wrong partition path for flink hive catalog when the partition fields are not in the last (apache#6292) * [HUDI-4487] support to create ro/rt table by spark sql (apache#6262) * [HUDI-4533] Fix RunCleanProcedure's ArrayIndexOutOfBoundsException (apache#6293) * [HUDI-4536] ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering (apache#6298) * [HUDI-4385] Support online compaction in the flink batch mode write (apache#6093) * [HUDI-4385] Support online compaction in the flink batch mode write Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4530] fix default payloadclass in mor is different with cow (apache#6288) * [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (apache#6306) * [HUDI-4544] support retain hour cleaning policy for flink (apache#6300) * [HUDI-4547] Fix SortOperatorGen sort indices (apache#6309) Signed-off-by: HunterXHunter <1356469429@qq.com> * [HUDI-4470] Remove spark dataPrefetch disabled prop in DefaultSource * [HUDI-4540] Cover different table types in functional tests of Spark structured streaming (apache#6317) * [HUDI-4514] optimize CTAS to adapt to saveAsTable api in different modes (apache#6295) * [HUDI-4474] Fix inferring props for meta sync (apache#6310) - HoodieConfig#setDefaults looks up declared fields, so should pass static class for reflection, otherwise, subclasses of HoodieSyncConfig won't set defaults properly. - Pass all write client configs of deltastreamer to meta sync - Make org.apache.hudi.hive.MultiPartKeysValueExtractor default for deltastreamer, to align with SQL and flink * [HUDI-4550] Fallback to listing based rollback for completed instant (apache#6313) Ideally, rollback is not triggered for completed instants. However, if it gets triggered due to some extraneous condition or forced while rollback strategy still configured to be marker-based, then fallback to listing-based rollback instead of failing. - CTOR changes in rollback plan and action executors. - Change in condition to determine whether to use marker-based rollback. - Added UT to cover the scenario. * [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value (apache#6248) - Added FourToFiveUpgradeHandler to detect hudi tables with "default" partition and throwing exception. - Added a new write config ("hoodie.skip.default.partition.validation") when enabled, will bypass the above validation. If users have a hudi table where "default" partition was created intentionally and not as sentinel, they can enable this config to get past the validation. * [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis (apache#6307) * [HUDI-4534] Fixing upgrade to reload Metaclient for deltastreamer writes (apache#6296) * [HUDI-4517] If no marker type file, fallback to timeline based marker (apache#6266) - If MARKERS.type file is not present, the logic assumes that the direct markers are stored, which causes the read failure in certain cases even where timeline server based marker is enabled. This PR handles the failure by falling back to timeline based marker in such cases. * [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884) - Adding request retry to RemoteHoodieTableFileSystemView. Users can enable using the new configs added. * [HUDI-4464] Clear warnings in Azure CI (apache#6210) Co-authored-by: jian.feng <jian.feng@shopee.com> * [MINOR] Update PR description template (apache#6323) * [HUDI-4508] Repair the exception when reading optimized query for mor in hive and presto/trino (apache#6254) In MOR table, file slice may just have log file but no base file, before the file slice is compacted. In this case, read-optimized query will match the condition !baseFileOpt.isPresent() in HoodieCopyOnWriteTableInputFormat.createFileStatusUnchecked() and throw IllegalStateException. Instead of throwing exception, it is more suitable to query nothing in the file slice. Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4548] Unpack the column max/min to string instead of Utf8 for Mor table (apache#6311) * [HUDI-4447] fix SQL metasync when perform delete table operation (apache#6180) * [HUDI-4424] Add new compactoin trigger stratgy: NUM_COMMITS_AFTER_REQ… (apache#6144) * [MINOR] improve flink dummySink's parallelism (apache#6325) * [HUDI-4568] Shade dropwizard metrics-core in hudi-aws-bundle (apache#6327) * [HUDI-4572] Fix 'Not a valid schema field: ts' error in HoodieFlinkCompactor if precombine field is not ts (apache#6331) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4570] Fix hive sync path error due to reuse of storage descriptors. (apache#6329) * [HUDI-4571] Fix partition extractor infer function when partition field mismatch (apache#6333) Infer META_SYNC_PARTITION_FIELDS and META_SYNC_PARTITION_EXTRACTOR_CLASS from hoodie.table.partition.fields first. If not set, then from hoodie.datasource.write.partitionpath.field. Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4570] Add test for updating multiple partitions in hive sync (apache#6340) * [MINOR] Fix wrong key to determine sync sql cascade (apache#6339) * [HUDI-4581] Claim RFC-58 for data skipping integration with query engines (apache#6346) * [HUDI-4577] Adding test coverage for `DELETE FROM`, Spark Quickstart guide (apache#6318) * [HUDI-4556] Improve functional test coverage of column stats index (apache#6319) * [HUDI-4558] lost 'hoodie.table.keygenerator.class' in hoodie.properties (apache#6320) Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4543] Support natural order when table schema contains a field named 'ts' (apache#6246) * be able to disable precombine field when table schema contains a field named ts Co-authored-by: jian yonghua <jianyonghua@163.com> * [HUDI-4569][RFC-58] Claim RFC-58 for adding a new feature named 'Multiple event_time Fields Latest Verification in a Single Table' for Hudi (apache#6328) Co-authored-by: XinyaoTian <leontian1024@gmail.com> * [HUDI-3503] Support more feature to call procedure CleanCommand (apache#6353) * [HUDI-4590] Add hudi-aws dependency to hudi-flink-bundle. (apache#6356) * [MINOR] fix potential npe in spark writer (apache#6363) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * fix bug in cli show fsview all (apache#6314) * [HUDI-4488] Improve S3EventsHoodieIncrSource efficiency (apache#6228) * [HUDI-4611] Fix the duplicate creation of config in HoodieFlinkStreamer (apache#6369) Co-authored-by: linfey <linfey2021@gmail.com> * [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table (apache#6141) * Spark support MOR read archived commits for incremental query * [MINOR] fix progress field calculate logic in HoodieLogRecordReader (apache#6291) * [HUDI-4608] Fix upgrade command in Hudi CLI (apache#6374) * [HUDI-4609] Improve usability of upgrade/downgrade commands in Hudi CLI (apache#6377) * [HUDI-4574] Fixed timeline based marker thread safety issue (apache#6383) * fixed timeline based markers thread safety issue * add document for TimelineBasedMarkers thread safety issues * [HUDI-4621] Add validation that bucket index fields should be subset of primary keys (apache#6396) * check bucket index fields Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4354] Add --force-empty-sync flag to deltastreamer (apache#6027) * [HUDI-4601] Read error from MOR table after compaction with timestamp partitioning (apache#6365) * read error from mor after compaction Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [MINOR] Update DOAP with 0.12.0 Release (apache#6413) * [HUDI-4529] Tweak some default config options for flink (apache#6287) * [HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415) * [HUDI-4551] Tweak the default parallelism of flink pipeline to execution env parallelism (apache#6312) * [MINOR] Improve code style of CLI Command classes (apache#6427) * [HUDI-3625] Claim RFC-60 for Federated Storage Layer (apache#6440) * [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar (apache#6386) - Adding PulsarSource to DeltaStreamer to support ingesting from Apache Pulsar. - Current implementation of PulsarSource is relying on "pulsar-spark-connector" to ingest using Spark instead of building similar pipeline from scratch. * [HUDI-3579] Add timeline commands in hudi-cli (apache#5139) * [HUDI-4638] Rename payload clazz and preCombine field options for flink sql (apache#6434) * Revert "[HUDI-4632] Remove the force active property for flink1.14 profile (apache#6415)" (apache#6449) This reverts commit 9055b2f. * [HUDI-4643] MergeInto syntax WHEN MATCHED is optional but must be set (apache#6443) * [HUDI-4644] Change default flink profile to 1.15.x (apache#6445) * [HUDI-4678] Claim RFC-61 for Snapshot view management (apache#6461) Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4676] infer cleaner policy when write concurrency mode is OCC (apache#6459) * [HUDI-4676] infer cleaner policy when write concurrency mode is OCC Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4683] Use enum class value for default value in flink options (apache#6453) * [HUDI-4584] Cleaning up Spark utilities (apache#6351) Cleans up Spark utilities and removes duplication * [HUDI-4686] Flip option 'write.ignore.failed' to default false (apache#6467) Also fix the flaky test * [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (apache#6267) * [HUDI-4637] Release thread in RateLimiter doesn't been terminated (apache#6433) * [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (apache#6481) * HUDI-4687 add show_invalid_parquet procedure (apache#6480) Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> * [HUDI-4584] Fixing `SQLConf` not being propagated to executor (apache#6352) Fixes `HoodieSparkUtils.createRDD` to make sure `SQLConf` is properly propagated to the executor (required by `AvroSerializer`) * [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies (apache#6170) * [HUDI-4665] Flipping default for "ignore failed batch" config in streaming sink to false (apache#6450) * [HUDI-4713] Fix flaky ITTestHoodieDataSource#testAppendWrite (apache#6490) * [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (apache#6494) * Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (apache#5884)" (apache#6501) This reverts commit 660177b. * [Stacked on 6386] Fixing `DebeziumSource` to properly commit offsets; (apache#6416) * [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer (apache#6111) * [HUDI-4703] use the historical schema to response time travel query (apache#6499) * [HUDI-4703] use the historical schema to response time travel query * [HUDI-4549] Remove avro from hudi-hive-sync-bundle and hudi-aws-bundle (apache#6472) * Remove avro shading from hudi-hive-sync-bundle and hudi-aws-bundle. Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com> * [HUDI-4482] remove guava and use caffeine instead for cache (apache#6240) * [HUDI-4483] Fix checkstyle in integ-test module (apache#6523) * [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (apache#6000) * [DOCS] Add docs about javax.security.auth.login.LoginException when starting Hudi Sink Connector (apache#6255) * [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (apache#6533) * [HUDI-4730] Fix batch job cannot clean old commits files (apache#6515) * [HUDI-4370] Fix batch job cannot clean old commits files Co-authored-by: jian.feng <jian.feng@shopee.com> * [HUDI-4740] Add metadata fields for hive catalog #createTable (apache#6541) * [HUDI-4695] Fixing flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime (apache#6534) * [HUDI-4193] change protoc version to unblock hudi compilation on m1 mac (apache#6535) * [HUDI-4438] Fix flaky TestCopyOnWriteActionExecutor#testPartitionMetafileFormat (apache#6546) * [MINOR] Fix typo in HoodieArchivalConfig (apache#6542) * [HUDI-4582] Support batch synchronization of partition to HMS to avoid timeout (apache#6347) Co-authored-by: xxhua <xxhua@freewheel.tv> * [HUDI-4742] Fix AWS Glue partition's location is wrong when updatePartition (apache#6545) Co-authored-by: xxhua <xxhua@freewheel.tv> * [HUDI-4418] Add support for ProtoKafkaSource (apache#6135) - Adds PROTO to Source.SourceType enum. - Handles PROTO type in SourceFormatAdapter by converting to Avro from proto Message objects. Conversion to Row goes Proto -> Avro -> Row currently. - Added ProtoClassBasedSchemaProvider to generate schemas for a proto class that is currently on the classpath. - Added ProtoKafkaSource which parses byte[] into a class that is on the path. - Added ProtoConversionUtil which exposes methods for creating schemas and translating from Proto messages to Avro GenericRecords. - Added KafkaSource which provides a base class for the other Kafka sources to use. * [HUDI-4642] Adding support to hudi-cli to repair deprecated partition (apache#6438) * [HUDI-4751] Fix owner instants for transaction manager api callers (apache#6549) * [HUDI-4739] Wrong value returned when key's length equals 1 (apache#6539) * extracts key fields Co-authored-by: 吴文池 <wuwenchi@deepexi.com> * [HUDI-4528] Add diff tool to compare commit metadata (apache#6485) * Add diff tool to compare commit metadata * Add partition level info to commits and compaction command * Partition support for compaction archived timeline * Add diff command test * [HUDI-4648] Support rename partition through CLI (apache#6569) * [HUDI-4775] Fixing incremental source for MOR table (apache#6587) * Fixing incremental source for MOR table * Remove unused import Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> * [HUDI-4694] Print testcase running time for CI jobs (apache#6586) * [RFC] Claim RFC-62 for Diagnostic Reporter (apache#6599) Co-authored-by: yuezhang <yuezhang@freewheel.tv> * [minor] following HUDI-4739, fix the extraction for simple record keys (apache#6594) * [HUDI-4619] Add a remote request retry mechanism for 'Remotehoodietablefilesystemview'. (apache#6393) * [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when source not contains meta fields (apache#6500) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> * [HUDI-4389] Make HoodieStreamingSink idempotent (apache#6098) * Support checkpoint and idempotent writes in HoodieStreamingSink - Use batchId as the checkpoint key and add to commit metadata - Support multi-writer for checkpoint data model * Walk back previous commits until checkpoint is found * Handle delete operation and fix test * [MINOR] Remove redundant braces (apache#6604) * [HUDI-4618] Separate log word for CommitUitls class (apache#6392) * [HUDI-4776] Fix merge into use unresolved assignment (apache#6589) * [HUDI-4795] Fix KryoException when bulk insert into a not bucket index hudi table Co-authored-by: hbg <bingeng.huang@shopee.com> * [HUDI-4615] Return checkpoint as null for empty data from events queue. (apache#6387) Co-authored-by: sivabalan <n.siva.b@gmail.com> * [HUDI-4782] Support TIMESTAMP_LTZ type for flink (apache#6607) * [HUDI-4731] Shutdown CloudWatch reporter when query completes (apache#6468) * [HUDI-4793] Fixing ScalaTest tests to properly respect Log4j2 configs (apache#6617) * [HUDI-4766] Strengthen flink clustering job (apache#6566) * Allow rollbacks if required during clustering * Allow size to be defined in Long instead of Integer * Fix bug where clustering will produce files of 120MB in the same filegroup * Added clean task * Fix scheduling config to be consistent with that with compaction * Fix filter mode getting ignored issue * Add --instant-time parameter * Prevent no execute() calls exception from being thrown (clustering & compaction) * Apply upstream changes * Fix compilation issues * Fix checkstyle Signed-off-by: LinMingQiang <1356469429@qq.com> Signed-off-by: HunterXHunter <1356469429@qq.com> Co-authored-by: miomiocat <284487410@qq.com> Co-authored-by: RexAn <bonean131@gmail.com> Co-authored-by: JerryYue-M <272614347@qq.com> Co-authored-by: jerryyue <jerryyue@didiglobal.com> Co-authored-by: Danny Chan <yuzhao.cyz@gmail.com> Co-authored-by: superche <73096722+hechao-ustc@users.noreply.github.com> Co-authored-by: superche <superche@tencent.com> Co-authored-by: Shiyan Xu <2701446+xushiyan@users.noreply.github.com> Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com> Co-authored-by: voonhous <voonhousu@gmail.com> Co-authored-by: voonhou.su <voonhou.su@shopee.com> Co-authored-by: YueZhang <69956021+zhangyue19921010@users.noreply.github.com> Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com> Co-authored-by: xi chaomin <36392121+xicm@users.noreply.github.com> Co-authored-by: xicm <xicm@asiainfo.com> Co-authored-by: ForwardXu <forwardxu315@gmail.com> Co-authored-by: 董可伦 <dongkelun01@inspur.com> Co-authored-by: shenjiayu17 <54424149+shenjiayu17@users.noreply.github.com> Co-authored-by: Lanyuanxiaoyao <lanyuanxiaoyao@gmail.com> Co-authored-by: KnightChess <981159963@qq.com> Co-authored-by: 苏承祥 <scx_white@aliyun.com> Co-authored-by: Kumud Kumar Srivatsava Tirupati <kumudkumartirupati@users.noreply.github.com> Co-authored-by: xiarixiaoyao <mengtao0326@qq.com> Co-authored-by: liujinhui <965147871@qq.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 冯健 <fengjian428@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: HunterXHunter <1356469429@qq.com> Co-authored-by: Luning (Lucas) Wang <rsl4@foxmail.com> Co-authored-by: Yann Byron <biyan900116@gmail.com> Co-authored-by: Tim Brown <tim.brown126@gmail.com> Co-authored-by: simonsssu <barley0806@gmail.com> Co-authored-by: Alexey Kudinkin <alexey@infinilake.com> Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com> Co-authored-by: Bo Cui <cuibo0108@163.com> Co-authored-by: Rahil Chertara <rchertar@amazon.com> Co-authored-by: wenningd <wenningding95@gmail.com> Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: Rahil C <32500120+rahil-c@users.noreply.github.com> Co-authored-by: Ryan Pifer <rmpifer@umich.edu> Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: simonssu <simonssu@tencent.com> Co-authored-by: Vander <30547463+vanderzh@users.noreply.github.com> Co-authored-by: Tim Brown <tim@onehouse.ai> Co-authored-by: Dongwook Kwon <dongwook@amazon.com> Co-authored-by: Shawn Chang <42792772+CTTY@users.noreply.github.com> Co-authored-by: Shawn Chang <yxchang@amazon.com> Co-authored-by: 5herhom <35916131+5herhom@users.noreply.github.com> Co-authored-by: 吴祥平 <408317717@qq.com> Co-authored-by: root <root@TCN1004532-1.tcent.cn> Co-authored-by: F7753 <mabiaocas@gmail.com> Co-authored-by: lewinma <lewinma@tencent.com> Co-authored-by: shaoxiong.zhan <31836510+microbearz@users.noreply.github.com> Co-authored-by: Nicholas Jiang <programgeek@163.com> Co-authored-by: Yonghua Jian_deepnova <47289660@qq.com> Co-authored-by: leesf <490081539@qq.com> Co-authored-by: 5herhom <543872547@qq.com> Co-authored-by: RexXiong <lvshuang.tb@gmail.com> Co-authored-by: BruceLin <brucekellan@gmail.com> Co-authored-by: Pratyaksh Sharma <pratyaksh13@gmail.com> Co-authored-by: wuwenchi <wuwenchihdu@hotmail.com> Co-authored-by: 吴文池 <wuwenchi@deepexi.com> Co-authored-by: jian yonghua <jianyonghua@163.com> Co-authored-by: Xinyao Tian (Richard) <31195026+XinyaoTian@users.noreply.github.com> Co-authored-by: XinyaoTian <leontian1024@gmail.com> Co-authored-by: zhanshaoxiong <shaoxiong0001@@gmail.com> Co-authored-by: vamshigv <107005799+vamshigv@users.noreply.github.com> Co-authored-by: feiyang_deepnova <736320652@qq.com> Co-authored-by: linfey <linfey2021@gmail.com> Co-authored-by: novisfff <62633257+novisfff@users.noreply.github.com> Co-authored-by: Qi Ji <qjqqyy@users.noreply.github.com> Co-authored-by: hehuiyuan <471627698@qq.com> Co-authored-by: Zouxxyy <zouxxyy@qq.com> Co-authored-by: Teng <teng_huo@outlook.com> Co-authored-by: leandro-rouberte <37634317+leandro-rouberte@users.noreply.github.com> Co-authored-by: Jon Vexler <jbvexler@gmail.com> Co-authored-by: smilecrazy <smilecrazy1h@gmail.com> Co-authored-by: xxhua <xxhua@freewheel.tv> Co-authored-by: komao <masterwangzx@gmail.com> Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Co-authored-by: felixYyu <felix2003@live.cn> Co-authored-by: Bingeng Huang <304979636@qq.com> Co-authored-by: hbg <bingeng.huang@shopee.com> Co-authored-by: Vinish Reddy <vinishreddygunner17@gmail.com> Co-authored-by: junyuc25 <10862251+junyuc25@users.noreply.github.com> Co-authored-by: rmahindra123 <rmahindra@Rajeshs-MacBook-Pro.local>

xushiyan changed the base branch from release-feature-rfc55 to master June 18, 2022 06:59

xushiyan force-pushed the HUDI-3730-metasyncrefactoring branch from 1bfc141 to 7fec7ff Compare June 18, 2022 07:45

xushiyan changed the title ~~[WIP] [HUDI-3730] Improve meta sync class design and hierarchies~~ [HUDI-3730] Improve meta sync class design and hierarchies Jun 19, 2022

xushiyan force-pushed the HUDI-3730-metasyncrefactoring branch 2 times, most recently from 3b1ff06 to 5b49711 Compare June 24, 2022 11:18

xushiyan assigned codope Jun 25, 2022

xushiyan added meta-sync priority:blocker labels Jun 25, 2022

xushiyan added this to Under Discussion PRs in PR Tracker Board via automation Jun 25, 2022

xushiyan force-pushed the HUDI-3730-metasyncrefactoring branch 2 times, most recently from 333eb20 to b923d9d Compare June 26, 2022 15:22

apache deleted a comment from hudi-bot Jun 26, 2022

xushiyan marked this pull request as ready for review June 26, 2022 15:31

xushiyan force-pushed the HUDI-3730-metasyncrefactoring branch from 7550054 to d0b466c Compare June 30, 2022 03:19

codope mentioned this pull request Jul 1, 2022

[HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs #5695

Merged

codope reviewed Jul 1, 2022

View reviewed changes

fengjian428 and others added 13 commits July 2, 2022 23:23

[HUDI-3730] Improve meta sync class design and hierarchies (apache#5754)

d4cd131

Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>

[HUDI-3730] Refactor metasync classes

aa60621

fix usage in hudi-sync/hudi-hive-sync

783d8bd

fix usage in hudi-sync/hudi-hive-sync

f54c59f

renaming

797bb26

renaming

151cdda

move classes

62a9c05

fix usage in datahub sync and aws glue sync

a0ac3ad

fix usage in gcp sync

76da5d3

fix usage in adb sync

f0f8020

fix usage

a736c79

move back part extractors

60f52cd

fix UT in adb sync

51a1a00

xushiyan and others added 15 commits July 2, 2022 23:25

temp disable ITTestHoodieDataSource

86866ff

fix datahub sync config

73fd717

add ut in datahub sync config

0193b00

[HUDI-3730] fix global tools UT (#12)

802015e

* [HUDI-3730] fix global tools UT * [HUDI-3730] fix global tools UT Co-authored-by: jian.feng <jian.feng@shopee.com>

revert IT disable

982bc22

fix get database name

7c945de

fix multiple help args issue

7cc5fa0

fix IT

b38bc32

remove metasync interface

acf3943

[HUDI-3730] add global tools UT (#13)

6305914

Co-authored-by: jian.feng <jian.feng@shopee.com>

fix style

9757ddf

fix rebase

3565896

address comments

0091988

add a UT and docs

712cf9c

fix getStorageSchema() impl. and remove redundant docs

55b7bc1

xushiyan force-pushed the HUDI-3730-metasyncrefactoring branch from 4ce5cef to 55b7bc1 Compare July 3, 2022 04:25

remove deprecated scanTablePartitions()

9b072f3

use static imports

c95a9d1

apache deleted a comment from hudi-bot Jul 3, 2022

codope approved these changes Jul 3, 2022

View reviewed changes

PR Tracker Board automation moved this from Under Discussion PRs to Nearing Landing Jul 3, 2022

codope merged commit c0e1587 into apache:master Jul 3, 2022

PR Tracker Board automation moved this from Nearing Landing to Done Jul 3, 2022

xushiyan mentioned this pull request Jul 26, 2022

[HUDI-3730] Make metasync configs backward compatible #6221

Merged

xushiyan mentioned this pull request Aug 9, 2022

[MINOR] Fix wrong key to determine sync sql cascade #6339

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3730] Improve meta sync class design and hierarchies #5854

[HUDI-3730] Improve meta sync class design and hierarchies #5854

xushiyan commented Jun 14, 2022 •

edited

Loading

xushiyan commented Jun 26, 2022

codope left a comment

codope Jun 30, 2022

xushiyan Jul 2, 2022

xushiyan Jul 2, 2022

codope Jul 2, 2022

xushiyan Jul 2, 2022

codope Jun 30, 2022

xushiyan Jul 2, 2022

xushiyan Jul 2, 2022

codope Jul 1, 2022

xushiyan Jul 2, 2022

codope Jul 1, 2022

xushiyan Jul 2, 2022

xushiyan commented Jul 3, 2022

hudi-bot commented Jul 3, 2022

codope left a comment

[HUDI-3730] Improve meta sync class design and hierarchies #5854

[HUDI-3730] Improve meta sync class design and hierarchies #5854

Conversation

xushiyan commented Jun 14, 2022 • edited Loading

xushiyan commented Jun 26, 2022

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan commented Jul 3, 2022

hudi-bot commented Jul 3, 2022

CI report:

codope left a comment

Choose a reason for hiding this comment

xushiyan commented Jun 14, 2022 •

edited

Loading