[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex by nsivabalan · Pull Request #7438 · apache/hudi

nsivabalan · 2022-12-12T22:14:30Z

Change Logs

Fixing reuse of file readers for metadata table in FileIndex.

Impact

Fixing reuse of file readers for metadata table in FileIndex. This was the case from the starting. but one of the earlier refactoring reverted it (offending commit). Master is already fixed with this, but we are not pulling the large patch into 0.12.2 and so, here is the fix against 0.12.2.

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

alexeykudinkin

Build fails b/c of this change: #7175

danny0405 · 2022-12-13T02:51:35Z

Thanks for the fix, what is the affect without this fix, the user can not query the latest result set if the file index is cached ?

nsivabalan · 2022-12-13T04:32:22Z

@danny0405 : no issues from querying standpoint. might have some perf hit, but no correctness or failures.

hudi-bot · 2022-12-13T12:20:45Z

CI report:

0e1dfcc Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…eline (apache#7196)

…ache#7334) Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like: ``` // [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements if (iter.isEmpty) { // ... } else { // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements iter.map { /* ... */ } } ```

…ted soon (apache#7347)

…#7346)

…eIterator (apache#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents

… deprecated soon (apache#7347)" (apache#7350) This reverts commit eb6da96.

…yValue contains ',' (apache#7342)

## Change Logs Actually disable spark-sql core flow tests

* Add schema set with stream api. Co-authored-by: superche <superche@tencent.com>

…TestDataGenerator (apache#7215)

)

…ata (apache#7320)

* add call help procedure Co-authored-by: 苏承祥 <sucx@tuya.com>

…ing (apache#6505)

…7140)

…ark (apache#7399)

Upgrade necessary to include fix for https://www.oscs1024.com/hd/CVE-2021-22569

* Fixes https://www.oscs1024.com/hd/CVE-2015-5237 Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

…from occurring (apache#7367) Co-authored-by: Jonathan Vexler <=>

…entation (apache#7395) * Removing `SqlTypedRecord` * Rebasing `ExpressionPayloadCodeGen` to execute against Avro payload direcrtly * Cleaned up `ExpressionPayload`s CodeGen to interface exclusively w/ `InternalRow`, abstracting away Catalyst <> Avro conversion * Make sure input's Schema is included into the cache's key * Fixing missing code-gen state * Revisited code-gen evaluation to make sure Avro > Catalyst conversion is not performed multiple times * Fixed `ExpressionPayload` accidentally pinning whole Avro record in memory * Replaced unnecessary bespoke code-gen w/ Spark's `SafeProjection`; Removed `ExpressionCodeGen`

…pIndex (apache#7404)

This is addressing NPE while handling column stats w/in the HoodieAppendHandle

…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection

Co-authored-by: hbg <bingeng.huang@shopee.com>

…Hive (apache#6725)

…ache#7420) This PR fixes the validation logic inside cleaner tests in TestCleanerInsertAndCleanByCommits. In the tests, the KEEP_LATEST_COMMITS cleaner policy is used. This policy first figures out the earliest commit to retain based on the config of the number of retained commits (hoodie.cleaner.commits.retained). Then, for each file group, one more version before the earliest commit to retain is also kept from cleaning. The commit for the version can be different among file groups. For example, given the following commits, with x denoting a base file written for the corresponding commit and file group, c1 c2 c3 c4 c5 fg1: x x x x fg2: x x x x with hoodie.cleaner.commits.retained=3, no files are cleaned. However, the current validation logic only statically picks the one commit before the earliest commit to retain in the Hudi timeline, i.e., c2 in the above example, for all file groups, marking the base file written by c1 for fg1 to be expected for cleaning, which does not match the KEEP_LATEST_COMMITS cleaner policy. The validation logic is fixed according to the expected cleaner behavior.

…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.

…7418)

…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

…pache#7393)

…g it non-serializable (apache#7424) - Internal state (cached records, writer schemas) are removed to make sure that `ExpressionPayload` object is serializable at all times. - `ExpressionPayload` caches are scoped down to `ThreadLocal` since some of the re-used components (AvroSerializer, AvroDeserializer, SafeProjection) have internal mutable state and therefore are not thread-safe

…pache#7360)

nsivabalan · 2022-12-13T23:58:29Z

closing in favor of #7450

nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 12, 2022

nsivabalan assigned alexeykudinkin Dec 12, 2022

alexeykudinkin approved these changes Dec 12, 2022

View reviewed changes

nsivabalan force-pushed the release-0.12.2_fileIndex_mdt_reuse branch from 5c821f5 to 0e1dfcc Compare December 13, 2022 04:21

YannByron and others added 22 commits December 13, 2022 07:09

[HUDI-5279] move logic for deleting active instant to HoodieActiveTim…

a0ab401

…eline (apache#7196)

resolving conflicts for apache#7334

f796f02

[MINOR] Bumping Azure Ubuntu image to 22.04, as 18.04 will be depreca…

9ae6692

…ted soon (apache#7347)

[HUDI-5304] Disabling spark-sql core flow tests to unblock CI (apache…

d54ecdf

…#7346)

[HUDI-5306] Unify RecordIterator and HoodieParquetReader with Closabl…

1b8e5ec

…eIterator (apache#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents

resolving conflicts for HUDI-5306

c30e60f

Revert "[MINOR] Bumping Azure Ubuntu image to 22.04, as 18.04 will be…

838887f

… deprecated soon (apache#7347)" (apache#7350) This reverts commit eb6da96.

[HUDI-5302] Fix: compute hash key from recordKey failed when recordKe…

89eb565

…yValue contains ',' (apache#7342)

[MINOR] Disable the SparkSqlCoreFlow tests (apache#7368)

6912c5a

## Change Logs Actually disable spark-sql core flow tests

[HUDI-5179] Updated Hudi Release guide (apache#7212)

78bbb8a

[HUDI-5331] Add schema settings with stream api (apache#7384)

b62a4c9

* Add schema set with stream api. Co-authored-by: superche <superche@tencent.com>

Fixing test failure

4d0ea23

[MINOR] Fix locale specific NumberFormatException in testutils Hoodie…

8cf1b3f

…TestDataGenerator (apache#7215)

[HUDI-5334] Fix checkpoint reading for structured streaming (apache#7389

1b4b40e

)

[HUDI-5290] Remove the lock in HoodieFlinkWriteClient#writeTableMetad…

135fc9e

…ata (apache#7320)

[HUDI-5314] add call help procedure (apache#7361)

89370ae

* add call help procedure Co-authored-by: 苏承祥 <sucx@tuya.com>

[HUDI-4764] AWS GlueSync turn partition already exist error into warn…

ed989b6

…ing (apache#6505)

[HUDI-5163] Fix failure handling with spark datasource write (apache#…

2455743

…7140)

[HUDI-3661] Flink async compaction is not thread safe when use waterm…

49e6976

…ark (apache#7399)

[HUDI-5344] Fix CVE - upgrade protobuf-java to 3.18.2 (apache#6957)

4efe14e

Upgrade necessary to include fix for https://www.oscs1024.com/hd/CVE-2021-22569

[HUDI-5344] Fix CVE - upgrade protobuf-java (apache#6960)

f7efb58

* Fixes https://www.oscs1024.com/hd/CVE-2015-5237 Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

jonvex and others added 17 commits December 13, 2022 10:57

[HUDI-5295] One meta sync failure should not prevent other meta sync …

12a0897

…from occurring (apache#7367) Co-authored-by: Jonathan Vexler <=>

[HUDI-5345] Avoid fs.exists calls for metadata table in HFileBootstra…

4e24144

…pIndex (apache#7404)

[HUDI-5291] Fixing NPE in MOR column stats accounting (apache#7349)

7abc074

This is addressing NPE while handling column stats w/in the HoodieAppendHandle

[HUDI-5350] Fix oom cause compaction event lost problem (apache#7408)

b930002

Co-authored-by: hbg <bingeng.huang@shopee.com>

[HUDI-4881] Push down filters if possible when syncing partitions to …

ae21564

…Hive (apache#6725)

[HUDI-5356] Call close on SparkRDDWriteClient several places (apache#…

8d83a87

…7418)

[HUDI-5336] Fixing parsing of log files while building file groups (a…

d39bae6

…pache#7393)

[HUDI-5372] Fix NPE caused by alter table add column. (apache#7236)

e63a2bc

[MINOR] Fix Out of Bounds Exception for DayBasedCompactionStrategy (a…

2e358b3

…pache#7360)

[HUDI-5353] Close file readers (apache#7412)

0afeb0e

Fixing reusing file readers with Metadata reader within FileIndex

3489a10

nsivabalan force-pushed the release-0.12.2_fileIndex_mdt_reuse branch from 0e1dfcc to 3489a10 Compare December 13, 2022 23:56

nsivabalan closed this Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438

[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438
nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
nsivabalan:release-0.12.2_fileIndex_mdt_reuse

nsivabalan commented Dec 12, 2022 •

edited

Loading

Uh oh!

alexeykudinkin left a comment •

edited

Loading

Uh oh!

danny0405 commented Dec 13, 2022

Uh oh!

nsivabalan commented Dec 13, 2022

Uh oh!

hudi-bot commented Dec 13, 2022

Uh oh!

nsivabalan commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

nsivabalan commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

alexeykudinkin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Dec 13, 2022

Uh oh!

nsivabalan commented Dec 13, 2022

Uh oh!

hudi-bot commented Dec 13, 2022

CI report:

Uh oh!

nsivabalan commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

nsivabalan commented Dec 12, 2022 •

edited

Loading

alexeykudinkin left a comment •

edited

Loading