Skip to content

[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438

Closed
nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
nsivabalan:release-0.12.2_fileIndex_mdt_reuse
Closed

[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438
nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
nsivabalan:release-0.12.2_fileIndex_mdt_reuse

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Dec 12, 2022

Change Logs

Fixing reuse of file readers for metadata table in FileIndex.

Impact

Fixing reuse of file readers for metadata table in FileIndex. This was the case from the starting. but one of the earlier refactoring reverted it (offending commit). Master is already fixed with this, but we are not pulling the large patch into 0.12.2 and so, here is the fix against 0.12.2.

Risk level (write none, low medium or high below)

low.

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 12, 2022
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build fails b/c of this change: #7175

@danny0405
Copy link
Contributor

Thanks for the fix, what is the affect without this fix, the user can not query the latest result set if the file index is cached ?

@nsivabalan nsivabalan force-pushed the release-0.12.2_fileIndex_mdt_reuse branch from 5c821f5 to 0e1dfcc Compare December 13, 2022 04:21
@nsivabalan
Copy link
Contributor Author

@danny0405 : no issues from querying standpoint. might have some perf hit, but no correctness or failures.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

YannByron and others added 22 commits December 13, 2022 07:09
…ache#7334)

Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like:

```
// [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements
if (iter.isEmpty) {
  // ...
} else {
  // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements
  iter.map { /* ... */ }
}
```
…eIterator (apache#7340)

* Unify RecordIterator and HoodieParquetReader with ClosableIterator
* Add a factory clazz for RecordIterator
* Add more documents
## Change Logs

Actually disable spark-sql core flow tests
* Add schema set with stream api.

Co-authored-by: superche <superche@tencent.com>
* add call help procedure

Co-authored-by: 苏承祥 <sucx@tuya.com>
jonvex and others added 17 commits December 13, 2022 10:57
…from occurring (apache#7367)


Co-authored-by: Jonathan Vexler <=>
…entation (apache#7395)

* Removing `SqlTypedRecord`

* Rebasing `ExpressionPayloadCodeGen` to execute against Avro payload direcrtly

* Cleaned up `ExpressionPayload`s CodeGen to interface exclusively w/ `InternalRow`, abstracting away Catalyst <> Avro conversion

* Make sure input's Schema is included into the cache's key

* Fixing missing code-gen state

* Revisited code-gen evaluation to make sure Avro > Catalyst conversion is not performed multiple times

* Fixed `ExpressionPayload` accidentally pinning whole Avro record in memory

* Replaced unnecessary bespoke code-gen w/ Spark's `SafeProjection`;
Removed `ExpressionCodeGen`
This is addressing NPE while handling column stats w/in the HoodieAppendHandle
…ce gaps (apache#7370)

This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command:

Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway)
Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around)
Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated)
Additionally following improvements to HoodieBulkInsertHelper were made:

Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
Co-authored-by: hbg <bingeng.huang@shopee.com>
…ache#7420)

This PR fixes the validation logic inside cleaner tests in TestCleanerInsertAndCleanByCommits.

In the tests, the KEEP_LATEST_COMMITS cleaner policy is used. This policy first figures out the earliest commit to retain based on the config of the number of retained commits (hoodie.cleaner.commits.retained). Then, for each file group, one more version before the earliest commit to retain is also kept from cleaning. The commit for the version can be different among file groups. For example, given the following commits, with x denoting a base file written for the corresponding commit and file group,

         c1  c2  c3  c4  c5
fg1:     x        x   x  x
fg2:          x   x   x  x
with hoodie.cleaner.commits.retained=3, no files are cleaned.

However, the current validation logic only statically picks the one commit before the earliest commit to retain in the Hudi timeline, i.e., c2 in the above example, for all file groups, marking the base file written by c1 for fg1 to be expected for cleaning, which does not match the KEEP_LATEST_COMMITS cleaner policy. The validation logic is fixed according to the expected cleaner behavior.
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…g it non-serializable (apache#7424)

- Internal state (cached records, writer schemas) are removed to make
   sure that `ExpressionPayload` object is serializable at all times.
- `ExpressionPayload` caches are scoped down to `ThreadLocal` since
   some of the re-used components (AvroSerializer, AvroDeserializer, SafeProjection)
   have internal mutable state and therefore are not thread-safe
@nsivabalan nsivabalan force-pushed the release-0.12.2_fileIndex_mdt_reuse branch from 0e1dfcc to 3489a10 Compare December 13, 2022 23:56
@nsivabalan
Copy link
Contributor Author

closing in favor of #7450

@nsivabalan nsivabalan closed this Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.