[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438
Closed
nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
Closed
[HUDI-5375] Fixing reusing file readers with Metadata reader within FileIndex#7438nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
nsivabalan wants to merge 39 commits intoapache:release-0.12.2from
Conversation
alexeykudinkin
approved these changes
Dec 12, 2022
Contributor
|
Thanks for the fix, what is the affect without this fix, the user can not query the latest result set if the file index is cached ? |
5c821f5 to
0e1dfcc
Compare
Contributor
Author
|
@danny0405 : no issues from querying standpoint. might have some perf hit, but no correctness or failures. |
Collaborator
…ache#7334) Addressing an invalid semantic of MOR iterators not being actually idempotent: ie, calling `hasNext` multiple times was actually leading to iterator advancing, therefore potentially skipping the elements for ex in cases like: ``` // [[isEmpty]] will invoke [[hasNext]] to check whether Iterator has any elements if (iter.isEmpty) { // ... } else { // Here [[map]] will invoke [[hasNext]] again, therefore skipping the elements iter.map { /* ... */ } } ```
…eIterator (apache#7340) * Unify RecordIterator and HoodieParquetReader with ClosableIterator * Add a factory clazz for RecordIterator * Add more documents
… deprecated soon (apache#7347)" (apache#7350) This reverts commit eb6da96.
…yValue contains ',' (apache#7342)
## Change Logs Actually disable spark-sql core flow tests
* Add schema set with stream api. Co-authored-by: superche <superche@tencent.com>
…TestDataGenerator (apache#7215)
* add call help procedure Co-authored-by: 苏承祥 <sucx@tuya.com>
Upgrade necessary to include fix for https://www.oscs1024.com/hd/CVE-2021-22569
* Fixes https://www.oscs1024.com/hd/CVE-2015-5237 Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…from occurring (apache#7367) Co-authored-by: Jonathan Vexler <=>
…entation (apache#7395) * Removing `SqlTypedRecord` * Rebasing `ExpressionPayloadCodeGen` to execute against Avro payload direcrtly * Cleaned up `ExpressionPayload`s CodeGen to interface exclusively w/ `InternalRow`, abstracting away Catalyst <> Avro conversion * Make sure input's Schema is included into the cache's key * Fixing missing code-gen state * Revisited code-gen evaluation to make sure Avro > Catalyst conversion is not performed multiple times * Fixed `ExpressionPayload` accidentally pinning whole Avro record in memory * Replaced unnecessary bespoke code-gen w/ Spark's `SafeProjection`; Removed `ExpressionCodeGen`
This is addressing NPE while handling column stats w/in the HoodieAppendHandle
…ce gaps (apache#7370) This PR is addressing some of the performance traps detected while stress-testing Spark SQL's Create Table as Select command: Avoids reordering of the columns w/in CTAS (there's no need for it, InsertIntoTableCommand will be resolving columns anyway) Fixing validation sequence w/in InsertIntoTableCommand to first resolve the columns and then run validation (currently it's done the other way around) Propagating properties specified in CTAS to the HoodieSparkSqlWriter (for ex, currently there's no way to disable MT when using CTAS precisely b/c of the fact that these properties are not propagated) Additionally following improvements to HoodieBulkInsertHelper were made: Now if meta-fields are disabled, we won't be dereferencing incoming Dataset into RDD and instead simply add stubbed out meta-fields t/h additional Projection
Co-authored-by: hbg <bingeng.huang@shopee.com>
…ache#7420) This PR fixes the validation logic inside cleaner tests in TestCleanerInsertAndCleanByCommits. In the tests, the KEEP_LATEST_COMMITS cleaner policy is used. This policy first figures out the earliest commit to retain based on the config of the number of retained commits (hoodie.cleaner.commits.retained). Then, for each file group, one more version before the earliest commit to retain is also kept from cleaning. The commit for the version can be different among file groups. For example, given the following commits, with x denoting a base file written for the corresponding commit and file group, c1 c2 c3 c4 c5 fg1: x x x x fg2: x x x x with hoodie.cleaner.commits.retained=3, no files are cleaned. However, the current validation logic only statically picks the one commit before the earliest commit to retain in the Hudi timeline, i.e., c2 in the above example, for all file groups, marking the base file written by c1 for fg1 to be expected for cleaning, which does not match the KEEP_LATEST_COMMITS cleaner policy. The validation logic is fixed according to the expected cleaner behavior.
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
…g it non-serializable (apache#7424) - Internal state (cached records, writer schemas) are removed to make sure that `ExpressionPayload` object is serializable at all times. - `ExpressionPayload` caches are scoped down to `ThreadLocal` since some of the re-used components (AvroSerializer, AvroDeserializer, SafeProjection) have internal mutable state and therefore are not thread-safe
0e1dfcc to
3489a10
Compare
Contributor
Author
|
closing in favor of #7450 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
Fixing reuse of file readers for metadata table in FileIndex.
Impact
Fixing reuse of file readers for metadata table in FileIndex. This was the case from the starting. but one of the earlier refactoring reverted it (offending commit). Master is already fixed with this, but we are not pulling the large patch into 0.12.2 and so, here is the fix against 0.12.2.
Risk level (write none, low medium or high below)
low.
Documentation Update
N/A
Contributor's checklist