refactor: define avro model for parsing commit metadata#477
refactor: define avro model for parsing commit metadata#477xushiyan merged 4 commits intoapache:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #477 +/- ##
==========================================
+ Coverage 86.72% 86.98% +0.25%
==========================================
Files 49 51 +2
Lines 2637 2643 +6
==========================================
+ Hits 2287 2299 +12
+ Misses 350 344 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull Request Overview
This PR refactors commit metadata parsing by introducing structured types (HoodieCommitMetadata and HoodieReplaceCommitMetadata) with automatic Avro schema derivation. This replaces manual JSON traversal with strongly-typed parsing using serde and apache-avro-derive.
- Introduces
HoodieCommitMetadataandHoodieReplaceCommitMetadatastructs with Avro schema support - Refactors schema resolution and file group building to use structured types
- Adds comprehensive test coverage for new metadata types
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/core/src/metadata/commit.rs | New module defining HoodieCommitMetadata and HoodieWriteStat with Avro schema derivation and helper methods |
| crates/core/src/metadata/replace_commit.rs | New module defining HoodieReplaceCommitMetadata for replace commit operations |
| crates/core/src/metadata/mod.rs | Exports new commit metadata modules |
| crates/core/src/schema/resolver.rs | Refactored to use HoodieCommitMetadata instead of manual JSON parsing |
| crates/core/src/file_group/builder.rs | Refactored to use structured metadata types with improved error messages |
| crates/core/Cargo.toml | Added apache-avro-derive dependency |
| Cargo.toml | Enabled "derive" feature for apache-avro and added apache-avro-derive |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // With the new implementation, this returns Ok with an empty HashSet | ||
| // because iter_write_stats() returns an empty iterator when partition_to_write_stats is None |
There was a problem hiding this comment.
[nitpick] The comment incorrectly states the behavior is new. This appears to be documenting changed behavior in an existing test. Consider updating the comment to explain why this behavior is correct rather than just describing the implementation detail.
| // With the new implementation, this returns Ok with an empty HashSet | |
| // because iter_write_stats() returns an empty iterator when partition_to_write_stats is None | |
| // When partition_to_write_stats is missing from the metadata, | |
| // the correct behavior is to return Ok with an empty HashSet, since there are no file groups to build. |
| .clone(); | ||
|
|
||
| let result = build_file_groups(&metadata); | ||
| // Serde will fail to parse this and return a deserialization error |
There was a problem hiding this comment.
[nitpick] The comment explains implementation details but doesn't clarify why this test case is important. Consider explaining what invalid data scenario this test is validating.
| // Serde will fail to parse this and return a deserialization error | |
| // This test validates that providing a non-string value for the 'fileId' field (here, a number) results in a deserialization error, as 'fileId' is expected to be a string. |
| .clone(); | ||
|
|
||
| let result = build_file_groups(&metadata); | ||
| // Serde will fail to parse this and return a deserialization error |
There was a problem hiding this comment.
[nitpick] The comment explains implementation details but doesn't clarify why this test case is important. Consider explaining what invalid data scenario this test is validating.
| // Serde will fail to parse this and return a deserialization error | |
| // This test verifies that deserialization fails when the "path" field is not a string, | |
| // ensuring that invalid data types in commit metadata are properly rejected. |
| .clone(); | ||
|
|
||
| let result = build_replaced_file_groups(&metadata); | ||
| // Serde will fail to parse this |
There was a problem hiding this comment.
[nitpick] The comment is incomplete and doesn't explain the test's purpose. Consider explaining what invalid data scenario this test is validating.
| // Serde will fail to parse this | |
| // This test verifies that if the file IDs array contains a non-string value (here, a number instead of a string), | |
| // Serde will fail to parse the commit metadata and return the appropriate error. |
| .clone(); | ||
|
|
||
| let result = build_replaced_file_groups(&metadata); | ||
| // Serde will fail to parse this |
There was a problem hiding this comment.
[nitpick] The comment is incomplete and doesn't explain the test's purpose. Consider explaining what invalid data scenario this test is validating.
| // Serde will fail to parse this | |
| // This tests the scenario where the array of file IDs contains a null value, | |
| // which is invalid. Serde should fail to parse this as a valid file ID. |
Description
Define commit metadata structs for commit and replacecommit. And refactor parsing logic using the structs.
How are the changes test-covered