New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1771] Propagate CDC format for hoodie #2854
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2854 +/- ##
=============================================
+ Coverage 52.58% 69.79% +17.20%
+ Complexity 3708 373 -3335
=============================================
Files 485 54 -431
Lines 23227 1993 -21234
Branches 2466 235 -2231
=============================================
- Hits 12215 1391 -10824
+ Misses 9934 471 -9463
+ Partials 1078 131 -947
Flags with carried forward coverage won't be shown. Click here to find out more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 should I start with HoodieDataSourceITCase
to understand the big picture first. I feel I need to spend a weekend playing with all this good stuff, so i am fully caught up. and start asking decent questions :)
Yes, |
Maybe I get it wrong, but it looks like the "_hoodie_cdc_operation" field is not set properly when it's flushing to files. |
Quick update here. We should totally do this. Just mulling the schema evol aspects. Need sometime to ensure backwards compat or atleast set expectations. |
You are right, when i write the code, there is no |
Totally agree, in order to keep backwards compatibility we may need some test cases in the hoodie core. |
@danny0405 Some good progress on the schema backwards compat. @codope did some tests around schema evolution and I think we can introduce this as a top level nullable field without breaking things. Need bit more testing. But seems promising. Few questions on Flink/Dynamic tables.
|
Flink batch also works. |
I was reading through batch execution mode. Felt like it was expecting bounded streams, perfect watermarks. What I have in mind is, triggering either a batch or streaming job say every 1-2 minute - which resumes processing where it left off. When commits to tables are only happening every minute or so, running continuous queries may be expensive? thoughts? |
That's true, we may need some optimization for the streaming processing operators and runtime. |
My colleague takes over this PR so i would close this one and let's move to another one. |
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.