[HUDI-1771] Propagate CDC format for hoodie #2854

danny0405 · 2021-04-20T08:10:43Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2021-04-20T09:17:13Z

Codecov Report

Merging #2854 (aeca8e7) into master (9a288cc) will increase coverage by 17.20%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #2854       +/-   ##
=============================================
+ Coverage     52.58%   69.79%   +17.20%     
+ Complexity     3708      373     -3335     
=============================================
  Files           485       54      -431     
  Lines         23227     1993    -21234     
  Branches       2466      235     -2231     
=============================================
- Hits          12215     1391    -10824     
+ Misses         9934      471     -9463     
+ Partials       1078      131      -947

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.79% <ø> (ø)`	`373.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...ava/org/apache/hudi/cli/commands/UtilsCommand.java
...hadoop/realtime/RealtimeCompactedRecordReader.java
...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java
.../hudi/common/table/timeline/dto/FileStatusDTO.java
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala
...g/apache/hudi/common/config/LockConfiguration.java
...spark/src/main/scala/org/apache/hudi/package.scala
...ache/hudi/common/fs/SizeAwareDataOutputStream.java
...hudi/common/config/DFSPropertiesConfiguration.java
.../hadoop/realtime/AbstractRealtimeRecordReader.java
... and 414 more

vinothchandar

@danny0405 should I start with HoodieDataSourceITCase to understand the big picture first. I feel I need to spend a weekend playing with all this good stuff, so i am fully caught up. and start asking decent questions :)

danny0405 · 2021-05-12T02:52:12Z

@danny0405 should I start with HoodieDataSourceITCase to understand the big picture first. I feel I need to spend a weekend playing with all this good stuff, so i am fully caught up. and start asking decent questions :)

Yes, HoodieDataSourceITCase is a good start to show how a Flink sql job connects the Hoodie tables. I'm glad we can have some good questions and discussion ~

Xuehai-Chen · 2021-06-05T10:22:41Z

Maybe I get it wrong, but it looks like the "_hoodie_cdc_operation" field is not set properly when it's flushing to files.
I think problem is when StreamWriteFunction.bufferRecord convert HoodieRecord to DataItem, the operation field is lost.

vinothchandar · 2021-06-15T13:24:55Z

Quick update here. We should totally do this. Just mulling the schema evol aspects. Need sometime to ensure backwards compat or atleast set expectations.

hudi-bot · 2021-06-15T16:35:36Z

CI report:

63cde19 Azure: CANCELED

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

danny0405 · 2021-06-16T09:40:06Z

Maybe I get it wrong, but it looks like the "_hoodie_cdc_operation" field is not set properly when it's flushing to files.
I think problem is when StreamWriteFunction.bufferRecord convert HoodieRecord to DataItem, the operation field is lost.

You are right, when i write the code, there is no DataItem in the StreamWriteFunction yet, you need to test just with this branch code.

danny0405 · 2021-06-16T09:41:06Z

Quick update here. We should totally do this. Just mulling the schema evol aspects. Need sometime to ensure backwards compat or atleast set expectations.

Totally agree, in order to keep backwards compatibility we may need some test cases in the hoodie core.

vinothchandar · 2021-07-08T07:24:43Z

@danny0405 Some good progress on the schema backwards compat.

@codope did some tests around schema evolution and I think we can introduce this as a top level nullable field without breaking things. Need bit more testing. But seems promising.

Few questions on Flink/Dynamic tables.

This will just run as a streaming job right? or is it possible to leverage these flags even with the Flink batch APIs

danny0405 · 2021-07-08T08:52:15Z

@danny0405 Some good progress on the schema backwards compat.

@codope did some tests around schema evolution and I think we can introduce this as a top level nullable field without breaking things. Need bit more testing. But seems promising.

Few questions on Flink/Dynamic tables.

This will just run as a streaming job right? or is it possible to leverage these flags even with the Flink batch APIs

Flink batch also works.

vinothchandar · 2021-07-08T15:26:35Z

I was reading through batch execution mode. Felt like it was expecting bounded streams, perfect watermarks.

What I have in mind is, triggering either a batch or streaming job say every 1-2 minute - which resumes processing where it left off. When commits to tables are only happening every minute or so, running continuous queries may be expensive? thoughts?

danny0405 · 2021-07-09T01:58:36Z

running continuous queries may be expensive

That's true, we may need some optimization for the streaming processing operators and runtime.

danny0405 · 2021-07-16T07:54:43Z

My colleague takes over this PR so i would close this one and let's move to another one.

[HUDI-1771] Propagate CDC format for hoodie

63cde19

danny0405 force-pushed the HUDI-1771 branch from aeca8e7 to 63cde19 Compare April 22, 2021 07:33

vinothchandar added this to Review in progress in PR Tracker Board Apr 24, 2021

vinothchandar self-assigned this Apr 24, 2021

vinothchandar moved this from Nearing Landing to Ready for Review in PR Tracker Board May 11, 2021

vinothchandar moved this from Ready for Review to Opened PRs in PR Tracker Board May 11, 2021

vinothchandar reviewed May 11, 2021

View reviewed changes

nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021

danny0405 closed this Jul 16, 2021

PR Tracker Board automation moved this from Under Discussion PRs to Done Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1771] Propagate CDC format for hoodie #2854

[HUDI-1771] Propagate CDC format for hoodie #2854

danny0405 commented Apr 20, 2021

codecov-commenter commented Apr 20, 2021

vinothchandar left a comment

danny0405 commented May 12, 2021

Xuehai-Chen commented Jun 5, 2021 •

edited

vinothchandar commented Jun 15, 2021

hudi-bot commented Jun 15, 2021 •

edited

danny0405 commented Jun 16, 2021

danny0405 commented Jun 16, 2021

vinothchandar commented Jul 8, 2021

danny0405 commented Jul 8, 2021

vinothchandar commented Jul 8, 2021

danny0405 commented Jul 9, 2021

danny0405 commented Jul 16, 2021

[HUDI-1771] Propagate CDC format for hoodie #2854

[HUDI-1771] Propagate CDC format for hoodie #2854

Conversation

danny0405 commented Apr 20, 2021

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

codecov-commenter commented Apr 20, 2021

Codecov Report

vinothchandar left a comment

Choose a reason for hiding this comment

danny0405 commented May 12, 2021

Xuehai-Chen commented Jun 5, 2021 • edited

vinothchandar commented Jun 15, 2021

hudi-bot commented Jun 15, 2021 • edited

CI report:

danny0405 commented Jun 16, 2021

danny0405 commented Jun 16, 2021

vinothchandar commented Jul 8, 2021

danny0405 commented Jul 8, 2021

vinothchandar commented Jul 8, 2021

danny0405 commented Jul 9, 2021

danny0405 commented Jul 16, 2021

Xuehai-Chen commented Jun 5, 2021 •

edited

hudi-bot commented Jun 15, 2021 •

edited