Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1771] Propagate CDC format for hoodie #2854

Closed
wants to merge 1 commit into from

Conversation

danny0405
Copy link
Contributor

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

Codecov Report

Merging #2854 (aeca8e7) into master (9a288cc) will increase coverage by 17.20%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #2854       +/-   ##
=============================================
+ Coverage     52.58%   69.79%   +17.20%     
+ Complexity     3708      373     -3335     
=============================================
  Files           485       54      -431     
  Lines         23227     1993    -21234     
  Branches       2466      235     -2231     
=============================================
- Hits          12215     1391    -10824     
+ Misses         9934      471     -9463     
+ Partials       1078      131      -947     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient ? ?
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 69.79% <ø> (ø) 373.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...ava/org/apache/hudi/cli/commands/UtilsCommand.java
...hadoop/realtime/RealtimeCompactedRecordReader.java
...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java
.../hudi/common/table/timeline/dto/FileStatusDTO.java
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala
...g/apache/hudi/common/config/LockConfiguration.java
...spark/src/main/scala/org/apache/hudi/package.scala
...ache/hudi/common/fs/SizeAwareDataOutputStream.java
...hudi/common/config/DFSPropertiesConfiguration.java
.../hadoop/realtime/AbstractRealtimeRecordReader.java
... and 414 more

@vinothchandar vinothchandar added this to Review in progress in PR Tracker Board Apr 24, 2021
@vinothchandar vinothchandar self-assigned this Apr 24, 2021
@vinothchandar vinothchandar moved this from Nearing Landing to Ready for Review in PR Tracker Board May 11, 2021
@vinothchandar vinothchandar moved this from Ready for Review to Opened PRs in PR Tracker Board May 11, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 should I start with HoodieDataSourceITCase to understand the big picture first. I feel I need to spend a weekend playing with all this good stuff, so i am fully caught up. and start asking decent questions :)

@nsivabalan nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label May 11, 2021
@danny0405
Copy link
Contributor Author

@danny0405 should I start with HoodieDataSourceITCase to understand the big picture first. I feel I need to spend a weekend playing with all this good stuff, so i am fully caught up. and start asking decent questions :)

Yes, HoodieDataSourceITCase is a good start to show how a Flink sql job connects the Hoodie tables. I'm glad we can have some good questions and discussion ~

@Xuehai-Chen
Copy link

Xuehai-Chen commented Jun 5, 2021

Maybe I get it wrong, but it looks like the "_hoodie_cdc_operation" field is not set properly when it's flushing to files.
I think problem is when StreamWriteFunction.bufferRecord convert HoodieRecord to DataItem, the operation field is lost.

@vinothchandar
Copy link
Member

Quick update here. We should totally do this. Just mulling the schema evol aspects. Need sometime to ensure backwards compat or atleast set expectations.

@hudi-bot
Copy link

hudi-bot commented Jun 15, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor Author

Maybe I get it wrong, but it looks like the "_hoodie_cdc_operation" field is not set properly when it's flushing to files.
I think problem is when StreamWriteFunction.bufferRecord convert HoodieRecord to DataItem, the operation field is lost.

You are right, when i write the code, there is no DataItem in the StreamWriteFunction yet, you need to test just with this branch code.

@danny0405
Copy link
Contributor Author

Quick update here. We should totally do this. Just mulling the schema evol aspects. Need sometime to ensure backwards compat or atleast set expectations.

Totally agree, in order to keep backwards compatibility we may need some test cases in the hoodie core.

@vinothchandar
Copy link
Member

@danny0405 Some good progress on the schema backwards compat.

@codope did some tests around schema evolution and I think we can introduce this as a top level nullable field without breaking things. Need bit more testing. But seems promising.

Few questions on Flink/Dynamic tables.

  • This will just run as a streaming job right? or is it possible to leverage these flags even with the Flink batch APIs

@danny0405
Copy link
Contributor Author

@danny0405 Some good progress on the schema backwards compat.

@codope did some tests around schema evolution and I think we can introduce this as a top level nullable field without breaking things. Need bit more testing. But seems promising.

Few questions on Flink/Dynamic tables.

  • This will just run as a streaming job right? or is it possible to leverage these flags even with the Flink batch APIs

Flink batch also works.

@vinothchandar
Copy link
Member

I was reading through batch execution mode. Felt like it was expecting bounded streams, perfect watermarks.

What I have in mind is, triggering either a batch or streaming job say every 1-2 minute - which resumes processing where it left off. When commits to tables are only happening every minute or so, running continuous queries may be expensive? thoughts?

@danny0405
Copy link
Contributor Author

running continuous queries may be expensive

That's true, we may need some optimization for the streaming processing operators and runtime.

@danny0405
Copy link
Contributor Author

My colleague takes over this PR so i would close this one and let's move to another one.

@danny0405 danny0405 closed this Jul 16, 2021
PR Tracker Board automation moved this from Under Discussion PRs to Done Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants