Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARCTIC-994] Introduce design of TransactionId generation to resolving data conflicts #1010

Merged
merged 49 commits into from
Feb 2, 2023

Conversation

wangtaohz
Copy link
Contributor

@wangtaohz wangtaohz commented Jan 12, 2023

Why are the changes needed?

fix #994

Brief change log

  1. generate TransactionId from change snapshot sequence
  2. spark operation begins a transaction with new TransactionId
  3. flink operation does not begin a transaction, and uses snapshot sequence as TransactionId when committing. The TxId in fileName is set to be 0, and the correct TransactionId should get from iceberg metadata in this case
  4. flink incremental pull using TableEntriesScan instead of the iceberg TableScan, to get the sequence number from iceberg metadata when TxId in fileName is 0
  5. import a UUID into the file name to distinguish files with the same TxId = 0, in CommonOutputFileFactory and AdaptHiveOutputFileFactory
  6. Introduce com.netease.arctic.io.FileNameHandle to resolve file name.
  7. Introduce com.netease.arctic.data.file.ContentFileWithSequence to wrap ContentFile.
  8. Make com.netease.arctic.scan.BaseChangeTableIncrementalScan can return SequenceNumber.
  9. make the using change snapshot sequence as TransactionId be compatible with old TransactionId (make sure snapshot sequence >= old transactionId) when ams start, in UpdateTool

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduces a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@github-actions github-actions bot added the module:core Core module label Jan 12, 2023
@wangtaohz
Copy link
Contributor Author

When TransactionId is generated on commit, the TransactionId in the file name is 0, so we have to import a UUID into the file name to distinguish files.

@github-actions github-actions bot added the module:mixed-hive Hive moduel for Mixed Format label Jan 16, 2023
@github-actions github-actions bot added the module:mixed-spark Spark module for Mixed Format label Jan 16, 2023
@github-actions github-actions bot added module:ams-server Ams server module module:ams-dashboard Ams dashboard module module:mixed-flink Flink moduel for Mixed Format module:mixed-trino trino module for Mixed Format labels Jan 16, 2023
Copy link
Contributor

@YesOrNo828 YesOrNo828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangtaohz Thanks for your contribution. I left some comments.

@zstraw Do you have a time to take a look?

@wangtaohz
Copy link
Contributor Author

This PR also fix #1045

@zhoujinsong zhoujinsong linked an issue Feb 2, 2023 that may be closed by this pull request
1 task
try (TaskWriter<Record> writer = GenericTaskWriters.builderFor(table)
.withTransactionId(txId)
.withChangeAction(action)
.buildChangeWriter()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove transactionId here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it writes some data into ChangeStore, there is no need to begin a transaction here, so I remove the transactionId.

@zhoujinsong zhoujinsong merged commit f7ce44e into apache:master Feb 2, 2023
Copy link
Contributor

@YesOrNo828 YesOrNo828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangtaohz wangtaohz deleted the fix-994-1 branch February 17, 2023 03:23
zhoujinsong pushed a commit that referenced this pull request May 31, 2023
…ing data conflicts (#1010)

* generate transaction from change table sequence number

* remove gap

* add operationId into OutputFileFactory

* remove allocateTransactionId from TablePropertyUtil for spark

* remove useless import

* flink not begin transaction

* remove begin transaction when write into change store

* add UpdateTool to be compatible with old transactionId

* fix unit test

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* fix checkstyle

* fix compile error

* 1.fix flink incremental pull in 1.12
2.fix unit test in 1.12

* 1.fix flink incremental pull in 1.14
2.fix unit test in 1.14

* 1.fix flink incremental pull in 1.15
2.fix unit test in 1.15

* remove useless import

* Adapt new Transaction model

* Adapt new Transaction model

* fix ams unit test

* fix hive commit target files serialization

* fix compile error

* add summary to empty snapshot for Transaction begin and add some comment

* Adapt new Transaction model

* fix parse file for tracer and refactor some methods of FileNameHandle"

* remove useless code in flink

* 1. add more comment for FileNameHandle
2.fix unit test of tracer in core

* add comment

* remove useless import

* refactor FileNameHandle to FileNameGenerator

* add unit test for #1045

* 1.spark remove txId for operations on UnkeyedTable
2.support generate hive sub dir without txId

* remove FileName transaction > 0 check

---------

Co-authored-by: shidayang <530847445@qq.com>
ShawHee pushed a commit to ShawHee/arctic that referenced this pull request Dec 29, 2023
…ing data conflicts (apache#1010)

* generate transaction from change table sequence number

* remove gap

* add operationId into OutputFileFactory

* remove allocateTransactionId from TablePropertyUtil for spark

* remove useless import

* flink not begin transaction

* remove begin transaction when write into change store

* add UpdateTool to be compatible with old transactionId

* fix unit test

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* Adapt new Transaction model

* fix checkstyle

* fix compile error

* 1.fix flink incremental pull in 1.12
2.fix unit test in 1.12

* 1.fix flink incremental pull in 1.14
2.fix unit test in 1.14

* 1.fix flink incremental pull in 1.15
2.fix unit test in 1.15

* remove useless import

* Adapt new Transaction model

* Adapt new Transaction model

* fix ams unit test

* fix hive commit target files serialization

* fix compile error

* add summary to empty snapshot for Transaction begin and add some comment

* Adapt new Transaction model

* fix parse file for tracer and refactor some methods of FileNameHandle"

* remove useless code in flink

* 1. add more comment for FileNameHandle
2.fix unit test of tracer in core

* add comment

* remove useless import

* refactor FileNameHandle to FileNameGenerator

* add unit test for apache#1045

* 1.spark remove txId for operations on UnkeyedTable
2.support generate hive sub dir without txId

* remove FileName transaction > 0 check

---------

Co-authored-by: shidayang <530847445@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ams-dashboard Ams dashboard module module:ams-server Ams server module module:core Core module module:mixed-flink Flink moduel for Mixed Format module:mixed-hive Hive moduel for Mixed Format module:mixed-spark Spark module for Mixed Format module:mixed-trino trino module for Mixed Format
Projects
None yet
6 participants