[MINOR] support log index by watermelon12138 · Pull Request #10143 · apache/hudi

watermelon12138 · 2023-11-20T09:14:24Z

Change Logs

The log index is supported to up the speed of reading log in Flink.

Impact

None. This feature is disabled by default.

Risk level (write none, low medium or high below)

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2023-11-20T12:04:43Z

CI report:

c60560a Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-11-21T02:36:57Z

Can you wrap up a general design of the changes, so that we are more eaiser to reach concensus for the general direction.

watermelon12138 · 2023-11-21T14:07:46Z

Can you wrap up a general design of the changes, so that we are more eaiser to reach concensus for the general direction.
@danny0405 ok, I will summarize the overall modification content and design ideas after I finish adding UT.

watermelon12138 · 2023-12-02T07:26:37Z

@danny0405
Hello, Danny
I would like to ask that why data with the same primary key is written to different log files (with the same FileId and different timestamps) in upsert mode? As a result, I cannot write ut to test the LogIndex capability. My test code is as follows:

` public void testHoodiePipelineBuilderSource() throws Exception {
//create a StreamExecutionEnvironment instance.
StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment();
execEnv.getConfig().disableObjectReuse();
execEnv.setParallelism(1);
// set up checkpoint interval
execEnv.enableCheckpointing(4000, CheckpointingMode.EXACTLY_ONCE);
execEnv.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
Configuration conf = TestConfigurations.getDefaultConf(tempFile.toURI().toString());
conf.setString(FlinkOptions.TABLE_NAME, "t1");
conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");
conf.setString(FlinkOptions.INDEX_TYPE, "BUCKET");
conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 1);
conf.setBoolean(FlinkOptions.LOG_INDEX_ENABLE, true);
conf.setString(FlinkOptions.PRECOMBINE_FIELD, "ts");
conf.setString(FlinkOptions.RECORD_KEY_FIELD, "uuid");
conf.setBoolean(FlinkOptions.PRE_COMBINE, true);
conf.setString(FlinkOptions.OPERATION, "upsert");

// write 3 batches of data set
TestData.writeData(TestData.dataSetInsert(1), conf);
TestData.writeData(TestData.dataSetInsert(1), conf);`

watermelon12138 · 2023-12-02T07:30:33Z

@danny0405 Hello, Danny I would like to ask that why data with the same primary key is written to different log files (with the same FileId and different timestamps) in upsert mode? As a result, I cannot write ut to test the LogIndex capability. My test code is as follows:

` public void testHoodiePipelineBuilderSource() throws Exception { //create a StreamExecutionEnvironment instance. StreamExecutionEnvironment execEnv = StreamExecutionEnvironment.getExecutionEnvironment(); execEnv.getConfig().disableObjectReuse(); execEnv.setParallelism(1); // set up checkpoint interval execEnv.enableCheckpointing(4000, CheckpointingMode.EXACTLY_ONCE); execEnv.getCheckpointConfig().setMaxConcurrentCheckpoints(1); Configuration conf = TestConfigurations.getDefaultConf(tempFile.toURI().toString()); conf.setString(FlinkOptions.TABLE_NAME, "t1"); conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); conf.setString(FlinkOptions.INDEX_TYPE, "BUCKET"); conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 1); conf.setBoolean(FlinkOptions.LOG_INDEX_ENABLE, true); conf.setString(FlinkOptions.PRECOMBINE_FIELD, "ts"); conf.setString(FlinkOptions.RECORD_KEY_FIELD, "uuid"); conf.setBoolean(FlinkOptions.PRE_COMBINE, true); conf.setString(FlinkOptions.OPERATION, "upsert");
// write 3 batches of data set
TestData.writeData(TestData.dataSetInsert(1), conf);
TestData.writeData(TestData.dataSetInsert(1), conf);`

@ad1happy2go
Hi great man !
Can you help me to resolve this? Thank you very mach.

danny0405 · 2023-12-04T02:45:13Z

I would like to ask that why data with the same primary key is written to different log files (with the same FileId and different timestamps) in upsert mode?

The primary lifecycle is maintained within one FileGroup, different log files may indicate multiple changes to one key which scattered among multiple commits.

[MINOR] support log index

c60560a

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

watermelon12138 closed this Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR] support log index#10143

[MINOR] support log index#10143
watermelon12138 wants to merge 1 commit intoapache:masterfrom
watermelon12138:SupportLogIndex

watermelon12138 commented Nov 20, 2023 •

edited

Loading

Uh oh!

hudi-bot commented Nov 20, 2023

Uh oh!

danny0405 commented Nov 21, 2023

Uh oh!

watermelon12138 commented Nov 21, 2023

Uh oh!

watermelon12138 commented Dec 2, 2023

Uh oh!

watermelon12138 commented Dec 2, 2023

Uh oh!

danny0405 commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

watermelon12138 commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Contributor's checklist

Uh oh!

hudi-bot commented Nov 20, 2023

CI report:

Uh oh!

danny0405 commented Nov 21, 2023

Uh oh!

watermelon12138 commented Nov 21, 2023

Uh oh!

watermelon12138 commented Dec 2, 2023

Uh oh!

watermelon12138 commented Dec 2, 2023

Uh oh!

danny0405 commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

watermelon12138 commented Nov 20, 2023 •

edited

Loading