Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Flink's SpeculativeExecution in batch execution mode - Backport of PR #10548 #10776

Merged
merged 4 commits into from
Aug 1, 2024

Conversation

venkata91
Copy link
Contributor

Summary

Add support for Flink's Speculative Execution in batch execution mode

Details

Backport from 1.19 to 1.18 required some manual conflict resolution only in the tests. They are as follows:

  1. Instead of getRuntimeContext().getTaskInfo().getIndexOfThisSubtask() and getRuntimeContext().getTaskInfo().getAttemptNumber() use getRuntimeContext().getIndexOfThisSubtask() and getRuntimeContext().getAttemptNumber()
  2. Add dropDatabase() helper method in TestBase which required adding DEFAULT_CATALOG_NAME as a constant in FlinkCatalogFactory.

Backport from 1.18 to 1.17 went through cleanly.

Testing

Added an integration test to verify the expected speculative execution behavior with IcebergSource

@github-actions github-actions bot added the flink label Jul 24, 2024
@venkata91
Copy link
Contributor Author

cc @pvary for review.

@pvary
Copy link
Contributor

pvary commented Jul 24, 2024

The 1st point is fine.
@rodmeneses: How did we handle the drop database stuff in previous Flink releases?

Thanks, Peter

@rodmeneses
Copy link
Contributor

The 1st point is fine. @rodmeneses: How did we handle the drop database stuff in previous Flink releases?

Thanks, Peter

We didn't have dropDatabase on 1.17 and 1.18. Only dropCatalog existed there.

@pvary
Copy link
Contributor

pvary commented Jul 26, 2024

@rodmeneses: IIRC we faced an issue in the tests when we were introducing Flink 1.19. This was solved by adding the dropDatabase to the tests. We found that it would be good to have a better cleanup method for every tests which uses FlinkSQL, but decided against adding that along with the 1.19 PR. This basically resulted in "inconsistencies" between the 1.17-1.18/1.19 tests.

If my recollection above is correct, then we have 2 tasks ahead of us:

  1. This PR doesn't need to add the dropDatabase related changes to the backports
  2. We still need to come up with a better cleanup for the SQL tests

@rodmeneses: Could you please confirm?

Thanks,
Peter

@rodmeneses
Copy link
Contributor

@rodmeneses: IIRC we faced an issue in the tests when we were introducing Flink 1.19. This was solved by adding the dropDatabase to the tests. We found that it would be good to have a better cleanup method for every tests which uses FlinkSQL, but decided against adding that along with the 1.19 PR. This basically resulted in "inconsistencies" between the 1.17-1.18/1.19 tests.

If my recollection above is correct, then we have 2 tasks ahead of us:

  1. This PR doesn't need to add the dropDatabase related changes to the backports
  2. We still need to come up with a better cleanup for the SQL tests

@rodmeneses: Could you please confirm?

Thanks, Peter

that is correct @pvary. We discussed about having a more thorough cleaning logic for all of FlinkSQL related unit tests, but I didn't have the time nor I wanted to keep adding more complexity on my previous PR.

@venkata91
Copy link
Contributor Author

venkata91 commented Jul 26, 2024

This basically resulted in "inconsistencies" between the 1.17-1.18/1.19 tests.

@pvary Can you clarify what do you mean by the "inconsistencies" here?

@pvary
Copy link
Contributor

pvary commented Jul 29, 2024

This basically resulted in "inconsistencies" between the 1.17-1.18/1.19 tests.

@pvary Can you clarify what do you mean by the "inconsistencies" here?

Differences would have been a better word.

Based on the discussion above: @venkata91: Could you please remove the dropDatabase from the tests for now? Would tests work without it?

Thanks,
Peter

@venkata91
Copy link
Contributor Author

This basically resulted in "inconsistencies" between the 1.17-1.18/1.19 tests.

@pvary Can you clarify what do you mean by the "inconsistencies" here?

Differences would have been a better word.

Based on the discussion above: @venkata91: Could you please remove the dropDatabase from the tests for now? Would tests work without it?

Thanks, Peter

Okay removed the dropDatabase for 1.17 and 1.18 from TestBase and currently dropping the database as part of after().

@venkata91
Copy link
Contributor Author

@pvary Gentle ping

@pvary
Copy link
Contributor

pvary commented Jul 31, 2024

@venkata91: Could you please check the failure?
Otherwise LGTM

@venkata91
Copy link
Contributor Author

venkata91 commented Jul 31, 2024

@venkata91: Could you please check the failure? Otherwise LGTM

@pvary I looked at it earlier and it seems to be unrelated. Test failure is in org.apache.iceberg.flink.TestFlinkAnonymousTable and I ran it locally and it passed as well. Could it be a transient test failure?

@venkata91
Copy link
Contributor Author

@pvary Btw, I merged the main branch changes to my local branch in order to trigger the tests again. Ideally, if it is transient issue, hopefully this should solve it.

@venkata91
Copy link
Contributor Author

@pvary Btw, I merged the main branch changes to my local branch in order to trigger the tests again. Ideally, if it is transient issue, hopefully this should solve it.

@pvary looks like all the checks passed now.

@pvary pvary merged commit 84c9125 into apache:main Aug 1, 2024
24 checks passed
@pvary
Copy link
Contributor

pvary commented Aug 1, 2024

Merged to main.
Thanks @venkata91 for the backport!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants