Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-29198][test] Fail after maximum RetryOnException #20805

Merged
merged 3 commits into from
Nov 4, 2022

Conversation

RyanSkraba
Copy link
Contributor

What is the purpose of the change

This pull request causes tests annotated with @RetryOnException to fail after the given number of retries when used with the JUnit5 RetryExtension. This is the old behaviour we saw with the JUnit4 RetryRule.

Currently, the test is retried for the maximum number of executions, but the "expected" exception is never thrown even on the last try. This causes the test to appear skipped instead of failed.

Brief change log

  • If the test doesn't succeed on the last possible retry, the exception is propagated as a test failure.
  • I reverted some overeager JUnit5 migration on test cases for the JUnit4 RetryRule. This code is appropriately duplicated for RetryExtension.

Verifying this change

This change is already covered by existing tests, such as RetryOnExceptionExtensionTest.

I've manually tested that the following test now fails after the expected number of retries (instead of being skipped):

    @TestTemplate
    @RetryOnException(times = NUMBER_OF_RUNS, exception = IllegalArgumentException.class)
    void testThrowExceptionForever() {
        throw new IllegalArgumentException();
    }

In JUnit5, it's very tricky to specify that a test is supposed to fail, but it's possible. I started down this route with a technique similar to ExpectToFail but it ends up needing to be very aware of the internals of our RetryExtension. I'm not sure it's worth adding test extension code to test extension code, but I can continue down that route!

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 9, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@snuyanzin
Copy link
Contributor

Thanks for the contribution.
I think it would be great to have a test here confirming that the change fixes the issue

@RyanSkraba
Copy link
Contributor Author

Thanks for the contribution. I think it would be great to have a test here confirming that the change fixes the issue

Thanks Sergey, especially for the collaboration! This was a tricky thing to test; it gets a bit unclear about how far to go when testing test code :D But since this is such a widely used TestExtension, it's good to have extra coverage -- especially when the symptom is otherwise silently skipping tests that should have failed.

Copy link
Contributor

@snuyanzin snuyanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some tiny comments: add Error and Throwable to the tests

@snuyanzin
Copy link
Contributor

@flinkbot run azure

@snuyanzin
Copy link
Contributor

lgtm from my side

//cc @XComp

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR, @RyanSkraba, and @snuyanzin for the initial review. I looked into it and added a few comments. Please find them below.

Comment on lines +43 to +47
// Failed when reach the total retry times
if (attemptIndex >= totalTimes) {
LOG.error("Test Failed at the last retry.", throwable);
throw throwable;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should move this code into AbstractRetryStrategy instead of using the same code snippet in both implementing classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! I thought about this too -- the way the code is expressed, there could be other implementations of AbstractRetryStrategy that apply different strategies (for example, to continue retrying for max period of time regardless of the number of attempts).

My opinion isn't strong on this, but the logic isn't really complicated enough to need to be abstracted away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point - but your example would probably get its own implementation of RetryStrategy instead of relying on AbstractRetryStrategy as the parent. A counter-argument would be to not try to optimize code for future feature to keep the code as small as possible. If someone decides to come up with a feature that doesn't fit the current implementation, it's time to refactor. Having code duplicates usually doesn't help but introduces potential problems, e.g. bugs only being fixed in one version but not the other.

Anyway, I still see your point...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave your suggestion a try, but creating and delegating to a new method in the abstract strategy takes more code and makes the final implementation non-obvious.

It might have been a bit premature to have introduced the strategy pattern with only two alternatives, but I'm not too worried about repeating this pretty simple if explicitly! Do you have a strong opinion about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave this up to you. No strong opinion here since, as you said, it's a small code change. So, I consider it an edge case. But in general, my comment from above still applies: Duplicate code creates the risk of fixing bugs only in one location but not the other.

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. nice catch with the accidental JUnit4->5 migration. Could you reorganize/squash the commits. Ideally, reverting the JUnit4->5 migration should be in a separate commit pointing to the relevant ticket.

@RyanSkraba RyanSkraba force-pushed the rskraba/FLINK-29198-retry-exception branch from c29037a to 7fa5d53 Compare October 17, 2022 13:56
@RyanSkraba RyanSkraba requested a review from XComp October 17, 2022 14:03
@RyanSkraba
Copy link
Contributor Author

Sorry for the delay! I've merge/squashed all of the commits (with @snuyanzin now as co-author, thanks!). The only changes were 1 javadoc clarification and the small requested simplification in the unit test. Can you take a look?

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RyanSkraba for addressing my comments. The change looks good now. Could you rebase the PR and we do a final PR run. It would also help to update the commit message and move the JavaDoc update into its own commit.

@RyanSkraba RyanSkraba force-pushed the rskraba/FLINK-29198-retry-exception branch from 7fa5d53 to bde4719 Compare October 20, 2022 07:42
@RyanSkraba
Copy link
Contributor Author

Rebased again and edited the commit messages!

@RyanSkraba RyanSkraba force-pushed the rskraba/FLINK-29198-retry-exception branch from bde4719 to 4235eb4 Compare October 21, 2022 15:38
@RyanSkraba
Copy link
Contributor Author

Rebased again and moved some modifications to their own separate commit

@RyanSkraba RyanSkraba requested review from XComp and snuyanzin and removed request for XComp October 24, 2022 09:34
RyanSkraba and others added 3 commits November 3, 2022 09:50
Co-authored-by: Sergey Nuyanzin <sergey.nuyanzin@aiven.io>
It's actually meant to test the JUnit4 RetryRule feature. The corresponding JUnit5 Extension is tested in `RetryOnFailureExtensionTest`.
@RyanSkraba RyanSkraba force-pushed the rskraba/FLINK-29198-retry-exception branch from 4235eb4 to fa21097 Compare November 3, 2022 08:51
@RyanSkraba
Copy link
Contributor Author

Rebased and changed the commit messages to the suggested text. Thanks for your attention!

In fact, for testing correctness, I would suggest that all three commits be cherry-picked back into branch-1.16

@XComp
Copy link
Contributor

XComp commented Nov 3, 2022

@flinkbot run azure

@XComp
Copy link
Contributor

XComp commented Nov 4, 2022

Failure caused by FLINK-24119

@XComp
Copy link
Contributor

XComp commented Nov 4, 2022

@flinkbot run azure

1 similar comment
@XComp
Copy link
Contributor

XComp commented Nov 4, 2022

@flinkbot run azure

@XComp XComp merged commit b9c5799 into apache:master Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants