-
Notifications
You must be signed in to change notification settings - Fork 695
GEODE-9642: skip RedundancyRecovery for colocated regions, if colocat… #7114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ion is not completed
| "create gateway-sender --id=parallelPositions --remote-distributed-system-id=1 --enable-persistence=true --disk-store-name=data --parallel=true") | ||
| .statusIsSuccess(); | ||
|
|
||
| GeodeAwaitility.await().until(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you add an atMost() condition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous PR this was commented, and updated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you.
| * immediately in all servers. | ||
| */ | ||
| @Test | ||
| public void alterInitializedRegionWithGwSenderOnManyServersDoesNotTakeTooLong() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case seems to pass in my laptop without the fix. Should more servers be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is test that reproduce fault in old PR. You can repeat test case several times, it will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, this would fail with changed timeout, but that cannot be changed. For details you can see comments in previous PR.
albertogpz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a general comment about how the tests have been modified.
I think that the problem with the modifications of the test cases is that it is not checked anymore if recovery is started at least in one server. It is only checked that if it is started, it is also finished, but it may happen (because it is not checked) that recovery is not started in any server.
I suggest an alternative change in which it is checked that recovery is finished in the last server started.
|
But recovery could be started, on more servers then just lust. In real situations when region is created, recovery will be started almost on all servers. So I am not sure that checking recovery only on one server is valid test. |
|
It only depends if collocation is completed. |
If servers are started sequentially, colocation will not be completed until the last one is started and that's why recovery will only be run in it. As almost all test cases start the servers sequentially, I suggested to check recovery on the last one. |
|
We are talking about creation of child region on servers, not starting servers. Normal behavior, when you create region, is that it is created on all servers in parallel. |
You are right. I meant region creation on each server and not server start. |
|
It would be good if any code owner could comment on this issue, and give his opinion on new solution, and what is expected from test. |
…is not triggered if colocation is not completed
b21005c to
8d5d016
Compare
.../src/distributedTest/java/org/apache/geode/internal/cache/execute/PRColocationDUnitTest.java
Outdated
Show resolved
Hide resolved
mhansonp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some debate here about the behavior changes. Waiting to see the result
e595393 to
f431983
Compare
|
@albertogpz Did your concerns get addressed? |
I think the modified tests still need to check that recovery was started at least in one of the servers. |
As far as I understand the tests, any time |
| private boolean getRecoveryStatus() { | ||
| return recoveryExecuted; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be inlined.
That's right. I had not seen that check. |
My concerns have been addressed. Now, recovery started in at least one server is addressed. |
mhansonp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment. Looks like you should use an await in one case..
| public void waitForRegion(Region region, long timeout) throws InterruptedException { | ||
| long start = System.currentTimeMillis(); | ||
| synchronized (this) { | ||
| while (!recoveryStartedOnRegions.contains(region)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be simplified to an await.until...
…ion is not completed
Co-authored-by: albertogpz alberto.gomez@est.tech
In this solution, for each server, if colocation is not completed (due to any possible reason), we are skipping RedundancyRecovery. So, when last server is registering partitioned region, he will set colocation completed and notify all other servers, and he will trigger creation of buckets on all other servers.
For all changes:
[*] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
[*] Has your PR been rebased against the latest commit within the target branch (typically
develop)?[*] Is your initial contribution a single, squashed commit?
[*] Does
gradlew buildrun cleanly?[*] Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?