Skip to content

Conversation

@mivanac
Copy link
Contributor

@mivanac mivanac commented Nov 16, 2021

…ion is not completed

Co-authored-by: albertogpz alberto.gomez@est.tech

In this solution, for each server, if colocation is not completed (due to any possible reason), we are skipping RedundancyRecovery. So, when last server is registering partitioned region, he will set colocation completed and notify all other servers, and he will trigger creation of buckets on all other servers.

For all changes:

  • [*] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • [*] Has your PR been rebased against the latest commit within the target branch (typically develop)?

  • [*] Is your initial contribution a single, squashed commit?

  • [*] Does gradlew build run cleanly?

  • [*] Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

"create gateway-sender --id=parallelPositions --remote-distributed-system-id=1 --enable-persistence=true --disk-store-name=data --parallel=true")
.statusIsSuccess();

GeodeAwaitility.await().until(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you add an atMost() condition here?

Copy link
Contributor Author

@mivanac mivanac Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous PR this was commented, and updated accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you.

* immediately in all servers.
*/
@Test
public void alterInitializedRegionWithGwSenderOnManyServersDoesNotTakeTooLong() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case seems to pass in my laptop without the fix. Should more servers be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is test that reproduce fault in old PR. You can repeat test case several times, it will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, this would fail with changed timeout, but that cannot be changed. For details you can see comments in previous PR.

Copy link
Contributor

@albertogpz albertogpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general comment about how the tests have been modified.

I think that the problem with the modifications of the test cases is that it is not checked anymore if recovery is started at least in one server. It is only checked that if it is started, it is also finished, but it may happen (because it is not checked) that recovery is not started in any server.

I suggest an alternative change in which it is checked that recovery is finished in the last server started.

@mivanac
Copy link
Contributor Author

mivanac commented Nov 17, 2021

But recovery could be started, on more servers then just lust. In real situations when region is created, recovery will be started almost on all servers. So I am not sure that checking recovery only on one server is valid test.

@mivanac
Copy link
Contributor Author

mivanac commented Nov 17, 2021

It only depends if collocation is completed.

@albertogpz
Copy link
Contributor

It only depends if collocation is completed.

If servers are started sequentially, colocation will not be completed until the last one is started and that's why recovery will only be run in it.
If servers are started in parallel, it must be checked that recovery is executed at least in one of them.

As almost all test cases start the servers sequentially, I suggested to check recovery on the last one.
For test cases starting the servers in parallel, it must be checked that recovery is started in at least one of them or alternatively, the start could be changed to start all the servers in parallel except for one and then check that recovery was started on the last one.

@mivanac
Copy link
Contributor Author

mivanac commented Nov 17, 2021

We are talking about creation of child region on servers, not starting servers. Normal behavior, when you create region, is that it is created on all servers in parallel.

@albertogpz
Copy link
Contributor

We are talking about creation of child region on servers, not starting servers. Normal behavior, when you create region, is that it is created on all servers in parallel.

You are right. I meant region creation on each server and not server start.

@mivanac
Copy link
Contributor Author

mivanac commented Nov 17, 2021

It would be good if any code owner could comment on this issue, and give his opinion on new solution, and what is expected from test.

…is not triggered if colocation is not completed
@mivanac mivanac force-pushed the newfeature/GEODE-9642 branch from b21005c to 8d5d016 Compare November 17, 2021 20:51
Copy link
Contributor

@mhansonp mhansonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some debate here about the behavior changes. Waiting to see the result

@mhansonp mhansonp self-requested a review November 17, 2021 23:26
@mivanac mivanac force-pushed the newfeature/GEODE-9642 branch from e595393 to f431983 Compare November 18, 2021 14:47
@mhansonp
Copy link
Contributor

@albertogpz Did your concerns get addressed?

@albertogpz
Copy link
Contributor

@albertogpz Did your concerns get addressed?

I think the modified tests still need to check that recovery was started at least in one of the servers.

@DonalEvans
Copy link
Contributor

@albertogpz Did your concerns get addressed?

I think the modified tests still need to check that recovery was started at least in one of the servers.

As far as I understand the tests, any time checkIfRecoveryExecuted() is called, it's implicitly checking that recovery was started, because if recovery is never started on any servers, then the calls to recoveryStarted.await() will time out and recoveryExecuted will not be set to true.

Comment on lines +2178 to +2180
private boolean getRecoveryStatus() {
return recoveryExecuted;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be inlined.

@albertogpz
Copy link
Contributor

albertogpz commented Dec 2, 2021

@albertogpz Did your concerns get addressed?

I think the modified tests still need to check that recovery was started at least in one of the servers.

As far as I understand the tests, any time checkIfRecoveryExecuted() is called, it's implicitly checking that recovery was started, because if recovery is never started on any servers, then the calls to recoveryStarted.await() will time out and recoveryExecuted will not be set to true.

That's right. I had not seen that check.

@albertogpz
Copy link
Contributor

@albertogpz Did your concerns get addressed?

I think the modified tests still need to check that recovery was started at least in one of the servers.

My concerns have been addressed. Now, recovery started in at least one server is addressed.

Copy link
Contributor

@mhansonp mhansonp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment. Looks like you should use an await in one case..

public void waitForRegion(Region region, long timeout) throws InterruptedException {
long start = System.currentTimeMillis();
synchronized (this) {
while (!recoveryStartedOnRegions.contains(region)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be simplified to an await.until...

@mivanac mivanac closed this Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants