IGNITE-9913 #6942

anton-vinogradov · 2019-10-07T10:37:29Z

Signed-off-by: Anton Vinogradov av@apache.org

Signed-off-by: Anton Vinogradov <av@apache.org>

agoncharuk · 2019-10-09T10:56:38Z

modules/core/src/main/java/org/apache/ignite/internal/processors/cache/ExchangeContext.java

+            GridAffinityAssignmentCache aff = grp.affinity();
+
+            Set<Integer> failedPrimaries = aff.primaryPartitions(fut.exchangeId().eventNode().id(), aff.lastVersion());
+            Set<Integer> locBackups = aff.backupPartitions(fut.sharedContext().localNodeId(), aff.lastVersion());
+
+            for (int part : failedPrimaries) {
+                if (locBackups.contains(part))
+                    return true;
+            }


This code does not look correct. A node may OWN a partition, but it will not be reported in aff.backupPartitions because the method returns current assignment, not the local partitions state. GridDhtPartitionTopology should be read for this information. A corresponding test need to be added.

Copying answer from issue.
Do you mean we have to serve also moving and renting partitions?

So, Set failedPrimaries = aff.primaryPartitions(fut.exchangeId().eventNode().id(), aff.lastVersion()); is a correct calculation, but
Set locBackups = aff.backupPartitions(fut.sharedContext().localNodeId(), aff.lastVersion()); should be replaced with dht.localPartitions() usage?

Jokser · 2019-10-09T13:23:58Z

modules/core/src/main/java/org/apache/ignite/internal/processors/cache/ExchangeContext.java

-        if (compatibilityNode || (crd && fut.localJoinExchange())) {
+        GridDiscoveryManager disco = fut.sharedContext().discovery();
+
+        if (protocolVer > 2 && fut.exchangeId().isLeft() && fut.isEventNodeInBaseline() &&


This protocol works only if cluster is in-memory and baseline-autoadjust is enabled.
How it works when persistence is enabled and with/without baseline autoadjust is enabled?

What protocol do you mean?
protocolVer == 3 was introduced by me.

Yes, I mean the protocol introduced by yourself.

It works when you have baseline ... fut.isEventNodeInBaseline()
Cant understand the question :(

disco.baselineChanged(fut.sharedContext().exchange().readyAffinityVersion(), fut.initialVersion())
This code will return true only if baseline has changed by NODE_LEFT event. Baseline is changed automatically in discovery cache only if it's in-memory cluster and auto-adjust is enabled. In other cases baseline is changed manually (if auto-adjust disabled) or by separate event (ChangeGlobalState) right after topology event if auto-adjust is enabled.

That's exactly the case covered by this fix :)

But you run this PME only if the baseline is changed: disco.baselineChanged(fut.sharedContext().exchange().readyAffinityVersion(), fut.initialVersion())

My bad :)

public boolean baselineChanged(AffinityTopologyVersion topVer1, AffinityTopologyVersion topVer2) {
...
return Objects.equals(baseline1, baseline2);
}

Typo fixed.

Issue solved?

agoncharuk · 2019-10-09T13:39:34Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

+    public boolean isEventNodeInBaseline() {
+        BaselineTopology top = cctx.discovery().discoCache().state().baselineTopology();
+
+        return top != null && top.consistentIds().contains(firstDiscoEvt.eventNode().consistentId());
+    }


Looks like there is a race here. discovery().discoCache() will return a current snapshot, which is updated in discovery thread, while the method is executed in exchange thread. Returned disco cache may have topology version different from the exchange future topology version.

BTW, what happens when exchanges are merged? Two failed nodes -> we need to check both? Failed and joined -> We should not allow fast-path exchange.

Correct fix is to use versioned disco?
cctx.discovery().discoCache(topVer)... correct?

We never merge left with anything
if (baselineNodeLeft) baselineNodeLeft = true; merge = false;
so, even cascade failure will be processed one-by-one.

Seems you can use firstEvtDiscoCache

agoncharuk · 2019-10-09T13:42:14Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

+    /**
+     * @throws IgniteCheckedException If failed.
+     */
+    private void waitRecovery() throws IgniteCheckedException {


The code in this method is very much similar to the waitPartitionRelease(). Can we avoid the duplication?

Sure, code was explicitly duplicated to simplify PoC review.
Will compact before the final review request.
Same issue with finishLocalTxs/recoverLocalTxsByNode

Jokser · 2019-10-09T14:01:31Z

...re/src/main/java/org/apache/ignite/internal/processors/cache/CacheAffinitySharedManager.java

+            @Override public void applyx(CacheGroupDescriptor desc) throws IgniteCheckedException {
+                CacheGroupHolder cache = getOrCreateGroupHolder(fut.initialVersion(), desc);
+
+                cache.aff.reinitializeWithoutOfflineNodes(fut.initialVersion());


This is not correct affinity initialization.
You don't take into account that cluster may have MOVING partitions. In this case, a primary node can be chosen from MOVING backup and affinity is not recalculated after PME end. You can see correct affinity recalculation at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol

Since I'm using assignment (not idealAssignment), which took moving into account (correct?) this should be the correct initialization?

There is no guarantee that the next backup will have a partition in OWNING status.
Case 1:
Suppose that you have a partition with 2 backups. Primary is OWNER other nodes are MOVING:
OMM
The second backup completed rebalancing and now have OWNING partition state:
OMO
Now the primary node is left and you have:
MO in affinity distribution. MOVING partition becomes primary which is incorrect behavior.
Case 2:
You lost your last OWNING partition in affinity. In this case, detectLostPartitions should be triggered.

You also nullify WaitRebalanceInfo. You shouldn't do this because in this case if you have currently running rebalancing late affinity assignment will not be triggered after rebalancing completion.

Let's clarify, aff.assignment contains only owning partitions (correct?)
I'm just shrinking previous assignment, without recalculation.
I see no MOVING issues here, am I missed something?

Rebalance will be restarted in case of moving partitions.
See no problem here :(

Affinity assignment contains OWNING and MOVING partitions.

Rebalance is started but late affinity assignment because of nullified WaitRebalanceInfo - not.

Detect lost partitions should be triggered after PME because if you don't have OWNING partitions you should mark others as LOST due to loss policy.

Jokser · 2019-10-09T15:33:37Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

+    /**
+     * @throws IgniteCheckedException If failed.
+     */
+    private void waitRecovery() throws IgniteCheckedException {


I see no tests in PR :(.
The following 2 cases need to be tested explicitly (to make sure that your optimization works correctly):
Case 1:

Create a transaction that contains a primary partition that belongs to the left node.

Make sure that the transaction is prepared on primary and backups and then block the finish request to backup.

Make sure that new transactions can be started and committed on nodes that don't contain affected partitions

Make sure that new transactions are blocked on nodes that contain affected partitions.

Unblock finish request messages.

Make sure that transactions waiting for local recovery are committed and PME on corresponding nodes finished.

Make sure that new transactions are unblocked and committed on nodes where PME is finished.

Case 2:

Create a transaction that contains a primary partition that belongs to the left node.

Make sure that the transaction is prepared on primary and backups and then block the finish request to backup.

Kill node, make sure that transaction recovery is started and then block recovery messages.

Make sure that PME on node left is finished on nodes that don't contain affected partitions.

Make sure that new transactions can be started and committed on nodes that don't contain affected partitions

Make sure that new transactions are blocked on nodes that contain affected partitions.

Unblock recovery messages.

Make sure that transactions are recovered and committed and PME on corresponding nodes finished.

Make sure that new transactions are unblocked and committed on nodes where PME is finished.

Signed-off-by: Anton Vinogradov <av@apache.org>

...re/src/main/java/org/apache/ignite/internal/processors/cache/CacheAffinitySharedManager.java

ascherbakoff · 2019-10-12T08:52:11Z

...s/core/src/main/java/org/apache/ignite/internal/managers/discovery/GridDiscoveryManager.java

+    public boolean baselineChanged(AffinityTopologyVersion topVer1, AffinityTopologyVersion topVer2) {
+        assert !topVer1.equals(topVer2);
+
+        DiscoCache disco1 = discoCache(topVer1);
+        DiscoCache disco2 = discoCache(topVer2);
+
+        BaselineTopology baseline1 = disco1 != null ? disco1.state().baselineTopology() : null;
+        BaselineTopology baseline2 = disco2 != null ? disco2.state().baselineTopology() : null;
+
+        return !Objects.equals(baseline1, baseline2);
+    }
+


Looks like we only care about baseline change comparing to previous version. Better to add a flag to DiscoCache indicating what baseline is changed comparing to previous version and get rid of internal collections comparison.
See org.apache.ignite.internal.managers.discovery.GridDiscoveryManager#createDiscoCache.

Looks like it's possible to just use firstEvtDiscoCache.state().localBaselineAutoAdjustment()?
Updated the PR.

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

ascherbakoff · 2019-10-12T11:12:41Z

modules/core/src/main/java/org/apache/ignite/internal/processors/cache/ExchangeContext.java

-        if (compatibilityNode || (crd && fut.localJoinExchange())) {
+        GridDiscoveryManager disco = fut.sharedContext().discovery();
+
+        if (protocolVer > 2 && fut.exchangeId().isLeft() && fut.isFirstEventNodeInBaseline() &&


Protocol v3 should be disabled if previous assignment is not ideal. Otherwise it means rebalancing is not finished and some nodes can see different local partition states at the moment of assignment calculation.

ascherbakoff · 2019-10-12T11:14:40Z

modules/core/src/main/java/org/apache/ignite/internal/processors/cache/ExchangeContext.java

+            GridAffinityAssignmentCache aff = grp.affinity();
+
+            Set<Integer> failedPrimaries = aff.primaryPartitions(fut.exchangeId().eventNode().id(), aff.lastVersion());
+            Set<Integer> loc = grp.topology().localPartitionMap().keySet();


You should only take into account owning partitions here (avoid renting partitions, moving are not possible because previous assignment is ideal).

ascherbakoff · 2019-10-12T11:17:23Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

@@ -1357,7 +1370,9 @@ private ExchangeType onServerNodeEvent(boolean crd) throws IgniteCheckedExceptio

            exchCtx.events().warnNoAffinityNodes(cctx);

-            centralizedAff = cctx.affinity().onCentralizedAffinityChange(this, crd);
+            centralizedAff = exchCtx.baselineNodeLeft() ?


centralizedAff is a part of old protocol. It's better not to mix old and new protocols, some refactoring is needed.

ascherbakoff · 2019-10-12T11:20:53Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

@@ -2225,7 +2351,7 @@ private String exchangeTimingsLogMessage(String header, List<String> timings) {

                boolean locNodeNotCrd = crd == null || !crd.isLocal();

-                if (locNodeNotCrd && (serverNodeDiscoveryEvent() || localJoinExchange()))
+                if ((locNodeNotCrd && (serverNodeDiscoveryEvent() || localJoinExchange())) || exchCtx.baselineNodeLeft())


Make sure partition loss policy works well with new protocol. I suggest removing nodes one by one under load until some subset of partitions will have no owners more.

ascherbakoff · 2019-10-12T11:39:22Z

...ite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java

+                    AffinityTopologyVersion topVer = sharedContext().exchange().readyAffinityVersion();
+
+                    // Failed node's primary partitions or just all local backups in case of possible exchange merge.
+                    Set<Integer> parts = nodeId != null ?


Is this intended as an optimization ?
This is not really needed, because here all possible updates are received and all remaining gaps could be safely removed.

Signed-off-by: Anton Vinogradov <av@apache.org>

# Conflicts: # modules/core/src/main/java/org/apache/ignite/internal/processors/cache/CacheAffinitySharedManager.java

Signed-off-by: Anton Vinogradov <av@apache.org>

anton-vinogradov · 2019-11-24T20:00:01Z

continued at #7069

anton-vinogradov added 11 commits October 7, 2019 13:35

IGNITE-9913 Wip

eec1365

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 latch fix

7dac4cd

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 latch fix

795f74a

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 assert fix

3e3b085

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 detectLostPartitions fix

6337cb8

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 obsolete test removal

cb9ad36

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 obsolete test removal

d5c93af

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 assertion fix

326d849

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

11fa6eb

IGNITE-9913 semantic

bb37e7b

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 timeout

8f453df

Signed-off-by: Anton Vinogradov <av@apache.org>

agoncharuk reviewed Oct 9, 2019

View reviewed changes

Jokser reviewed Oct 9, 2019

View reviewed changes

agoncharuk reviewed Oct 9, 2019

View reviewed changes

Jokser reviewed Oct 9, 2019

View reviewed changes

anton-vinogradov added 6 commits October 10, 2019 15:01

IGNITE-9913 epic fail fix :)

c57b23e

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 correct reinitialization

d70cecf

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

50aafd4

IGNITE-9913 isEventNodeInBaseline fix

50697f5

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 local partitions fix

ba5f005

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 first event node fix

3f3d9bb

Signed-off-by: Anton Vinogradov <av@apache.org>

ascherbakoff reviewed Oct 12, 2019

View reviewed changes

anton-vinogradov added 6 commits October 14, 2019 09:51

IGNITE-9913 redundant createAffinityDiffMessages call fix

df254cc

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 localBaselineAutoAdjustment check

cef2c24

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 localBaselineAutoAdjustment check

02b6bee

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

23336da

IGNITE-9913 Wip

8e4030d

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

b957dca

# Conflicts: # modules/core/src/main/java/org/apache/ignite/internal/processors/cache/CacheAffinitySharedManager.java

anton-vinogradov added 28 commits November 12, 2019 14:02

IGNITE-9913 Wip

202693d

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

bd8a83b

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

396d65c

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

6d0feca

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

80baa74

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

3d7a7b3

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

c5151a0

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

cd275cb

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

86abae3

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

98ea3a6

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

04487f6

IGNITE-9913 Wip

a8c51a5

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

1915b7f

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

68fbb9f

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

1ed97b9

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

4f359cf

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

77c5a61

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

2b8ccf7

IGNITE-9913 Wip

a58c0c7

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

bc4be8d

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

a2e1f2e

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

ba36b78

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

0f0515c

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

7ccfeed

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

b0e2d97

Signed-off-by: Anton Vinogradov <av@apache.org>

Merge remote-tracking branch 'remotes/origin/master' into ignite-9913

0134b75

IGNITE-9913 Wip

a17c0a0

Signed-off-by: Anton Vinogradov <av@apache.org>

IGNITE-9913 Wip

d4ecf51

Signed-off-by: Anton Vinogradov <av@apache.org>

anton-vinogradov closed this Nov 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-9913 #6942

IGNITE-9913 #6942

anton-vinogradov commented Oct 7, 2019

agoncharuk Oct 9, 2019

anton-vinogradov Oct 10, 2019

Jokser Oct 9, 2019

anton-vinogradov Oct 9, 2019

Jokser Oct 9, 2019

anton-vinogradov Oct 9, 2019

Jokser Oct 9, 2019

anton-vinogradov Oct 10, 2019

Jokser Oct 10, 2019

anton-vinogradov Oct 10, 2019

anton-vinogradov Oct 10, 2019

anton-vinogradov Oct 11, 2019

agoncharuk Oct 9, 2019

agoncharuk Oct 9, 2019

anton-vinogradov Oct 10, 2019 •

edited

NSAmelchev Oct 10, 2019

agoncharuk Oct 9, 2019

anton-vinogradov Oct 10, 2019

Jokser Oct 9, 2019

anton-vinogradov Oct 10, 2019 •

edited

Jokser Oct 10, 2019

anton-vinogradov Oct 10, 2019

Jokser Oct 10, 2019

Jokser Oct 9, 2019

ascherbakoff Oct 12, 2019

anton-vinogradov Oct 14, 2019

ascherbakoff Oct 12, 2019

ascherbakoff Oct 12, 2019

ascherbakoff Oct 12, 2019

ascherbakoff Oct 12, 2019

ascherbakoff Oct 12, 2019 •

edited

anton-vinogradov commented Nov 24, 2019

IGNITE-9913 #6942

IGNITE-9913 #6942

Conversation

anton-vinogradov commented Oct 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anton-vinogradov Oct 10, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anton-vinogradov Oct 10, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ascherbakoff Oct 12, 2019 • edited

Choose a reason for hiding this comment

anton-vinogradov commented Nov 24, 2019

anton-vinogradov Oct 10, 2019 •

edited

anton-vinogradov Oct 10, 2019 •

edited

ascherbakoff Oct 12, 2019 •

edited