Skip to content
This repository has been archived by the owner on Jun 7, 2021. It is now read-only.

TRAFODION-2940 In HA env, one node lose network, when recover, trafci can't use #1427

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mashengchen
Copy link
Contributor

when there loses network for a long time ,and then network recover, there will trigger zookeeper session expired, at this time ,check whether current dcsmaster is leader, if not unbind this node's floating ip, and make dcsmaster init status, then rerun dcs master.

@Traf-Jenkins
Copy link

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2399/

@Traf-Jenkins
Copy link

Copy link
Contributor

@DaveBirdsall DaveBirdsall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one question about the changes.

LOG.info("Connected to ZooKeeper successful, restart DCS Master.");
// reset lock
isLeader = new CountDownLatch(1);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this logic. Do we sit inside one of these method calls during normal processing? Or does tmpZkc.connect() and close() complete immediately? If so, it looks like we just loop around the while loop and do it again, over and over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is for when dcsmaster return with network error situation.
in the logic , it will try to connect to zk
if it can't conn ( tmpZkc.connect(); ) , there will in catch block and do retry
if it connect to zk, close the connect, then dcsmaster will run call() method again, in the time dcsmaster rework ,there must hava another backup-master working ( because there must one dcs master work and current master lose network ,then backupmaster take over the role) , so when dcsmaster rework , it will set value in zk /rootpath/dcs/leader/ then hang by lock "isLeader = new CountDownLatch(1);"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, Zookeeper connection lost has been covered by session expired event, so this loop is useless.

@moscowgentalman
Copy link
Contributor

Hope some DCS experts to take a review. @mashengchen please invite proper DCS experts. I cannot understand these changes well.

@mashengchen
Copy link
Contributor Author

@arvind-narain @kevinxu021 can you help to take a look

@hegdean
Copy link
Contributor

hegdean commented Feb 9, 2018

Can you please describe with more details regarding the original issue in the JIRA. It is unclear to me the scenario with 2 floating IPs. Let's hold off on the merge until we understand the original issue that will help us understand the changes done

@mashengchen
Copy link
Contributor Author

  1. config ha env
  2. sqstart
  3. use iptables to down master node's network, (iptables -I INPUT -s hostname -j DROP)
  4. sleep for 300 seconds
  5. backup-master take over the master role
  6. recover network (iptables -I INPUT -s hostname -d DROP)
  7. pdsh $MY_NODES ifconfig|grep 23400 will have 2 results, one is the down dcsmaster and another is the backup-master
    actually , the old dcsmaster is still in while loop when network recover, so there will have 2 working dcsmaster.

@DaveBirdsall
Copy link
Contributor

@hegdean, are you happy with this change now? Should I merge it?

Copy link
Contributor

@svarnau svarnau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not versed in networking enough to judge teh whole change. Just comment on duplicated code.

This code was tested against the test scenario you described? And it solved the duplicate IP problem that the old code did not?

unbindlb=`echo "$myifport"|awk '{print $NF}'`
echo "Virtual ip $gv_float_external_ip is in use on node $HOSTNAME bound to interface $myinterface($unbindlb) - unbinding..."
sudo /sbin/ip addr del $unbindip dev $myinterface
status=$?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unnecessary to duplicate entire function just to change whether commands run locally or remotely. That can be done with a variable, so that we don't have two copies of this code to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is a little different from the one using PDSH, the ouput of $SQ_PDSH $MY_NODES /sbin/ip addr show is different from /sbin/ip link show , so the cut step after the " | " (pipe) is different.

@mashengchen
Copy link
Contributor Author

@svarnau yes , I had done the test, and it solved the duplicate IP problem

Copy link
Contributor

@uamgo uamgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fixes needed. Waiting for the changes.

LOG.info("Connected to ZooKeeper successful, restart DCS Master.");
// reset lock
isLeader = new CountDownLatch(1);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, Zookeeper connection lost has been covered by session expired event, so this loop is useless.

LOG.error(e);
System.exit(1);
LOG.error(e.getMessage(), e);
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close zk connection while any connection exception happens. it's better to use specific Exception for zookeeper.

@@ -203,8 +244,8 @@ public void run() {
} catch (KeeperException.NodeExistsException e) {
// do nothing...some other server has created znodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be better to add warning for this exception that we could what happened. Any events happened should be reflected in log files.

}
}
return 1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add try-catch for each progress.

if (scriptContext.getExitCode() == 0)
LOG.info("Unbind Floating IP successful");
else
LOG.error("Unbind Floating IP failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to print exit code.

@@ -93,6 +93,7 @@ public void run() {
try {
queue.wait();
} catch (InterruptedException e) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message for what happened by debug level.

setSessionRecoverSuccessful(false);
}
} catch (KeeperException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log messages in file.

connect(true);
} catch (IOException e) {
setSessionRecoverSuccessful(false);
LOG.error("session expired and throw IOException while do reconnect: " + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 params API for log. If it doesn't matter for whatever the message is, it should be LOG.warn.

setSessionRecoverSuccessful(false);
LOG.error("session expired and throw IOException while do reconnect: " + e.getMessage());
} catch (InterruptedException e) {
LOG.error("session expired and throw InterruptedException while do reconnect: " + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

// session expired, may be never happending. but if it happen there
// need to close old client and rebuild new client
try {
connect(true);
Copy link
Contributor

@uamgo uamgo Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In connect(...), I saw it was catched for zookeeper exception, how can we get exception here?

@Traf-Jenkins
Copy link

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2475/

@Traf-Jenkins
Copy link

@Traf-Jenkins
Copy link

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2691/

@Traf-Jenkins
Copy link

@hegdean
Copy link
Contributor

hegdean commented May 31, 2018

@arvind can u please review this

@arvind
Copy link

arvind commented May 31, 2018

I think the wrong Arvind was tagged :)

@hegdean
Copy link
Contributor

hegdean commented May 31, 2018

@arvind-narain can u please review

@Traf-Jenkins
Copy link

Check Test Started: https://jenkins.esgyn.com/job/Check-PR-master/2887/

@Traf-Jenkins
Copy link

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
8 participants