Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-1546: Adding Read and Write Aggregations for Pacemaker to make it HA compatible #1710

Merged
merged 15 commits into from
Oct 5, 2016

Conversation

knusbaum
Copy link
Contributor

No description provided.

@knusbaum knusbaum closed this Sep 23, 2016
@knusbaum knusbaum reopened this Sep 23, 2016
Copy link

@d2r d2r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor clean-up comments.

We do need to update the Pacemaker documentation since we are replacing the old host config that is documented there.

@@ -276,6 +276,7 @@ resource.aware.scheduler.priority.strategy: "org.apache.storm.scheduler.resource
dev.zookeeper.path: "/tmp/dev-storm-zookeeper"

pacemaker.host: "localhost"
pacemaker.servers: []
Copy link

@d2r d2r Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove pacemaker.host if it is no longer used.


public class PacemakerClientPool {

private static final Logger LOG = LoggerFactory.getLogger(PacemakerClientPool.class);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check indentation

}
return client;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: extra line

@@ -160,13 +179,23 @@ public void set_worker_hb(String path, byte[] data, List<ACL> acls) {
int retry = maxRetries;
while (true) {
try {
HashSet<String> retSet = new HashSet<>();
int latest_time_secs = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest_time_sec not used?

@@ -18,48 +18,22 @@
package org.apache.storm.cluster;

import org.apache.storm.pacemaker.PacemakerClient;
import org.apache.storm.pacemaker.PacemakerClientPool;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove PacemakerClient import, as it's not used?

@@ -69,7 +69,7 @@ else if(evm instanceof HBMessage) {

@Override
public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent event) {
LOG.error("Connection to pacemaker failed", event.getCause());
LOG.error("Connection to pacemaker failed. Trying to reconnect {}", event.getCause().getMessage());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to log the stack here, only the message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the stack. It ends up absolutely filling Nimbus logs with relatively useless messages when a Pacemaker node goes down. I'm open to ideas, though, because a trace for an NPE or something would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know which exceptions we want to ignore, perhaps just IOExceptions, then we can explicitly blacklist them, and even log them under warn, instead of error.


private PacemakerClient clientProxy;
public class PaceMakerStateStorageFactoryTest {
private static Logger LOG = LoggerFactory.getLogger(PaceMakerStateStorageFactoryTest.class);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to be logging anything in this test. Do we need the Logger?

@knusbaum
Copy link
Contributor Author

Addressed your comments. Not sure why they have not disappeared.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a nit, and possibly a backwards compatibility issue.

Also I think we need to update the docs to explain HA and how it works/how to configure it.

@isString
public static final String PACEMAKER_HOST = "pacemaker.host";
@isStringList
public static final String PACEMAKER_SERVERS = "pacemaker.servers";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change that would be good to go back to 1.x as well. I am OK with this, because I don't think anyone is really using pacemaker, but I would like to hear others opinions on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of backward compatibility change for this, like nimbus.host and nimbus.seeds. For 2.0 we don't need to keep nimbus.host and pacemaker.host but 1.x still need to have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR Yep, the 1.x-branch PR will leave pacemaker.host in, but have it deprecated.

@@ -69,7 +69,7 @@ else if(evm instanceof HBMessage) {

@Override
public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent event) {
LOG.error("Connection to pacemaker failed", event.getCause());
LOG.error("Connection to pacemaker failed. Trying to reconnect {}", event.getCause().getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know which exceptions we want to ignore, perhaps just IOExceptions, then we can explicitly blacklist them, and even log them under warn, instead of error.

this.config = config;
List<String> serverList = (List<String>)config.get(Config.PACEMAKER_SERVERS);
if(serverList == null) {
serverList = new ArrayList<String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think <String> is needed just <>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in the following few lines too.


public List<HBMessage> sendAll(HBMessage m) throws PacemakerConnectionException {
List<HBMessage> responses = new ArrayList<HBMessage>();
LOG.info("Using servers: {}", servers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be debug (especially if we are worried about filling the logs)

if(ret == null) {
throw new HBExecutionException("Failed to get a response.");
}
LOG.debug("Successful get_worker_hb");
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the ret == null check before this? Is it possible to get more then one response and all of them are bogus?

Copy link
Contributor Author

@knusbaum knusbaum Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't quite sure how we wanted to handle this, NULL responses are valid, and right now the code treats mismatched responses as if they were NULL. We can distinguish between them if we want the behavior to differ whether the pacemakers returned NULL details or returned an invalid message. Throwing on null is not what we wanted, though, since it will cause Nimbus to crash when nimbus tries to read a topology's heartbeats before workers have started sending them.

Maybe we want to add some logic so that if ALL pacemakers return an invalid response, then we throw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is best, if all of them return something invalid or we get nothing back at all then we throw.

@knusbaum
Copy link
Contributor Author

@d2r @revans2 Any additional comments?

@revans2
Copy link
Contributor

revans2 commented Oct 3, 2016

+1

@knusbaum
Copy link
Contributor Author

knusbaum commented Oct 4, 2016

Documentation updated.

@d2r
Copy link

d2r commented Oct 5, 2016

Thanks for doc update. +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants