-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-1546: Adding Read and Write Aggregations for Pacemaker to make it HA compatible #1710
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor clean-up comments.
We do need to update the Pacemaker documentation since we are replacing the old host config that is documented there.
@@ -276,6 +276,7 @@ resource.aware.scheduler.priority.strategy: "org.apache.storm.scheduler.resource | |||
dev.zookeeper.path: "/tmp/dev-storm-zookeeper" | |||
|
|||
pacemaker.host: "localhost" | |||
pacemaker.servers: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove pacemaker.host
if it is no longer used.
|
||
public class PacemakerClientPool { | ||
|
||
private static final Logger LOG = LoggerFactory.getLogger(PacemakerClientPool.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check indentation
} | ||
return client; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: extra line
@@ -160,13 +179,23 @@ public void set_worker_hb(String path, byte[] data, List<ACL> acls) { | |||
int retry = maxRetries; | |||
while (true) { | |||
try { | |||
HashSet<String> retSet = new HashSet<>(); | |||
int latest_time_secs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest_time_sec
not used?
@@ -18,48 +18,22 @@ | |||
package org.apache.storm.cluster; | |||
|
|||
import org.apache.storm.pacemaker.PacemakerClient; | |||
import org.apache.storm.pacemaker.PacemakerClientPool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove PacemakerClient import, as it's not used?
@@ -69,7 +69,7 @@ else if(evm instanceof HBMessage) { | |||
|
|||
@Override | |||
public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent event) { | |||
LOG.error("Connection to pacemaker failed", event.getCause()); | |||
LOG.error("Connection to pacemaker failed. Trying to reconnect {}", event.getCause().getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to log the stack here, only the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want the stack. It ends up absolutely filling Nimbus logs with relatively useless messages when a Pacemaker node goes down. I'm open to ideas, though, because a trace for an NPE or something would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know which exceptions we want to ignore, perhaps just IOExceptions, then we can explicitly blacklist them, and even log them under warn, instead of error.
|
||
private PacemakerClient clientProxy; | ||
public class PaceMakerStateStorageFactoryTest { | ||
private static Logger LOG = LoggerFactory.getLogger(PaceMakerStateStorageFactoryTest.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to be logging anything in this test. Do we need the Logger?
Addressed your comments. Not sure why they have not disappeared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a nit, and possibly a backwards compatibility issue.
Also I think we need to update the docs to explain HA and how it works/how to configure it.
@isString | ||
public static final String PACEMAKER_HOST = "pacemaker.host"; | ||
@isStringList | ||
public static final String PACEMAKER_SERVERS = "pacemaker.servers"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change that would be good to go back to 1.x as well. I am OK with this, because I don't think anyone is really using pacemaker, but I would like to hear others opinions on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of backward compatibility change for this, like nimbus.host and nimbus.seeds. For 2.0 we don't need to keep nimbus.host and pacemaker.host but 1.x still need to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR Yep, the 1.x-branch PR will leave pacemaker.host in, but have it deprecated.
@@ -69,7 +69,7 @@ else if(evm instanceof HBMessage) { | |||
|
|||
@Override | |||
public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent event) { | |||
LOG.error("Connection to pacemaker failed", event.getCause()); | |||
LOG.error("Connection to pacemaker failed. Trying to reconnect {}", event.getCause().getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know which exceptions we want to ignore, perhaps just IOExceptions, then we can explicitly blacklist them, and even log them under warn, instead of error.
this.config = config; | ||
List<String> serverList = (List<String>)config.get(Config.PACEMAKER_SERVERS); | ||
if(serverList == null) { | ||
serverList = new ArrayList<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think <String>
is needed just <>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in the following few lines too.
|
||
public List<HBMessage> sendAll(HBMessage m) throws PacemakerConnectionException { | ||
List<HBMessage> responses = new ArrayList<HBMessage>(); | ||
LOG.info("Using servers: {}", servers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be debug (especially if we are worried about filling the logs)
if(ret == null) { | ||
throw new HBExecutionException("Failed to get a response."); | ||
} | ||
LOG.debug("Successful get_worker_hb"); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the ret == null check before this? Is it possible to get more then one response and all of them are bogus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't quite sure how we wanted to handle this, NULL responses are valid, and right now the code treats mismatched responses as if they were NULL. We can distinguish between them if we want the behavior to differ whether the pacemakers returned NULL details or returned an invalid message. Throwing on null is not what we wanted, though, since it will cause Nimbus to crash when nimbus tries to read a topology's heartbeats before workers have started sending them.
Maybe we want to add some logic so that if ALL pacemakers return an invalid response, then we throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is best, if all of them return something invalid or we get nothing back at all then we throw.
+1 |
Documentation updated. |
Thanks for doc update. +1 |
No description provided.