-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-3379: Fix intermittent NPE during worker boot in local mode #2998
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Left minor comments.
// Add new connections atomically | ||
cachedNodeToPortSocket.getAndUpdate(prev -> { | ||
Map<NodeInfo, IConnection> next = new HashMap<>(prev); | ||
for (NodeInfo nodeInfo : newConnections) { | ||
next.put(nodeInfo, | ||
mqContext.connect( | ||
topologyId, | ||
assignment.get_node_host().get(nodeInfo.get_node()), // Host | ||
nodeHost.get(nodeInfo.get_node()), // Host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could throw NPE when assignment is null since nodeHost will be null. If there's some assumption this line will not be evaluated when assignment is null, may be better to leave a comment describing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. NodeHost can't be null here, since we're iterating over newConnections. newConnections is only non-empty if assignment is not null.
I looked at rewriting this to make that clearer, but didn't like how it turned out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's OK once we make clear it doesn't throw NPE. Thanks for explaining here.
@@ -112,6 +112,8 @@ | |||
private ThriftServer thriftServer; | |||
//used for local cluster heartbeating | |||
private Nimbus.Iface localNimbus; | |||
//Passed to workers in local clusters, exposed by thrift server in distributed mode | |||
private org.apache.storm.generated.Supervisor.Iface superviserThriftInterface; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: superviserThriftInterface -> supervisorThriftInterface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, nice catch
Will also put +1 once everything is sorted out in #2996. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Squashed and rebased on top of STORM-3376. Will merge shortly. Thanks for the reviews. |
Contains #2996, please review that first.
This issue is caused by a race. When Nimbus assigns executors to a supervisor, it writes them into Zookeeper. The supervisor is also told via Thrift (in local mode via a handle directly to Nimbus). When the worker starts and initializes WorkerState, it normally uses Thrift to ask the supervisor about the assignment. If that fails it tries Zookeeper. The attempt to read from Zookeeper is just a best-effort attempt to start the worker, it is inherently racy.
In local mode, the supervisor Thrift server isn't running, so the WorkerState just skips straight to reading from Zookeeper. Since the supervisor is told by Nimbus about assignment directly, there is no guarantee that the assignment becomes visible in Zookeeper before the WorkerState tries reading it. This causes the NPE.
The fix is to provide the Worker with a handle to the supervisor Thrift interface in local mode, similar to how the supervisor gets a handle to Nimbus. WorkerState can then ask the supervisor directly about the assignment.
I ran the tests in TestPlanCompiler (which runs a local cluster) a few hundred times, and they appear stable with this fix and STORM-3376.