Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-3379: Fix intermittent NPE during worker boot in local mode #2998

Merged
merged 2 commits into from
May 5, 2019

Conversation

srdo
Copy link
Contributor

@srdo srdo commented Apr 14, 2019

Contains #2996, please review that first.

This issue is caused by a race. When Nimbus assigns executors to a supervisor, it writes them into Zookeeper. The supervisor is also told via Thrift (in local mode via a handle directly to Nimbus). When the worker starts and initializes WorkerState, it normally uses Thrift to ask the supervisor about the assignment. If that fails it tries Zookeeper. The attempt to read from Zookeeper is just a best-effort attempt to start the worker, it is inherently racy.

In local mode, the supervisor Thrift server isn't running, so the WorkerState just skips straight to reading from Zookeeper. Since the supervisor is told by Nimbus about assignment directly, there is no guarantee that the assignment becomes visible in Zookeeper before the WorkerState tries reading it. This causes the NPE.

The fix is to provide the Worker with a handle to the supervisor Thrift interface in local mode, similar to how the supervisor gets a handle to Nimbus. WorkerState can then ask the supervisor directly about the assignment.

I ran the tests in TestPlanCompiler (which runs a local cluster) a few hundred times, and they appear stable with this fix and STORM-3376.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Left minor comments.

// Add new connections atomically
cachedNodeToPortSocket.getAndUpdate(prev -> {
Map<NodeInfo, IConnection> next = new HashMap<>(prev);
for (NodeInfo nodeInfo : newConnections) {
next.put(nodeInfo,
mqContext.connect(
topologyId,
assignment.get_node_host().get(nodeInfo.get_node()), // Host
nodeHost.get(nodeInfo.get_node()), // Host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could throw NPE when assignment is null since nodeHost will be null. If there's some assumption this line will not be evaluated when assignment is null, may be better to leave a comment describing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. NodeHost can't be null here, since we're iterating over newConnections. newConnections is only non-empty if assignment is not null.

I looked at rewriting this to make that clearer, but didn't like how it turned out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK once we make clear it doesn't throw NPE. Thanks for explaining here.

@@ -112,6 +112,8 @@
private ThriftServer thriftServer;
//used for local cluster heartbeating
private Nimbus.Iface localNimbus;
//Passed to workers in local clusters, exposed by thrift server in distributed mode
private org.apache.storm.generated.Supervisor.Iface superviserThriftInterface;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: superviserThriftInterface -> supervisorThriftInterface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, nice catch

@HeartSaVioR
Copy link
Contributor

Will also put +1 once everything is sorted out in #2996.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@srdo
Copy link
Contributor Author

srdo commented May 5, 2019

Squashed and rebased on top of STORM-3376. Will merge shortly. Thanks for the reviews.

@asfgit asfgit merged commit c985695 into apache:master May 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants