Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-3321: Fix race in LocalCluster regarding Nimbus leadership, red… #2945

Merged
merged 1 commit into from
Mar 7, 2019

Conversation

srdo
Copy link
Contributor

@srdo srdo commented Jan 21, 2019

…uce poll timers for Nimbus and supervisor to speed up tests and avoid timeouts

https://issues.apache.org/jira/browse/STORM-3321

I also updated an error message to print how long Slots take to shut down, previously it would print 1000ms no matter how long the shutdown took.

The timer interval reductions in LocalCluster are set to only happen if time simulation is disabled. There are some tests (e.g. nimbus_test.clj) that use time simulation and want those timers to be longer, so it seemed easier to just leave it alone for them.

…uce poll timers for Nimbus and supervisor to speed up tests and avoid timeouts
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and sorry to review this so lately. Left some minor comments but it looks good overall.

@@ -231,6 +233,10 @@ private LocalCluster(Builder builder) throws Exception {
} else {
this.clusterState = builder.clusterState;
}
if (!Time.isSimulating()) {
//Ensure Nimbus assigns topologies as quickly as possible
conf.put(DaemonConfig.NIMBUS_MONITOR_FREQ_SECS, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change also useful for other cases (non-test) as well? If we are unsure about it, could we setup global config for non time simulating test and add this to there instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still useful outside testing, since the topology will start faster, and topologies will usually be submitted right after bootup of the local Nimbus. Are we intending for people to use LocalCluster for cases other than testing though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "testing" I mean both automatic tests and manual tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense. Doesn't seem to be a big deal as I think about this again.

@@ -690,6 +700,10 @@ public synchronized Supervisor addSupervisor(Number ports, Map<String, Object> c
}
superConf.put(Config.STORM_LOCAL_DIR, tmpDir.getPath());
superConf.put(DaemonConfig.SUPERVISOR_SLOTS_PORTS, portNumbers);
if (!Time.isSimulating()) {
//Monitor for assignment changes as often as possible, so e.g. shutdown happens as fast as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

* @return true is leadership was acquired, false otherwise
*/
@VisibleForTesting
boolean awaitLeadership(long timeout, TimeUnit timeUnit) throws InterruptedException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't think ILeaderElector as user facing (public) API so don't mind about backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think this API is supposed to be used by anyone other than Nimbus.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -131,8 +130,6 @@ public void nextTuple() {
String id = UUID.randomUUID().toString();
_pending.put(id, ft);
_collector.emit(ft.stream, ft.values, id);
} else {
Utils.sleep(100);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this sleep, can this cause CPU hotspot if _serverTuples is always empty ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storm already sleeps between nextTuple calls if nothing is emitted based on the sleep strategy. The extra sleep here doesn't do much for us I think, and manually sleeping in nextTuple is a bad habit I'd like to discourage (it blocks other message processing for the spout, e.g. ack handling).

See

spoutWaitStrategy(reachedMaxSpoutPending, emptyStretch);

@asfgit asfgit merged commit da12d89 into apache:master Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants