Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2190: reduce contention between submission and scheduling #1764

Merged
merged 4 commits into from Dec 20, 2016

Conversation

revans2
Copy link
Contributor

@revans2 revans2 commented Nov 7, 2016

No description provided.

@@ -90,7 +90,9 @@ public static void main(String[] args) throws Exception {
if (args != null && args.length > 0) {
conf.setNumWorkers(3);

StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology());
for (String name: args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on #1765. Is this change an artifact of manual testing?

@ptgoetz
Copy link
Member

ptgoetz commented Nov 30, 2016

One question about the for loop in WordCountTopology. I'm +1 once that's answered/addressed.

(into {})
(.assignSlots inimbus topologies)))
(log-message "not a leader, skipping assignments")))
(locking (:sched-lock nimbus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't a wrong ordering of events happen since we are locking when calculating a scheduling then unlocking and then locking and uploading the new scheduling and unlocking
for example:
T0: submit
T1: rebalance
T2: rebalance - calculate new scheduling
T3: submit - calculate new scheduling
T4: rebalance - upload new scheduling to zk
T5: submit - upload new scheduling to zk

even though we should end up with the scheduling calculated by the rebalance but we end up with scheduling calculated from the original submit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerrypeng You are correct that this could happen. I don't really think it will be that likely to happen in practice but I'll think about it and see if we can fix it.

Copy link
Contributor

@jerrypeng jerrypeng Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe its also time to start thinking about decentralized scheduling mechanisms if certain scheduling strategies may take a while to compute a schedule, but that would require a major overhaul in nimbus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and now that nimbus is in java we can look at doing some refactoring along those lines. If you feel that we need to do it now and that this is a blocker I can spend some time looking into how to do that better.

@revans2
Copy link
Contributor Author

revans2 commented Dec 1, 2016

Just so we don't miss the comment from @jerrypeng

Couldn't a wrong ordering of events happen since we are locking when calculating a scheduling then unlocking and then locking and uploading the new scheduling and unlocking
for example:
T0: submit
T1: rebalance
T2: rebalance - calculate new scheduling
T3: submit - calculate new scheduling
T4: rebalance - upload new scheduling to zk
T5: submit - upload new scheduling to zk

even though we should end up with the scheduling calculated by the rebalance but we end up with scheduling calculated from the original submit.

Yes, that is correct. We should do something here, and he suggested that perhaps as part of a refactor of Nimbus we should look at support for long running scheduling.

In the short term I think I might make scheduling and writing to ZK atomic, but long term I think I will file a JIRA to look at better scheduling.

@revans2
Copy link
Contributor Author

revans2 commented Dec 2, 2016

@ptgoetz @jerrypeng

I made a few changes to thing. I fixed the race condition and I addressed the review comments, but I also put in some optimizations to storm submitter. We were literally calling getClusterInfo 3+ times for each topology submission, and because the ultimate goal of STORM-2190 is to make it more scalable this helps a lot. There is still some lock contention, but it is much better then it was before.

If things look good here I will backport the changes to my other pull request.

@ptgoetz
Copy link
Member

ptgoetz commented Dec 2, 2016

+1

@HeartSaVioR
Copy link
Contributor

+1. Please apply the optimization to #1765 as well.

@jerrypeng
Copy link
Contributor

LGTM +1 @revans2 thanks for making the optimizations

@revans2
Copy link
Contributor Author

revans2 commented Dec 3, 2016

@HeartSaVioR I plan on doing that.

@HeartSaVioR
Copy link
Contributor

+1 again.

@asfgit asfgit merged commit 8021038 into apache:master Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants