Skip to content

[STORM-3112] Incremental scheduling supports#2723

Closed
danny0405 wants to merge 4 commits intoapache:masterfrom
danny0405:delta-s
Closed

[STORM-3112] Incremental scheduling supports#2723
danny0405 wants to merge 4 commits intoapache:masterfrom
danny0405:delta-s

Conversation

@danny0405
Copy link

@danny0405 danny0405 commented Jun 18, 2018

What is this patch for

Incremental scheduling supports.

As STORM-3093 described, now the scheduling work for a round is a complete scan and computation for all the topologies on cluster, which is a very heavy work when topologies increment to hundreds.

So this patch is to refactor the scheduling logic that only care about topologies that need to.

Promotions List

  1. Cache the id to storm base mapping which reduce the pressure to ZooKeeper.
  2. Only schedule the topologies that need to: with dead executors or not enough running workers.
  3. For some schedulers we still need a full scheduling, i.e. IsolationScheduler.
  4. Reuse the scheduling resources bestride multi scheduling round, i.e. nodeId -> used slot, nodeId -> used resource, nodeId -> totalResource.

Cause in STORM-3093 i already cache the storm-id -> executors mapping, now for a scheduling round, things we will do:

  1. Scan all the active storm bases( cached ) and local storm-conf/storm-topology, then to refresh the heartbeats cache, and we will know which topologies need to schedule.
  2. Compute scheduleAssignment only for need scheduling topologies.

About Robustness When Nimbus Restarts

  1. The cached storm-bases are taken care of by ILocalAssignmentsBackend.
  2. the scheduling cache will be refresh for the first time scheduling through a full topologies scheduling.

The New Scheduling mode

image

Test data

This is the one incremental topology scheduling time cost data produced by LargeAmountsOfTopologiesSchedulingTest.java, i also add in a ISchedulerTracer if you wanna more specific time cost data during scheduling.

Need to emphasize that this data is already promoted by removing storm-bases accessing time and re-computing of id->executors mapping compared to old scheduling mode, but we can still see
a remarkable promotion and very good performance for this new scheduling mode.

storm_sche_cost

We can see that incremental scheduling is very lightweight and fast.

JIRA: STORM-3112.

@danny0405
Copy link
Author

@revans2 @HeartSaVioR
Hi, Bobby, HeartSaVioR, please help me to review this if you have time, thx in advance.

@revans2
Copy link
Contributor

revans2 commented Jun 20, 2018

@danny0405 There is a lot of code here and I am trying to understand what exactly this is doing, because it is hard to get it from just the code. I can see the extra caching that you put in for the StormBase, which looks great. The JIRA says that we are only scheduling topologies that need to be scheduled, but I see that you had to change the isolation scheduler to disable that feature. What exactly are the differences that the scheduler sees.

I am also a little confused about how this, besides the caching, improves performance. All of the schedulers will loop through the list of topologies and only schedule the ones that need to be scheduled. It is a tiny check that takes a few ms at the most.

@danny0405
Copy link
Author

danny0405 commented Jun 21, 2018

old-schedule
@revans2
I draw a img to describe what the work flow was before this patch. I will add in the new work flow with this patch soon.

Then the reasons why this patch can improve performance:

  1. We reduce the storm base fetching time totally because of the cache.
  2. We refresh the heartbeats cache first then make a diff to the all id -> executors mapping quickly
    so we can decide which ones need to be scheduled, if no one there, we skip fast.
  3. Because of step-2, we only need to compute the needed structures and resources mapping for
    very few topologies, which is very fast.
  4. Although we already cache storm-topology and storm-conf for all the topologies, the reduced useless computation in step-3 are still considerable, including the computation in Nimbus#mkAssignments and repeated calculation in all kinds of schedulers.

The reason why IsolationScheduler need a full scheduling:

  1. The only reason is that it needs all the topologies details to know which topologies are isolated, so it can assign slots based on this info.

@HeartSaVioR
Copy link
Contributor

@danny0405
Providing some numbers would help persuading others if it is easy to measure. If it is not easy to get the numbers, I'm OK to skip and compare the difference of diagram/code.

@danny0405
Copy link
Author

danny0405 commented Jun 22, 2018

@HeartSaVioR @revans2
Hi, i have updated the design about the new scheduling mode, but i'm very sorry that i do not have a large cluster to test for this patch, i tried on my laptop but it stuck cause opening too many processes.

Could you have a benchmark test for this patch ? I will very appreciate for it.

@HeartSaVioR
Copy link
Contributor

@danny0405 Unfortunately I also can't run a large cluster. I could run Storm in 3~4 VM nodes but I think the patch will not address such small cluster. I'll take a look at design difference and comment sooner.

@HeartSaVioR
Copy link
Contributor

@danny0405
I'm sorry but now I feel I need to have time to focus on current milestone - Storm 2.0.0 - and for Storm 2.0.0, this is a kind of improvement and non-blocker.
There're some issues/PRs being left on getting Storm 2.0.0, so for me these are the first things to be reviewed. Smaller issues can be handled along with that.

@danny0405
Copy link
Author

@HeartSaVioR
Really thx, BTW, i can help to fix if there some bugs left for releasing STORM-2.0.0.

@HeartSaVioR
Copy link
Contributor

@danny0405
Remaining tasks would be mostly reviewing existing PRs. I'll sort out and add them to the epic issue for Storm 2.0.0.

@revans2
Copy link
Contributor

revans2 commented Jun 25, 2018

@danny0405 I have spent some time looking at your patch, I have not found any issues with the code itself, but I easily could have missed something. My biggest problems is that I just cannot get past the backwards incompatibility imposed by NeedsFullTopologiesScheduler. I also don't want to merge in a performance improvement without any actual numbers to back it up.

To get the performance numbers, you really only want to know how long it takes to schedule. You don't actually need to run a full cluster. The simplest way to make that happen is to fake out the heartbeats for the supervisors and the workers. You could do it as a stand alone application, but it might be nice to have a bit more control over it so you can simulate workers that don't come up, or workers that crash.

Once you have that working I really would like to see a breakdown of how much time is being spent computing the different parts that go into creating the new Cluster.

As for NeedsFullTopologiesScheduler would either like to see this switched so schedulers opt into getting less information, or even better have us cache the fully computed inputs to Cluster and just update the cache incrementally instead of leaving things out.

@danny0405
Copy link
Author

danny0405 commented Jun 30, 2018

@revans2
Hi, bobby, i have already updated the test data for this patch.

As for NeedsFullTopologiesScheduler, i agree to remove it. We can tweak the IsolationScheduler or cache topologies data, either is ok.

But i am inclined to choose the first one, cause we do not have a powerful reliable cache for master now, everything we cached will need to have a recover strategy, too many cache will make this things a little mess.

For the first chose, i think more work will be done for refactoring the code, I will fire a another JIRA to solve this problem.

For this patch, it is enough for us.

@danny0405
Copy link
Author

@revans2 @HeartSaVioR
Can you help me to review this again ? thx very much

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 9, 2018

@danny0405
Could you address @revans2 comment, mostly backing up your proposed patch with numbers? You may want to get help to correctly measure numbers from well-known library like JMH.
If you're not familiar with JMH, this blog post would help you: https://blog.codecentric.de/en/2017/10/performance-measurement-with-jmh-java-microbenchmark-harness/

I'd say showing the numbers is more powerful for persuading than let others dive to the code and find value, especially the patch is not addressing Storm's performance issues what we already know about. STORM-2693 addressed our long-lived issue, so that was less needed to persuade others but this doesn't look like the case.

@HeartSaVioR
Copy link
Contributor

I'd like to say it doesn't mean I don't plan to review this. I'd rather say other issues which are directly coupled with releases (mostly Storm 2.0.0) should be reviewed prior to this, unless you show interesting numbers to let us want to include this to Storm 2.0.0.

@danny0405
Copy link
Author

@revans2
Hi, Bobby, do you have time to review this patch now, cause it's a little long time and i kind of forget details about this patch, thx in advance.

@danny0405
Copy link
Author

@srdo @revans2 @HeartSaVioR Can you please review this for me ? It's so long time since the patch we proposed, i will very appreciate it if you can review this patch.

@danny0405 danny0405 closed this Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants