STORM-3040: Improve scheduler performance #2647

revans2 · 2018-04-26T20:11:13Z

There are a lot of different scheduler improvements. Mostly these are either caching, storing data in multiple ways so we can look it up quickly, and finally lazily sorting nodes in a rack only when it is needed, instead of all ahead of time.

I also added in performance tests. They currently pass on travis, but I would like to hear from others on if this solution looks good or if there is a better way for us to do performance testing.

danny0405 · 2018-04-27T04:09:21Z

@revans2
In total, this is a good promotion direction.

So this promotion mainly focus on repetitive computation and some eagerly computed data structure, right?

What is the average promotion percentage when applied this patch?

kishorvpatil

👍 LGTM

revans2 · 2018-05-04T16:27:34Z

@danny0405 In my tests TestResourceAwareScheduler.testLargeTopologiesCommon went from about 7 mins to about 7 seconds. For TestResourceAwareScheduler.testLargeTopologiesOnLargeClustersGras I don't have a before value because I killed it after an hour. The after is about 7 seconds per topology, or about a min and a half.

revans2 · 2018-05-07T14:58:36Z

@danny0405 @kishorvpatil With some recent changes to master my patch started to fail with some checkstyle issues. I have rebased and fixed all of the issues. Please take a look again, specifically the second commit and let me know.

revans2 · 2018-05-07T14:59:31Z

Oh I forgot I also added back in something I messed up before and added back in anti-affinity to GRAS.

danny0405 · 2018-05-08T09:16:51Z

@revans2
I approve with you promotion totally.

The only concern are all kinds of cache we use here, now storm has many caches not just for scheduling. I just think if we can make a disk-storage-backend for all of such caches master needs. Disk cache has better fault-tolerance and is much cheeper than memory, but this is another promotion direction, and has nothing to do with this patch.

BTW: thx for your nice work.

danny0405 · 2018-05-08T09:19:31Z

storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java

@@ -48,6 +49,9 @@

 public class Cluster implements ISchedulingState {
    private static final Logger LOG = LoggerFactory.getLogger(Cluster.class);
+    private static final Function<String, Set<WorkerSlot>> MAKE_SET = (x) -> new HashSet<>();
+    private static final Function<String, Map<WorkerSlot, NormalizedResourceRequest>> MAKE_MAP = (x) -> new HashMap<>();


HashSet is ok for now single daemon scheduling, we may make it thread safe when we want to support parallel scheduling, so can we add a comment about thread safe here?

I am happy to add in a comment to Cluster itself about it, as none of Cluster is currently thread safe.

As for parallel scheduling the plan that we had been thinking about was more around scheduling multiple topologies in parallel, rather then trying to make a single scheduler strategy multi-threaded, but both have advantages and disadvantages.

danny0405 · 2018-05-08T09:22:00Z

storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java

@@ -763,6 +773,7 @@ public void setAssignments(
            assertValidTopologyForModification(assignment.getTopologyId());
        }
        assignments.clear();
+        totalResourcesPerNodeCache.clear();


Actually we can reuse these cache for the next scheduling round, when we bring in central master cache in.

I tried that, but it didn't have the performance boost I was hoping for. The vast majority of the performance problem came from recomputing the value each time we wanted to sort, with for GRAS is once per executor. So without the cache for a large topology we were recomputing things hundreds of thousands of times. With the cache it is only how many nodes are in the cluster, which ends up being relatively small. In reality the noise between runs drowned out any improvement, so I opted to not do the change.

kishorvpatil

Still 👍

Ethanlm

👍

revans2 · 2018-05-08T14:57:18Z

@danny0405 I added in the comments about thread safety like you suggested.

Ethanlm

+1

danny0405 · 2018-05-09T03:05:46Z

@revans2
Thx for your work, the storm-core travis check still fails, we should fix that.

revans2 · 2018-05-09T15:09:49Z

@danny0405 the failure is a known race condition around netty and is not related to this change.

… into STORM-3040 STORM-3040: Improve scheduler performance This closes #2647

revans2 force-pushed the STORM-3040 branch from 8d3e5cf to c76ea6e Compare April 26, 2018 22:05

kishorvpatil approved these changes May 2, 2018

View reviewed changes

Robert (Bobby) Evans added 2 commits May 4, 2018 13:59

STORM-3040: Improve scheduler performance

32392bc

Fixed Checkstyle Issues

d0be8ae

revans2 force-pushed the STORM-3040 branch from c76ea6e to d0be8ae Compare May 7, 2018 14:57

danny0405 reviewed May 8, 2018

View reviewed changes

kishorvpatil approved these changes May 8, 2018

View reviewed changes

Ethanlm approved these changes May 8, 2018

View reviewed changes

Addressed review comments

558e9b6

Ethanlm approved these changes May 8, 2018

View reviewed changes

asfgit merged commit 558e9b6 into apache:master May 14, 2018

asfgit pushed a commit that referenced this pull request May 14, 2018

Merge branch 'STORM-3040' of https://github.com/revans2/incubator-storm…

53f38bc

… into STORM-3040 STORM-3040: Improve scheduler performance This closes #2647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STORM-3040: Improve scheduler performance #2647

STORM-3040: Improve scheduler performance #2647

revans2 commented Apr 26, 2018

danny0405 commented Apr 27, 2018

kishorvpatil left a comment

revans2 commented May 4, 2018

revans2 commented May 7, 2018

revans2 commented May 7, 2018

danny0405 commented May 8, 2018

danny0405 May 8, 2018

revans2 May 8, 2018

danny0405 May 8, 2018

revans2 May 8, 2018

kishorvpatil left a comment •

edited

Loading

Ethanlm left a comment

revans2 commented May 8, 2018

Ethanlm left a comment

danny0405 commented May 9, 2018

revans2 commented May 9, 2018

STORM-3040: Improve scheduler performance #2647

STORM-3040: Improve scheduler performance #2647

Conversation

revans2 commented Apr 26, 2018

danny0405 commented Apr 27, 2018

kishorvpatil left a comment

Choose a reason for hiding this comment

revans2 commented May 4, 2018

revans2 commented May 7, 2018

revans2 commented May 7, 2018

danny0405 commented May 8, 2018

danny0405 May 8, 2018

Choose a reason for hiding this comment

revans2 May 8, 2018

Choose a reason for hiding this comment

danny0405 May 8, 2018

Choose a reason for hiding this comment

revans2 May 8, 2018

Choose a reason for hiding this comment

kishorvpatil left a comment • edited Loading

Choose a reason for hiding this comment

Ethanlm left a comment

Choose a reason for hiding this comment

revans2 commented May 8, 2018

Ethanlm left a comment

Choose a reason for hiding this comment

danny0405 commented May 9, 2018

revans2 commented May 9, 2018

kishorvpatil left a comment •

edited

Loading