STORM-2306 : Messaging subsystem redesign. #2241

roshannaik · 2017-07-25T03:48:50Z

Having spent a lot of time on this, I am happy to share some good news and some even better news with you.

Before venturing further, I must add, to limit the scope of this PR, no attempt was made to improve ACK-ing mode perf. Although there are some big latency gains seen in ACK mode, these are a side effect of the new messaging design and work remains to be done to improve ACK mode.

Please see the design docs posted on the STORM-2306 jira for details on what is being done

So, first the good news .. a quick competitive evaluation:

1) Competitive Perf evaluation :

Here are some quick comparison of Storm numbers taken on my laptop against numbers for similar/identical topologies published by Heron, Flink and Apex. Shall provide just rolled up summary here and leave the detailed analysis for later.

Storm numbers here were run on my MacBook Pro (2015) with 16GB ram and a single 4 core Intel i7 chip.

A) Compared To Heron and Flink:

Heron recently published this blog about the big perf improvements (~4-6x) they achieved.
https://blog.twitter.com/engineering/en_us/topics/open-source/2017/optimizing-twitter-heron.html
They ran it on dual 12-core Intel Xeon chips (didn't say how many machines).

They use a simplified word count topology that I have emulated for comparison purposes and included it as part of this PR (SimplifiedWordCountTopo).

Flink also publishes numbers for a similar setup here
https://flink.apache.org/features.html#streaming

Below are per core throughput numbers.

[:HERON:]
Acking Disabled: per core = ~475 k/sec.
Acking Enabled: per core = ~150 k/sec. Latency = 30ms

[:FLINK:]
Per core: ~1 mill/sec

[:STORM:]
Acking Disabled: per core = 2 mill/sec. (1 spout & 1 counter bolt)
Acking Enabled: per core = 0.6 mill/sec, Latency = 0.73 ms (+1 acker)

Takeaways:

Storm's with-ACK throughput is better than Heron's no-ACK throughput.
Without ACKs, Storm is far ahead of Heron and also better than Flink.
Storm's Latency (in microseconds) is also significantly better than both (although this metric is better to compared with multiple machines in the run). AFAIKT, Flink is generally not known for having good latency (as per Flink's own comparison with Storm - https://data-artisans.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink).

B) Compared to Apex:

Apex appears to be the best performer among the opensource lot.. by a reasonably good margin. Some numbers they published in their early days (when it was called DataTorrent) were misleading/dubious IMO, but the newer numbers appear credible.

Here we look at how fast inter spout/bolt communication can be achieved using an ultra minimalist topology.
A ConstSpout emits a short string to a DevNull bolt that discards the tuples it receives. This topo has been in storm-perf for sometime now.

Apex provides numbers for a identical setup ... what they call "container local" performance here:
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/

Other than the fact that Storm numbers were on my laptop, these numbers are a good apples to apples comparison.

[:APEX:]
Container local Throughput : ~4.2 mill/sec

[:STORM:]
Worker local throughput : 8.1 mill/sec

2) Core messaging Performance

Now for the better news. The redesigned messaging system is actually much faster and able to move messages between threads at an astounding rate .... :

120 mill/sec (batchSz=1000, 2 producers writing to 1 consumer).
67 mill/sec (batchSz=1000, 1 producers writing to 1 consumer).

I have included JCQueuePerfTest.java in this PR to help get quick measurements from within the IDE.

That naturally begs the question .. why is Storm pushing only 8.1 mill/sec between a ConstSpout and DevNullBolt ? The short answer is ... there are big bottlenecks in other parts of the code. In this PR I have tackled some such bottlenecks but many still remain. We are faster than the competition, but still have room to be much much faster. If anyone is interested in pursuing these to push Storm's perf to the next level, I am happy to point them in the right direction.

Again, please refer to the design docs in the JIRA for details on the new design and the rationale behind them.

satishd

Nice Work Roshan!
Had an initial look at the code and left few comments(mostly minor).

satishd · 2017-07-25T04:13:15Z

conf/defaults.yaml

@@ -146,7 +149,7 @@ supervisor.run.worker.as.user: false
 #how long supervisor will wait to ensure that a worker process is started
 supervisor.worker.start.timeout.secs: 120
 #how long between heartbeats until supervisor considers that worker dead and tries to restart it
-supervisor.worker.timeout.secs: 30
+supervisor.worker.timeout.secs: 30000


Is this really a deliberate change?

no... i need to revert that.

satishd · 2017-07-25T04:14:48Z

conf/defaults.yaml

+topology.transfer.batch.size: 10
+topology.executor.receive.buffer.size: 50000
+topology.producer.batch.size: 1000  # TODO: Roshan:  rename
+topology.flush.tuple.freq.millis: 5000


nit: Better to add a comment describing about this property.

Was planning to document it somewhere. Do we have a place to document Storm settings ?

I'm OK if they're added to Config / DaemonConfig and had javadoc on each fields. Other configurations are also documented to the class.

satishd · 2017-07-25T04:16:30Z

conf/defaults.yaml

@@ -304,6 +303,7 @@ storm.cgroup.resources:
 storm.cgroup.hierarchy.name: "storm"
 storm.supervisor.cgroup.rootdir: "storm"
 storm.cgroup.cgexec.cmd: "/bin/cgexec"
+storm.cgroup.cgexec.cmd: "/bin/cgexec"


may be an accidental copy, needs to be removed.

satishd · 2017-07-25T04:21:03Z

storm-client/src/jvm/org/apache/storm/task/IOutputCollector.java

@@ -30,4 +31,5 @@
    void ack(Tuple input);
    void fail(Tuple input);
    void resetTimeout(Tuple input);
+    void flush();


May want to add some javadoc about the same. It seems we are ready to break the APIs with new set of changes in this redesign.

This is not an external facing api.

It is kind of external facing API as it is given to bolt implementors in prepare API. IOutputCollector can be wrapped by users to do any custom processing before they are emitted to target tasks.

(Random passer-by user: E.g. we do this for tests a lot, yes. On the other hand, this is for Storm 2.0, so API breakages are kinda expected. Then again, at least it should be thinked through / discussed.)

I dont think they implement that interface... so may not be an issue. we may need folks to just recompile their topos with Storm 2.0 at most.

I think this is like public API, but given that we ship the change to major version only, I guess that might be fine. If we have alternatives it would be better though.

do we support end users writing custom output collectors ? This is public for internal use only I think.
Right, there was no easy alternative.

satishd · 2017-07-25T04:56:29Z

storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java

-                LOG.warn("Received invalid messages for unknown tasks. Dropping... ");
+    private void transferLocalBatch(List<AddressedTuple> tupleBatch) {
+        try {
+            for (int i = 0; i < tupleBatch.size(); i++) {


Does foreach have significant perf issue?

See 3.8 in the design doc.

For non random access lists the perf will get worse. So change the signature transferLocalBatch to accept ArrayList<AddressedTuple>.

yes indeed!

Nice observation @roshannaik and @arunmahadevan!

satishd · 2017-07-25T06:03:28Z

storm-client/src/jvm/org/apache/storm/executor/ExecutorTransfer.java

+    }
+
+    public boolean transferLocal(AddressedTuple tuple) throws InterruptedException {
+        workerData.checkSerialize(serializer, tuple);


I guess this should be done only when it is determined to be sending to local task. So, this should be pushed to just before queue.publish(tuple) below.

Good catch. Thanks!

satishd · 2017-07-25T06:10:05Z

storm-client/src/jvm/org/apache/storm/grouping/ShuffleGrouping.java

    private ArrayList<List<Integer>> choices;
-    private AtomicInteger current;
+    private int current = 0 ;


You changed this to non threadsafe. Is this instance not shared by multiple executors?

chooseTasks is implemented based on the assumption of thread safety, so we should check this again. If it's thread-safe, it is better to leave comment on this.

satishd · 2017-07-25T06:18:41Z

storm-client/src/jvm/org/apache/storm/tuple/TupleImpl.java

        }
    }

-    public TupleImpl(GeneralTopologyContext context, List<Object> values, int taskId, String streamId, MessageId id) {
+    public TupleImpl(GeneralTopologyContext context, List<Object> values, String srcComponent, int taskId, String streamId, MessageId id) {


Why do we need to have this to be passed in constructor as that can be derived like below from the existing arguments? This constructor change is spread across all usages.
srcComponent = context.getComponentId(taskId)

because its a hashmap lookup. this change was based on profiling.

satishd · 2017-07-25T06:40:15Z

storm-client/src/jvm/org/apache/storm/utils/TransferDrainer.java

+  // Cache the msgs grouped by destination node
+  public void add(TaskMessage taskMsg) {
+    int destId = taskMsg.task();
+    ArrayList<TaskMessage> msgs = bundles.get(destId);


we can have bundles.computeIfAbsent(destId, integer -> new ArrayList<>()); removing null check, creating list and put that into bundles map.

This is in critical path.. that lambda argument will result in allocation a new (lambda) object.

Non-capturing lambdas only have a single instance in a generated private static field, so the allocation cost is only paid once.

Of course, there is some extra cost in the indirection ... then again, computeIfAbsent() is optimized in OpenJDK to only do a single hash + table lookup, but if (get() == null) put() does two of each if the get() fails. Overall the difference will be nonexistant / heavily dependant on the prevailing branch.

2 cents: either is fine to me.

thanks for that sweet tip! just to be sure, i would feel comfortable verifying with profiler first ... given this is in critical path.

satishd · 2017-07-25T06:48:15Z

storm-client/src/jvm/org/apache/storm/utils/TransferDrainer.java

-      addListRefToMap(this.bundles, entry.getKey(), entry.getValue());
+
+  // Cache the msgs grouped by destination node
+  public void add(TaskMessage taskMsg) {


It seems add and send are always invoked in the same thread from (JCQueue.Consumer's accept and flush) and there is no contention. Is that right?

avermeer · 2017-07-25T07:24:38Z

Hello Roshan, Thank you very much for your huge work for improving Storm performances! Regarding "competitive perf evaluation", would you say that now the reasons why SuperCheif team moved away from Storm to their homebrewed streaming processing system in 2015 (see http://blog.librato.com/posts/superchief) no longer hold? Best regards, Alexandre Vermeerbergen 2017-07-25 8:58 GMT+02:00 Satish Duggana <notifications@github.com>:

…

***@***.**** commented on this pull request. Nice Work Roshan! Had an initial look at the code and left few comments(mostly minor), Overall LGTM. ------------------------------ In conf/defaults.yaml <#2241 (comment)>: > @@ -146,7 +149,7 @@ supervisor.run.worker.as.user: false #how long supervisor will wait to ensure that a worker process is started supervisor.worker.start.timeout.secs: 120 #how long between heartbeats until supervisor considers that worker dead and tries to restart it -supervisor.worker.timeout.secs: 30 +supervisor.worker.timeout.secs: 30000 Is this really a deliberate change? ------------------------------ In conf/defaults.yaml <#2241 (comment)>: > @@ -253,11 +247,16 @@ topology.trident.batch.emit.interval.millis: 500 topology.testing.always.try.serialize: false topology.classpath: null topology.environment: null -topology.bolts.outgoing.overflow.buffer.enable: false -topology.disruptor.wait.timeout.millis: 1000 -topology.disruptor.batch.size: 100 -topology.disruptor.batch.timeout.millis: 1 -topology.disable.loadaware.messaging: false +topology.bolts.outgoing.overflow.buffer.enable: false # TODO: Roshan : Whats this ? +topology.disruptor.wait.timeout.millis: 1000 # TODO: Roshan: not used, but we may/not want this behavior +topology.transfer.buffer.size: 50000 +topology.transfer.batch.size: 10 +topology.executor.receive.buffer.size: 50000 +topology.producer.batch.size: 1000 # TODO: Roshan: rename +topology.flush.tuple.freq.millis: 5000 nit: Better to add a comment describing about this property. ------------------------------ In conf/defaults.yaml <#2241 (comment)>: > @@ -304,6 +303,7 @@ storm.cgroup.resources: storm.cgroup.hierarchy.name: "storm" storm.supervisor.cgroup.rootdir: "storm" storm.cgroup.cgexec.cmd: "/bin/cgexec" +storm.cgroup.cgexec.cmd: "/bin/cgexec" may be an accidental copy, needs to be removed. ------------------------------ In storm-client/src/jvm/org/apache/storm/task/IOutputCollector.java <#2241 (comment)>: > @@ -30,4 +31,5 @@ void ack(Tuple input); void fail(Tuple input); void resetTimeout(Tuple input); + void flush(); May want to add some javadoc about the same. It seems we are ready to break the APIs with new set of changes in this redesign. ------------------------------ In storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java <#2241 (comment)>: > - for (Map.Entry<Integer, List<AddressedTuple>> entry : grouped.entrySet()) { - DisruptorQueue queue = shortExecutorReceiveQueueMap.get(entry.getKey()); - if (null != queue) { - queue.publish(entry.getValue()); - } else { - LOG.warn("Received invalid messages for unknown tasks. Dropping... "); + private void transferLocalBatch(List<AddressedTuple> tupleBatch) { + try { + for (int i = 0; i < tupleBatch.size(); i++) { Does foreach have significant perf issue? ------------------------------ In storm-client/src/jvm/org/apache/storm/daemon/Acker.java <#2241 (comment)>: > @@ -66,6 +67,7 @@ public void prepare(Map<String, Object> topoConf, TopologyContext context, Outpu @OverRide public void execute(Tuple input) { + long start = System.currentTimeMillis(); nit: start is never used. ------------------------------ In storm-client/src/jvm/org/apache/storm/daemon/GrouperFactory.java <#2241 (comment)>: > @@ -137,7 +137,7 @@ public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<I public static class FieldsGrouper implements CustomStreamGrouping { private Fields outFields; - private List<Integer> targetTasks; + private ArrayList<List<Integer> > targetTasks; nit: No need to change from List to ArrayList. ------------------------------ In storm-client/src/jvm/org/apache/storm/daemon/Task.java <#2241 (comment)>: > @@ -246,4 +291,26 @@ private void addTaskHooks() { } } + private static HashMap<String, ArrayList<LoadAwareCustomStreamGrouping>> getGroupersPerStream(Map<String, Map<String, LoadAwareCustomStreamGrouping>> streamComponentToGrouper) { nit: This can return Map<String, List<LoadAwareCustomStreamGrouping> instead of implementations. ------------------------------ In storm-client/src/jvm/org/apache/storm/daemon/Task.java <#2241 (comment)>: > public List<Integer> getOutgoingTasks(String stream, List<Object> values) { if (debug) { LOG.info("Emitting Tuple: taskId={} componentId={} stream={} values={}", taskId, componentId, stream, values); } - List<Integer> outTasks = new ArrayList<>(); - if (!streamComponentToGrouper.containsKey(stream)) { - throw new IllegalArgumentException("Unknown stream ID: " + stream); - } - if (null != streamComponentToGrouper.get(stream)) { - // null value for __system - for (LoadAwareCustomStreamGrouping grouper : streamComponentToGrouper.get(stream).values()) { + ArrayList<Integer> outTasks = new ArrayList<>(); + + // TODO: PERF: expensive hashtable lookup in critical path Is this an expensive hit? This map may not contain many keys(no of streams defined for this task). ------------------------------ In storm-client/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java <#2241 (comment)>: > this.outputCollectors = new ArrayList<>(); - for (Map.Entry<Integer, Task> entry : idToTask.entrySet()) { - Task taskData = entry.getValue(); + for (int i=0; i<idToTask.size(); ++i) { Why did we replace with this instead of using foreach with idToTask.entrySet() as that does not need any map lookups? ------------------------------ In storm-client/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java <#2241 (comment)>: > this.outputCollectors = new ArrayList<>(); - for (Map.Entry<Integer, Task> entry : idToTask.entrySet()) { - Task taskData = entry.getValue(); + for (int i=0; i<idToTask.size(); ++i) { + Task taskData = idToTask.get(i); + if (taskData==null) Is this a valid condition? I guess taskData can never be null. ------------------------------ In storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java <#2241 (comment)>: > - public ExecutorShutdown(Executor executor, List<Utils.SmartThread> threads, Map<Integer, Task> taskDatas) { + public ExecutorShutdown(Executor executor, List<Utils.SmartThread> threads, ArrayList<Task> taskDatas) { nit: taskDatas as List ------------------------------ In storm-client/src/jvm/org/apache/storm/executor/ExecutorTransfer.java <#2241 (comment)>: > } } + + private void flushRemotes() throws InterruptedException { + workerData.flushRemotes(); + } + + public boolean transferLocal(AddressedTuple tuple) throws InterruptedException { + workerData.checkSerialize(serializer, tuple); I guess this should be done only when it is determined to be sending to local task. So, this should be pushed to just before queue.publish(tuple) below. ------------------------------ In storm-client/src/jvm/org/apache/storm/grouping/ShuffleGrouping.java <#2241 (comment)>: > private ArrayList<List<Integer>> choices; - private AtomicInteger current; + private int current = 0 ; You changed this to non threadsafe. Is this instance not shared by multiple components? ------------------------------ In storm-client/src/jvm/org/apache/storm/tuple/TupleImpl.java <#2241 (comment)>: > } } - public TupleImpl(GeneralTopologyContext context, List<Object> values, int taskId, String streamId, MessageId id) { + public TupleImpl(GeneralTopologyContext context, List<Object> values, String srcComponent, int taskId, String streamId, MessageId id) { Why do we need to have this to be passed in constructor as that can be derived like below from the existing arguments? This constructor change is spread across all usages. srcComponent = context.getComponentId(taskId) ------------------------------ In storm-client/src/jvm/org/apache/storm/utils/TransferDrainer.java <#2241 (comment)>: > private static final Logger LOG = LoggerFactory.getLogger(TransferDrainer.class); - - public void add(HashMap<Integer, ArrayList<TaskMessage>> taskTupleSetMap) { - for (Map.Entry<Integer, ArrayList<TaskMessage>> entry : taskTupleSetMap.entrySet()) { - addListRefToMap(this.bundles, entry.getKey(), entry.getValue()); + + // Cache the msgs grouped by destination node + public void add(TaskMessage taskMsg) { + int destId = taskMsg.task(); + ArrayList<TaskMessage> msgs = bundles.get(destId); we can have bundles.computeIfAbsent(destId, integer -> new ArrayList<>()); removing null check, creating list and put that into bundles map. ------------------------------ In storm-client/src/jvm/org/apache/storm/utils/TransferDrainer.java <#2241 (comment)>: > private static final Logger LOG = LoggerFactory.getLogger(TransferDrainer.class); - - public void add(HashMap<Integer, ArrayList<TaskMessage>> taskTupleSetMap) { - for (Map.Entry<Integer, ArrayList<TaskMessage>> entry : taskTupleSetMap.entrySet()) { - addListRefToMap(this.bundles, entry.getKey(), entry.getValue()); + + // Cache the msgs grouped by destination node + public void add(TaskMessage taskMsg) { It seems add and send are always invoked in the same thread from (JCQueue.Consumer's accept and flush) and there is no contention. Is that right? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2241 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARCSMfGDBX-tlWGOS3zMhYmPFFFA8yMYks5sRZIvgaJpZM4OiB-Q> .

satishd · 2017-07-25T15:37:10Z

@roshannaik Did you run any topologies with trident topologies in distributed mode with these changes?
You may want to look at travis failures some of them looks to be related.

roshannaik · 2017-07-25T21:36:09Z

@avermeer Looks like SuperChief blog is relaying the same basic claims that Heron has marketed. Since you ask, i will share my opinions wrt Heron's claims.

Heron has never been a player in the high performance club. They have been smart about not comparing themselves with the real top performers of the day. I only included them here because they have made much noise against Storm, otherwise not interesting in a perf shootout.
They are smart about not mentioning which version of Storm they are comparing with (how does a paper with such critical info missing get accepted ?). That creates an illusion in people that their perf claims apply to all versions of Storm in general... even if Storm publishes new perf numbers comparing itself to a prior version.
Heron's threading model (1 thread per process.. based on what i gather from their articles), is really primitive for this application domain. I don't recommend it, but by setting 'topology.workers' equal to the number of spout& bolt instances, Storm can be run in Heron mode.
I find it much easier to debug a process with multiple components using a debugger rather starting separate debuggers for every instance of spout/bolt. Also, I would imagine, having so many processes means you have an explosion of log files to deal with when triaging.
Unclear why the recovery model (when worker process crashes) is any better ... the same kind of replay from the spout would be required. The gains may be minor if any. Making minor optimizations to the failure path and penalizing the normal operation path... is backwards.
Cant get a stack from a Storm worker ? Thats clearly false. Try it yourself. I do it all the time. Heapdumps, on the other hand, can stall the worker and if the heap size is really large the supervisor might feel the worker is having a problem. There are timeouts that you can increase to for the supervisor to wait longer. I cant imagine that Heron doesn't monitor their workers and restart them if they are not responsive.
Heron's Backpressure model is simply too overweight, but marketed as a novel idea.
A quick read of their latest perf blog, noted in the comparison, and it was evident that they missed recognizing their real perf problem.

HeartSaVioR · 2017-07-25T23:26:59Z

@roshannaik
First of all, great work! The numbers are impressive.

Before going into this deeply, I would like to see the comparison between current master branch vs this patch (say, before and after) so that we can see the impact of the patch clearly. IMO this kind of comparison is required basically on every performance patch.

@revans2 brought awesome analysis when introducing disruptor batching.
#765 (comment)
#765 (comment)

It should be great if we can see similar analysis for this patch too, only if you don't really mind. You can put your analysis with details on it.

ptgoetz · 2017-07-25T23:42:04Z

Let's not be too hard on the Heron community. Yes, in the past they've not been exactly friendly in terms of marketing and technical claims, but now that Heron is incubating as an Apache project (full disclosure: I'm a mentor), there's a chance that that might change. A collaborative relationship with the Heron community has potential benefit, an adversarial one not so much.

HeartSaVioR · 2017-07-25T23:54:20Z

@avermeer
As @roshannaik stated, we addressed many parts of performance bottleneck on Storm 1.0.0. The paper about Heron and blog post on SuperChief referred old version of Storm so it doesn't even hold for 1.0.0.
(Yes I understand they can't pick 1.0.0 to compare since 1.1.0 is released at Apr. 2016.)

I think we learnt a lot and adopt some parts of them like back pressure from Heron's paper (though we implemented in different way), but personally also wonder of their points of performance gaining, especially 1 computational thread on 1 worker. Message should be serialized even across the workers in a node, and the cost of serialization/deserialization is not that cheap.

Unlike TPC on SQL world, I think there's no general clear way of benchmark in Streaming, so claiming that A is N times faster than B could hide too much detail. Most of frameworks has their own characteristics hence excels on specific use case or performance test. Moreover, developer from A doesn't know about B deeply so most of time fails to set variables optimized specifically for the test.

That's why I prefer comparing with our own history much instead of comparing it with others.

roshannaik · 2017-07-26T01:18:26Z

@HeartSaVioR the current master clocks little over 3 mill/sec for the ConstSpoutNullBolt Topo without ACKing. On the latency end, with ACK enabled i saw numbers in multiple seconds with batchSz=1 for master and touching as low ~30 microseconds with the new patch.

More analysis? I thought I might have spent too much time providing measurements of various kinds, both in the design docs and here. :-)

Actually I have only presented a few pruned down numbers that I felt were interesting, in which case I was a bit meticulous to record the numbers. Given the scope of this patch, meticulously recording and cataloging all the perf tests that I have done (both high level profiler data & low level hardware perf counters) and performing the many more tests that I could have done, will take forever. I felt, ROI was -ve to rigorously pursue some of those measurements any further.

If you feel some other critical measurements are necessary, bring it up.

Batching is a small aspect of this patch. I had done some quick tests to determine a default that appears to be a modest sweetspot for high throughput low latency topos, but not enough to provide a rigorous analysis. As batch size is very dependent on topology components, deployment size and ack-mode, IMO the search for a more generally optimal batch size is not important right now.

I think we will have more publish worthy analysis once these changes stabilize and get more testing.

On a related note, about latency and ACK-mode, during my recent measurements, I noticed an issue in ACK mode, both in current master and optimized version, perf degrades with time (for certain batch sizes)... Something I had not noticed on the old version of master on which I had based my changes for a long time. I will share these observations soon.
This needs to be looked into before we have an official release.

arunmahadevan

Did a quick pass and left a few comments. Will go through the design doc and do an in-depth review later.

arunmahadevan · 2017-07-26T04:08:55Z

conf/defaults.yaml

+topology.disruptor.wait.timeout.millis: 1000  # TODO: Roshan: not used, but we may/not want this behavior
+topology.transfer.buffer.size: 50000
+topology.transfer.batch.size: 10
+topology.executor.receive.buffer.size: 50000


Seems you are reusing this config as the JCQueue size? Its good to introduce a different config, since users might have overridden this value and it may not be an apt value for the JCQueue size.

the name is still good and means the same thing. some other defaults are also likely change in storm 2.0.. i didnt see good reason to rename it. i have renamed any setting with the name 'disruptor' in it.

Agreed that the name looks good. If the value users might override matters, let's guide a new start point of optimization from release announce doc or another doc.

FYI. I do have the notes in the java doc comments as of now. Agree that we need to make a note of the such settings for which older/existing values may need review along with the release announce doc.

arunmahadevan · 2017-07-26T04:11:22Z

storm-client/pom.xml

@@ -257,7 +258,7 @@
                <!--Note - the version would be inherited-->
                <configuration>
                    <excludes>**/generated/**</excludes>
-                    <maxAllowedViolations>10785</maxAllowedViolations>
+                    <maxAllowedViolations>50785</maxAllowedViolations>


Theres no point in having a standard if we keep violating it. Please fix the checkstyle issues with the new code.

it was intended to get a PR out quickly for others to being reviewing. these fixes can wait for the PR updates. will get fixed surely.

Kindly reminder, given that we are getting close.

yes. fixed internally. will be part of next update.

arunmahadevan · 2017-07-26T04:14:53Z

storm-client/src/jvm/org/apache/storm/daemon/GrouperFactory.java

@@ -137,7 +137,7 @@ public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<I
    public static class FieldsGrouper implements CustomStreamGrouping {

        private Fields outFields;
-        private List<Integer> targetTasks;
+        private ArrayList<List<Integer> > targetTasks;


Unless you are using some specific method of the implementation its always better to write code against the interfaces.

arunmahadevan · 2017-07-26T04:23:12Z

storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java

+        try {
+            for (int i = 0; i < tupleBatch.size(); i++) {
+                AddressedTuple tuple = tupleBatch.get(i);
+                JCQueue queue = shortExecutorReceiveQueueMap.get(tuple.dest);


Since you are optimizing the critical path, you might want to use an ArrayList or array for indexed based lookups than a Map.

yes it did show up as a bottleneck during profiling. I recall trying to replace it with an ArrayList... and then backing out as it started getting a bit complicated... i think it had something to do with this map getting periodically refreshed. Let me take a look at it again.

arunmahadevan · 2017-07-26T04:27:16Z

storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java

-                LOG.warn("Received invalid messages for unknown tasks. Dropping... ");
+    private void transferLocalBatch(List<AddressedTuple> tupleBatch) {
+        try {
+            for (int i = 0; i < tupleBatch.size(); i++) {


For non random access lists the perf will get worse. So change the signature transferLocalBatch to accept ArrayList<AddressedTuple>.

arunmahadevan · 2017-07-26T04:28:37Z

storm-client/src/jvm/org/apache/storm/executor/Executor.java

-                sendSize, waitTimeOutMs, batchSize, batchTimeOutMs);
+    // Called by flush-tuple-timer thread
+    public boolean publishFlushTuple() {
+        TupleImpl tuple = new TupleImpl(workerTopologyContext, new Values(), Constants.SYSTEM_COMPONENT_ID,


Should this be constructed newly for each call? If this is a fixed value you can make it a static final value.

sure... although this allocation doesn't matter much for perf.. as its not in critical path. Its invoked once in a while by the timer.

arunmahadevan · 2017-07-26T04:28:43Z

storm-client/src/jvm/org/apache/storm/executor/Executor.java

+    public boolean publishFlushTuple() {
+        TupleImpl tuple = new TupleImpl(workerTopologyContext, new Values(), Constants.SYSTEM_COMPONENT_ID,
+                (int) Constants.SYSTEM_TASK_ID, Constants.SYSTEM_FLUSH_STREAM_ID);
+        AddressedTuple flushTuple = new AddressedTuple(AddressedTuple.BROADCAST_DEST, tuple);


Should this be constructed newly for each call? If this is a fixed value you can make it a static final value.

right.. i dont think it is necessary.

arunmahadevan · 2017-07-26T04:29:20Z

storm-client/src/jvm/org/apache/storm/executor/Executor.java

+                (int) Constants.SYSTEM_TASK_ID, Constants.SYSTEM_FLUSH_STREAM_ID);
+        AddressedTuple flushTuple = new AddressedTuple(AddressedTuple.BROADCAST_DEST, tuple);
+        if( receiveQueue.tryPublish(flushTuple) ) {
+            LOG.debug("Published Flush tuple to: {} ", getComponentId());


nit: getComponentId will get evaluated irrespective of the log level

thats fine... its not doing any lookups. nor is it in critical path.

arunmahadevan · 2017-07-26T04:29:28Z

storm-client/src/jvm/org/apache/storm/executor/Executor.java

+            return true;
+        }
+        else {
+            LOG.debug("RecvQ is currently full, will retry later. Unable to publish Flush tuple to : ", getComponentId());


nit: getComponentId will get evaluated irrespective of the log level

{} is missing in log message.

HeartSaVioR · 2017-07-26T06:59:02Z

@roshannaik
I didn't mean to require additional load on you. I just would like to see comparison against Storm itself with suppressing variables, not competing with papers, to see impact of this patch clearly.
I'll try to run performance tests (maybe ThroughputVsLatency or something in perf module) and share the result.

I'll go through the code first.

satishd · 2017-07-26T07:18:59Z

storm-client/src/jvm/org/apache/storm/utils/JCQueue.java

+        public void add(Object obj) throws InterruptedException {
+            boolean inserted = q.publishInternal(obj);
+            while(!inserted) {
+                Thread.yield();


It may be better to busy spin for few times and then yield.

some other changes coming to that code soon.

HeartSaVioR

I have done for the first pass.
Lots of style violation observed but I guess we can address them after having some +1s on merging.
I didn't look at classes regarding JCQueue implementation since I don't have any knowledge on JCQueue. I'll try to see it later, but feel OK to skip given that there're at least two other reviewers reviewing this thoughtfully.

HeartSaVioR · 2017-07-26T07:00:10Z

conf/defaults.yaml

@@ -49,6 +49,9 @@ storm.nimbus.retry.times: 5
 storm.nimbus.retry.interval.millis: 2000
 storm.nimbus.retry.intervalceiling.millis: 60000
 storm.auth.simple-white-list.users: []
+storm.auth.simple-acl.users: []


Following three lines seem not relevant of this patch.

thanks.. i think it that got introduced somehow during manual reconciling/rebasing to latest master.

HeartSaVioR · 2017-07-26T07:03:20Z

conf/defaults.yaml

@@ -231,16 +228,13 @@ topology.multilang.serializer: "org.apache.storm.multilang.JsonSerializer"
 topology.shellbolt.max.pending: 100
 topology.skip.missing.kryo.registrations: false
 topology.max.task.parallelism: null
-topology.max.spout.pending: null
+topology.max.spout.pending: null  # TODO: We dont need this any more


Looks like you're removing max spout pending and also configuration for backpressure. Does this patch also touch the mechanism of backpressure?

And let's address this in patch or file an issue and remove TODO.

Yes back-pressure is addressed. Described in the design doc. A couple little things left related to BP .. 1) Ensure metrics get updated when retrying to publish to downstream Q that is full. 2) Remove max.spout.pending

I guess it utilizes blocking queue which looks natural for the definition of back pressure. If then we need to be sure there's no livelock and deadlock. AFAIK, Storm used blocking queue before, and ended up adding overflow buffer because of deadlock (dining philosopher).

the critical code path is now lock free so should be no room for live/dead-locks. To be very precise, its not exactly a blocking queue.. as there is a retry loop that we can break out of anytime. There are some changes coming soon in this PR.. wrt to what happens when Q is full. will also add that info in the design doc.

HeartSaVioR · 2017-07-26T07:14:40Z

conf/defaults.yaml

 topology.state.synchronization.timeout.secs: 60
-topology.stats.sample.rate: 0.05
+topology.stats.sample.rate: 0.001


Is the change intended?
This is OK to change it while doing performance test (especially micro perf), but 0.1% might be too small for production. Even users complained that only 5% of tuples are measured.
Other frameworks using dropwizard metrics even don't sample for meters, and I guess we will follow the rule on new metrics.

It was definitely impacting perf. There may be some tweaking needed in the sampler code (not sure). Also Sampling 5 out of every 100 seemed bit excessive even for topos doing several thousand per second. Does 1 out of every 100 seem more reasonable ?

Likewise I said some frameworks even don't sample and measure all for meters. JStorm is one of example, and they (Alibaba) said "every day the JStorm Cluster handles 1.5 PB/2 Trillion messages" in 2015, which is 23,000,000 messages per seconds. Yes we don't know how many servers they use, but they also claim their performance in website http://jstorm.io:8080/Performance/ and metrics seems not make considerable bottleneck.

One thing to consider is that current metrics implementation may be really slower than dropwizard thing, so we may want to adjust rate if we believe it is really a bottleneck.

Btw, most of the cases users would want to have operational convenience over performance. We should consider first what users are facing day by day.

Sure. I agree ... thats why I ask. Does 1 out of every 100 seem reasonable ? Or do you feel we should continue to sample 5 out of 100 ? This change is based on impact observed during perf runs.... but if it -vely impacts operational convenience then worth reverting.

Yes I don't feel we need to change it if it's only for performance. We already know the sampling rate affects performance, and tolerate it.
Btw, likewise I commented, I think we have touched so much variables irrespective of the message subsystem and JCQueue. How you can prove introducing JCQueue brings better throughput and/or lower latency in this patch? How about message subsystem? Let's not touch variables and address them in separate issue or patch.

Ok I personally feel that sampling can be reduced a bit, but if you have a preference towards retaining it .... i am ok with that. JCQueue is at the core messaging system .. not separate. the numbers demonstrate the improvement.

HeartSaVioR · 2017-07-26T07:25:16Z

conf/defaults.yaml

-topology.disruptor.batch.size: 100
-topology.disruptor.batch.timeout.millis: 1
-topology.disable.loadaware.messaging: false
+topology.bolts.outgoing.overflow.buffer.enable: false # TODO: Roshan : Whats this ?


In 0.9.x we don't have overflow buffer for bolts and it could make deadlock (philosopher dining problem), so introduced overflow buffer, but it then opens the chance of OOM, so set it to false by default.

After introducing disruptor batching and back pressure, we already made the queue unbounded so don't need the configuration any more.

ok sounds like we can remove this in the PR.

HeartSaVioR · 2017-07-26T07:32:10Z

conf/defaults.yaml

+topology.disruptor.wait.timeout.millis: 1000  # TODO: Roshan: not used, but we may/not want this behavior
+topology.transfer.buffer.size: 50000
+topology.transfer.batch.size: 10
+topology.executor.receive.buffer.size: 50000


Agreed that the name looks good. If the value users might override matters, let's guide a new start point of optimization from release announce doc or another doc.

HeartSaVioR · 2017-07-26T10:55:47Z

storm-client/src/jvm/org/apache/storm/utils/Utils.java

+                        if (s==null) // then stop running it
+                            break;
+                        if (s>0)
+                            Thread.sleep(s);  // TODO: Roshan : need to do something about sleep strategy


Use Time.sleepSecs() to be compatible with time simulation. And please check its unit was second, but unit is effectively changed to milliseconds in this change.

HeartSaVioR · 2017-07-26T11:00:26Z

storm-client/test/jvm/org/apache/storm/bolt/TestJoinBolt.java

@@ -291,7 +291,7 @@ private static TupleWindow makeTupleWindow(ArrayList<Tuple>... streams) {
        MockContext mockContext = new MockContext(fieldNames);

        for (Object[] record : data) {
-            TupleImpl rec = new TupleImpl(mockContext, Arrays.asList(record), 0, streamName);
+            TupleImpl rec = new TupleImpl(mockContext, Arrays.asList(record), "testSrc", 0, streamName);


Let's put component id to the parameter of makeStream if possible.

HeartSaVioR · 2017-07-26T11:00:44Z

storm-client/test/jvm/org/apache/storm/bolt/TestJoinBolt.java

@@ -313,7 +313,7 @@ private static TupleWindow makeTupleWindow(ArrayList<Tuple>... streams) {

            ArrayList<Object> tupleValues = new ArrayList<>(1);
            tupleValues.add(recordMap);
-            TupleImpl tuple = new TupleImpl(mockContext, tupleValues, 0, streamName);
+            TupleImpl tuple = new TupleImpl(mockContext, tupleValues, "testSrc", 0, streamName);


Let's put component id to the parameter of makeNestedEventsStream if possible.

HeartSaVioR · 2017-07-26T11:01:00Z

storm-client/test/jvm/org/apache/storm/topology/WindowedBoltExecutorTest.java

@@ -78,7 +78,7 @@ public Fields getComponentOutputFields(String componentId, String streamId) {
    }

    private Tuple getTuple(String streamId, final Fields fields, Values values) {
-        return new TupleImpl(getContext(fields), values, 1, streamId) {
+        return new TupleImpl(getContext(fields), values, "testSrc", 1, streamId) {


Let's put component id to the parameter of getTuple if possible.

HeartSaVioR · 2017-07-26T11:02:56Z

storm-server/src/main/java/org/apache/storm/Testing.java

@@ -712,6 +712,6 @@ public static Tuple testTuple(List<Object> values, MkTupleParam param) {
                new HashMap<>(),
                new HashMap<>(),
                new AtomicBoolean(false));
-        return new TupleImpl(context, values, 1, stream);
+        return new TupleImpl(context, values, "testSrc", 1, stream);


Let's use component instead of "testSrc".

HeartSaVioR · 2017-07-26T11:15:17Z

Btw, we should be careful if we drop something we provided. This patch seems focus to performance optimization (especially micro), and for some spots it might not be always better. One example is disabling load aware by default.

revans2 · 2017-07-26T16:08:56Z

-1

Perhaps I am running into some odd issues here so if I can be corrected I would be happy to change my vote, but nothing I have run with this patch is better in any way. Are all of the results from micro benchmarks? Did anyone run a real topology with this patch before posting all of these wonderful results to twitter? I am not one to swear but WTF?

I built a stock 2.0.0-SNAPSHOT build (450ed63) and compared it to the exact same release with this patch merged on top of it (which was a clean merge). I am running

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Sierra 10.12.6
MacBook Pro (Retina, 15-inch, Mid 2015)
2.8 GHz Intel Core i7
16 GB 1600 MHz DDR3

I ran the ThroughputVsLatency topology with several different options and no changes at all to the default storm.yaml.

With this patch I found that.

Running a topology with more than one worker appears to not be able to send any messages between workers (or it takes so long most of the messages time out). So I switched all of my tests to a single worker.
When processing a nearly idle topology (500 sentences/second) the CPU utilization was everything that my box could give it. (8 cores fully utilized) compared to about one half of one core used by stock storm.
The latency is absolutely horrible. The minimum latency for a somewhat idle topology was 1 to 4 seconds to do a word count. For a topology processing 10,000 sentences per second it dropped to 800 ms. The maximum latency was 15 seconds for all of these cases. Compare that to stock storm which has a min latency of around 3 to 4 ms for the normal case.
The system bolt metrics do not work, or at lest I was not able to get any of them back. I tried to compare memory and CPU usage through top, which worked out OK.
memory usage is insane. The resident memory was 2 GB for almost all of the workers no matter the throughput. That is 1 GB more than stock storm for the same 10,000 sentences per second.
The maximum throughput is about 1/4th what it is on stock storm. I was able to to 100,000 sentences per second on my laptop. I could do it both with a single worker, and with 4 worker (although the CPU usage was higher in the latter case). With this patch I was able to do 30,000 sentences per second in the best case, but on average it could only do about 25,000.

I am happy to make all of my numbers public. I also plan to run them on a Linux box to see if it is any different.

revans2

Not done with the code review, but until my other concerns are addressed I don't see much reason to do more.

revans2 · 2017-07-26T13:10:50Z

conf/defaults.yaml

+topology.flush.tuple.freq.millis: 5000
+topology.spout.recvq.skips: 3  # Check recvQ once every N invocations of Spout's nextTuple() [when ACKs disabled]
+
+topology.disable.loadaware.messaging: true   # load aware messaging reduces throughput by ~20%.


I agree that turning it off by default may not be what we want. It really helps on heterogeneous clusters. And quite honestly a lot of the benchmarks that we have been running are no where near what I have seen happen in production. It gives you a nice feeling to say we can do X million tuples per second, but these are for bolts that do close to nothing. Most real bolts I have seen actually take some time to process.

revans2 · 2017-07-26T13:12:45Z

examples/storm-perf/src/main/java/org/apache/storm/perf/ConstSpoutIdBoltNullBoltTopo.java

@@ -76,6 +76,8 @@ public static StormTopology getTopology(Map<String, Object> conf) {
    public static void main(String[] args) throws Exception {
        int runTime = -1;
        Config topoConf = new Config();
+        topoConf.put(Config.TOPOLOGY_SPOUT_RECVQ_SKIPS, 8);


If we are setting this for all of the perf tests, then why is it not the default? Having a benchmark that does not show what people will get out of the box feels a bit fishy to me, even though everyone does it.

Higher numbers for that setting are only useful for very low latency spouts. The default setting there assumes that you are likely to be using Kafka/Hdfs kind of spouts which have higher latencies (amount of time spent inside nextTuple() ).

revans2 · 2017-07-26T13:22:41Z

storm-client/src/jvm/org/apache/storm/Config.java

     */
-    @isPowerOf2
    public static final String TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE="topology.executor.receive.buffer.size";


If this does not need to be a power of 2 that is fine but we still want an annotation to say what type it is. I think @isPositiveNumber is probably good enough. You might want to make it an @isInteger too if you want to limit it to 32 bits.

indeed! my bad.

revans2 · 2017-07-26T13:25:34Z

storm-client/src/jvm/org/apache/storm/StormTimer.java

@@ -97,6 +97,8 @@ public void run() {
                        // events.
                        Time.sleep(1000);
                    }
+                    if(Thread.interrupted())


@HeartSaVioR InterruptedException in most places in storm is only used to shut the process down cleanly. This looks fine to me.

revans2 · 2017-07-26T13:30:15Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

-            for (LoadAwareCustomStreamGrouping grouper : streamComponentToGrouper.get(stream).values()) {
+        ArrayList<Integer> outTasks = new ArrayList<>();
+
+        // TODO: PERF: expensive hashtable lookup in critical path


So like with all TODOs in the code we either need to fix the TODO, remove the TODO because we are not going to do it, or remove the TODO and file a follow on JIRA to do it later.

revans2 · 2017-07-26T13:34:39Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java

-                    new WorkerState(conf, context, topologyId, assignmentId, port, workerId, topologyConf, stateStorage,
+                return loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials);
+            }
+        }); // Subject.doAs(...)


nit: If we are changing this can we just make it a lambda instead?

Subject.doAs(subject, () -> loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials));

+1 on nit, cause it is simpler and not placed to critical path.

revans2 · 2017-07-26T13:35:27Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java


+    private void setupFlushTupleTimer(final List<IRunningExecutor> executors) {
+//                StormTimer timerTask = workerState.getUserTimer();


revans2 · 2017-07-26T13:35:34Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java

+            public void run() {
+                for (int i = 0; i < executors.size(); i++) {
+                    IRunningExecutor exec = executors.get(i);
+                    if(exec.getExecutorId().get(0) != -1) // dont send to system bolt


revans2 · 2017-07-26T13:36:30Z

storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java

        return transferQueue;
    }

    public StormTimer getUserTimer() {
        return userTimer;
    }

-    final DisruptorQueue transferQueue;
+    final JCQueue transferQueue; //TODO: Roshan: transferQueue also needs to avoid double batching


Is this done? Looks kind of important.

revans2 · 2017-07-26T17:12:15Z

I ran again with this exact version (5c0db92) and got the same results.

revans2 · 2017-07-26T17:49:19Z

I tried on Linux too and got very similar results. The CPU and memory usage of the topology was lower but the actual throughput and latency of the topology was very similar.

harshach · 2017-07-26T18:09:10Z

@revans2 Do you mind posting your storm.yaml or are you running with defaults. We will try to see if we can reproduce this same behavior on our side. If there are any bugs we will work to fix it and but its shows great potential on the perf improvements.
Regarding posting to twitter , Yes we are very excited about the patch and definitely want to share the results with the community. Not sure why you are getting upset about it. Its important that we make these perf improvements and also let the community know that there are continuous improvements in Storm. If you found a bug thats great thats why we've PR and review process in place.

revans2 · 2017-07-26T18:21:16Z

@harshach I am running with defaults in all cases

I build mvn clean install -DskipTests
package cd storm-dist/binary; mvn clean package
untar the result tar -xzvf ./final-package/target/apache-storm-2.0.0-SNAPSHOT.tar.gz
move the directory so I can save it if I want to switch back and forth mv apache-storm-2.0.0-SNAPSHOT under-test
then bring up the cluster

cd under-test;
./bin/storm dev-zookeeper &
./bin/storm nimbus &
./bin/storm supervisor &
./bin/storm ui &
./bin/storm logviewer &

Wait for everything to come up, and I can see the ui.

Then I run some tests (I pull in the storm starter from the build because the packaged one does not build on its own).

./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency -c topology.workers=1
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1 -c topology.max.spout.pending=500
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1 -c topology.max.spout.pending=1000

HeartSaVioR · 2017-07-26T23:20:59Z

First of all, I think I should be more careful about like / retweet. Sorry about that and just reverted all.

Regarding tweet, I think it heavily depends on which accounts (who) tweet it.

If users or contributors are tweeting about this PR, we can take it like a gossip and no problem on it.

If committers or PMCs are doing it, that could be shown as kind of publicizing, especially this PR compares with other frameworks and claims Storm will be no 1. performer. If PMC members tweet about unsafe or not yet stable source, and if it turned out to another result, someone could feel we (Storm community) hype, regardless of intention.
If we say this PR could increase throughput by N times from before, that would be less problematic. I don't say no problem, but much less.

If we are doing it with official account (@apachestorm), that just matters, because the account could be shown as on behalf of Storm community or at least PMC members. I was the one wondering why official account retweeted about non-Storm related tweet like https://twitter.com/BoredElonMusk/status/889935279213223936 .
If we haven't talked about which things official account should (not) tweet, we may need to have a time to talk about.

HeartSaVioR · 2017-07-26T23:53:18Z

From now what I can suggest is... breaking down STORM-2306 to "redesign messaging subsystem" and "switch to JCTools Queues", and try out latter part first.
This patch mixes up everything (micro optimization, switching queue, messaging architecture change) in one, so hard to point out where the issue starts. Let's just have a new pull request only replacing Disruptor with JCQueue and see it helps. I'd rather lean on putting the minimal effort and make it better incrementally.
And let's separate micro-optimization (converting map usage to array list, avoiding iterator, etc.) as another issue and don't address them from other issues. I think we should have better throughput even without them if newer subsystem is better and/or JCTools Queues are better than Disruptor in our case.

harshach · 2017-07-27T02:51:29Z

@HeartSaVioR lets keep this discussion to reviews. This is not forum to discuss what one should tweet or not that's up to individuals. Nobody is trying to promote something that's not feasible lets not try to be a moral authority here to suggest what one can do or not.
Regarding breaking this into multiple PRs addressing different subsystems, that's a reasonable ask. But lets wait before we go down that path we need to look into the issues raised here and reproduce the case.
I am running few tests myself and I'll report my findings.

HeartSaVioR · 2017-07-27T03:38:11Z

@harshach
As I commented multiple times, this patch is mixed up of replacing queue, changing message subsystem, and micro-optimization. Three committers already had a look at the first pass (excluding @revans2 since he stopped reviewing) but we couldn't identify any critical problems unless @revans2 shared the test result. If we go on with this patch, we even don't know which part of change contribute s performance gain and how much it is.

If we were having PR only replacing JCTools, it is easy to review and identify the benefit, or pros/cons of replacing Disruptor with JCTools. The patch will be fairly simple to review and verify, and most of the issues raised here shouldn't be raised there (because I guess most of them are not from queue replacement).

harshach · 2017-07-27T03:55:55Z

@HeartSaVioR I don't mind breaking this into multiple PRs if it helps reviewing and merging in. Its up to @roshannaik .

harshach · 2017-07-27T07:58:48Z

@revans2 @HeartSaVioR
Here are my findings https://docs.google.com/spreadsheets/d/1wPpC3YXp-vTIelRTUVoLxuxIYxUekiIHUpZ2ZEysC4Y/edit#gid=1644511...

Looking at ThroughputvsLatency I found some issues:

By default it adds 51 total threads , that IMO is incorrect when benchmarking in a 4-core machine.
Also it adds two bolts for logging/measurements which might be impacting the numbers
https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...
https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...
It also throttles the spout
https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/...

I did the following changes:

Disable the HTTP and Logging bolts
Disable throttling spout, we want spout to run as fast as it can
reduced the executor counts

If you see lines from 51 - 75

Apache Master clearly couldn't handle the faster spout and starts timing out. Perf degrades considerably and very quickly. Where as STORM-2306 not only was able to handle the faster spout and delivered stable and processing at more start out being 10x faster then improves to 35x faster compared to master.

Also ran storm-perf topologies ConstSpoutIdNullBoltIdTopo and ConstSpoutNullBoltTopo. These topologies are trying to see whats the message throughput and latency when there are only 2 components involved without including any external dependencies. Essentially testing the messaging system.

From line 3-45 you can see with this patch we are getting under 10ms (depends on the topology) compare to an avg of 250ms+. (with batchSize=1000)

Also ran storm-examples ThroughputVsLatency with 2 workers. Here there is clearly a bug which is prevent inter-worker communication so don't have comparative numbers.

HeartSaVioR · 2017-07-27T09:13:12Z

@harshach
For ThroughputvsLatency, throttling spout is intended. We set desired throughput and see histogram of latency and other metrics. (CPU, GC, etc.) There're 3 components in topology which parallelism would be set to 4 * worker count so total 12 executor threads per worker. I think we can parameterize the magic number 4 and adjust it while testing too.

I have also done with some performance tests, without modifying TVL topology. The reason is that we should also care about non-performance-maximized topology. For benchmarking performance maximized topology we also have ConstSpoutIdBoltNullBoltTopo, so let's not modify TVL and verify this patch works with all the cases.

Since this patch doesn't seem to handle inter-worker communication properly, the test set what we can do for now is very limited.

Here's my machine spec used for performance test:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS: Ubuntu 17.04
CPU: AMD Ryzen 5 1600 3.2Ghz 6 core (with hyper-thread = 12 logical cores)
RAM: Samsung DDR4 32G 19200
SSD: Samsung 850 Evo

and here's my number (just pasted as raw number):
https://docs.google.com/spreadsheets/d/1J3S4R68CsazlINF60rQn4XCy2Hx5QNzqVoWSO9qo2tc/edit?usp=sharing

My observation is that this patch looks impressive with performance maximized topology, but this also looks really bad (not acceptable) with relatively idle topology.
I've observed all the things what @revans2 observed with TVL tests, one example is that with default parameter (or rate set to 10000, don't remember) of TVL, a worker for this patch showed over 300% (around 340%) of CPU usage whereas a worker for stock showed less than 70%.
But surprisingly this patch looks stable with ConstSpoutIdBoltNullBoltTopo and even CPU usage seems lower than stock in this test.

While we often publicize micro-benchmark result, in practice users would run much idle topologies.
I'm OK if things can be stabilized with adjusting parameters (if then I think default value should be set to support such case), but if not, it should be addressed before accepting the patch. I would be -1 until TVL results with various rate are not stable.

harshach · 2017-07-27T10:22:02Z

@HeartSaVioR Its not 12 executors per worker. If you don't pass a command-line argument, it sets parallelism variable here to 4 https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L277
and multiplys with 4 here again https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L359 . So setting a parallelism unit 16 per component. so with 3 components 48 + 2 metrics consumer + 1 acker = 51.
This is nothing to do with how many workers you've.

I still need to look more in detail of TVL topology, before doing so want to write a real-world topology that reads from Kafka and writes another Kafka topic and test that out. The issue here is not with idle topology but a topology that is throttling the input rate and its not idle. We've two metrics consumers with parallelism set to 1. How confident are we that the metrics consumers are not causing the issue. I'll look into these in more detail and update my findings here.

HeartSaVioR · 2017-07-27T10:34:49Z

@harshach The second argument is effectively representing worker count: you can see that topology set worker count as parallelism. I agree that the name is really misleading even I ran tests with topology.workers instead of passing second argument. (need to run test again...)

HeartSaVioR · 2017-07-27T11:02:04Z

Let me share a quick test result with passing 10000 1 to TVL parameter:

STORM-2306

uptime:   30 acked:   144,000 acked/sec:   4,800.00 failed:        0 99%:   3,070,230,527 99.9%:   3,221,225,471 min:      63,897,600 max:   3,380,609,023 mean: 1,299,365,069.36 stddev:  685,287,508.16 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:   60 acked:   303,000 acked/sec:  10,100.00 failed:        0 99%:   3,011,510,271 99.9%:   3,200,253,951 min:      28,540,928 max:   3,303,014,399 mean: 1,283,728,691.41 stddev:  671,791,145.42 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:   90 acked:   297,000 acked/sec:   9,900.00 failed:        0 99%:   3,047,161,855 99.9%:   3,307,208,703 min:      62,980,096 max:   3,737,124,863 mean: 1,283,141,447.64 stddev:  675,126,086.16 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  120 acked:   303,000 acked/sec:  10,100.00 failed:        0 99%:   3,047,161,855 99.9%:   3,206,545,407 min:      31,965,184 max:   3,347,054,591 mean: 1,284,140,763.79 stddev:  690,625,730.54 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  150 acked:   299,000 acked/sec:   9,966.67 failed:        0 99%:   3,072,327,679 99.9%:   3,231,711,231 min:      16,703,488 max:   3,414,163,455 mean: 1,320,620,493.23 stddev:  693,327,734.87 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  180 acked:   300,000 acked/sec:  10,000.00 failed:        0 99%:   3,042,967,551 99.9%:   3,248,488,447 min:      48,005,120 max:   3,846,176,767 mean: 1,313,068,274.86 stddev:  671,810,427.83 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  210 acked:   301,000 acked/sec:  10,033.33 failed:        0 99%:   3,061,841,919 99.9%:   3,363,831,807 min:      51,347,456 max:   3,802,136,575 mean: 1,297,807,219.57 stddev:  678,980,965.35 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  240 acked:   301,000 acked/sec:  10,033.33 failed:        0 99%:   3,019,898,879 99.9%:   3,208,642,559 min:      36,962,304 max:   3,363,831,807 mean: 1,315,037,518.24 stddev:  676,620,121.79 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  270 acked:   297,000 acked/sec:   9,900.00 failed:        0 99%:   3,026,190,335 99.9%:   3,200,253,951 min:      52,363,264 max:   3,349,151,743 mean: 1,308,161,023.51 stddev:  680,121,348.29 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  300 acked:   300,000 acked/sec:  10,000.00 failed:        0 99%:   3,021,996,031 99.9%:   3,200,253,951 min:      49,348,608 max:   3,317,694,463 mean: 1,335,928,012.31 stddev:  667,642,145.32 user:          0 sys:          0 gc:          0 mem:       0.00

CPU usage was around 150 ~ 250%, mostly around 160% which seemed to be a bit more stable, but still fluctuating with small rate 10000.

current

uptime:   30 acked:   140,440 acked/sec:   4,681.33 failed:        0 99%:      14,016,511 99.9%:      26,558,463 min:       2,449,408 max:      52,035,583 mean:    7,800,556.68 stddev:    1,790,982.79 user:     28,620 sys:      2,340 gc:          0 mem:      49.94
uptime:   60 acked:   301,860 acked/sec:  10,062.00 failed:        0 99%:      11,141,119 99.9%:      15,351,807 min:       3,233,792 max:      26,181,631 mean:    7,479,081.72 stddev:    1,175,253.40 user:     30,270 sys:      6,800 gc:        190 mem:      54.88
uptime:   90 acked:   301,600 acked/sec:  10,053.33 failed:        0 99%:      10,813,439 99.9%:      13,197,311 min:       3,246,080 max:      16,138,239 mean:    7,375,841.06 stddev:    1,112,541.35 user:     31,660 sys:      7,160 gc:        194 mem:      54.68
uptime:  120 acked:   301,460 acked/sec:  10,048.67 failed:        0 99%:      11,042,815 99.9%:      13,828,095 min:       3,266,560 max:      17,285,119 mean:    7,400,672.94 stddev:    1,130,409.32 user:     29,650 sys:      7,330 gc:        200 mem:      47.80
uptime:  150 acked:   301,500 acked/sec:  10,050.00 failed:        0 99%:      10,911,743 99.9%:      13,246,463 min:       3,248,128 max:      15,654,911 mean:    7,399,041.82 stddev:    1,118,368.85 user:     29,920 sys:      7,370 gc:        199 mem:      41.95
uptime:  180 acked:   301,540 acked/sec:  10,051.33 failed:        0 99%:      10,969,087 99.9%:      13,598,719 min:       3,233,792 max:      16,302,079 mean:    7,390,435.62 stddev:    1,129,976.16 user:     29,840 sys:      7,190 gc:        201 mem:      41.16
uptime:  210 acked:   301,540 acked/sec:  10,051.33 failed:        0 99%:      11,182,079 99.9%:      14,557,183 min:       3,246,080 max:      19,513,343 mean:    7,382,121.55 stddev:    1,161,863.92 user:     29,620 sys:      7,460 gc:        206 mem:     120.40
uptime:  240 acked:   301,560 acked/sec:  10,052.00 failed:        0 99%:      10,903,551 99.9%:      13,303,807 min:       3,262,464 max:      15,187,967 mean:    7,361,940.07 stddev:    1,119,275.14 user:     29,830 sys:      7,150 gc:        205 mem:      63.19
uptime:  270 acked:   301,360 acked/sec:  10,045.33 failed:        0 99%:      10,944,511 99.9%:      13,541,375 min:       3,274,752 max:      16,891,903 mean:    7,349,283.96 stddev:    1,132,254.05 user:     29,530 sys:      7,220 gc:        197 mem:     107.99
uptime:  300 acked:   301,420 acked/sec:  10,047.33 failed:        0 99%:      11,132,927 99.9%:      13,967,359 min:       3,250,176 max:      16,957,439 mean:    7,315,955.89 stddev:    1,158,914.73 user:     29,220 sys:      7,590 gc:        210 mem:      49.34

CPU usage was around 120%, no outstanding fluctuation (+- 10%).

…- interworker=3mill/sec. withAck=700k/s,1.3ms

revans2 · 2017-11-06T20:27:09Z

@roshannaik any update on this? Is there anything I can do to help get this done for 2.x?

roshannaik · 2017-11-06T23:30:19Z

@revans2 thanks very much for the offer to help, i think it might be useful to get past this issue that is blocking this.

Just updated the PR with these two key changes.

Discovered that workers need to communicate their internal back pressure situation to other workers so that they stop/resume sending messages to components experiencing BP.
Now Bolts also use non-blocking emits so that in case of BP they can continue to process metrics ticks.

Curent Status:

Some new metrics needed for monitoring overflow needs to be introduced. Also some minor todos mentioned in code need to be addressed.
During scale testing, found that under certain multi-worker scenarios, noticed a inexplicable performance drop. The simplest way to reproduce the issue appears to be using a 2 workers setup with spouts and bolts distributed so that there is messaging occurring in both directions. Don't have a good grip on the issue yet. Unfortunately last 3 weeks had to switch my attention to some other pressing release centric tasks at work.

May be useful to sync up with you offline soon and see if looking into the issue together helps.

…ication - Fixed some issues found during multi-worker testing - Fixes to perf topos to pick up cmd line args correctly - Removing support for single producer mode in JCQueue - Almost all msg transfers are non-blocking now (other than metrics and credentials-change notifications). Allows spouts and bolts to processes metrics ticks even if output path is choked due to backpressure - Some temporary debug logs that need to be removed after testing - Needs more multiworker mode testing

- Adding workerID to BpStatus for better debuggability - Logging the length of an idle stretch for BP & max.spout.pending wait situations - Changes to defaults: topology.executor.receive.buffer.size=32k (rounding up to power of 2), topology.flush.tuple.freq.millis=1 (same as master) - minor fixes and improvements

…it Strategy settings + topology.flush.tuple.freq.millis -> topology.batch.flush.interval.millis )

- Added BackPressure unit tests

…over from remote worker dying

HeartSaVioR

Another partial review done. I finished reviewing to Task.java, and will continue to review sooner when I can find another time.

HeartSaVioR · 2017-12-20T12:10:13Z

conf/defaults.yaml

+topology.batch.flush.interval.millis: 1  # Flush tuples are disabled if this is set to 0 or if (topology.producer.batch.size=1 and topology.transfer.batch.size=1).
+topology.spout.recvq.skips: 3  # Check recvQ once every N invocations of Spout's nextTuple() [when ACKs disabled]
+
+topology.disable.loadaware.messaging: false   # load aware messaging can degrade throughput


May be better to describe the cases when we recommend using load aware messaging, or disable load aware messaging. The comment may mislead users to consider this option as always better to disable.

Yes that would be good but I dont have a good handle on loadAware's perf characteristics, other than noticing throughput degradation with the ConstSpout* Topos to make good recommendations. I had initially disabled it, but then since there was (well founded) opposition from Bobby wrt disabling it, so I reverted it and left the short comment.

Bobby mentioned some interesting improvements coming to LoadAware (like scheduling communicating tasks within same worker instead of even distribution across workers)... so hopefully this situation may change.

Since this is file is not conducive to making verbose recommendations, perhaps we can put better guidance in the Performance.md... if anyone has some concrete inputs about it.

Bobby mentioned some interesting improvements coming to LoadAware (like scheduling communicating tasks within same worker instead of even distribution across workers)... so hopefully this situation may change.

Yes basic locality aware is already applied to load aware shuffle grouping, and I also expect improvement on determining locality will be available in future.

Since this is file is not conducive to making verbose recommendations, perhaps we can put better guidance in the Performance.md... if anyone has some concrete inputs about it.

Agreed. Details could be put to Performance.md.

HeartSaVioR · 2017-12-20T12:26:04Z

docs/Performance.md

+
+
+## Max.spout.pending
+The back pressure mechanism no longer requires `topology.max.spout.pending`. It is recommend to set this to null (default).


According to the report posted in issue, setting this option still shows better result and hence the tests in the report utilizes the option. Do we want to guide it as well, or still be better to recommend users to set this to null?

Thanks for catching that. This comment is no longer true. Will fix.

HeartSaVioR · 2017-12-20T12:45:02Z

examples/storm-perf/src/main/java/org/apache/storm/perf/JCToolsPerfTest.java

+    @Override
+    public void run() {
+        long start = System.currentTimeMillis();
+//        while (!Thread.interrupted()) {


Is it expected to toggle uncomment (and comment while (!halt) {)? If not, let's just remove this line.

HeartSaVioR · 2017-12-20T12:45:27Z

examples/storm-perf/src/main/java/org/apache/storm/perf/JCToolsPerfTest.java

+    @Override
+    public void run() {
+        long start = System.currentTimeMillis();
+//        while (!Thread.interrupted()) {


HeartSaVioR · 2017-12-20T12:45:49Z

examples/storm-perf/src/main/java/org/apache/storm/perf/JCToolsPerfTest.java

+    public void run() {
+        Handler handler = new Handler();
+        long start = System.currentTimeMillis();
+//        while(!Thread.interrupted()) {


Same here and comments in below lines.

HeartSaVioR · 2017-12-20T14:19:34Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

                if (grouper == GrouperFactory.DIRECT) {
                    throw new IllegalArgumentException("Cannot do regular emit to direct stream");
                }
                List<Integer> compTasks = grouper.chooseTasks(taskId, values, loadMapping);
-                outTasks.addAll(compTasks);
+                outTasks.addAll(compTasks);   // TODO: PERF: this is a perf hit


Same here: may be better to decide.

IMHO I'm still not convinced that it can introduce performance hit. Something I can imagine are allocating backed array (only once in method call) and expanding array, but unless we use fan-out in huge size topology, outTasks is expected to be small.

HeartSaVioR · 2017-12-20T14:22:48Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

@@ -177,6 +196,35 @@ public BuiltinMetrics getBuiltInMetrics() {
        return builtInMetrics;
    }

+
+    // Non Blocking call. If cannot emmit to destination immediately, such tuples will be added to `pendingEmits` argument


nit: emmit -> emit

HeartSaVioR · 2017-12-20T14:39:09Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

+    /**
+     * Send sampled data to the eventlogger if the global or component level debug flag is set (via nimbus api).
+     */
+    public void sendToEventLogger(Executor executor, List values,


Will the method be called every emit when number of executors of event logger are more than 0? If then we may want to apply more optimizations here (not in this issue but another following up issue - others can take it up).

Agree. Since topology.eventlogger.executors=0 (disabled) by default, i didn't spend time on this.

HeartSaVioR · 2017-12-20T14:45:40Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

@@ -192,7 +240,7 @@ private TopologyContext mkTopologyContext(StormTopology topology) throws IOExcep
                    ConfigUtils.supervisorStormDistRoot(conf, workerData.getTopologyId())),
                    ConfigUtils.workerPidsRoot(conf, workerData.getWorkerId()),
            taskId,
-            workerData.getPort(), workerData.getTaskIds(),
+            workerData.getPort(), workerData.getLocalTaskIds(),


Renamed method looks better to clarify the meaning. Nice improvement.

HeartSaVioR · 2017-12-20T14:47:39Z

storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java

@@ -186,9 +201,11 @@ public Runnable getSuicideCallback() {
    final AtomicBoolean isTopologyActive;
    final AtomicReference<Map<String, DebugOptions>> stormComponentToDebug;

-    // executors and taskIds running in this worker
+    // executors and localTaskIds running in this worker
    final Set<List<Long>> executors;


Based on the comment, this can be also localExecutors.

HeartSaVioR

Still in progress of reviewing... Will continue from SpoutOutputCollectorImpl.java

HeartSaVioR · 2017-12-21T04:14:22Z

storm-client/src/jvm/org/apache/storm/daemon/worker/BackPressureTracker.java

+                itr.remove();
+                changed = true;
+            } else {
+//                LOG.info("Task = {}, OverflowCount = {}, Q = {}", entry.getKey(), entry.getValue().getOverflowCount(), entry.getValue().getQueuedCount() );


Would we want to leave it with DEBUG, or remove the comment?

HeartSaVioR · 2017-12-21T04:16:27Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java

-                    new WorkerState(conf, context, topologyId, assignmentId, port, workerId, topologyConf, stateStorage,
+                return loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials);
+            }
+        }); // Subject.doAs(...)


+1 on nit, cause it is simpler and not placed to critical path.

HeartSaVioR · 2017-12-21T04:28:25Z

storm-client/src/jvm/org/apache/storm/StormTimer.java

@@ -193,6 +210,24 @@ public void run() {
        });
    }

+    /**
+     * Schedule a function to run recurrently
+     * @param delayMs the number of seconds to delay before running the function


the unit is not seconds but milliseconds.

HeartSaVioR · 2017-12-21T04:29:24Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java

+    private void setupFlushTupleTimer(final Map<String, Object> topologyConf, final List<IRunningExecutor> executors) {
+        final Integer producerBatchSize = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_PRODUCER_BATCH_SIZE));
+        final Integer xferBatchSize = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_TRANSFER_BATCH_SIZE));
+        final Long flushIntervalMicros = ObjectReader.getLong(topologyConf.get(Config.TOPOLOGY_BATCH_FLUSH_INTERVAL_MILLIS));


config name is milliseconds, whereas variable name is microseconds. if you want to convert it later, it would be better to do it here, or just rename to milliseconds.

Looks like it is used as milliseconds below, so I guess the variable name and below log messages in this method are wrong.

HeartSaVioR · 2017-12-21T04:31:54Z

storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java


+    private void setupBackPressureCheckTimer(final Map<String, Object> topologyConf) {
+        final Integer workerCount = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_WORKERS));
+        if (workerCount == 0) {


Don't we need to change it to if (workCount <= 1) to cover workerCount == 1?

yes that was the intent.

HeartSaVioR · 2017-12-21T07:02:32Z

storm-client/src/jvm/org/apache/storm/executor/bolt/BoltExecutor.java

+                        if (consumeIdleCounter==0) {
+                            LOG.debug("Invoking consume wait strategy");
+                        }
+                        consumeIdleCounter = consumeWaitStrategy.idle(consumeIdleCounter);


Could we let wait strategy instances handle relevant counters (consumeIdleCounter, bpIdleCount)?

i think it is possible, but the counters have to be reset and examined if ==0 .... from outside, it didnt seem much benefit to abstract it inside the strategy.

OK. Let's revisit if we will get confused again.

HeartSaVioR · 2017-12-21T07:51:32Z

storm-client/src/jvm/org/apache/storm/executor/bolt/BoltOutputCollectorImpl.java

-        boltFailInfo.applyOn(taskData.getUserContext());
-        if (delta >= 0) {
+        boltFailInfo.applyOn(task.getUserContext());
+        if (delta != 0) {


Maybe you addressed above only. It should stay the same.

storm/storm-core/src/clj/org/apache/storm/daemon/executor.clj

Lines 831 to 835 in b8f76af

(when (<= 0 delta)

(stats/bolt-failed-tuple! executor-stats

(.getSourceComponent tuple)

(.getSourceStreamId tuple)

delta))))

HeartSaVioR · 2017-12-21T07:59:24Z

storm-client/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java

-                final boolean isActive = stormActive.get();
+            public Long call() throws Exception {
+                int receiveCount = 0;
+                if (i++ == recvqCheckSkipCount) {


minor: we may be able to have longer but better variable name other than i, since it's not a temporary short scope variable.

yes makes sense.

HeartSaVioR · 2017-12-21T08:06:06Z

storm-client/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java

+                        rmspCount = 0;
+                    }
+
+                    if ( receiveCount>1 ) {


Is it intentional to compare this with 1? I guess it looks natural to compare with 0, but you may have some reasons to do it.

good observation. Actually it used to be 0. Then I noticed it was somewhat common case for there to be just one item in the recvQ (metrics tick, flush tuple etc) during idle stretches. Once we consume that, we idle immediately, rather than check again. a minor optimization.

HeartSaVioR · 2017-12-21T08:13:48Z

storm-client/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java

-                    if (!transferQueue.isFull() && !throttleOn && !reachedMaxSpoutPending) {
-                        for (ISpout spout : spouts) {
-                            spout.nextTuple();
+                    boolean pendingEmitsIsEmpty = tryFlushPendingEmits();


minor: The logic for when Spout is active is now somewhat long and it contains lots of nested if and return. May be better to refactor a bit.

yes i tried once and rolled it back.. due to the two levels of callables. will try again.

HeartSaVioR

Forgot one to comment:
I have seen such kind of lineidToTask.get(taskId - idToTaskBase) so many times. Given that it is not easy to understand unless we indicate how idToTask is build, it would be really better if we could find a way to abstract it.

HeartSaVioR

Just finished reviewing, except JCQueue and related tests. I'll see I can review them afterwards.

HeartSaVioR · 2017-12-26T07:24:16Z

storm-client/src/jvm/org/apache/storm/messaging/netty/Client.java

+                writeMessage(channel, batch);
+            }
+        } catch (IOException e) {
+            dropMessages(msgs);


Is !channel.isConnected() the only case for IOException? If then it doesn't swallow the exception quietly, but it will swallow the exception quietly if there're other cases for IOException.

The writeMessage() is the only one that throws that IOException ( after checking if !channel.isConnected()). I think I should add a log msg also.

HeartSaVioR · 2017-12-26T07:38:26Z

storm-client/src/jvm/org/apache/storm/serialization/KryoTupleDeserializer.java

@@ -39,7 +38,7 @@ public KryoTupleDeserializer(final Map<String, Object> conf, final GeneralTopolo
        _kryoInput = new Input(1);
    }        

-    public Tuple deserialize(byte[] ser) {
+    public TupleImpl deserialize(byte[] ser) {


Just a reminder: Let's add @Override, and just revert back if we don't leverage the explicit implementation type.

HeartSaVioR · 2017-12-26T07:43:46Z

storm-client/src/jvm/org/apache/storm/task/IOutputCollector.java

@@ -30,4 +31,5 @@
    void ack(Tuple input);
    void fail(Tuple input);
    void resetTimeout(Tuple input);
+    void flush();


I think this is like public API, but given that we ship the change to major version only, I guess that might be fine. If we have alternatives it would be better though.

HeartSaVioR · 2017-12-26T07:45:00Z

storm-client/src/jvm/org/apache/storm/trident/spout/RichSpoutBatchExecutor.java

+
+        @Override
+        public void flush() {
+            //NOOP   //TODO: Roshan: validate if this is OK


Please remove TODO if it is addressed, or address this before merging.

Not sure yet. Wanted others opinion. Would you know if its ok for that to be a NO-OP?

We are adding this in this patch, so I'm also not sure about that.

HeartSaVioR · 2017-12-26T07:50:50Z

storm-client/src/jvm/org/apache/storm/tuple/TupleImpl.java

-    public TupleImpl(GeneralTopologyContext context, List<Object> values, int taskId, String streamId) {
-        this(context, values, taskId, streamId, MessageId.makeUnanchored());
+
+//// Perf Note: Disabling these checks as they are a throughput bottleneck


Friendly reminder: if we are OK to drop schema validation, let's just remove the code. If we still want to validate the schema, let's leave it as it is.
(Maybe we could pass schema as well for same reason as srcComponent.)

HeartSaVioR · 2017-12-26T08:11:38Z

storm-client/src/jvm/org/apache/storm/utils/Utils.java

+                        if (s==null) // then stop running it
+                            break;
+                        if (s>0)
+                            Thread.sleep(s);


Please use Time.sleepSecs() for simulated time.

HeartSaVioR · 2017-12-26T08:12:31Z

storm-client/src/jvm/org/apache/storm/utils/Utils.java

@@ -1505,4 +1507,20 @@ public static StormTopology addVersions(StormTopology topology) {
        }
        return defaultsConf;
    }
+
+    public static <V> ArrayList<V> convertToArray(Map<Integer, V> srcMap, int start) {
+        Set<Integer> executorIds = srcMap.keySet();


Let's use more generic variable name: ids would be enough.

HeartSaVioR · 2017-12-26T08:14:07Z

storm-server/src/main/java/org/apache/storm/daemon/supervisor/BasicContainer.java

@@ -79,6 +79,7 @@
    protected final long lowMemoryThresholdMB;
    protected final long mediumMemoryThresholdMb;
    protected final long mediumMemoryGracePeriodMs;
+    private static int port = 5006;  // TODO: Roshan: remove this after stabilization


Friendly reminder: this should be addressed before merging.

yes. will do this last. need for debugging.

HeartSaVioR · 2017-12-26T08:16:42Z

storm-server/src/main/java/org/apache/storm/daemon/supervisor/BasicContainer.java

@@ -747,6 +748,16 @@ protected String javaCmd(String cmd) {
        commandList.add("-Dstorm.conf.file=" + topoConfFile);
        commandList.add("-Dstorm.options=" + stormOptions);
        commandList.add("-Djava.io.tmpdir=" + workerTmpDir);
+        if (_topoConf.get("attach.debugger") != null) { // TODO: Roshan: remove this after stabilization


Friendly reminder: this should be addressed before merging.

HeartSaVioR

Also reviewed JCQueue related files: I'm finished another pass of reviewing.

HeartSaVioR · 2017-12-26T12:24:42Z

storm-client/src/jvm/org/apache/storm/utils/JCQueue.java

+        }
+    } // class DirectInserter
+
+    private static class BatchInserter implements Inserter {


minor: may be better to leave a comment that the class is non thread-safe. Even better to also add comment to ThreadLocal variable.

good point.

HeartSaVioR · 2017-12-26T12:49:21Z

storm-client/src/jvm/org/apache/storm/utils/JCQueue.java

+        if (producerBatchSz > 1) {
+            inserter = thdLocalBatcher.get();
+            if (inserter == null) {
+                BatchInserter b = new BatchInserter(this, producerBatchSz);


Just curious: Is there specific reason to have temporary local variable b instead of directly using local inserter variable?

To avoid casting.
thdLocalBatcher.set(b) expects a BatchInserter. inserter is of base type.

HeartSaVioR · 2017-12-26T12:53:31Z

storm-client/src/jvm/org/apache/storm/utils/JCQueue.java

+    }
+
+
+    public interface ExitCondition {


Predicate with proper variable names might be able to get rid of this interface.

yes. i had it that way originally, then felt that the call to keepRunning() made things more readable (made it clear that if it returns true we will keep running as opposed to quit). I can go back to predicate interface if you feel this is not desirable.

OK makes sense. Let's keep it as it is.

HeartSaVioR · 2017-12-26T12:55:48Z

storm-client/test/jvm/org/apache/storm/utils/JCQueueTest.java

        for (int i = 0; i < producerNum; i++) {
            producerThreads[i].join(TIMEOUT);
            assertFalse("producer "+i+" is still alive", producerThreads[i].isAlive());
        }
        queue.haltWithInterrupt();
        consumerThread.join(TIMEOUT);
-        assertFalse("consumer is still alive", consumerThread.isAlive());
+        //TODO need to fix this... assertFalse("consumer is still alive", consumerThread.isAlive());


Is it fixed, or still need to fix it?

yes its its fixed. will remove the comment.

HeartSaVioR · 2017-12-26T12:58:26Z

storm-client/test/jvm/org/apache/storm/utils/JCQueueBackpressureTest.java

+    }
+
+
+    // check that tryPublish() & tryOverflowPublish() work as expected


Missed to add @Test, or intended?

thanks. will fix.

…hread (similar to master). Gives better interworker perf

roshannaik

Thanks @HeartSaVioR. Addressed most of your comments . Will get back on them.

roshannaik · 2017-12-28T09:50:18Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

-            for (LoadAwareCustomStreamGrouping grouper : streamComponentToGrouper.get(stream).values()) {
+        ArrayList<Integer> outTasks = new ArrayList<>();
+
+        // TODO: PERF: expensive hashtable lookup in critical path


May be possible to fix, but needs some work... one way is to see if streamIds can be internally represented as ints.
Filed issue

roshannaik · 2017-12-28T10:12:51Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

                if (grouper == GrouperFactory.DIRECT) {
                    throw new IllegalArgumentException("Cannot do regular emit to direct stream");
                }
                List<Integer> compTasks = grouper.chooseTasks(taskId, values, loadMapping);
-                outTasks.addAll(compTasks);
+                outTasks.addAll(compTasks);   // TODO: PERF: this is a perf hit


I understand your position, as I was surprised too. I have filed an issue and explained in it how I originally discovered it. I didnt pursue it too much, so I dont know why it is an issue (or if it still is), but felt it was good idea to record that observation for later.

roshannaik · 2017-12-28T10:14:33Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

@@ -177,6 +196,35 @@ public BuiltinMetrics getBuiltInMetrics() {
        return builtInMetrics;
    }

+
+    // Non Blocking call. If cannot emmit to destination immediately, such tuples will be added to `pendingEmits` argument


roshannaik · 2017-12-28T10:19:09Z

storm-client/src/jvm/org/apache/storm/daemon/Task.java

+    /**
+     * Send sampled data to the eventlogger if the global or component level debug flag is set (via nimbus api).
+     */
+    public void sendToEventLogger(Executor executor, List values,


Agree. Since topology.eventlogger.executors=0 (disabled) by default, i didn't spend time on this.

roshannaik · 2017-12-28T10:22:27Z

storm-client/src/jvm/org/apache/storm/daemon/worker/BackPressureTracker.java

+                itr.remove();
+                changed = true;
+            } else {
+//                LOG.info("Task = {}, OverflowCount = {}, Q = {}", entry.getKey(), entry.getValue().getOverflowCount(), entry.getValue().getQueuedCount() );


roshannaik · 2017-12-28T11:32:45Z

storm-client/src/jvm/org/apache/storm/trident/spout/RichSpoutBatchExecutor.java

+
+        @Override
+        public void flush() {
+            //NOOP   //TODO: Roshan: validate if this is OK


Not sure yet. Wanted others opinion. Would you know if its ok for that to be a NO-OP?

roshannaik · 2017-12-28T11:34:53Z

examples/storm-perf/src/main/java/org/apache/storm/perf/JCToolsPerfTest.java

+    @Override
+    public void run() {
+        long start = System.currentTimeMillis();
+//        while (!Thread.interrupted()) {


roshannaik · 2017-12-28T11:35:07Z

examples/storm-perf/src/main/java/org/apache/storm/perf/JCToolsPerfTest.java

+    public void run() {
+        Handler handler = new Handler();
+        long start = System.currentTimeMillis();
+//        while(!Thread.interrupted()) {


roshannaik · 2017-12-28T11:38:43Z

storm-client/src/jvm/org/apache/storm/executor/ExecutorTransfer.java

+            if (workerData.tryTransferRemote(addressedTuple, pendingEmits)) {
+                ++remotesBatchSz;
+                if (remotesBatchSz >= producerBatchSz) {
+                    workerData.tryFlushRemotes();


need to take a look more carefully. will get back on that.

roshannaik · 2017-12-28T12:00:29Z

storm-client/src/jvm/org/apache/storm/executor/Executor.java

            try {
                Task task = new Task(executor, taskId);
-                executor.sendUnanchored(
-                        task, StormCommon.SYSTEM_STREAM_ID, new Values("startup"), executor.getExecutorTransfer());
+                task.sendUnanchored( StormCommon.SYSTEM_STREAM_ID, new Values("startup"), executor.getExecutorTransfer(), null); // TODO: Roshan: does this get delivered/handled anywhere ?


@HeartSaVioR, @revans2 , others : would you know why we have this call to sendUnanchored ? I recall verifying on master (and 2306) that it doesn't get delivered to anywhere. Could remove it if not needed.

revans2 · 2018-01-04T21:30:38Z

Sorry it has taken me so long to get back to this. I did a quick pass through the code, but I need to spend some time to go more in depth. I expect to be doing this this afternoon and tomorrow.

My biggest concern right now is around up-merging and getting the unit tests to pass. I don't see a reason to have JCQueueTest fail 100% of the time.

I am also concerned that a lot of other unit tests fail 100% of the time with this patch too.

roshannaik · 2018-01-05T04:20:49Z

This PR page has become unusable. Most of the time it doesn't load. If it does load, it is too slow to use...even for just typing a comment. Let stop using this and switch to the new PR (thanks for the suggestion @HeartSaVioR ) There are a few recent unaddressed comments on this PR that I will track and address in the new PR.

satishd reviewed Jul 25, 2017

View reviewed changes

srdo mentioned this pull request Jul 25, 2017

STORM-2497: Let Supervisor enforce memory and add in support for shared memory regions #2113

Merged

arunmahadevan reviewed Jul 26, 2017

View reviewed changes

satishd reviewed Jul 26, 2017

View reviewed changes

HeartSaVioR reviewed Jul 26, 2017

View reviewed changes

revans2 suggested changes Jul 26, 2017

View reviewed changes

Roshan Naik and others added 2 commits September 1, 2017 14:12

bug fix in SpoutOutputCollectorImpl.emit()

ece4ce2

Moving worker xsfer code out of WorkerState into WorkerTransfer.java …

ba0b9e2

…- interworker=3mill/sec. withAck=700k/s,1.3ms

roshannaik force-pushed the STORM-2306m branch from 1c89c0f to c275bfc Compare November 14, 2017 23:31

Roshan Naik added 3 commits November 30, 2017 02:19

Change logLevels. Fix unit test

d0aa1ad

Addressing minor TODOs

b6fc2b8

roshannaik force-pushed the STORM-2306m branch from 0816894 to 3849cbb Compare December 4, 2017 12:05

Roshan Naik added 2 commits December 4, 2017 17:12

- Renaming config settings for easier understanding ( Progressive Wa…

39b395d

…it Strategy settings + topology.flush.tuple.freq.millis -> topology.batch.flush.interval.millis )

- Added message Drop Metrics

1f35bc9

- Added BackPressure unit tests

roshannaik force-pushed the STORM-2306m branch from 3849cbb to 1f35bc9 Compare December 5, 2017 01:40

Roshan Naik added 2 commits December 4, 2017 21:05

Client.send throws IOException is connection closes. Allows it to rec…

b179a19

…over from remote worker dying

Send current BackPressure status to every new client connection

84cb990

roshannaik force-pushed the STORM-2306m branch from 78d5e0a to 84cb990 Compare December 6, 2017 11:44

HeartSaVioR reviewed Dec 20, 2017

View reviewed changes

HeartSaVioR reviewed Dec 21, 2017

View reviewed changes

HeartSaVioR reviewed Dec 26, 2017

View reviewed changes

Roshan Naik added 2 commits December 28, 2017 01:15

Moving kryo serialization from worker transfer thd back to executor t…

81d477f

…hread (similar to master). Gives better interworker perf

Fixing checkstyle violations

78ce1e9

roshannaik commented Dec 28, 2017

View reviewed changes

addressing code review comments from Jungtaek

0edc250

roshannaik force-pushed the STORM-2306m branch from fdfddb4 to 0edc250 Compare December 28, 2017 12:08

roshannaik mentioned this pull request Jan 5, 2018

new PR for STORM-2306 : Messaging subsystem redesign #2502

Merged

roshannaik closed this Jan 5, 2018


		private void setupFlushTupleTimer(final List<IRunningExecutor> executors) {
		// StormTimer timerTask = workerState.getUserTimer();



		## Max.spout.pending
		The back pressure mechanism no longer requires `topology.max.spout.pending`. It is recommend to set this to null (default).

	(when (<= 0 delta)
	(stats/bolt-failed-tuple! executor-stats
	(.getSourceComponent tuple)
	(.getSourceStreamId tuple)
	delta))))

		}


		// check that tryPublish() & tryOverflowPublish() work as expected

STORM-2306 : Messaging subsystem redesign. #2241

STORM-2306 : Messaging subsystem redesign. #2241

Conversation

roshannaik commented Jul 25, 2017

1) Competitive Perf evaluation :

A) Compared To Heron and Flink:

B) Compared to Apex:

2) Core messaging Performance

satishd left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roshannaik Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satishd Jul 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roshannaik Jul 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satishd Jul 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avermeer commented Jul 25, 2017 via email

satishd commented Jul 25, 2017 • edited

roshannaik commented Jul 25, 2017 • edited

HeartSaVioR commented Jul 25, 2017

ptgoetz commented Jul 25, 2017

HeartSaVioR commented Jul 25, 2017 • edited

roshannaik commented Jul 26, 2017 • edited

arunmahadevan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roshannaik Jul 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roshannaik Jul 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roshannaik Jul 27, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Jul 26, 2017

satishd left a comment •

edited

roshannaik Jul 27, 2017 •

edited

satishd Jul 25, 2017 •

edited

roshannaik Jul 25, 2017 •

edited

satishd Jul 25, 2017 •

edited

satishd commented Jul 25, 2017 •

edited

roshannaik commented Jul 25, 2017 •

edited

HeartSaVioR commented Jul 25, 2017 •

edited

roshannaik commented Jul 26, 2017 •

edited

roshannaik Jul 26, 2017 •

edited

roshannaik Jul 26, 2017 •

edited

roshannaik Jul 27, 2017 •

edited

roshannaik Jul 27, 2017 •

edited